[jira] [Commented] (HDFS-13117) Proposal to support writing replications to HDFS asynchronously

xuchuanyin (JIRA) Sat, 10 Feb 2018 19:59:15 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16359764#comment-16359764
 ]


xuchuanyin commented on HDFS-13117:
-----------------------------------

[~jojochuang] Actually I've tested it in a 3-node cluster. The time to copy 
file from local disk to HDFS with 3 replication is about 
{color:#FF0000}*300ms*{color} and the time to change the HDFS file from 
1-replica to 3-replica costs about {color:#FF0000}*10ms*{color} or less. (This 
does not contain the time to write local disk and the time to write first 
replica to HDFS).

 

Besides, skip writing to local disk will {color:#FF0000}save about 33% amount 
of disk write I/O{color}.

> Proposal to support writing replications to HDFS asynchronously
> ---------------------------------------------------------------
>
>                 Key: HDFS-13117
>                 URL: https://issues.apache.org/jira/browse/HDFS-13117
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: xuchuanyin
>            Priority: Major
>
> My initial question was as below:
> ```
> I've learned that When We write data to HDFS using the interface provided by 
> HDFS such as 'FileSystem.create', our client will block until all the blocks 
> and their replications are done. This will cause efficiency problem if we use 
> HDFS as our final data storage. And many of my colleagues write the data to 
> local disk in the main thread and copy it to HDFS in another thread. 
> Obviously, it increases the disk I/O.
>  
>    So, is there a way to optimize this usage? I don't want to increase the 
> disk I/O, neither do I want to be blocked during the writing of extra 
> replications.
>   How about writing to HDFS by specifying only one replication in the main 
> thread and set the actual number of replication in another thread? Or is 
> there any better way to do this?
> ```
>  
> So my proposal here is to support writing extra replications to HDFS 
> asynchronously. User can set a minimum replicator as acceptable number of 
> replications ( < default or expected replicator). When writing to HDFS, user 
> will only be blocked until the minimum replicator has been finished and HDFS 
> will continue to complete the extra replications in background.Since HDFS 
> will periodically check the integrity of all the replications, we can also 
> leave this work to HDFS itself.
>  
> There are ways to provide the interfaces：
> 1. Creating a series of interfaces by adding `acceptableReplication` 
> parameter to the current interfaces as below:
> ```
> Before：
> FSDataOutputStream create(Path f,
>   boolean overwrite,
>   int bufferSize,
>   short replication,
>   long blockSize
> ) throws IOException
>  
> After：
> FSDataOutputStream create(Path f,
>   boolean overwrite,
>   int bufferSize,
>   short replication,
>   short acceptableReplication, // minimum number of replication to finish 
> before return
>   long blockSize
> ) throws IOException
> ```
>  
> 2. Adding the `acceptableReplication` and `asynchronous` to the runtime (or 
> default) configuration, so user will not have to change any interface and 
> will benefit from this feature.
>  
> How do you think about this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13117) Proposal to support writing replications to HDFS asynchronously

Reply via email to