[ https://issues.apache.org/jira/browse/HDFS-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356722#comment-16356722 ]
Wei-Chiu Chuang commented on HDFS-13117: ---------------------------------------- {quote}at least the time between the last block of the first replication and the last block of the last replication can be saved. {quote} Maybe. But the latency is less than 1 ms. Note that data are written in 512-byte chunks, not in blocks. So you basically save almost nothing. If you really want to test the performance of your approach, try creating a file with replication=1, and then use FileSystem.setReplication() to make it 3-replica. NameNode will then schedule for replication asynchronously. I don't think you'll notice much difference than writing a file with replication=3. > Proposal to support writing replications to HDFS asynchronously > --------------------------------------------------------------- > > Key: HDFS-13117 > URL: https://issues.apache.org/jira/browse/HDFS-13117 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: xuchuanyin > Priority: Major > > My initial question was as below: > ``` > I've learned that When We write data to HDFS using the interface provided by > HDFS such as 'FileSystem.create', our client will block until all the blocks > and their replications are done. This will cause efficiency problem if we use > HDFS as our final data storage. And many of my colleagues write the data to > local disk in the main thread and copy it to HDFS in another thread. > Obviously, it increases the disk I/O. > > So, is there a way to optimize this usage? I don't want to increase the > disk I/O, neither do I want to be blocked during the writing of extra > replications. > How about writing to HDFS by specifying only one replication in the main > thread and set the actual number of replication in another thread? Or is > there any better way to do this? > ``` > > So my proposal here is to support writing extra replications to HDFS > asynchronously. User can set a minimum replicator as acceptable number of > replications ( < default or expected replicator). When writing to HDFS, user > will only be blocked until the minimum replicator has been finished and HDFS > will continue to complete the extra replications in background.Since HDFS > will periodically check the integrity of all the replications, we can also > leave this work to HDFS itself. > > There are ways to provide the interfaces: > 1. Creating a series of interfaces by adding `acceptableReplication` > parameter to the current interfaces as below: > ``` > Before: > FSDataOutputStream create(Path f, > boolean overwrite, > int bufferSize, > short replication, > long blockSize > ) throws IOException > > After: > FSDataOutputStream create(Path f, > boolean overwrite, > int bufferSize, > short replication, > short acceptableReplication, // minimum number of replication to finish > before return > long blockSize > ) throws IOException > ``` > > 2. Adding the `acceptableReplication` and `asynchronous` to the runtime (or > default) configuration, so user will not have to change any interface and > will benefit from this feature. > > How do you think about this? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org