Thanks, Miklos.
To achieve the goal, we need to combine two or more interfaces currently provided by HDFS. So, how do you think about providing an another interface to write extra replications to HDFS to make it more simple. Before: FSDataoutputStream create(Path f, boolean overwrite, int bufferSize, short replication, long blockSize) throws IOException After: FSDataoutputStream create(Path f, boolean overwrite, int bufferSize, short replication, short acceptableReplication, // block until the minimum replication finished long blockSize) throws IOException -----原始邮件----- 发件人:"Miklos Szegedi" <szege...@apache.org> 发送时间:2018-01-30 01:50:23 (星期二) 收件人: "徐传印" <xuchuan...@hust.edu.cn> 抄送: Hdfs-dev <hdfs-...@hadoop.apache.org>, "Hadoop Common" <common-...@hadoop.apache.org>, "common-u...@hadoop.apache.org" <user@hadoop.apache.org> 主题: Re: performance about writing data to HDFS Hello, Here is an example. You can set an initial low replication like this code does: https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L193 Create and write to a stream instead of dealing with a local copy: https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L195 Once you are done, you can set a final replication count and HDFS will replicate in the background: https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L250 You can optionally even wait until an acceptable replication count is reached: https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L256 Thanks, Miklos On Sun, Jan 28, 2018 at 10:57 PM, 徐传印 <xuchuan...@hust.edu.cn> wrote: Hi community: I have a question about the performance of writing to HDFS. I've learned that When We write data to HDFS using the interface provided by HDFS such as 'FileSystem.create', our client will block until all the blocks and their replications are done. This will cause efficiency problem if we use HDFS as our final data storage. And many of my colleagues write the data to local disk in the main thread and copy it to HDFS in another thread. Obviously, it increases the disk I/O. So, is there a way to optimize this usage? I don't want to increase the disk I/O, neither do I want to be blocked during the writing of extra replications. How about writing to HDFS by specifying only one replication in the main thread and set the actual number of replication in another thread? Or is there any better way to do this?