Hello, Here is an example.
You can set an initial low replication like this code does: https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L193 Create and write to a stream instead of dealing with a local copy: https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L195 Once you are done, you can set a final replication count and HDFS will replicate in the background: https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L250 You can optionally even wait until an acceptable replication count is reached: https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L256 Thanks, Miklos On Sun, Jan 28, 2018 at 10:57 PM, 徐传印 <xuchuan...@hust.edu.cn> wrote: > > Hi community: > I have a question about the performance of writing to HDFS. > > I've learned that When We write data to HDFS using the interface > provided by HDFS such as 'FileSystem.create', our client will block until > all the blocks and their replications are done. This will cause efficiency > problem if we use HDFS as our final data storage. And many of my colleagues > write the data to local disk in the main thread and copy it to HDFS in > another thread. Obviously, it increases the disk I/O. > > So, is there a way to optimize this usage? I don't want to increase the > disk I/O, neither do I want to be blocked during the writing of extra > replications. > > How about writing to HDFS by specifying only one replication in the main > thread and set the actual number of replication in another thread? Or is > there any better way to do this? >