Re: Re: performance about writing data to HDFS

徐传印 Thu, 01 Feb 2018 01:31:19 -0800

Thanks, Miklos.




To achieve the goal, we need to combine two or more interfaces currently 
provided by HDFS.




So, how do you think about providing an another interface to write extra 
replications to HDFS to make it more simple.




Before:

FSDataoutputStream create(Path f,

 boolean overwrite,

 int bufferSize,

 short replication,

 long blockSize) throws IOException




After:

FSDataoutputStream create(Path f,

 boolean overwrite,

 int bufferSize,

 short replication,

 short acceptableReplication, // block until the minimum replication finished 

 long blockSize) throws IOException


-----原始邮件-----
发件人:"Miklos Szegedi" <szege...@apache.org>
发送时间:2018-01-30 01:50:23 (星期二)
收件人: "徐传印" <xuchuan...@hust.edu.cn>
抄送: Hdfs-dev <hdfs-...@hadoop.apache.org>, "Hadoop Common" 
<common-...@hadoop.apache.org>, "common-u...@hadoop.apache.org" 
<user@hadoop.apache.org>
主题: Re: performance about writing data to HDFS


Hello,


Here is an example.


You can set an initial low replication like this code does:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L193



Create and write to a stream instead of dealing with a local copy:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L195



Once you are done, you can set a final replication count and HDFS will 
replicate in the background:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L250



You can optionally even wait until an acceptable replication count is reached:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L256



Thanks,
Miklos


On Sun, Jan 28, 2018 at 10:57 PM, 徐传印 <xuchuan...@hust.edu.cn> wrote:

Hi community:
  I have a question about the performance of writing to HDFS.

  I've learned that When We write data to HDFS using the interface provided by 
HDFS such as 'FileSystem.create', our client will block until all the blocks 
and their replications are done. This will cause efficiency problem if we use 
HDFS as our final data storage. And many of my colleagues write the data to 
local disk in the main thread and copy it to HDFS in another thread. Obviously, 
it increases the disk I/O.

  So, is there a way to optimize this usage? I don't want to increase the disk 
I/O, neither do I want to be blocked during the writing of extra replications.



  How about writing to HDFS by specifying only one replication in the main 
thread and set the actual number of replication in another thread? Or is there 
any better way to do this?

Re: Re: performance about writing data to HDFS

Reply via email to