Re: performance about writing data to HDFS

Miklos Szegedi Mon, 29 Jan 2018 10:04:59 -0800

Hello,

Here is an example.


You can set an initial low replication like this code does:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L193

Create and write to a stream instead of dealing with a local copy:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L195

Once you are done, you can set a final replication count and HDFS will
replicate in the background:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L250

You can optionally even wait until an acceptable replication count is
reached:
https://github.com/apache/hadoop/blob/56feaa40bb94fcaa96ae668eebfabec4611928c0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java#L256

Thanks,
Miklos

On Sun, Jan 28, 2018 at 10:57 PM, 徐传印 <[email protected]> wrote:

>
> Hi community:
>   I have a question about the performance of writing to HDFS.
>
>   I've learned that When We write data to HDFS using the interface
> provided by HDFS such as 'FileSystem.create', our client will block until
> all the blocks and their replications are done. This will cause efficiency
> problem if we use HDFS as our final data storage. And many of my colleagues
> write the data to local disk in the main thread and copy it to HDFS in
> another thread. Obviously, it increases the disk I/O.
>
>   So, is there a way to optimize this usage? I don't want to increase the
> disk I/O, neither do I want to be blocked during the writing of extra
> replications.
>
>   How about writing to HDFS by specifying only one replication in the main
> thread and set the actual number of replication in another thread? Or is
> there any better way to do this?
>

Re: performance about writing data to HDFS

Reply via email to