performance about writing data to HDFS

徐传印 Mon, 29 Jan 2018 09:35:33 -0800

Hi community:
  I have a question about the performance of writing to HDFS.


  I've learned that When We write data to HDFS using the interface provided by 
HDFS such as 'FileSystem.create', our client will block until all the blocks 
and their replications are done. This will cause efficiency problem if we use 
HDFS as our final data storage. And many of my colleagues write the data to 
local disk in the main thread and copy it to HDFS in another thread. Obviously, 
it increases the disk I/O.

  So, is there a way to optimize this usage? I don't want to increase the disk 
I/O, neither do I want to be blocked during the writing of extra replications.



  How about writing to HDFS by specifying only one replication in the main 
thread and set the actual number of replication in another thread? Or is there 
any better way to do this?

performance about writing data to HDFS

Reply via email to