how Bulkloader determine the max size of HFile?

Liu, Ming (Ming) Fri, 12 Feb 2016 08:39:12 -0800

Hi, all,

I am trying to understand Trafodion bulkloader better. One thing I noticed is 
that the bulkloader will generate 10G size of HFile into staging area, and then 
incremental add them into corresponding hbase region. My question is: how this 
10G is determined? Is there any way I can change it to a smaller value?


The purpose is : 10G is rather big , so I assume this is the reason that the 
SORT operator need overflow to scratch file. So I am wondering, if each HFile 
is 500M for example, so that SORT can be done totally in RAM, then avoid the 
'write amplification' of writing 10G into scratch file and then read them out 
and write same data into HFile. And the scratch content is not compressed, 
pretty much IO cost. Isn't this will improve the overall bulkloading speed, by 
avoiding the scratch files?

Although the bulkload speed is not bad comparing to HBase's importtsv utility 
as I tested these days (Trafodion is up to 3x faster than importtsv ), I am 
wondering if there are still rooms to improve. So if this 10G can be changed so 
I can do some more tests , that will be very helpful.

Of course, that will create a bunch of small HFiles, and HBase needs to do a 
major compaction. So it just postpone the dirty job, but I just want to try it 
out and see if it could help the loading. 

So if there is a CQD I can try, it will be super.

Thanks,
Ming

how Bulkloader determine the max size of HFile?

Reply via email to