On Fri, Dec 18, 2015 at 4:31 PM, Riesland, Zack
<zack.riesl...@sensus.com> wrote:
> We are able to ingest MUCH larger sets of data (hundreds of GB) using the 
> CSVBulkLoadTool.
>
> However, we have found it to be a huge memory hog.
>
> We dug into the source a bit and found that 
> HFileOutputFormat.configureIncrementalLoad(), in using TotalOrderPartitioner 
> and KeyValueReducer, ultimately keeps a TreeSet of all the key/value pairs 
> before finally writing the HFiles.
>
> So if the size of your data exceeds the memory allocated on the client 
> calling the MapReduce job, it will eventually fail.


I think (or at least hope!) that the situation isn't quite as bad as that.

The HFileOutputFormat.configureIncrementalLoad call will load the
start keys of all regions, and configure those for use by the
TotalOrderPartitioner. This will grow as the number of regions for the
output table grows.

The KeyValueSortReducer does indeed use a TreeSet to store KeyValues,
but this is a TreeSet per distinct row key. The size of this TreeSet
will grow as the number of columns per row grows. The memory usage is
typically higher than expected because each single column value is
stored as a row key, which contains the full row key of the row.

- Gabriel

Reply via email to