We are able to ingest MUCH larger sets of data (hundreds of GB) using the 
CSVBulkLoadTool. 

However, we have found it to be a huge memory hog.

We dug into the source a bit and found that 
HFileOutputFormat.configureIncrementalLoad(), in using TotalOrderPartitioner 
and KeyValueReducer, ultimately keeps a TreeSet of all the key/value pairs 
before finally writing the HFiles.

So if the size of your data exceeds the memory allocated on the client calling 
the MapReduce job, it will eventually fail.

Again, that data set doesn't seem anywhere near large enough to be an issue 
though.

-----Original Message-----
From: Gabriel Reid [mailto:gabriel.r...@gmail.com] 
Sent: Friday, December 18, 2015 10:17 AM
To: user@phoenix.apache.org
Subject: Re: Java Out of Memory Errors with CsvBulkLoadTool

Hi Jonathan,

Sounds like something is very wrong here.

Are you running the job on an actual cluster, or are you using the local job 
tracker (i.e. running the import job on a single computer).

Normally an import job, regardless of the size of the input, should run with 
map and reduce tasks that have a standard (e.g. 2GB) heap size per task 
(although there will typically be multiple tasks started on the cluster). There 
shouldn't be any need to have anything like a 48GB heap.

If you are running this on an actual cluster, could you elaborate on where/how 
you're setting the 48GB heap size setting?

- Gabriel


On Fri, Dec 18, 2015 at 1:46 AM, Cox, Jonathan A <ja...@sandia.gov> wrote:
> I am trying to ingest a 575MB CSV file with 192,444 lines using the 
> CsvBulkLoadTool MapReduce job. When running this job, I find that I 
> have to boost the max Java heap space to 48GB (24GB fails with Java 
> out of memory errors).
>
>
>
> I’m concerned about scaling issues. It seems like it shouldn’t require 
> between 24-48GB of memory to ingest a 575MB file. However, I am pretty 
> new to Hadoop/HBase/Phoenix, so maybe I am off base here.
>
>
>
> Can anybody comment on this observation?
>
>
>
> Thanks,
>
> Jonathan

Reply via email to