Re: [EXTERNAL] Re: Java Out of Memory Errors with CsvBulkLoadTool

Youngwoo Kim Fri, 18 Dec 2015 21:34:13 -0800

Got a same problem here. my bulk load job failed due to lack of memory at
reduce phase. 15M rows each day on a Phoenix table with a additional index.


After all, I recreated my tables with salting. It helps a lot because the
bulk load job launched same number of reducer with salt buckets.

But, I believe, if you're bulk loading large dataset, it would fail. AFAIK,
current MR based bulk loading requires a lot of memory for writing target
files.

- Youngwoo

On Sat, Dec 19, 2015 at 5:35 AM, Cox, Jonathan A <[email protected]> wrote:

> Hi Gabriel,
>
> The Hadoop version is 2.6.2.
>
> -Jonathan
>
> -----Original Message-----
> From: Gabriel Reid [mailto:[email protected]]
> Sent: Friday, December 18, 2015 11:58 AM
> To: [email protected]
> Subject: Re: [EXTERNAL] Re: Java Out of Memory Errors with CsvBulkLoadTool
>
> Hi Jonathan,
>
> Which Hadoop version are you using? I'm actually wondering if
> mapred.child.java.opts is still supported in Hadoop 2.x (I think it has
> been replaced by mapreduce.map.java.opts and mapreduce.reduce.java.opts).
>
> The HADOOP_CLIENT_OPTS won't make a difference if you're running in
> (pseudo) distributed mode, as separate JVMs will be started up for the
> tasks.
>
> - Gabriel
>
>
> On Fri, Dec 18, 2015 at 7:33 PM, Cox, Jonathan A <[email protected]> wrote:
> > Gabriel,
> >
> > I am running the job on a single machine in pseudo distributed mode.
> I've set the max Java heap size in two different ways (just to be sure):
> >
> > export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Xmx48g"
> >
> > and also in mapred-site.xml:
> >   <property>
> >     <name>mapred.child.java.opts</name>
> >     <value>-Xmx48g</value>
> >   </property>
> >
> > -----Original Message-----
> > From: Gabriel Reid [mailto:[email protected]]
> > Sent: Friday, December 18, 2015 8:17 AM
> > To: [email protected]
> > Subject: [EXTERNAL] Re: Java Out of Memory Errors with CsvBulkLoadTool
> >
> > Hi Jonathan,
> >
> > Sounds like something is very wrong here.
> >
> > Are you running the job on an actual cluster, or are you using the local
> job tracker (i.e. running the import job on a single computer).
> >
> > Normally an import job, regardless of the size of the input, should run
> with map and reduce tasks that have a standard (e.g. 2GB) heap size per
> task (although there will typically be multiple tasks started on the
> cluster). There shouldn't be any need to have anything like a 48GB heap.
> >
> > If you are running this on an actual cluster, could you elaborate on
> where/how you're setting the 48GB heap size setting?
> >
> > - Gabriel
> >
> >
> > On Fri, Dec 18, 2015 at 1:46 AM, Cox, Jonathan A <[email protected]>
> wrote:
> >> I am trying to ingest a 575MB CSV file with 192,444 lines using the
> >> CsvBulkLoadTool MapReduce job. When running this job, I find that I
> >> have to boost the max Java heap space to 48GB (24GB fails with Java
> >> out of memory errors).
> >>
> >>
> >>
> >> I’m concerned about scaling issues. It seems like it shouldn’t
> >> require between 24-48GB of memory to ingest a 575MB file. However, I
> >> am pretty new to Hadoop/HBase/Phoenix, so maybe I am off base here.
> >>
> >>
> >>
> >> Can anybody comment on this observation?
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Jonathan
>

Re: [EXTERNAL] Re: Java Out of Memory Errors with CsvBulkLoadTool

Reply via email to