If you google for such memory failures, you'll find the mapreduce tunable
that'll help you:

mapred.job.shuffle.input.buffer.percent ; it is well known that the default
values in hadoop config

don't work well for large data systems

-Rahul


On Wed, Feb 16, 2011 at 10:36 AM, James Seigel <ja...@tynt.com> wrote:

> Good luck.
>
> Let me know how it goes.
>
> James
>
> Sent from my mobile. Please excuse the typos.
>
> On 2011-02-16, at 11:11 AM, Kelly Burkhart <kelly.burkh...@gmail.com>
> wrote:
>
> > OK, the job was preferring the config file on my local machine which
> > is not part of the cluster over the cluster config files.  That seems
> > completely broken to me; my config was basically empty other than
> > containing the location of the cluster and my job apparently used
> > defaults rather than the cluster config.  It doesn't make sense to me
> > to keep configuration files synchronized on every machine that may
> > access the cluster.
> >
> > I'm running again; we'll see if it completes this time.
> >
> > -K
> >
> > On Wed, Feb 16, 2011 at 10:30 AM, James Seigel <ja...@tynt.com> wrote:
> >> Hrmmm. Well as you've pointed out. 200m is quite small and is probably
> >> the cause.
> >>
> >> Now thEre might be some overriding settings in something you are using
> >> to launch or something.
> >>
> >> You could set those values in the config to not be overridden in the
> >> main conf then see what tries to override it in the logs
> >>
> >> Cheers
> >> James
> >>
> >> Sent from my mobile. Please excuse the typos.
> >>
> >> On 2011-02-16, at 9:21 AM, Kelly Burkhart <kelly.burkh...@gmail.com>
> wrote:
> >>
> >>> I should have mentioned this in my last email: I thought of that so I
> >>> logged into every machine in the cluster; each machine's
> >>> mapred-site.xml has the same md5sum.
> >>>
> >>> On Wed, Feb 16, 2011 at 10:15 AM, James Seigel <ja...@tynt.com> wrote:
> >>>> He might not have that conf distributed out to each machine
> >>>>
> >>>>
> >>>> Sent from my mobile. Please excuse the typos.
> >>>>
> >>>> On 2011-02-16, at 9:10 AM, Kelly Burkhart <kelly.burkh...@gmail.com>
> wrote:
> >>>>
> >>>>> Our clust admin (who's out of town today) has mapred.child.java.opts
> >>>>> set to -Xmx1280 in mapred-site.xml.  However, if I go to the job
> >>>>> configuration page for a job I'm running right now, it claims this
> >>>>> option is set to -Xmx200m.  There are other settings in
> >>>>> mapred-site.xml that are different too.  Why would map/reduce jobs
> not
> >>>>> respect the mapred-site.xml file?
> >>>>>
> >>>>> -K
> >>>>>
> >>>>> On Wed, Feb 16, 2011 at 9:43 AM, Jim Falgout <
> jim.falg...@pervasive.com> wrote:
> >>>>>> You can set the amount of memory used by the reducer using the
> mapreduce.reduce.java.opts property. Set it in mapred-site.xml or override
> it in your job. You can set it to something like: -Xm512M to increase the
> amount of memory used by the JVM spawned for the reducer task.
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Kelly Burkhart [mailto:kelly.burkh...@gmail.com]
> >>>>>> Sent: Wednesday, February 16, 2011 9:12 AM
> >>>>>> To: common-user@hadoop.apache.org
> >>>>>> Subject: Re: Reduce java.lang.OutOfMemoryError
> >>>>>>
> >>>>>> I have had it fail with a single reducer and with 100 reducers.
> >>>>>> Ultimately it needs to be funneled to a single reducer though.
> >>>>>>
> >>>>>> -K
> >>>>>>
> >>>>>> On Wed, Feb 16, 2011 at 9:02 AM, real great..
> >>>>>> <greatness.hardn...@gmail.com> wrote:
> >>>>>>> Hi,
> >>>>>>> How many reducers are you using currently?
> >>>>>>> Try increasing the number or reducers.
> >>>>>>> Let me know if it helps.
> >>>>>>>
> >>>>>>> On Wed, Feb 16, 2011 at 8:30 PM, Kelly Burkhart <
> kelly.burkh...@gmail.com>wrote:
> >>>>>>>
> >>>>>>>> Hello, I'm seeing frequent fails in reduce jobs with errors
> similar
> >>>>>>>> to
> >>>>>>>> this:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 2011-02-15 15:21:10,163 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>>>>> header: attempt_201102081823_0175_m_002153_0, compressed len:
> 172492,
> >>>>>>>> decompressed len: 172488
> >>>>>>>> 2011-02-15 15:21:10,163 FATAL org.apache.hadoop.mapred.TaskRunner:
> >>>>>>>> attempt_201102081823_0175_r_000034_0 : Map output copy failure :
> >>>>>>>> java.lang.OutOfMemoryError: Java heap space
> >>>>>>>>        at
> >>>>>>>>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuf
> >>>>>>>> fleInMemory(ReduceTask.java:1508)
> >>>>>>>>        at
> >>>>>>>>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getM
> >>>>>>>> apOutput(ReduceTask.java:1408)
> >>>>>>>>        at
> >>>>>>>>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copy
> >>>>>>>> Output(ReduceTask.java:1261)
> >>>>>>>>        at
> >>>>>>>>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(
> >>>>>>>> ReduceTask.java:1195)
> >>>>>>>>
> >>>>>>>> 2011-02-15 15:21:10,163 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>>>>> Shuffling 172488 bytes (172492 raw bytes) into RAM from
> >>>>>>>> attempt_201102081823_0175_m_002153_0
> >>>>>>>> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>>>>> header: attempt_201102081823_0175_m_002118_0, compressed len:
> 161944,
> >>>>>>>> decompressed len: 161940
> >>>>>>>> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>>>>> header: attempt_201102081823_0175_m_001704_0, compressed len:
> 228365,
> >>>>>>>> decompressed len: 228361
> >>>>>>>> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask:
> >>>>>>>> Task
> >>>>>>>> attempt_201102081823_0175_r_000034_0: Failed fetch #1 from
> >>>>>>>> attempt_201102081823_0175_m_002153_0
> >>>>>>>> 2011-02-15 15:21:10,424 FATAL org.apache.hadoop.mapred.TaskRunner:
> >>>>>>>> attempt_201102081823_0175_r_000034_0 : Map output copy failure :
> >>>>>>>> java.lang.OutOfMemoryError: Java heap space
> >>>>>>>>        at
> >>>>>>>>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuf
> >>>>>>>> fleInMemory(ReduceTask.java:1508)
> >>>>>>>>        at
> >>>>>>>>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getM
> >>>>>>>> apOutput(ReduceTask.java:1408)
> >>>>>>>>        at
> >>>>>>>>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copy
> >>>>>>>> Output(ReduceTask.java:1261)
> >>>>>>>>        at
> >>>>>>>>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(
> >>>>>>>> ReduceTask.java:1195)
> >>>>>>>>
> >>>>>>>> Some also show this:
> >>>>>>>>
> >>>>>>>> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >>>>>>>>        at
> >>>>>>>> sun.net.www.http.ChunkedInputStream.(ChunkedInputStream.java:63)
> >>>>>>>>        at
> >>>>>>>> sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:811)
> >>>>>>>>        at
> sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> >>>>>>>>        at
> >>>>>>>>
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLCon
> >>>>>>>> nection.java:1072)
> >>>>>>>>        at
> >>>>>>>>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getI
> >>>>>>>> nputStream(ReduceTask.java:1447)
> >>>>>>>>        at
> >>>>>>>>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getM
> >>>>>>>> apOutput(ReduceTask.java:1349)
> >>>>>>>>        at
> >>>>>>>>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copy
> >>>>>>>> Output(ReduceTask.java:1261)
> >>>>>>>>        at
> >>>>>>>>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(
> >>>>>>>> ReduceTask.java:1195)
> >>>>>>>>
> >>>>>>>> The particular job I'm running is an attempt to merge multiple
> time
> >>>>>>>> series files into a single file.  The job tracker shows the
> following:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Kind    Num Tasks    Complete   Killed    Failed/Killed Task
> Attempts
> >>>>>>>> map     15795        15795      0         0 / 29 reduce  100
> >>>>>>>> 30         70        17 / 29
> >>>>>>>>
> >>>>>>>> All of the files I'm reading have records with a timestamp key
> similar to:
> >>>>>>>>
> >>>>>>>> 2011-01-03 08:30:00.457000<tab><record>
> >>>>>>>>
> >>>>>>>> My map job is a simple python program that ignores rows with times
> <
> >>>>>>>> 08:30:00 and > 15:00:00, determines the type of input row and
> writes
> >>>>>>>> it to stdout with very minor modification.  It maintains no state
> and
> >>>>>>>> should not use any significant memory.  My reducer is the
> >>>>>>>> IdentityReducer.  The input files are individually gzipped then
> put
> >>>>>>>> into hdfs.  The total uncompressed size of the output should be
> >>>>>>>> around 150G.  Our cluster is 32 nodes each of which has 16G RAM
> and
> >>>>>>>> most of which have two 2T drives.  We're running hadoop 0.20.2.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Can anyone provide some insight on how we can eliminate this
> issue?
> >>>>>>>> I'm certain this email does not provide enough info, please let me
> >>>>>>>> know what further information is needed to troubleshoot.
> >>>>>>>>
> >>>>>>>> Thanks in advance,
> >>>>>>>>
> >>>>>>>> -Kelly
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Regards,
> >>>>>>> R.V.
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>
>

Reply via email to