Hi Devaraj,

We don't have any special configuration on the job conf...

We only allow 3 map tasks and 3 reduce tasks in *one* node at any time.  So
we are puzzled why there are 572 job confs on *one* node?  From the heap
dump, we see there are 569 MapTask and 3 ReduceTask, (and that corresponds
to 1138 MapTaskStatus and 6 ReduceTaskStatus.)

We *think* many Map tasks were stuck in COMMIT_PENDING stage, because in
heap dump, we saw a lot of MapTaskStatus objects being in either
"UNASSIGNED" or "COMMIT_PENDING" state (the runState variable in
MapTaskStatus).   Then we took a look at another node on UI just now,  for a
given task tracker, under "Non-runnign tasks", there are at least 200 or 300
COMMIT_PENDING tasks.  It appears they stuck too.

Thanks a lot for your help!

Lili


On Wed, Apr 30, 2008 at 2:14 PM, Devaraj Das <[EMAIL PROTECTED]> wrote:

> Hi Lili, the jobconf memory consumption seems quite high. Could you please
> let us know if you pass anything in the jobconf of jobs that you run? I
> think you are seeing the 572 objects since a job is running and the
> TaskInProgress objects for tasks of the running job are kept in memory
> (but
> I need to double check this).
> Regarding COMMIT_PENDING, yes it means that tasktracker has finished
> executing the task but the jobtracker hasn't committed the output yet. In
> 0.16 all tasks have to necessarily take the transition from
> RUNNING->COMMIT_PENDING->SUCCEEDED. This behavior has been improved in
> 0.17
> (hadoop-3140) to include only tasks that generate output, i.e., a task is
> marked as SUCCEEDED if it doesn't generate any output in its output path.
>
> Devaraj
>
> > -----Original Message-----
> > From: Lili Wu [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, May 01, 2008 2:09 AM
> > To: core-user@hadoop.apache.org
> > Cc: [EMAIL PROTECTED]
> > Subject: OOM error with large # of map tasks
> >
> > We are using hadoop 0.16 and are seeing a consistent problem:
> >  out of memory errors when we have a large # of map tasks.
> > The specifics of what is submitted when we reproduce this:
> >
> > three large jobs:
> > 1. 20,000 map tasks and 10 reduce tasks
> > 2. 17,000 map tasks and 10 reduce tasks
> > 3. 10,000 map tasks and 10 reduce tasks
> >
> > these are at normal priority and periodically we swap the
> > priorities around to get some tasks started by each and let
> > them complete.
> > other smaller jobs come  and go every hour or so (no more
> > than 200 map tasks, 4-10 reducers).
> >
> > Our cluster consists of 23 nodes and we have 69 map tasks and
> > 69 reduce tasks.
> > Eventually, we see consistent oom errors in the task logs and
> > the task tracker itself goes down on as many as 14 of our nodes.
> >
> > We examined a heap dump after one of these crashes of a
> > TaskTracker and found something interesting--there were 572
> > instances of JobConf's that
> > accounted for 940mb of String objects.   This seems quite odd
> > that there are
> > so many instances of JobConf.  It seems to correlate with
> > task in the COMMIT_PENDING state as shown on the status for a
> > task tracker node.  Has anyone observed something like this?
> > can anyone explain what would cause tasks to remain in this
> > state? (which also apparently is in-memory vs
> > serialized to disk...).   In general, what does
> > COMMIT_PENDING mean?  (job
> > done, but output not committed to dfs?)
> >
> > Thanks!
> >
>
>

Reply via email to