Hi Lili, the jobconf memory consumption seems quite high. Could you please
let us know if you pass anything in the jobconf of jobs that you run? I
think you are seeing the 572 objects since a job is running and the
TaskInProgress objects for tasks of the running job are kept in memory (but
I need to double check this).
Regarding COMMIT_PENDING, yes it means that tasktracker has finished
executing the task but the jobtracker hasn't committed the output yet. In
0.16 all tasks have to necessarily take the transition from
RUNNING->COMMIT_PENDING->SUCCEEDED. This behavior has been improved in 0.17
(hadoop-3140) to include only tasks that generate output, i.e., a task is
marked as SUCCEEDED if it doesn't generate any output in its output path.

Devaraj

> -----Original Message-----
> From: Lili Wu [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, May 01, 2008 2:09 AM
> To: core-user@hadoop.apache.org
> Cc: [EMAIL PROTECTED]
> Subject: OOM error with large # of map tasks
> 
> We are using hadoop 0.16 and are seeing a consistent problem: 
>  out of memory errors when we have a large # of map tasks.
> The specifics of what is submitted when we reproduce this:
> 
> three large jobs:
> 1. 20,000 map tasks and 10 reduce tasks
> 2. 17,000 map tasks and 10 reduce tasks
> 3. 10,000 map tasks and 10 reduce tasks
> 
> these are at normal priority and periodically we swap the 
> priorities around to get some tasks started by each and let 
> them complete.
> other smaller jobs come  and go every hour or so (no more 
> than 200 map tasks, 4-10 reducers).
> 
> Our cluster consists of 23 nodes and we have 69 map tasks and 
> 69 reduce tasks.
> Eventually, we see consistent oom errors in the task logs and 
> the task tracker itself goes down on as many as 14 of our nodes.
> 
> We examined a heap dump after one of these crashes of a 
> TaskTracker and found something interesting--there were 572 
> instances of JobConf's that
> accounted for 940mb of String objects.   This seems quite odd 
> that there are
> so many instances of JobConf.  It seems to correlate with 
> task in the COMMIT_PENDING state as shown on the status for a 
> task tracker node.  Has anyone observed something like this?  
> can anyone explain what would cause tasks to remain in this 
> state? (which also apparently is in-memory vs
> serialized to disk...).   In general, what does 
> COMMIT_PENDING mean?  (job
> done, but output not committed to dfs?)
> 
> Thanks!
> 

Reply via email to