hey all

I have about 40 jobs in a batch i'm running. but consistently one particular mr job hangs at the tail of the copy or at the beginning of the sort (it 'looks' like it's still copying, but it isn't)

This job is a little bigger than the previous successful ones. The mapper dumped about 21,450,689,962 bytes. and combine output records jives with map output records (not using a special combiner).

The namenode logs shows this..

2008-03-13 13:48:29,789 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor timed out block blk_7813471133156061911 2008-03-13 13:48:33,310 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 10.251.70.210:50010 to replicate blk_7813471133156061911 to datanode(s) 10.251.65.207:50010 10.251.126.6:50010 2008-03-13 13:58:29,809 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor timed out block blk_7813471133156061911 2008-03-13 13:58:35,589 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 10.251.127.198:50010 to replicate blk_7813471133156061911 to datanode(s) 10.251.127.228:50010 10.251.69.162:50010 2008-03-13 14:08:29,729 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor timed out block blk_7813471133156061911 2008-03-13 14:08:34,869 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 10.251.73.223:50010 to replicate blk_7813471133156061911 to datanode(s) 10.251.26.223:50010 10.251.65.207:50010

and I should add there have been periodic connection resets among the nodes (20 slaves). but my hang happens consistently on this job at this point. i also run a fresh cluster every time i exec this batch. so there isn't any cruft in the dfs.

also, this job has completed fine in the past. but i don't remember seeing so much network static in the past either. but historically i have enabled block compression, the last two hangs compression was disabled. unsure if it ever hung with compression or not (i will try a fresh cluster with compression enabled to confirm).

any ideas on how to unjam, debug?

ckw

Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/



Reply via email to