A map/reduce process applied on 3T input data halts for 1 hour at map 57% reduce 19% without any progress.

A same error occurs a millions of times in the huge syslog file. And I also got a huge stderr file, where the logs are:

Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_201106031106_0001/attempt_201106031106_0001_m_000015_0/output/spill1176.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:930) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:401) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:886)
java.io.IOException: Spill failed
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:573) at edu.uchicago.naivetagger.fivetokens.FiveTokens_step2$Map.map(FiveTokens_step2.java:68) at edu.uchicago.naivetagger.fivetokens.FiveTokens_step2$Map.map(FiveTokens_step2.java:36)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_201106031106_0001/attempt_201106031106_0001_m_000015_0/output/spill1176.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:930) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:401) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:886)
java.io.IOException: Spill failed
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:573) at edu.uchicago.naivetagger.fivetokens.FiveTokens_step2$Map.map(FiveTokens_step2.java:68) at edu.uchicago.naivetagger.fivetokens.FiveTokens_step2$Map.map(FiveTokens_step2.java:36)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)

The userlogs directory explodes to 125G. I was wondering whether I ran out of disk on any node, so I used "bin/hadoop dfsadmin -report" and it returns

Configured Capacity: 17001199828992 (15.46 TB)
Present Capacity: 5039523250176 (4.58 TB)
DFS Remaining: 3148881514496 (2.86 TB)
DFS Used: 1890641735680 (1.72 TB)
DFS Used%: 37.52%

Name: 192.168.136.15:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 106267807744 (98.97 GB)
Non DFS Used: 791830847488 (737.45 GB)
DFS Remaining: 101971922944(94.97 GB)
DFS Used%: 10.63%
DFS Remaining%: 10.2%
Last contact: Fri Jun 03 16:09:14 GMT-06:00 2011


Name: 192.168.136.33:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 100771901440 (93.85 GB)
Non DFS Used: 899197136896 (837.44 GB)
DFS Remaining: 101539840(96.84 MB)
DFS Used%: 10.08%
DFS Remaining%: 0.01%
Last contact: Fri Jun 03 16:09:15 GMT-06:00 2011


Name: 192.168.136.35:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 116891889664 (108.86 GB)
Non DFS Used: 426219978752 (396.95 GB)
DFS Remaining: 456958709760(425.58 GB)
DFS Used%: 11.69%
DFS Remaining%: 45.69%
Last contact: Fri Jun 03 16:09:14 GMT-06:00 2011


Name: 192.168.136.22:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 117867053056 (109.77 GB)
Non DFS Used: 662408974336 (616.92 GB)
DFS Remaining: 219794550784(204.7 GB)
DFS Used%: 11.79%
DFS Remaining%: 21.98%
Last contact: Fri Jun 03 16:09:14 GMT-06:00 2011


Name: 192.168.136.34:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 102359646208 (95.33 GB)
Non DFS Used: 568364900352 (529.33 GB)
DFS Remaining: 329346031616(306.73 GB)
DFS Used%: 10.24%
DFS Remaining%: 32.93%
Last contact: Fri Jun 03 16:09:14 GMT-06:00 2011


Name: 192.168.136.20:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 120821141504 (112.52 GB)
Non DFS Used: 650393731072 (605.73 GB)
DFS Remaining: 228855705600(213.14 GB)
DFS Used%: 12.08%
DFS Remaining%: 22.88%
Last contact: Fri Jun 03 16:09:14 GMT-06:00 2011


Name: 192.168.136.19:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 107614384128 (100.22 GB)
Non DFS Used: 754391699456 (702.58 GB)
DFS Remaining: 138064494592(128.58 GB)
DFS Used%: 10.76%
DFS Remaining%: 13.81%
Last contact: Fri Jun 03 16:09:14 GMT-06:00 2011


Name: 192.168.136.31:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 76825395200 (71.55 GB)
Non DFS Used: 923142823936 (859.74 GB)
DFS Remaining: 102359040(97.62 MB)
DFS Used%: 7.68%
DFS Remaining%: 0.01%
Last contact: Fri Jun 03 16:09:14 GMT-06:00 2011


Name: 192.168.136.17:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 106580185088 (99.26 GB)
Non DFS Used: 806645080064 (751.25 GB)
DFS Remaining: 86845313024(80.88 GB)
DFS Used%: 10.66%
DFS Remaining%: 8.68%
Last contact: Fri Jun 03 16:09:14 GMT-06:00 2011


Name: 192.168.136.36:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 119845072896 (111.61 GB)
Non DFS Used: 784374362112 (730.51 GB)
DFS Remaining: 95851143168(89.27 GB)
DFS Used%: 11.98%
DFS Remaining%: 9.58%
Last contact: Fri Jun 03 16:09:14 GMT-06:00 2011


Name: 192.168.136.14:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 123510685696 (115.03 GB)
Non DFS Used: 807602118656 (752.14 GB)
DFS Remaining: 68957773824(64.22 GB)
DFS Used%: 12.35%
DFS Remaining%: 6.9%
Last contact: Fri Jun 03 16:09:14 GMT-06:00 2011


Name: 192.168.136.23:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 110193819648 (102.63 GB)
Non DFS Used: 644950650880 (600.66 GB)
DFS Remaining: 244926107648(228.11 GB)
DFS Used%: 11.02%
DFS Remaining%: 24.49%
Last contact: Fri Jun 03 16:09:14 GMT-06:00 2011


Name: 192.168.136.12:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 105652305920 (98.4 GB)
Non DFS Used: 464804843520 (432.88 GB)
DFS Remaining: 429613428736(400.11 GB)
DFS Used%: 10.56%
DFS Remaining%: 42.96%
Last contact: Fri Jun 03 16:09:14 GMT-06:00 2011


Name: 192.168.136.11:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 148825817088 (138.6 GB)
Non DFS Used: 485232361472 (451.91 GB)
DFS Remaining: 366012399616(340.88 GB)
DFS Used%: 14.88%
DFS Remaining%: 36.6%
Last contact: Fri Jun 03 16:09:15 GMT-06:00 2011


Name: 192.168.136.24:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 109966073856 (102.41 GB)
Non DFS Used: 641134641152 (597.1 GB)
DFS Remaining: 248969863168(231.87 GB)
DFS Used%: 11%
DFS Remaining%: 24.9%
Last contact: Fri Jun 03 16:09:16 GMT-06:00 2011


Name: 192.168.136.30:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 109333385216 (101.82 GB)
Non DFS Used: 841253269504 (783.48 GB)
DFS Remaining: 49483923456(46.09 GB)
DFS Used%: 10.93%
DFS Remaining%: 4.95%
Last contact: Fri Jun 03 16:09:15 GMT-06:00 2011


Name: 192.168.136.29:50010
Decommission Status : Normal
Configured Capacity: 1000070578176 (931.39 GB)
DFS Used: 107315171328 (99.95 GB)
Non DFS Used: 809729159168 (754.12 GB)
DFS Remaining: 83026247680(77.32 GB)
DFS Used%: 10.73%
DFS Remaining%: 8.3%
Last contact: Fri Jun 03 16:09:16 GMT-06:00 2011

There are two nodes that only have 0.01% DFS remaining, is that the exact reason halting the program? If that is the case, is there any balancing option to avoid this: when one node runs out of storage, is it possible to automatically switch the output storage to other nodes who have more free storage?

Thanks!

Shi

Reply via email to