Does your DN log show up any form of errors when you run into this?
This happens with just two jobs reading how many files? And how many
DNs are these spread across?

I'm thinking its probably something to do with your ulimits for the
running DN processes, but I can't say for sure without taking a look
at the logs.

Some other stuff I can think of, a little blindly:
- What's your dfs.datanode.max.xcievers settings?
- Can you ensure 'hadoop classpath' on all nodes reflects the same
output, and no accidental jar mixups?

On Thu, Dec 29, 2011 at 11:48 PM, Markus Jelsma
<markus.jel...@openindex.io> wrote:
> We just reproduced it (somehow) by running jobs concurrently reading the same
> data. Two out of three similar jobs died early in the map phase with Could not
> obtain block errors, one finished completely.
>
> java.io.IOException: Could not obtain block: blk_119146860335302651_13067
> file=/user/systems/crawl/crawldb/current/part-00000/data
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2093)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1897)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2048)
>        at java.io.DataInputStream.readFully(DataInputStream.java:178)
>        at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>        at 
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>        at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>        at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2069)
>        at
> org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:68)
>        at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
>        at 
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>        at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
> Another job (different but reading the same data) finished the map phase but
> died partially (half of the reducers) and completely frooze.
>
> 2011-12-29 18:07:58,899 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
> createBlockOutputStream java.io.EOFException
> 2011-12-29 18:07:58,899 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> block blk_4748641522370871094_13532
> 2011-12-29 18:07:58,900 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
> datanode 141.105.120.154:50010
> 2011-12-29 18:07:58,902 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
> createBlockOutputStream java.io.EOFException
> 2011-12-29 18:07:58,902 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> block blk_-1454920600140944030_13532
> 2011-12-29 18:07:58,903 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
> datanode 141.105.120.152:50010
> 2011-12-29 18:07:58,907 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
> createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink
> as 141.105.120.153:50010
> 2011-12-29 18:07:58,907 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> block blk_3551418605384221738_13532
> 2011-12-29 18:07:58,908 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
> datanode 141.105.120.153:50010
> 2011-12-29 18:07:58,910 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
> createBlockOutputStream java.io.EOFException
> 2011-12-29 18:07:58,910 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> block blk_-1826030182013954555_13532
> 2011-12-29 18:07:58,911 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
> datanode 141.105.120.150:50010
> 2011-12-29 18:07:58,911 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
> Exception: java.io.IOException: Unable to create new block.
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3213)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2300(DFSClient.java:2406)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2646)
>
> 2011-12-29 18:07:58,912 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery
> for block blk_-1826030182013954555_13532 bad datanode[0] nodes == null
> 2011-12-29 18:07:58,912 WARN org.apache.hadoop.hdfs.DFSClient: Could not get
> block locations. Source file "/user/systems/generate-
> temp-1325180944829/_temporary/_attempt_201112290956_0012_r_000004_0/fetchlist-13/part-00004"
> - Aborting...
> 2011-12-29 18:07:59,049 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2011-12-29 18:07:59,062 WARN org.apache.hadoop.mapred.Child: Error running
> child
> java.io.EOFException
>        at java.io.DataInputStream.readShort(DataInputStream.java:298)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3272)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3196)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2300(DFSClient.java:2406)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2646)
> 2011-12-29 18:07:59,064 INFO org.apache.hadoop.mapred.Task: Runnning cleanup
> for the task
>
> It smells like the datanodes in 20.205.0 don't deal well with concurrent jobs,
> especially handling the same data.
>
> Is there any advice for this? Again, this does not happen on 20.203.0.
> Many thanks
>
>
>> I should add that the failing tasks that ran concurrently all read the same
>> map files from HDFS.
>>
>> > Hi,
>> >
>> > We just ran run large scale Apache Nutch jobs in our evaluation of
>> > 20.205.0 and they all failed. Some of these jobs ran concurrently with
>> > the fair scheduler enabled. These were simple jobs consuming little RAM.
>> > I double checked and there were certainly no RAM issues.
>> >
>> > All jobs failed and most tasks had a less than descriptive message. A few
>> > told they dealt with I/O errors reading task output. However, the data
>> > the read is fine. When we ran the same jobs manually (and some
>> > concurrently) some did fine and others died for with I/O errors reading
>> > task output again!
>> >
>> > The heap allocation for the reducers is not high but no OOM's were
>> > reported. Besides the occasional I/O error, which i think is strange
>> > enough, most tasks did not write anything to the logs that i can link to
>> > this problem.
>> >
>> > We do not see this happening on our 20.203.0 cluster although resources
>> > and settings are different. 205 is a new high-end cluster with similar
>> > conservative settings but only more mappers/reducers per node. Resource
>> > settings are almost identical. The 203 cluster has three times as many
>> > machines so also more open file descriptors and threads.
>> >
>> > Any thoughts to share?
>> > Thanks,



-- 
Harsh J

Reply via email to