Hello, We are running a job that makes use of Avro Multiple Ouputs (Avro 1.7.5). When there are lots of output files the job was failing with the following error which I believed caused the job to fail:
hc1hdfs2p.thecarlylegroup.local:50010:DataXceiverServer: java.io.IOException: Xceiver count 4097 exceeds the limit of concurrent xcievers: 4096 at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:137) at java.lang.Thread.run(Thread.java:744) This error starts to appear when we have lots of output directories due to our use of AvroMultipleOutputs (all maps complete without issue and the multi output is being done in the reducers which fail). I went ahead and increased dfs.datanode.max.xcievers to 8192 and reran the job. Using Cloudera Manager I saw that the Transceivers across nodes maxed out at 5376 so setting the max to 8192 solved that first error. Unfortunately, the job still failed. When checking HDFS logs the above error about the xciever limit was gone but I now saw lots of the following. hc1hdfs3p.thecarlylegroup.local:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.14.5.83:53280 dest: /10.14.5.81:50010 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:711) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229) at java.lang.Thread.run(Thread.java:744) The above errors seem to be happening a lot and I'm not sure if they are related to the job failure. This error seems to match exactly the error pattern seen in this thread below (which unfortunately had no responses) http://mail-archives.apache.org/mod_mbox/hadoop-user/201408.mbox/%3CCAJOOh6E1D1bx_9NrAUPPzAb6x1=fxd52rgqwxfzwy5tpjiw...@mail.gmail.com%3E The only other warnings I see occurring around the same time as the job failure are: WARN Failed to place enough replicas, still in need of 1 to reach 3. For > more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy > WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor Exit > code from container container_1417712817932_31879_01_002464 is : 143 Does anyone have any ideas for what could be causing the job to fail? I did not see anything obvious looking through Cloudera Manager charts or logs. For example, open files was below the limit and memory was well within what the nodes have (6 nodes with 90GB each). No errors in YARN either. Thank you! Best, Ed