Hello, we are using Hadoop 2.2.0 (HDP 2.0), avro 1.7.4. running on CentOS 6.3
I am facing a following issue when using a AvroMultipleOutputs with dynamic output files. My M/R job works fine for a smaller amount of data or at least the error hasn't appear there so far. With bigger amount of data I am getting following error back to console: Error: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /data/pif-dob-categorize/2014/08/26/14/_temporary/1/_temporary/attempt_1409147867302_0090_r_000000_0/HISTORY/20140216/64619-r-00000.avro could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1384) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2503) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59582) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047) at org.apache.hadoop.ipc.Client.call(Client.java:1347) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy10.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:330) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy11.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1231) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1078) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514) I checked that: - cluster is no partitioned - network is fine - that HDFS cluster has enough capacity - actually just 8% used - setting for dfs.reserved is set to 1 but 80GB is still free. Our cluster is small for development purposes just to clarify. DataNode log contains a lot of following errors: 2014-08-28 06:57:22,585 ERROR datanode.DataNode (DataXceiver.java:run(225)) - bd-prg-dev1-dn1.corp.ncr.com:50010:DataXceiver error processing WRITE_BLOCK operation src: /153.86.209.223:47123 dest: / 153.86.209.223:50010 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221) at java.lang.Thread.run(Thread.java:744) NameNode contails planty of errors of type: 2014-08-28 06:58:19,904 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(295)) - Not able to place enough replicas, still in need of 2 to reach 2 For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger 2014-08-28 06:58:19,905 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(295)) - Not able to place enough replicas, still in need of 2 to reach 2 For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger 2014-08-28 06:58:19,905 ERROR security.UserGroupInformation (UserGroupInformation.java:doAs(1494)) - PriviledgedActionException as:jobsubmit (auth:SIMPLE) cause:java.io.IOException: File /data/pif-dob-categorize/2014/08/26/14/_temporary/1/_temporary/attempt_1409147867302_0090_r_000000_3/HISTORY/20130128/64619-r-00000.avro could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation. 2014-08-28 06:58:19,905 INFO ipc.Server (Server.java:run(2075)) - IPC Server handler 56 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 153.86.209.223:47263 Call#185 Retry#0: error: java.io.IOException: File /data/pif-dob-categorize/2014/08/26/14/_temporary/1/_temporary/attempt_1409147867302_0090_r_000000_3/HISTORY/20130128/64619-r-00000.avro could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation. java.io.IOException: File /data/pif-dob-categorize/2014/08/26/14/_temporary/1/_temporary/attempt_1409147867302_0090_r_000000_3/HISTORY/20130128/64619-r-00000.avro could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1384) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2503) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59582) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047) Other logs doesn't provide any relevant info or I haven't notice anything suspicious. Our cluster has following parameters: # hadoop - yarn-site.xml yarn.nodemanager.resource.memory-mb : 2048 yarn.scheduler.minimum-allocation-mb : 256 yarn.scheduler.maximum-allocation-mb : 2048 # hadoop - mapred-site.xml mapreduce.map.memory.mb : 768 mapreduce.map.java.opts : -Xmx512m mapreduce.reduce.memory.mb : 1024 mapreduce.reduce.java.opts : -Xmx768m mapreduce.task.io.sort.mb : 100 yarn.app.mapreduce.am.resource.mb : 1024 yarn.app.mapreduce.am.command-opts : -Xmx768m # hadoop - hdfs-site.xml dfs.replication : 3 I tried to manipulate following ones <property> <name>dfs.datanode.max.transfer.threads</name> <value>4096</value> <!--<value>1024</value>--> </property> <property> <name>mapreduce.task.io.sort.mb</name> <value>300</value> <!--<value>100</value>--> </property> <property> <name>mapreduce.reduce.merge.inmem.threshold</name> <value>0</value> <!--<value>1000</value>--> </property> <property> <name>mapreduce.task.io.sort.factor</name> <value>300</value> <!--<value>100</value>--> </property> <property> <name>mapreduce.reduce.input.limit</name> <value>-1</value> <!--<value>10737418240</value>--> Commented one are the original values. Several times happened that job succeeded but it wasn't stable solution. Any guess what might went wrong, what to check or where to look? I am pretty new to hadoop so any hints are warmly appreciated. Thanks Jakub