Re-sending the post. Any help is highly appreciated.
---------- Forwarded message ---------- From: Ashwanth Kumar <ashwanthku...@googlemail.com> Date: Sun, Nov 15, 2015 at 9:24 AM Subject: Unable to submit jobs to a Hadoop cluster after a while To: user@hadoop.apache.org We're running Hadoop 2.6.0 via CDH5.4.4 and we get the following error while submitting a new job 15/10/08 00:33:31 WARN security.UserGroupInformation: PriviledgedActionException as:hadoop (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /data/hadoopfs/mapred/staging/hadoop/.staging/job_201510050004_0388/job.jar could only be replicated to 0 nodes instead of minReplication (=1). There are 161 datanode(s) running and no node(s) are excluded in this operation. At that time we had 161 DNs running in the cluster. From the NN logs I see 2015-10-08 01:00:26,889 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to choose remote rack (location = ~/default-rack), fallback to local rack org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:691) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:580) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:357) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:419) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:214) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:111) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:3746) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$200(BlockManager.java:3711) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1400) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1306) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3682) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3634) at java.lang.Thread.run(Thread.java:722) 2015-10-08 01:00:26,890 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) [ >From one of the live 160+ DN logs, we saw Node /default-rack/10.181.8.222:50010 [ Storage [DISK]DS-2d39f3c3-2e67-48ad-871b-632f66b277d7:NORMAL: 10.181.8.222:50010 is not chosen since the node is too busy (load: 2 > 1.8370786516853932) . ] Node /default-rack/10.181.25.147:50010 [ Storage [DISK]DS-60b511b0-62aa-4c0f-92d9-6d90ff32ee49:NORMAL: 10.181.25.147:50010 is not chosen since the node is too busy (load: 2 > 1.8370786516853932) . ] Node /default-rack/10.181.8.152:50010 [ Storage [DISK]DS-7e0bf761-86f2-4748-9eda-fbfd9c69e127:NORMAL: 10.181.8.152:50010 is not chosen since the node is too busy (load: 2 > 1.8370786516853932) . ] Node /default-rack/10.181.25.67:50010 [ Storage [DISK]DS-5849e4d8-4ab6-4392-aee2-7a354c82c19d:NORMAL: 10.181.25.67:50010 is not chosen since the node is too busy (load: 2 > 1.8370786516853932) . ] Few things we observed from our end - If we restart the NN, we're able to submit jobs without any issues - We run this Hadoop cluster on AWS - DN and TT process run on a single EC2 machine which is backed by an AutoScaling Group. - We've another cluster which does't autoscale and doesn't exhibit the behaviour Any pointers or ideas on how to solve this for good would be really appreciated. -- Ashwanth Kumar / ashwanthkumar.in