Fwd: Unable to submit jobs to a Hadoop cluster after a while

Ashwanth Kumar Wed, 18 Nov 2015 23:39:12 -0800

Re-sending the post. Any help is highly appreciated.


---------- Forwarded message ----------
From: Ashwanth Kumar <ashwanthku...@googlemail.com>
Date: Sun, Nov 15, 2015 at 9:24 AM
Subject: Unable to submit jobs to a Hadoop cluster after a while
To: user@hadoop.apache.org


We're running Hadoop 2.6.0 via CDH5.4.4 and we get the following error
while submitting a new job

15/10/08 00:33:31 WARN security.UserGroupInformation:
PriviledgedActionException as:hadoop (auth:SIMPLE)
cause:org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/data/hadoopfs/mapred/staging/hadoop/.staging/job_201510050004_0388/job.jar
could only be replicated to 0 nodes instead of minReplication (=1).  There
are 161 datanode(s) running and no node(s) are excluded in this operation.

At that time we had 161 DNs running in the cluster. From the NN logs I see

2015-10-08 01:00:26,889 DEBUG
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed
to choose remote rack (location = ~/default-rack), fallback to local rack
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:691)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:580)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:357)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:419)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:214)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:111)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:3746)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$200(BlockManager.java:3711)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1400)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1306)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3682)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3634)
at java.lang.Thread.run(Thread.java:722)
2015-10-08 01:00:26,890 WARN
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed
to place enough replicas, still in need of 1 to reach 3
(unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7,
storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]},
newBlock=false) [

>From one of the live 160+ DN logs, we saw

Node /default-rack/10.181.8.222:50010 [
  Storage [DISK]DS-2d39f3c3-2e67-48ad-871b-632f66b277d7:NORMAL:
10.181.8.222:50010 is not chosen since the node is too busy (load: 2 >
1.8370786516853932) .
]
Node /default-rack/10.181.25.147:50010 [
  Storage [DISK]DS-60b511b0-62aa-4c0f-92d9-6d90ff32ee49:NORMAL:
10.181.25.147:50010 is not chosen since the node is too busy (load: 2 >
1.8370786516853932) .
]
Node /default-rack/10.181.8.152:50010 [
  Storage [DISK]DS-7e0bf761-86f2-4748-9eda-fbfd9c69e127:NORMAL:
10.181.8.152:50010 is not chosen since the node is too busy (load: 2 >
1.8370786516853932) .
]
Node /default-rack/10.181.25.67:50010 [
  Storage [DISK]DS-5849e4d8-4ab6-4392-aee2-7a354c82c19d:NORMAL:
10.181.25.67:50010 is not chosen since the node is too busy (load: 2 >
1.8370786516853932) .
]


Few things we observed from our end
- If we restart the NN, we're able to submit jobs without any issues
- We run this Hadoop cluster on AWS
- DN and TT process run on a single EC2 machine which is backed by an
AutoScaling Group.
- We've another cluster which does't autoscale and doesn't exhibit the
behaviour

Any pointers or ideas on how to solve this for good would be really
appreciated.



-- 

Ashwanth Kumar / ashwanthkumar.in

Fwd: Unable to submit jobs to a Hadoop cluster after a while

Reply via email to