Fwd: Unable to submit jobs to a Hadoop cluster after a while
Re-sending the post. Any help is highly appreciated. -- Forwarded message -- From: Ashwanth Kumar Date: Sun, Nov 15, 2015 at 9:24 AM Subject: Unable to submit jobs to a Hadoop cluster after a while To: user@hadoop.apache.org We're running Hadoop 2.6.0 via CDH5.4.4 and we get the following error while submitting a new job 15/10/08 00:33:31 WARN security.UserGroupInformation: PriviledgedActionException as:hadoop (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /data/hadoopfs/mapred/staging/hadoop/.staging/job_201510050004_0388/job.jar could only be replicated to 0 nodes instead of minReplication (=1). There are 161 datanode(s) running and no node(s) are excluded in this operation. At that time we had 161 DNs running in the cluster. From the NN logs I see 2015-10-08 01:00:26,889 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to choose remote rack (location = ~/default-rack), fallback to local rack org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:691) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:580) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:357) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:419) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:214) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:111) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:3746) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$200(BlockManager.java:3711) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1400) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1306) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3682) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3634) at java.lang.Thread.run(Thread.java:722) 2015-10-08 01:00:26,890 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) [ >From one of the live 160+ DN logs, we saw Node /default-rack/10.181.8.222:50010 [ Storage [DISK]DS-2d39f3c3-2e67-48ad-871b-632f66b277d7:NORMAL: 10.181.8.222:50010 is not chosen since the node is too busy (load: 2 > 1.8370786516853932) . ] Node /default-rack/10.181.25.147:50010 [ Storage [DISK]DS-60b511b0-62aa-4c0f-92d9-6d90ff32ee49:NORMAL: 10.181.25.147:50010 is not chosen since the node is too busy (load: 2 > 1.8370786516853932) . ] Node /default-rack/10.181.8.152:50010 [ Storage [DISK]DS-7e0bf761-86f2-4748-9eda-fbfd9c69e127:NORMAL: 10.181.8.152:50010 is not chosen since the node is too busy (load: 2 > 1.8370786516853932) . ] Node /default-rack/10.181.25.67:50010 [ Storage [DISK]DS-5849e4d8-4ab6-4392-aee2-7a354c82c19d:NORMAL: 10.181.25.67:50010 is not chosen since the node is too busy (load: 2 > 1.8370786516853932) . ] Few things we observed from our end - If we restart the NN, we're able to submit jobs without any issues - We run this Hadoop cluster on AWS - DN and TT process run on a single EC2 machine which is backed by an AutoScaling Group. - We've another cluster which does't autoscale and doesn't exhibit the behaviour Any pointers or ideas on how to solve this for good would be really appreciated. -- Ashwanth Kumar / ashwanthkumar.in
map task frozen from master(s) perspective, but no process is there, and task log reports completion
Hi, I have a map task "slot" occupied with a task that does not make progress for hours, and in fact is seen by yarn as NEW and STARTING. (Since we use yarn / hadoop2, it is not a slot per-se, but the resource mechanism works as dynamically computing slots - for instance I have top 5 map+reduce tasks running in current config. I cannot change this while the job is still running right?) I have found a log of the task shwn completion: 2015-11-19 04:01:14,719 INFO [main] org.apache.hadoop.mapred.MapTask: Starting flush of map output 2015-11-19 04:01:14,719 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output 2015-11-19 04:01:14,719 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 63496; bufvoid = 104857600 2015-11-19 04:01:14,719 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 26214396(104857584); kvend = 26201248(104804992); length = 13149/6553600 2015-11-19 04:01:14,851 INFO [main] org.apache.hadoop.mapred.MapTask: Finished spill 0 2015-11-19 04:01:14,858 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1447872797537_0001_m_002241_0 is done. And is in the process of committing 2015-11-19 04:01:14,889 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1447872797537_0001_m_002241_0' done. 2015-11-19 04:01:14,889 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system... 2015-11-19 04:01:14,890 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped. 2015-11-19 04:01:14,890 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete. My hypothesis is that the task could not report its progress or completion to the application master, but in this case the master should have timed it out I believe? Can I kill the task attempt in any way to allow it to restart? Pls advise, Nicu
Re: failed to start namenode
Its surprising that no logs are created. How are you trying to start Name node? If you are starting using cloudera manager then logs can be seen found out on screen as well. On Thu, Nov 19, 2015 at 11:36 AM, siva kumar wrote: > Hi Sandeep, > The log is not getting generated for the name > node. > > On Wed, Nov 18, 2015 at 5:53 PM, sandeep das wrote: > >> At least share some excerpts from log of name node log file. >> >> On Wed, Nov 18, 2015 at 5:46 PM, siva kumar wrote: >> >>> Hi Folks, >>> I'm trying to install a fresh hadoop cluster. But >>> then,namenode is not is not starting up because of which hdfs service is >>> not started during my first run. Can anyone help me out? >>> I'm trying this using parcels(CDH-5). >>> >>> Any help? >>> >> >> >
Re: failed to start namenode
Hi Sandeep, The log is not getting generated for the name node. On Wed, Nov 18, 2015 at 5:53 PM, sandeep das wrote: > At least share some excerpts from log of name node log file. > > On Wed, Nov 18, 2015 at 5:46 PM, siva kumar wrote: > >> Hi Folks, >> I'm trying to install a fresh hadoop cluster. But >> then,namenode is not is not starting up because of which hdfs service is >> not started during my first run. Can anyone help me out? >> I'm trying this using parcels(CDH-5). >> >> Any help? >> > >
Yarn application reading from Data node using short-circuit.
Hi, I was going through some benchmarking and realized that there are lots of TCP connections are initiated while running my PIG jobs over YARN(MR2). These TCP connections are related to data node. Although short-circuit is enabled in my data nodes but still a lot TCP connections are being created. I wanted to check that how can we enable YARN applicationMaster to read data from Data node using short-circuits i.e. unix domain sockets. I believe that will improve the performance of our jobs. Can someone please help to understand how can I make sure that MR2 jobs created by PIG scripts are reading data from Data node using short-circuit instead of TCP connections? Regards, Sandeep
Re: yarn uses nodes non symetrically
Nicolae It depends on how big your AM container is compared to the task containers. By default, the AM container size is 1.5GB and the map/reduce containers are 1GB. You can adjust these by setting yarn.app.mapreduce.am.resource.mb, mapreduce.map.memory.mb, and mapreduce.map.memory.mb. If you make them smaller, make sure you also adjust the -Xmx values for the mapreduce.*.java.opts properties as well.Thanks, -Eric From: Nicolae Marasoiu To: "user@hadoop.apache.org" Sent: Tuesday, November 17, 2015 8:01 AM Subject: yarn uses nodes non symetrically #yiv7260601419 #yiv7260601419 -- P {margin-top:0;margin-bottom:0;}#yiv7260601419 Hi, My nodes are identical, and the yarn-site.xml are identical too.However, between slaves, one is used to the full but the other, around half, meaning: one gets 4 containers, the other gets 3 (and one of them is the app master which is quite idle), and I don't know why. Thanks,Nicu
RE: Does MapReduceApplicationMaster prevents data node from spawning YarnChild?
Thank You very much if it is the case then ApplicationMaster alone takes alomst everything that my datanode can offer - that would explain why task itself can't start. Thank You very much again. Dnia 17 listopada 2015 20:40 Bikas Sahanapisał(a): You can check for “yarn.app.mapreduce.am.resource.mb” in your configs. Its default is 1.5GB and other MR task defaults are 1GB. From: darekg11 [mailto:darek...@tlen.pl] Sent: Tuesday, November 17, 2015 11:34 AM To: user@hadoop.apache.org Subject: RE: Does MapReduceApplicationMaster prevents data node from spawning YarnChild? Thank You very much, will check but the problem is that machines aren't really powerful - only 2GB of RAM and 2xCore each at 3.3GHz. Do You maybe know approx value of required resources for AppMaster? And can we check current resource consumption of AppMaster? Dnia 17 listopada 2015 18:40 Bikas Saha napisał(a):No. In general App Masters and their containers can be launched on any machine and both can be launched on the same machine. If your case happens repeatedly then you could check the RM UI, while the job is running, to see the maximum resource on a node manager and the resource currently assigned. Perhaps your node managers don’t have enough resources to run multiple containers? From: darekg11 [mailto:darek...@tlen.pl] Sent: Tuesday, November 17, 2015 7:55 AM To: user@hadoop.apache.org Subject: Does MapReduceApplicationMaster prevents data node from spawning YarnChild? Hello again dear users. Today I ran into following problem: My mini cluster consist of: 1 NameNode and 2 SlaveNodes. When I ran my MapReduce program written in java with number of reducers equals to two. As the result on first SlaveNode I got MRAppMaster task and only the second slave launched Yarn Child which actually was producing output results. I understand that MRAppMaster is essential process repsonsible for managing life of given task. And because of that single slave node can't launch MrAppMaster and Yarn Child at the same time or am I misunderstanding something?
Data spilling on disk from MR jobs
Hi, I'm running my pig script over YARN(MR2). I was going through some tuning parameter and find out that the value of parameter "mapreduce.task.io.sort.mb" should be tuned properly. By default it is configured to 256 MB in my cloudera setup. I would wish to know that how can I find whether my MR jobs are spilling data into disk or not. Are there any logs which can help me to find how much data was spilled over disk? Is there any parameter which can be configured to enable such logging. CDH: CDH-5.4.4-1.cdh5.4.4.p0.4 Hadoop: 2.6.0-cdh5.4.4 Let me know in case more information is required. Regards, Sandeep
Re: failed to start namenode
At least share some excerpts from log of name node log file. On Wed, Nov 18, 2015 at 5:46 PM, siva kumar wrote: > Hi Folks, > I'm trying to install a fresh hadoop cluster. But > then,namenode is not is not starting up because of which hdfs service is > not started during my first run. Can anyone help me out? > I'm trying this using parcels(CDH-5). > > Any help? >
failed to start namenode
Hi Folks, I'm trying to install a fresh hadoop cluster. But then,namenode is not is not starting up because of which hdfs service is not started during my first run. Can anyone help me out? I'm trying this using parcels(CDH-5). Any help?