Re: [ANNOUNCE] Apache Hadoop 3.2.1 release
Updated twitter message: `` Apache Hadoop 3.2.1 is released: https://s.apache.org/96r4h Announcement: https://s.apache.org/jhnpe Overview: https://s.apache.org/tht6a Changes: https://s.apache.org/pd6of Release notes: https://s.apache.org/ta50b Thanks to our community of developers, operators, and users. -Rohith Sharma K S On Wed, 25 Sep 2019 at 14:15, Sunil Govindan wrote: > Here the link of Overview URL is old. > We should ideally use https://hadoop.apache.org/release/3.2.1.html > > Thanks > Sunil > > On Wed, Sep 25, 2019 at 2:10 PM Rohith Sharma K S < > rohithsharm...@apache.org> wrote: > >> Can someone help to post this in twitter account? >> >> Apache Hadoop 3.2.1 is released: https://s.apache.org/mzdb6 >> Overview: https://s.apache.org/tht6a >> Changes: https://s.apache.org/pd6of >> Release notes: https://s.apache.org/ta50b >> >> Thanks to our community of developers, operators, and users. >> >> -Rohith Sharma K S >> >> On Wed, 25 Sep 2019 at 13:44, Rohith Sharma K S < >> rohithsharm...@apache.org> wrote: >> >>> Hi all, >>> >>> It gives us great pleasure to announce that the Apache Hadoop >>> community has >>> voted to release Apache Hadoop 3.2.1. >>> >>> Apache Hadoop 3.2.1 is the stable release of Apache Hadoop 3.2 line, >>> which >>> includes 493 fixes since Hadoop 3.2.0 release: >>> >>> - For major changes included in Hadoop 3.2 line, please refer Hadoop >>> 3.2.1 main page[1]. >>> - For more details about fixes in 3.2.1 release, please read >>> CHANGELOG[2] and RELEASENOTES[3]. >>> >>> The release news is posted on the Hadoop website too, you can go to the >>> downloads section directly[4]. >>> >>> Thank you all for contributing to the Apache Hadoop! >>> >>> Cheers, >>> Rohith Sharma K S >>> >>> >>> [1] https://hadoop.apache.org/docs/r3.2.1/index.html >>> [2] >>> https://hadoop.apache.org/docs/r3.2.1/hadoop-project-dist/hadoop-common/release/3.2.1/CHANGELOG.3.2.1.html >>> [3] >>> https://hadoop.apache.org/docs/r3.2.1/hadoop-project-dist/hadoop-common/release/3.2.1/RELEASENOTES.3.2.1.html >>> [4] https://hadoop.apache.org >>> >>
Re: [ANNOUNCE] Apache Hadoop 3.2.1 release
Updated announcement Hi all, It gives us great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 3.2.1. Apache Hadoop 3.2.1 is the stable release of Apache Hadoop 3.2 line, which includes 493 fixes since Hadoop 3.2.0 release: - For major changes included in Hadoop 3.2 line, please refer Hadoop 3.2.1 main page [1]. - For more details about fixes in 3.2.1 release, please read CHANGELOG [2] and RELEASENOTES [3]. The release news is posted on the Hadoop website too, you can go to the downloads section directly [4]. This announcement itself is also up on the website [0]. Thank you all for contributing to the Apache Hadoop! Cheers, Rohith Sharma K S [0] Announcement: https://hadoop.apache.org/release/3.2.1.html [1] Overview of major changes: https://hadoop.apache.org/docs/r3.2.1/index.html [2] Detailed change-log: https://hadoop.apache.org/docs/r3.2.1/hadoop-project-dist/hadoop-common/release/3.2.1/CHANGELOG.3.2.1.html [3] Detailed release-notes: https://hadoop.apache.org/docs/r3.2.1/hadoop-project-dist/hadoop-common/release/3.2.1/RELEASENOTES.3.2.1.html [4] Project Home: https://hadoop.apache.org On Wed, 25 Sep 2019 at 13:44, Rohith Sharma K S wrote: > Hi all, > > It gives us great pleasure to announce that the Apache Hadoop > community has > voted to release Apache Hadoop 3.2.1. > > Apache Hadoop 3.2.1 is the stable release of Apache Hadoop 3.2 line, which > includes 493 fixes since Hadoop 3.2.0 release: > > - For major changes included in Hadoop 3.2 line, please refer Hadoop 3.2.1 > main page[1]. > - For more details about fixes in 3.2.1 release, please read CHANGELOG[2] > and RELEASENOTES[3]. > > The release news is posted on the Hadoop website too, you can go to the > downloads section directly[4]. > > Thank you all for contributing to the Apache Hadoop! > > Cheers, > Rohith Sharma K S > > > [1] https://hadoop.apache.org/docs/r3.2.1/index.html > [2] > https://hadoop.apache.org/docs/r3.2.1/hadoop-project-dist/hadoop-common/release/3.2.1/CHANGELOG.3.2.1.html > [3] > https://hadoop.apache.org/docs/r3.2.1/hadoop-project-dist/hadoop-common/release/3.2.1/RELEASENOTES.3.2.1.html > [4] https://hadoop.apache.org >
Re: [ANNOUNCE] Apache Hadoop 3.2.1 release
Can someone help to post this in twitter account? Apache Hadoop 3.2.1 is released: https://s.apache.org/mzdb6 Overview: https://s.apache.org/tht6a Changes: https://s.apache.org/pd6of Release notes: https://s.apache.org/ta50b Thanks to our community of developers, operators, and users. -Rohith Sharma K S On Wed, 25 Sep 2019 at 13:44, Rohith Sharma K S wrote: > Hi all, > > It gives us great pleasure to announce that the Apache Hadoop > community has > voted to release Apache Hadoop 3.2.1. > > Apache Hadoop 3.2.1 is the stable release of Apache Hadoop 3.2 line, which > includes 493 fixes since Hadoop 3.2.0 release: > > - For major changes included in Hadoop 3.2 line, please refer Hadoop 3.2.1 > main page[1]. > - For more details about fixes in 3.2.1 release, please read CHANGELOG[2] > and RELEASENOTES[3]. > > The release news is posted on the Hadoop website too, you can go to the > downloads section directly[4]. > > Thank you all for contributing to the Apache Hadoop! > > Cheers, > Rohith Sharma K S > > > [1] https://hadoop.apache.org/docs/r3.2.1/index.html > [2] > https://hadoop.apache.org/docs/r3.2.1/hadoop-project-dist/hadoop-common/release/3.2.1/CHANGELOG.3.2.1.html > [3] > https://hadoop.apache.org/docs/r3.2.1/hadoop-project-dist/hadoop-common/release/3.2.1/RELEASENOTES.3.2.1.html > [4] https://hadoop.apache.org >
[ANNOUNCE] Apache Hadoop 3.2.1 release
Hi all, It gives us great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 3.2.1. Apache Hadoop 3.2.1 is the stable release of Apache Hadoop 3.2 line, which includes 493 fixes since Hadoop 3.2.0 release: - For major changes included in Hadoop 3.2 line, please refer Hadoop 3.2.1 main page[1]. - For more details about fixes in 3.2.1 release, please read CHANGELOG[2] and RELEASENOTES[3]. The release news is posted on the Hadoop website too, you can go to the downloads section directly[4]. Thank you all for contributing to the Apache Hadoop! Cheers, Rohith Sharma K S [1] https://hadoop.apache.org/docs/r3.2.1/index.html [2] https://hadoop.apache.org/docs/r3.2.1/hadoop-project-dist/hadoop-common/release/3.2.1/CHANGELOG.3.2.1.html [3] https://hadoop.apache.org/docs/r3.2.1/hadoop-project-dist/hadoop-common/release/3.2.1/RELEASENOTES.3.2.1.html [4] https://hadoop.apache.org
Re: Question about Yarn rolling upgrade
The above JIRA mentioned breaks but those are fixed in 2.6 itself. The only one JIRA I see is YARN-8310 which is fixed in 2.10. Looking from stack trace which you have mentioned, it doesn't seems related to your issue. May be try applying a patch and run a job. Otherwise, lets create a JIRA and discuss there in detail. -Rohith Sharma K S On Thu, 7 Feb 2019 at 22:52, Aihua Xu wrote: > Hi Rohith, > > Thanks for your suggestion. I was tracing the issue and found out it's > caused by the incompatibility from these two changes. The tokens have been > changed. > > YARN-668. Changed > NMTokenIdentifier/AMRMTokenIdentifier/ContainerTokenIdentifier to use > protobuf object as the payload. Contributed by Junping Du. > > YARN-2615. Changed > ClientToAMTokenIdentifier/RM(Timeline)DelegationTokenIdentifier to use > protobuf as payload. Contributed by Junping Du > > > I was testing new RM with old NM. > > Followup on the the order of Yarn upgrade. I checked the HWX blog > <https://hortonworks.com/blog/introducing-rolling-upgrades-downgrades-apache-hadoop-yarn-cluster/> > about > rolling upgrade and it's suggesting to upgrade RM first. But you are > saying we should NM first and RM second? Can you confirm? > > Thanks, > Aihua > > > > On Wed, Feb 6, 2019 at 8:26 PM Rohith Sharma K S < > rohithsharm...@apache.org> wrote: > >> Hi Aihua, >> >> Could you give more clarity on when job is submitted like a) before >> starting upgrade b) after RM upgrade and before NM upgrade c) after YARN >> upgrade fully? >> Typically, order of upgrade suggested is NM's first and RM second. >> >> Reg the NM warn messages you might be hitting >> https://issues.apache.org/jira/browse/HADOOP-11692. >> >> Doesn't any subsequent jobs succeeded post upgrade? >> -Rohith Sharma K S >> >> On Thu, 7 Feb 2019 at 03:20, Aihua Xu wrote: >> >>> Hi all, >>> >>> I'm investigating the rolling upgrade process from Hadoop 2.6 to Hadoop >>> 2.9.1. I'm trying to upgrade ResourceManager first and then try to upgrade >>> NodeManager. When I submit a yarn job, RM fails with the following >>> exception: >>> >>> Application application_1549408943468_0001 failed 2 times due to Error >>> launching appattempt_1549408943468_0001_02. Got exception: >>> java.io.IOException: Failed on local exception: java.io.IOException: >>> java.io.EOFException; Host Details : local host is: >>> "hadoopbenchaqjm01-sjc1/10.67.2.171"; destination host is: >>> "hadoopbencha22-sjc1.prod.uber.internal":8041; >>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:805) >>> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497) >>> at org.apache.hadoop.ipc.Client.call(Client.java:1439) >>> at org.apache.hadoop.ipc.Client.call(Client.java:1349) >>> at >>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) >>> at >>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) >>> at com.sun.proxy.$Proxy87.startContainers(Unknown Source) >>> at >>> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:498) >>> at >>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) >>> at >>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) >>> at >>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) >>> at >>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) >>> at >>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) >>> at com.sun.proxy.$Proxy88.startContainers(Unknown Source) >>> at >>> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122) >>> at >>> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:307) >>> at >>> java.util.concurrent.Thr
Re: Question about Yarn rolling upgrade
Hi Aihua, Could you give more clarity on when job is submitted like a) before starting upgrade b) after RM upgrade and before NM upgrade c) after YARN upgrade fully? Typically, order of upgrade suggested is NM's first and RM second. Reg the NM warn messages you might be hitting https://issues.apache.org/jira/browse/HADOOP-11692. Doesn't any subsequent jobs succeeded post upgrade? -Rohith Sharma K S On Thu, 7 Feb 2019 at 03:20, Aihua Xu wrote: > Hi all, > > I'm investigating the rolling upgrade process from Hadoop 2.6 to Hadoop > 2.9.1. I'm trying to upgrade ResourceManager first and then try to upgrade > NodeManager. When I submit a yarn job, RM fails with the following > exception: > > Application application_1549408943468_0001 failed 2 times due to Error > launching appattempt_1549408943468_0001_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.io.EOFException; Host Details : local host is: > "hadoopbenchaqjm01-sjc1/10.67.2.171"; destination host is: > "hadoopbencha22-sjc1.prod.uber.internal":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:805) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497) > at org.apache.hadoop.ipc.Client.call(Client.java:1439) > at org.apache.hadoop.ipc.Client.call(Client.java:1349) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy88.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:307) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: java.io.EOFException > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:757) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889) > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:720) > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:813) > at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:411) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1554) > at org.apache.hadoop.ipc.Client.call(Client.java:1385) > ... 20 more > Caused by: java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1798) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:365) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615) > at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:411) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:800) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:796) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889) > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:795) > ... 23 more > > > and NM with > > 2019-02-06 00:29:20,214 WARN SecurityLogger.org.apache.hadoop.ipc.Server: > Auth failed for 10.67.2.171:54588:null (DIGEST-MD5: IO error acquiring > password) with true cause: (null) > > > I'm wondering if it's a known issue and anybody has an insight for it. > > Thanks, > Aihua > > >
Re: Get information of containers - running/killed/completei
Hi Ajay For the running containers, you can get container report from ResourceManager. For completed/killed containers, you need start ApplicationHistoryServer daemon and use the same API i.e yarnClient.getContainerReport() to get container report. Basically, this API contact RM first for container report. If RM does not have this container Id then yarnClient contact ApplicationHistoryServer to get container report. Thanks & Regards Rohith Sharma K S On 15 November 2016 at 11:14, AJAY GUPTA wrote: > Hi > > For monitoring purposes, I need to capture some container information for > my application deployed on Yarn, specially for containers getting killed. > This also included the finishTime of the container i.e., the time when the > container got killed. Is there any API which will provide this information. > Currently, I am able to get information of only RUNNING containers via > yarnClient.getContainerReport(). > > > Thanks, > Ajay > >
Re: ACCEPTED: waiting for AM container to be allocated, launched and register with RM
Hi From below discussion and AM logs, I see that AM container has launched but not able to connect to RM. This looks like your configuration issue. Would you check your job.xml jar that does yarn.resourcemanager.scheduler.address has been configured? Essentially, this address required by MRAppMaster for connecting to RM for heartbeats. If you don’t not configure, default value will be taken i.e 8030. Thanks & Regards Rohith Sharma K S > On Aug 20, 2016, at 7:02 AM, rammohan ganapavarapu > wrote: > > Even if the cluster dont have enough resources it should connect to " > /0.0.0.0:8030 <http://0.0.0.0:8030/>" right? it should connect to my > , not sure why its trying to connect to 0.0.0.0:8030 > <http://0.0.0.0:8030/>. > I have verified the config and i removed traces of 0.0.0.0 still no luck. > org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8030 <http://0.0.0.0:8030/> > > If an one has any clue please share. > > Thanks, > Ram > > > On Fri, Aug 19, 2016 at 2:32 PM, rammohan ganapavarapu > mailto:rammohanga...@gmail.com>> wrote: > When i submit a job using yarn its seems working only with oozie its failing > i guess, not sure what is missing. > > yarn jar > /uap/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 20 > 1000 > Number of Maps = 20 > Samples per Map = 1000 > . > . > . > Job Finished in 19.622 seconds > Estimated value of Pi is 3.1428 > > Ram > > On Fri, Aug 19, 2016 at 11:46 AM, rammohan ganapavarapu > mailto:rammohanga...@gmail.com>> wrote: > Ok, i have used yarn-utils.py to get the correct values for my cluster and > update those properties and restarted RM and NM but still no luck not sure > what i am missing, any other insights will help me. > > Below are my properties from yarn-site.xml and map-site.xml. > > python yarn-utils.py -c 24 -m 63 -d 3 -k False > Using cores=24 memory=63GB disks=3 hbase=False > Profile: cores=24 memory=63488MB reserved=1GB usableMem=62GB disks=3 > Num Container=6 > Container Ram=10240MB > Used Ram=60GB > Unused Ram=1GB > yarn.scheduler.minimum-allocation-mb=10240 > yarn.scheduler.maximum-allocation-mb=61440 > yarn.nodemanager.resource.memory-mb=61440 > mapreduce.map.memory.mb=5120 > mapreduce.map.java.opts=-Xmx4096m > mapreduce.reduce.memory.mb=10240 > mapreduce.reduce.java.opts=-Xmx8192m > yarn.app.mapreduce.am <http://yarn.app.mapreduce.am/>.resource.mb=5120 > yarn.app.mapreduce.am <http://yarn.app.mapreduce.am/>.command-opts=-Xmx4096m > mapreduce.task.io.sort.mb=1024 > > > > mapreduce.map.memory.mb > 5120 > > > mapreduce.map.java.opts > -Xmx4096m > > > mapreduce.reduce.memory.mb > 10240 > > > mapreduce.reduce.java.opts > -Xmx8192m > > > yarn.app.mapreduce.am > <http://yarn.app.mapreduce.am/>.resource.mb > 5120 > > > yarn.app.mapreduce.am > <http://yarn.app.mapreduce.am/>.command-opts > -Xmx4096m > > > mapreduce.task.io.sort.mb > 1024 > > > > > > yarn.scheduler.minimum-allocation-mb > 10240 > > > > yarn.scheduler.maximum-allocation-mb > 61440 > > > > yarn.nodemanager.resource.memory-mb > 61440 > > > > Ram > > On Thu, Aug 18, 2016 at 11:14 PM, tkg_cangkul <mailto:yuza.ras...@gmail.com>> wrote: > maybe this link can be some reference to tune up the cluster: > > http://jason4zhu.blogspot.co.id/2014/10/memory-configuration-in-hadoop.html > <http://jason4zhu.blogspot.co.id/2014/10/memory-configuration-in-hadoop.html> > > > On 19/08/16 11:13, rammohan ganapavarapu wrote: >> Do you know what properties to tune? >> >> Thanks, >> Ram >> >> On Thu, Aug 18, 2016 at 9:11 PM, tkg_cangkul > <mailto:yuza.ras...@gmail.com>> wrote: >> i think that's because you don't have enough resource. u can tune your >> cluster config to maximize your resource. >> >> >> On 19/08/16 11:03, rammohan ganapavarapu wrote: >>> I dont see any thing odd except this not sure if i have to worry about it >>> or not. >>> >>> 2016-08-19 03:29:26,621 INFO [main] org.apache.hadoop.yarn.client.RMProxy: >>> Connecting to ResourceManager at /0.0.0.0:8030 <http://0.0.0.0:8030/> >>> 2016-08-19 03:29:27,646 INFO [main] org.apache.hadoop.ipc.Client: Retrying >>&
Re: Issue with Hadoop Job History Server
MR jobs and JHS should have same configurations for done-dir if configured. Otherwise staging-dir should be same for both. Make sure both Job and JHS has same configurations value. Usually what would happen is , MRApp writes job file in one location and HistoryServer trying to read from different location. This causes, JHS to display empty jobs. Thanks & Regards Rohith Sharma K S > On Aug 18, 2016, at 12:35 PM, Gao, Yunlong wrote: > > To whom it may concern, > > I am using Hadoop 2.7.1.2.3.6.0-3796, with the Hortonworks distribution of > HDP-2.3.6.0-3796. I have a question with the Hadoop Job History sever. > > After I set up everything, the resource manager/name nodes/data nodes seem to > be running fine. But the job history server is not working correctly. The > issue with it is that the UI of the job history server does not show any > jobs. And all the rest calls to the job history server do not work either. > Also notice that there is no logs in HDFS under the directory of > "mapreduce.jobhistory.done-dir" > > I have tried with different things, including restarting the job history > server and monitor the log -- no error/exceptions is observed. I also rename > the /hadoop/mapreduce/jhs/mr-jhs-state for the state recovery of job history > server, and then restart it again, but no particular error happens. I tried > with some other random stuff that I borrowed from online blogs/documents but > got no luck. > > > Any help would be very much appreciated. > > Thanks, > Yunlong >
Re: Connecting JConsole to ResourceManager
Hi Have you enabled JMX remote connections parameters for RM start up? If you are trying to remote connection, these parameter supposed to passed in hadoop opts You need to enable remote by configuring these parameter in RM jam start up. -Dcom.sun.management.jmxremote.port= \ -Dcom.sun.management.jmxremote.authenticate=false \ -Dcom.sun.management.jmxremote.ssl=false -Regards Rohith Sharma K S > On Aug 9, 2016, at 12:32 PM, Atri Sharma wrote: > > Hi All, > > I am trying to connect to a running ResourceManager process on Windows. I ran > jconsole and it shows the ResourceManager process. When I try connecting, it > immediately fails saying that it cannot connect. > > I verified that the cluster is running fine by running the wordcount example. > > Please advise. > > Regards, > > Atri >
RE: Securely discovering Application Master's metadata or sending a secret to Application Master at submission
Hi Basically I see you have multiple questions 1. How to get AM RPC port ? >>> This you can get it via YarnClient# getApplicationReport(). This gives >>> common/generic application specific details. Note that RM does not maintain >>> any custom details for applications. 2. How can you get metadata of AM? >>> Basically AM design should be such that bind an interface to AM RPC. And >>> AM-RPC host and port can be obtained from ResourceManager. Using host:port >>> of AM from application submitter, connect to AM and get required details >>> from AM only. To achieve this , YARN does not provide any interface since >>> AM are written users. Essentially, user can design AM to expose client >>> interface to their clients. For your better understanding , see MapReduce >>> framework MRAppMaster. 3. About the authenticity of job-submitter to AM >>> Use secured hadoop cluster with Kerberos enabled. Note that AM also should >>> be implemented for handling Kerberos. Thanks & Regards Rohith Sharma K S From: Mingyu Kim [mailto:m...@palantir.com] Sent: 10 June 2016 03:47 To: Rohith Sharma K S; user@hadoop.apache.org Cc: Matt Cheah Subject: Re: Securely discovering Application Master's metadata or sending a secret to Application Master at submission Hi Rohith, Thanks for the pointers. I checked the Hadoop documentation you linked, but it’s not clear how I can expose client interface for providing metadata. By “YARN internal communications”, I was referring to the endpoints that are exposed by AM on the RPC port as reported in ApplicationReport. I assume either RM or containers will communicate with AM through these endpoints. I believe your suggestion is to expose additional endpoints to the AM RPC port. Can you clarify how I can do that? Is there an interface/class I need to extend? How can I register the extra endpoints for providing metadata on the existing AM RPC port? Mingyu From: Rohith Sharma K S mailto:rohithsharm...@huawei.com>> Date: Wednesday, June 8, 2016 at 11:15 PM To: Mingyu Kim mailto:m...@palantir.com>>, "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" mailto:user@hadoop.apache.org>> Cc: Matt Cheah mailto:mch...@palantir.com>> Subject: RE: Securely discovering Application Master's metadata or sending a secret to Application Master at submission Hi Do you know how I can extend the client interface of the RPC port? >>> YARN provides YARNClIent library that uses ApplicationClientProtocol. For >>> your more understanding refer >>> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html#Writing_a_simple_Client<https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.apache.org_docs_stable_hadoop-2Dyarn_hadoop-2Dyarn-2Dsite_WritingYarnApplications.html-23Writing-5Fa-5Fsimple-5FClient&d=DQMGaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=5pHc0M-1BOxtbvvaoT6ahycddGtWm-uq9f5JW_FJRQM&s=S9H5l9wo0JK9Oet5_GiN-lW4lQBxkaC1mxPDRY1kGpk&e=> I know AM has some endpoints exposed through the RPC port for internal YARN communications, but was not sure how I can extend it to expose a custom endpoint. >>> I am not sure what you mean here internal YARN communication? AM can >>> connect to RM only via AM-RM interface for register/unregister and >>> heartbeat and details sent to RM are limited. It is up to the AM’s to >>> expose client interface for providing metadata. Thanks & Regards Rohith Sharma K S From: Mingyu Kim [mailto:m...@palantir.com] Sent: 09 June 2016 11:21 To: Rohith Sharma K S; user@hadoop.apache.org<mailto:user@hadoop.apache.org> Cc: Matt Cheah Subject: Re: Securely discovering Application Master's metadata or sending a secret to Application Master at submission Hi Rohith, Thanks for the quick response. That sounds promising. Do you know how I can extend the client interface of the RPC port? I know AM has some endpoints exposed through the RPC port for internal YARN communications, but was not sure how I can extend it to expose a custom endpoint. Any pointer would be appreciated! Mingyu From: Rohith Sharma K S mailto:rohithsharm...@huawei.com>> Date: Wednesday, June 8, 2016 at 10:39 PM To: Mingyu Kim mailto:m...@palantir.com>>, "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" mailto:user@hadoop.apache.org>> Cc: Matt Cheah mailto:mch...@palantir.com>> Subject: RE: Securely discovering Application Master's metadata or sending a secret to Application Master at submission Hi Apart from AM address and tracking URL, no other meta data of applicationMaster are stored in YARN. May be AM can expose client interface so th
RE: Securely discovering Application Master's metadata or sending a secret to Application Master at submission
Hi Do you know how I can extend the client interface of the RPC port? >>> YARN provides YARNClIent library that uses ApplicationClientProtocol. For >>> your more understanding refer >>> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html#Writing_a_simple_Client I know AM has some endpoints exposed through the RPC port for internal YARN communications, but was not sure how I can extend it to expose a custom endpoint. >>> I am not sure what you mean here internal YARN communication? AM can >>> connect to RM only via AM-RM interface for register/unregister and >>> heartbeat and details sent to RM are limited. It is up to the AM’s to >>> expose client interface for providing metadata. Thanks & Regards Rohith Sharma K S From: Mingyu Kim [mailto:m...@palantir.com] Sent: 09 June 2016 11:21 To: Rohith Sharma K S; user@hadoop.apache.org Cc: Matt Cheah Subject: Re: Securely discovering Application Master's metadata or sending a secret to Application Master at submission Hi Rohith, Thanks for the quick response. That sounds promising. Do you know how I can extend the client interface of the RPC port? I know AM has some endpoints exposed through the RPC port for internal YARN communications, but was not sure how I can extend it to expose a custom endpoint. Any pointer would be appreciated! Mingyu From: Rohith Sharma K S mailto:rohithsharm...@huawei.com>> Date: Wednesday, June 8, 2016 at 10:39 PM To: Mingyu Kim mailto:m...@palantir.com>>, "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" mailto:user@hadoop.apache.org>> Cc: Matt Cheah mailto:mch...@palantir.com>> Subject: RE: Securely discovering Application Master's metadata or sending a secret to Application Master at submission Hi Apart from AM address and tracking URL, no other meta data of applicationMaster are stored in YARN. May be AM can expose client interface so that AM clients can interact with Running AM to retrieve specific AM details. RPC port of AM can be get from YARN client interface such as ApplicationClientProtocol# getApplicationReport() OR ApplicationClientProtocol #getApplicationAttemptReport(). Thanks & Regards Rohith Sharma K S From: Mingyu Kim [mailto:m...@palantir.com] Sent: 09 June 2016 10:36 To: user@hadoop.apache.org<mailto:user@hadoop.apache.org> Cc: Matt Cheah Subject: Securely discovering Application Master's metadata or sending a secret to Application Master at submission Hi all, To provide a bit of background, I’m trying to deploy a REST server on Application Master and discover the randomly assigned port number securely. I can easily discover the host name of AM through YARN REST API, but the port number needs to be discovered separately. (Port number is assigned within a specified range with retries to avoid port conflicts) An easy solution would be to have Application Master make a callback with the port number, but I’d like to design it such that YARN nodes don’t talk back to the node that submitted the YARN application. So, this problem reduces to securely discovering a small metadata of Application Master. To be clear, by being secure, I’m less concerned about exposing the information to others, but more concerned about the integrity of data (e.g. the metadata actually originated from the Application Master.) I was hoping that there is a way to register some Application Master metadata to Resource Manager, but there doesn’t seem to be a way. Another option I considered was to write the information to a HDFS file, but in order to verify the integrity of the content, I need a way to securely send a private key to Application Master, which I’m not sure what the best is. To recap, does anyone know if there is a way • To register small metadata securely from Application Master to Resource Manager so that it can be discovered by the YARN application submitter? • Or, to securely send a private key to Application Master at the application submission time? Thanks a lot, Mingyu
RE: Securely discovering Application Master's metadata or sending a secret to Application Master at submission
Hi Apart from AM address and tracking URL, no other meta data of applicationMaster are stored in YARN. May be AM can expose client interface so that AM clients can interact with Running AM to retrieve specific AM details. RPC port of AM can be get from YARN client interface such as ApplicationClientProtocol# getApplicationReport() OR ApplicationClientProtocol #getApplicationAttemptReport(). Thanks & Regards Rohith Sharma K S From: Mingyu Kim [mailto:m...@palantir.com] Sent: 09 June 2016 10:36 To: user@hadoop.apache.org Cc: Matt Cheah Subject: Securely discovering Application Master's metadata or sending a secret to Application Master at submission Hi all, To provide a bit of background, I’m trying to deploy a REST server on Application Master and discover the randomly assigned port number securely. I can easily discover the host name of AM through YARN REST API, but the port number needs to be discovered separately. (Port number is assigned within a specified range with retries to avoid port conflicts) An easy solution would be to have Application Master make a callback with the port number, but I’d like to design it such that YARN nodes don’t talk back to the node that submitted the YARN application. So, this problem reduces to securely discovering a small metadata of Application Master. To be clear, by being secure, I’m less concerned about exposing the information to others, but more concerned about the integrity of data (e.g. the metadata actually originated from the Application Master.) I was hoping that there is a way to register some Application Master metadata to Resource Manager, but there doesn’t seem to be a way. Another option I considered was to write the information to a HDFS file, but in order to verify the integrity of the content, I need a way to securely send a private key to Application Master, which I’m not sure what the best is. To recap, does anyone know if there is a way • To register small metadata securely from Application Master to Resource Manager so that it can be discovered by the YARN application submitter? • Or, to securely send a private key to Application Master at the application submission time? Thanks a lot, Mingyu
RE: Leak in RM Capacity scheduler leading to OOM
I think you might be hitting with YARN-2997. This issue fixes for sending duplicated completed containers to RM. Thanks & Regards Rohith Sharma K S -Original Message- From: Sharad Agarwal [mailto:sha...@apache.org] Sent: 24 March 2016 08:58 To: Sharad Agarwal Cc: yarn-...@hadoop.apache.org; user@hadoop.apache.org Subject: Re: Leak in RM Capacity scheduler leading to OOM Ticket for this is here -> https://issues.apache.org/jira/browse/YARN-4852 On Wed, Mar 23, 2016 at 5:50 PM, Sharad Agarwal wrote: > Taking a dump of 8 GB heap shows about 18 million > org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto > > Similar counts are there for ApplicationAttempt, ContainerId. All > seems to be linked via > org.apache.hadoop.yarn.proto.YarnProtos$ContainerStatusProto, the > count of which is also about 18 million. > > On further debugging, looking at the CapacityScheduler code: > > It seems to add duplicated entries of UpdatedContainerInfo objects for > the completed containers. In the same dump seeing about 0.5 > UpdatedContainerInfo million objects > > This issue only surfaces if the scheduler thread is not able to drain > fast enough the UpdatedContainerInfo objects, happens only in a big cluster. > > Has anyone noticed the same. We are running hadoop 2.6.0 > > Sharad >
RE: Concurrency control
Hi Laxman, In Hadoop-2.8(Not released yet), CapacityScheduler provides configuration for configuring ordering policy. By configuring FAIR_ORDERING_POLICY in CS , probably you should be able to achieve your goal i.e avoiding starving of applications for resources. org.apache.hadoop.yarn.server.resourcemanager.scheduler.policy.FairOrderingPolicy> An OrderingPolicy which orders SchedulableEntities for fairness (see FairScheduler FairSharePolicy), generally, processes with lesser usage are lesser. If sizedBasedWeight is set to true then an application with high demand may be prioritized ahead of an application with less usage. This is to offset the tendency to favor small apps, which could result in starvation for large apps if many small ones enter and leave the queue continuously (optional, default false) Community Issue Id : https://issues.apache.org/jira/browse/YARN-3463 Thanks & Regards Rohith Sharma K S From: Laxman Ch [mailto:laxman@gmail.com] Sent: 29 September 2015 13:36 To: user@hadoop.apache.org Subject: Re: Concurrency control Bouncing this thread again. Any other thoughts please? On 17 September 2015 at 23:21, Laxman Ch mailto:laxman@gmail.com>> wrote: No Naga. That wont help. I am running two applications (app1 - 100 vcores, app2 - 100 vcores) with same user which runs in same queue (capacity=100vcores). In this scenario, if app1 triggers first occupies all the slots and runs longs then app2 will starve longer. Let me reiterate my problem statement. I wanted "to control the amount of resources (vcores, memory) used by an application SIMULTANEOUSLY" On 17 September 2015 at 22:28, Naganarasimha Garla mailto:naganarasimha...@gmail.com>> wrote: Hi Laxman, For the example you have stated may be we can do the following things : 1. Create/modify the queue with capacity and max cap set such that its equivalent to 100 vcores. So as there is no elasticity, given application will not be using the resources beyond the capacity configured 2. yarn.scheduler.capacity..minimum-user-limit-percent so that each active user would be assured with the minimum guaranteed resources . By default value is 100 implies no user limits are imposed. Additionally we can think of "yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage" which will enforce strict cpu usage for a given container if required. + Naga On Thu, Sep 17, 2015 at 4:42 PM, Laxman Ch mailto:laxman@gmail.com>> wrote: Yes. I'm already using cgroups. Cgroups helps in controlling the resources at container level. But my requirement is more about controlling the concurrent resource usage of an application at whole cluster level. And yes, we do configure queues properly. But, that won't help. For example, I have an application with a requirement of 1000 vcores. But, I wanted to control this application not to go beyond 100 vcores at any point of time in the cluster/queue. This makes that application to run longer even when my cluster is free but I will be able meet the guaranteed SLAs of other applications. Hope this helps to understand my question. And thanks Narasimha for quick response. On 17 September 2015 at 16:17, Naganarasimha Garla mailto:naganarasimha...@gmail.com>> wrote: Hi Laxman, Yes if cgroups are enabled and "yarn.scheduler.capacity.resource-calculator" configured to DominantResourceCalculator then cpu and memory can be controlled. Please Kindly furhter refer to the official documentation http://hadoop.apache.org/docs/r1.2.1/capacity_scheduler.html But may be if say more about problem then we can suggest ideal configuration, seems like capacity configuration and splitting of the queue is not rightly done or you might refer to Fair Scheduler if you want more fairness for container allocation for different apps. On Thu, Sep 17, 2015 at 4:10 PM, Laxman Ch mailto:laxman@gmail.com>> wrote: Hi, In YARN, do we have any way to control the amount of resources (vcores, memory) used by an application SIMULTANEOUSLY. - In my cluster, noticed some large and long running mr-app occupied all the slots of the queue and blocking other apps to get started. - I'm using Capacity schedulers (using hierarchical queues and preemption disabled) - Using Hadoop version 2.6.0 - Did some googling around this and gone through configuration docs but I'm not able to find anything that matches my requirement. If needed, I can provide more details on the usecase and problem. -- Thanks, Laxman -- Thanks, Laxman -- Thanks, Laxman -- Thanks, Laxman
RE: How to auto relaunch a YARN Application Master on a failure?
It is possible.. You can set the number of attempts to be launched in case of AM failures. yarn.resourcemanager.am.max-attempts. Default is 2, you can increase it. This is at global level. Per application level, you need to send in ApplicationSubmissionContext# setMaxAppAttempts Thanks & Regards Rohith Sharma K S From: Sridhar Chellappa [mailto:schellap2...@gmail.com] Sent: 19 August 2015 14:55 To: user@hadoop.apache.org Subject: How to auto relaunch a YARN Application Master on a failure? Is this possible? If yes, can someone get back to me as to how?
RE: Confusing Yarn RPC Configuration
>>> I believe it is the same issue for node manage connection This would be probably related to below issues https://issues.apache.org/jira/i#browse/YARN-3944 https://issues.apache.org/jira/i#browse/YARN-3238 Thanks & Regards Rohith Sharma K S From: Jeff Zhang [mailto:zjf...@gmail.com] Sent: 18 August 2015 09:11 To: user@hadoop.apache.org Subject: Confusing Yarn RPC Configuration I use yarn.resourcemanager.connect.max-wait.ms<http://yarn.resourcemanager.connect.max-wait.ms> to control how much time to wait for setting up RM connection. But the weird thing I found that this configuration is not the real max wait time. Actually Yarn will convert it to retry count with configuration yarn.resourcemanager.connect.retry-interval.ms<http://yarn.resourcemanager.connect.retry-interval.ms>. Let's say yarn.resourcemanager.connect.max-wait.ms<http://yarn.resourcemanager.connect.max-wait.ms>=1 and yarn.resourcemanager.connect.retry-interval.ms<http://yarn.resourcemanager.connect.retry-interval.ms>=2000, then yarn will create RetryUpToMaximumCountWithFixedSleep with max count = 5 (1/2000) Because for each RM connection, there's retry policy inside of hadoop RPC. Let's say ipc.client.connect.retry.interval=1000 and ipc.client.connect.max.retries=10, so for each RM connection it will try 10 times and totally cost 10 seconds (1000*10). So overall for the RM connection it would cost 50 seconds (10 * 5), and this number is not consistent with yarn.resourcemanager.connect.max-wait.ms<http://yarn.resourcemanager.connect.max-wait.ms> which confuse users. I am not sure the purpose of 2 rounds of retry policy (Yarn side and RPC internal side), should it be only 1 round of retry policy and yarn related configuration is just for override the RPC configuration ? BTW, I believe it is the same issue for node manage connection. -- Best Regards Jeff Zhang
RE: Remotely submit a job to Yarn on CDH5.4
Are you trying submit job from Windows to Linux server? If yes, try to submit job using with mapreduce.app-submission.cross-platform=true. Thanks & Regards Rohith Sharma K S From: Fei Hu [mailto:hufe...@gmail.com] Sent: 18 August 2015 21:11 To: user@hadoop.apache.org Subject: Remotely submit a job to Yarn on CDH5.4 Hi, I want to remotely submit a job to Yarn on CDH5.4. The following is the code about the WordCount and the error report. Any one knows how to solve it? Thanks in advance, Fei INFO: Job job_1439867352386_0025 failed with state FAILED due to: Application application_1439867352386_0025 failed 2 times due to AM Container for appattempt_1439867352386_0025_02 exited with exitCode: 1 For more detailed output, check application tracking page:http://compute-04:8088/proxy/application_1439867352386_0025/Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_1439867352386_0025_02_01 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 Failing this attempt. Failing the application. public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); System.setProperty("HADOOP_USER_NAME","hdfs"); conf.set("hadoop.job.ugi", "supergroup"); conf.set("mapreduce.framework.name", "yarn"); conf.set("fs.defaultFS", "hdfs://compute-04:8020"); conf.set("mapreduce.map.java.opts", "-Xmx1024M"); conf.set("mapreduce.reduce.java.opts", "-Xmx1024M"); conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()); conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName()); conf.set("yarn.resourcemanager.address", "199.25.200.134:8032"); conf.set("yarn.resourcemanager.resource-tracker.address", "199.25.200.134:8031"); conf.set("yarn.resourcemanager.scheduler.address", "199.25.200.134:8030"); conf.set("yarn.resourcemanager.admin.address", "199.25.200.134:8033"); conf.set("yarn.nodemanager.aux-services", "mapreduce_shuffle"); conf.set("yarn.application.classpath", "/etc/hadoop/conf.cloudera.hdfs," + "/etc/hadoop/conf.cloudera.yarn," + "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/*," + "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/lib/*," + "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/*," + "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-hdfs/lib/*," + "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/*," + "/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop-yarn/lib/*”); GenericOptionsParser optionParser = new GenericOptionsParser(conf, args); String[] remainingArgs = optionParser.getRemainingArgs(); if (!(remainingArgs.length != 2 || remainingArgs.length != 4)) { System.err.println("Usage: wordcount [-skip skipPatternFile]"); System.exit(2); } Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount2.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); List otherArgs = new ArrayList(); for (int i=0; i < remainingArgs.length; ++i) {
RE: Application Master waits a long time after Mapper/Reducers finish
Hi From thread dump, it seems waiting for HDFS operation. Can you attach AM logs, and do you see any client retry for connecting to HDFS? "CommitterEvent Processor #4" prio=10 tid=0x0199a800 nid=0x18df in Object.wait() [0x7f4f12aa4000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) …. at org.apache.hadoop.hdfs.DFSClient.rename(DFSClient.java:1864) at org.apache.hadoop.hdfs.DistributedFileSystem.rename(DistributedFileSystem.java:575) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:345) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:362) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:274) at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:237) May be you can check from HDFS that is it Healthy? Thanks & Regards Rohith Sharma K S From: Ashish Kumar Singh [mailto:ashish23...@gmail.com] Sent: 20 July 2015 14:16 To: user@hadoop.apache.org Subject: Application Master waits a long time after Mapper/Reducers finish Hello Users , I am facing a problem running Mapreduce jobs on Hadoop 2.6. I am observing that the Applocation Master waits for a long time after all the Mappers and Reducers are completed before the job is completed . This wait time sometimes exceeds 20-25 mins which is very strange as our mappers and reducers complete in less than 10 minutes for the job . Below are some observations: a) Job completion status stands at 95% when the wait begins b)JOB_COMMIT is initiated just before this wait time ( logs: 2015-07-14 01:54:46,636 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1436854849540_0123Job Transitioned from RUNNING to COMMITTING ) c) job success happens after 20-25 minutes ( logs: 2015-07-14 02:15:06,634 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1436854849540_0123Job Transitioned from COMMITTING to SUCCEEDED ) Appreciate any help on this . Thread dump while the Application master hangs is attached. Regards, Ashish
RE: Lost mapreduce applications displayed in UI
Hi, Do you remember the steps when applications won’t be displayed in RM web UI? I mean after which actions in the RM web UI applications are not displaying? Is there any filtering is applied in the UI like “Showing 0 to 0 of 0 entries (filtered from 4 total entries)” in the bottom of RM applications page? Thanks & Regards Rohith Sharma K S From: Zhijie Shen [mailto:zs...@hortonworks.com] Sent: 13 May 2015 05:00 To: user@hadoop.apache.org Subject: Re: Lost mapreduce applications displayed in UI Maybe you have hit the completed app limit (1 by default). Once the limit hits, the oldest completed app will be removed from cache. - Zhijie From: hitarth trivedi mailto:t.hita...@gmail.com>> Sent: Tuesday, May 12, 2015 3:32 PM To: user@hadoop.apache.org<mailto:user@hadoop.apache.org> Subject: Lost mapreduce applications displayed in UI Hi, My cluster suddenly stopped displaying application information in UI (http://localhost:8088/cluster/apps). Although the counters like 'Apps Submitted' , 'Apps Completed', 'Apps Running' etc, all seems to increment accurately and display right information, whnever I start new mapreduce job. Any help is appreciated. Thanks, Hitrix
RE: YARN Exceptions
Are you running Secured Hadoop cluster( Kerberos ) and at YARN – container executor as LinuxContainerExecutor? Thanks & Regards Rohith Sharma K S From: Kumar Jayapal [mailto:kjayapa...@gmail.com] Sent: 25 April 2015 20:10 To: user@hadoop.apache.org Subject: Re: YARN Exceptions Yes Here is the complete log and sqoop import command to get the data from oracle. [root@sqpcdh01094p001 ~]# sqoop import --connect "jdbc:oracle:thin:@lorct101094t01a.qat.np.costco.com:1521/CT1<http://jdbc:oracle:thin:@lorct101094t01a.qat.np.costco.com:1521/CT1>" --username "edhdtaesvc" --password "" --table "SAPSR3.AUSP" --target-dir "/data/crmdq/CT1" --table "SAPSR3.AUSP" --split-by PARTNER_GUID --as-avrodatafile --compression-codec org.apache.hadoop.io.compress.SnappyCodec --m 1 Warning: /opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p654.326/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. 15/04/25 13:37:19 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.2 15/04/25 13:37:19 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 15/04/25 13:37:20 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled. 15/04/25 13:37:20 INFO manager.SqlManager: Using default fetchSize of 1000 15/04/25 13:37:20 INFO tool.CodeGenTool: Beginning code generation 15/04/25 13:37:20 INFO manager.OracleManager: Time zone has been set to GMT 15/04/25 13:37:20 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM SAPSR3.AUSP t WHERE 1=0 15/04/25 13:37:20 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce Note: /tmp/sqoop-root/compile/dbe5b6d69507ee60c249062c54813557/SAPSR3_AUSP.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 15/04/25 13:37:22 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/dbe5b6d69507ee60c249062c54813557/SAPSR3.AUSP.jar 15/04/25 13:37:22 INFO mapreduce.ImportJobBase: Beginning import of SAPSR3.AUSP 15/04/25 13:37:22 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 15/04/25 13:37:22 INFO manager.OracleManager: Time zone has been set to GMT 15/04/25 13:37:23 INFO manager.OracleManager: Time zone has been set to GMT 15/04/25 13:37:23 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM SAPSR3.AUSP t WHERE 1=0 15/04/25 13:37:23 INFO mapreduce.DataDrivenImportJob: Writing Avro schema file: /tmp/sqoop-root/compile/dbe5b6d69507ee60c249062c54813557/sqoop_import_SAPSR3_AUSP.avsc 15/04/25 13:37:23 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 15/04/25 13:37:23 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 14047 for edhdtaesvc on ha-hdfs:nameservice1 15/04/25 13:37:23 ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !! 15/04/25 13:37:23 INFO security.TokenCache: Got dt for hdfs://nameservice1; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameservice1, Ident: (HDFS_DELEGATION_TOKEN token 14047 for edhdtaesvc) 15/04/25 13:37:23 ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !! 15/04/25 13:37:23 ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !! 15/04/25 13:37:25 ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !! 15/04/25 13:37:25 INFO db.DBInputFormat: Using read commited transaction isolation 15/04/25 13:37:25 INFO mapreduce.JobSubmitter: number of splits:1 15/04/25 13:37:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1429968417065_0004 15/04/25 13:37:26 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameservice1, Ident: (HDFS_DELEGATION_TOKEN token 14047 for edhdtaesvc) 15/04/25 13:37:26 INFO impl.YarnClientImpl: Submitted application application_1429968417065_0004 15/04/25 13:37:26 INFO mapreduce.Job: The url to track the job: http://yrncdh01094p001.corp.costco.com:8088/proxy/application_1429968417065_0004/ 15/04/25 13:37:26 INFO mapreduce.Job: Running job: job_1429968417065_0004 15/04/25 13:37:40 INFO mapreduce.Job: Job job_1429968417065_0004 running in uber mode : false 15/04/25 13:37:40 INFO mapreduce.Job: map 0% reduce 0% 15/04/25 13:37:40 INFO mapreduce.Job: Job job_1429968417065_0004 failed with state FAILED due to: Application application_1429968417065_0004 failed 2 times due to AM Container for appattempt_1429968417065_0004_02 exited with exitCode: -1000 due to: Application application_1429968417065_0004 initialization failed (exitCode=255) with output: User edhdtaesvc not found .Failing this a
RE: YARN HA Active ResourceManager failover when machine is stopped
Hi I had seen this issue in my cluster without HA configured when the process is Halted. I assume that your scenario also having similar issue when Active RM machine is Shutdown abruptly. May be you can verify and compare taking thread dump of NM and with below JIRA’s. Open JIRA’s in community regarding this problem are https://issues.apache.org/jira/i#browse/YARN-1061 (Without HA) https://issues.apache.org/jira/i#browse/YARN-2578 (With HA) Thanks & Regards Rohith Sharma K S From: Matt Narrell [mailto:matt.narr...@gmail.com] Sent: 24 April 2015 23:28 To: user@hadoop.apache.org Subject: Re: YARN HA Active ResourceManager failover when machine is stopped Also, another observation is that when the VMs are halted, its seems like the NodeManagers do not consider this a scenario to round-robin among the configured ResourceManagers? Is there some timeout that I’ve missed to instruct the NodeManagers to do this round-robining in the case of the machine not responding (to distinguish it from a network blip)? mn On Apr 24, 2015, at 1:50 AM, Drake민영근 mailto:drake@nexr.com>> wrote: Hi, Matt The second log file looks like node manager's log, not the standby resource manager. Thanks. Drake 민영근 Ph.D kt NexR On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell mailto:matt.narr...@gmail.com>> wrote: Active ResourceManager: http://pastebin.com/hE0ppmnb Standby ResourceManager: http://pastebin.com/DB8VjHqA Oppressively chatty and not much valuable info contained therein. On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli mailto:vino...@hortonworks.com>> wrote: I have run into this offline with someone else too but couldn't root-cause it. Will you be able to share your active/standby ResourceManager logs via pastebin or something? +Vinod On Apr 23, 2015, at 9:41 AM, Matt Narrell mailto:matt.narr...@gmail.com>> wrote: I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0 I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting. Thanks, mn
RE: how to delete logs automatically from hadoop yarn
That’s interesting use-case!! >>>> let’s say I want to delete container logs which are older than week or so. >>>> So is there any configuration to do that? I don’t think there is such configuration exist in the YARN currently. I think it should be able to handle from log4j properties. But enabling log-aggregation, disk filling issue can be overcome. I think in the Hadoop-2.6 or later(yet to release)handling long running services on yarn is done in JIRA https://issues.apache.org/jira/i#browse/YARN-2443 . >>> Because of these continuous logs, we are running out of Linux file limit >>> and thereafter containers are not launched because of exception while >>> creating log directory inside application ID directory I could not get how continuous logs causing exceeding Linux resource limit. How many containers are running in cluster and per machine? If I think, each containers holds one resource for logging. Thanks & Regards Rohith Sharma K S From: Smita Deshpande [mailto:smita.deshpa...@cumulus-systems.com] Sent: 20 April 2015 10:23 To: user@hadoop.apache.org Subject: RE: how to delete logs automatically from hadoop yarn Hi Rohith, Thanks for your solution. The actual problem we are looking at is : We have a lifelong running application, so configurations by which logs will be deleted right after application is finished will not help us. Because of these continuous logs, we are running out of Linux file limit and thereafter containers are not launched because of exception while creating log directory inside application ID directory. During the job execution itself, let’s say I want to delete container logs which are older than week or so. So is there any configuration to do that? Thanks, Smita From: Rohith Sharma K S [mailto:rohithsharm...@huawei.com] Sent: Monday, April 20, 2015 10:09 AM To: user@hadoop.apache.org<mailto:user@hadoop.apache.org> Subject: RE: how to delete logs automatically from hadoop yarn Hi With below configuration , log deletion should be triggered. You can see from the log that deletion has been set to 3600 sec in NM like below. May be you can check NM logs for the below log that give debug information. “INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler: Scheduling Log Deletion for application: application_1428298081702_0008, with delay of 10800 seconds” But there is another configuration which affect deletion task is “yarn.nodemanager.delete.debug-delay-sec”, default value is zero. It means immediately deletion will be triggered. Check is this is configured? Number of seconds after an application finishes before the nodemanager's DeletionService will delete the application's localized file directory and log directory. To diagnose Yarn application problems, set this property's value large enough (for example, to 600 = 10 minutes) to permit examination of these directories. After changing the property's value, you must restart the nodemanager in order for it to have an effect. The roots of Yarn applications' work directories is configurable with the yarn.nodemanager.local-dirs property (see below), and the roots of the Yarn applications' log directories is configurable with the yarn.nodemanager.log-dirs property (see also below). yarn.nodemanager.delete.debug-delay-sec 0 Thanks & Regards Rohith Sharma K S From: Sunil Garg [mailto:sunil.g...@cumulus-systems.com] Sent: 20 April 2015 09:52 To: user@hadoop.apache.org<mailto:user@hadoop.apache.org> Subject: how to delete logs automatically from hadoop yarn How to delete logs from Hadoop yarn automatically, I Have tried following settings but it is not working Is there any other way we can do this or am I doing something wrong !! yarn.log-aggregation-enable false yarn.nodemanager.log.retain-seconds 3600 Thanks Sunil Garg
RE: how to delete logs automatically from hadoop yarn
Hi With below configuration , log deletion should be triggered. You can see from the log that deletion has been set to 3600 sec in NM like below. May be you can check NM logs for the below log that give debug information. “INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler: Scheduling Log Deletion for application: application_1428298081702_0008, with delay of 10800 seconds” But there is another configuration which affect deletion task is “yarn.nodemanager.delete.debug-delay-sec”, default value is zero. It means immediately deletion will be triggered. Check is this is configured? Number of seconds after an application finishes before the nodemanager's DeletionService will delete the application's localized file directory and log directory. To diagnose Yarn application problems, set this property's value large enough (for example, to 600 = 10 minutes) to permit examination of these directories. After changing the property's value, you must restart the nodemanager in order for it to have an effect. The roots of Yarn applications' work directories is configurable with the yarn.nodemanager.local-dirs property (see below), and the roots of the Yarn applications' log directories is configurable with the yarn.nodemanager.log-dirs property (see also below). yarn.nodemanager.delete.debug-delay-sec 0 Thanks & Regards Rohith Sharma K S From: Sunil Garg [mailto:sunil.g...@cumulus-systems.com] Sent: 20 April 2015 09:52 To: user@hadoop.apache.org Subject: how to delete logs automatically from hadoop yarn How to delete logs from Hadoop yarn automatically, I Have tried following settings but it is not working Is there any other way we can do this or am I doing something wrong !! yarn.log-aggregation-enable false yarn.nodemanager.log.retain-seconds 3600 Thanks Sunil Garg
RE: Mapreduce job got stuck
Hi, On master machine, NodeManager is not running because of “Caused by: java.net.BindException: Problem binding to [kirti:8040], got from logs. The port 8040 is in use!!! Configure available port number. Thanks & Regards Rohith Sharma K S From: Vandana kumari [mailto:kvandana1...@gmail.com] Sent: 15 April 2015 16:29 To: user@hadoop.apache.org; Rohith Sharma K S Subject: Re: Mapreduce job got stuck When i made the changes as specified by Rohith, my job is running but it runs only on slave nodes(amit & yashbir) not on master node(kirti) and still no nodemanager is running on master node. On Wed, Apr 15, 2015 at 6:39 AM, Vandana kumari mailto:kvandana1...@gmail.com>> wrote: i had attached nodemanager log of master file and modified yarn-site.xml file On Wed, Apr 15, 2015 at 6:21 AM, Rohith Sharma K S mailto:rohithsharm...@huawei.com>> wrote: Hi Vandana From the configurations, it looks like none of the NodeManagers are registered with RM because of configuration “yarn.resourcemanager.resource- tracker.address” issue. May be you can confirm any NM’s are registered with RM. In the below, there is space after “resource-“ but “resource-tracker” is single without any space. Check after removing space. yarn.resourcemanager.resource- tracker.address Similarly I see same issue in “yarn.nodemanager.aux- services.mapreduce.shuffle.class” where space after “aux-”!!! Hope it helps you to resolve issue Thanks & Regards Rohith Sharma K S From: Vandana kumari [mailto:kvandana1...@gmail.com<mailto:kvandana1...@gmail.com>] Sent: 15 April 2015 15:33 To: user@hadoop.apache.org<mailto:user@hadoop.apache.org> Subject: Mapreduce job got stuck i had setup a 3 node hadoop cluster on centos 6.5 but nodemanager is not running on master and is running on slave nodes. Alse when i submit a job then job get stuck. the same job runs well on sinle node setup. I am unable to figure out the problem. Attaching all the configuration files. Any help will be highly appreciated. -- Thanks and regards Vandana kumari -- Thanks and regards Vandana kumari -- Thanks and regards Vandana kumari
RE: Change in fair-scheduler.xml
Hi 1 - Is there a document on what should be the default settings in the XML file for say 96 GB.. 48 core system with say 4/queues? You can refer below the doc for configuring fair scheduler http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html 2 - When we change the file does the yarn service need to be bounced for the changed values to get reflected? Yarn admin supports runtime refresh queues without restarting ResourceManager. It can be achieved by using “$HADOOP_HOME/bin/yarn rmadmin –refreshQueues” CLI command. Thanks & Regards Rohith Sharma K S From: Manish Maheshwari [mailto:mylogi...@gmail.com] Sent: 15 April 2015 15:43 To: user@hadoop.apache.org Subject: Change in fair-scheduler.xml Hi, We are trying to change properties of fair scheduler settings. 1 - Is there a document on what should be the default settings in the XML file for say 96 GB.. 48 core system with say 4/queues? 2 - When we change the file does the yarn service need to be bounced for the changed values to get reflected? Thanks Manish
RE: Mapreduce job got stuck
Hi Vandana From the configurations, it looks like none of the NodeManagers are registered with RM because of configuration “yarn.resourcemanager.resource- tracker.address” issue. May be you can confirm any NM’s are registered with RM. In the below, there is space after “resource-“ but “resource-tracker” is single without any space. Check after removing space. yarn.resourcemanager.resource- tracker.address Similarly I see same issue in “yarn.nodemanager.aux- services.mapreduce.shuffle.class” where space after “aux-”!!! Hope it helps you to resolve issue Thanks & Regards Rohith Sharma K S From: Vandana kumari [mailto:kvandana1...@gmail.com] Sent: 15 April 2015 15:33 To: user@hadoop.apache.org Subject: Mapreduce job got stuck i had setup a 3 node hadoop cluster on centos 6.5 but nodemanager is not running on master and is running on slave nodes. Alse when i submit a job then job get stuck. the same job runs well on sinle node setup. I am unable to figure out the problem. Attaching all the configuration files. Any help will be highly appreciated. -- Thanks and regards Vandana kumari
RE: How to stop a mapreduce job from terminal running on Hadoop Cluster?
In addition to below options, in the Hadoop-2.7(yet to release in couple of weeks) the user friendly option provided for killing the applications from Web UI. In the application block , ‘Kill Application’ button has been provided for killing applications. Thanks & Regards Rohith Sharma K S From: Pradeep Gollakota [mailto:pradeep...@gmail.com] Sent: 12 April 2015 23:41 To: user@hadoop.apache.org Subject: Re: How to stop a mapreduce job from terminal running on Hadoop Cluster? Also, mapred job -kill On Sun, Apr 12, 2015 at 11:07 AM, Shahab Yunus mailto:shahab.yu...@gmail.com>> wrote: You can kill t by using the following yarn command yarn application -kill https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html Or use old hadoop job command http://stackoverflow.com/questions/11458519/how-to-kill-hadoop-jobs Regards, Shahab On Sun, Apr 12, 2015 at 2:03 PM, Answer Agrawal mailto:yrsna.tse...@gmail.com>> wrote: To run a job we use the command $ hadoop jar example.jar inputpath outputpath If job is so time taken and we want to stop it in middle then which command is used? Or is there any other way to do that? Thanks,
RE: Pin Map/Reduce tasks to specific cores
Hi George In MRV2, YARN supports CGroups implementation. Using CGroup it is possible to run containers in specific cores. For your detailed reference, some of the useful links http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDP2/HDP-2-trunk/bk_system-admin-guide/content/ch_cgroups.html http://blog.cloudera.com/blog/2013/12/managing-multiple-resources-in-hadoop-2-with-yarn/ http://riccomini.name/posts/hadoop/2013-06-14-yarn-with-cgroups/ P.S : I could not find any related document in Hadoop Yarn docs. I will raise ticket for the same in community. Hope the above information will help your use case!!! Thanks & Regards Rohith Sharma K S From: George Ioannidis [mailto:giorgio...@gmail.com] Sent: 07 April 2015 01:55 To: user@hadoop.apache.org Subject: Pin Map/Reduce tasks to specific cores Hello. My question, which can be found on Stack Overflow<http://stackoverflow.com/questions/29283213/core-affinity-of-map-tasks-in-hadoop> as well, regards pinning map/reduce tasks to specific cores, either on hadoop v.1.2.1 or hadoop v.2. In specific, I would like to know if the end-user can have any control on which core executes a specific map/reduce task. To pin an application on linux, there's the "taskset" command, but is anything similar provided by hadoop? If not, is the Linux Scheduler in charge of allocating tasks to specific cores? -- Below I am providing two cases to better illustrate my question: Case #1: 2 GiB input size, HDFS block size of 64 MiB and 2 compute nodes available, with 32 cores each. As follows, 32 map tasks will be called; let's suppose that mapred.tasktracker.map.tasks.maximum = 16, so 16 map tasks will be allocated to each node. Can I guarantee that each Map Task will run on a specific core, or is it up to the Linux Scheduler? -- Case #2: The same as case #1, but now the input size is 8 GiB, so there are not enough slots for all map tasks (128), so multiple tasks will share the same cores. Can I control how much "time" each task will spend on a specific core and if it will be reassigned to the same core in the future? Any information on the above would be highly appreciated. Kind Regards, George
RE: Does Hadoop 2.6.0 have job level blacklisting?
Hi Chris Is there still job level blacklisting as there was in earlier versions? >> yes, job level blacklisting support is there. Application Master has to >> identify the nodes which it wants to blacklists and send those nodes details >> to ResourceManager via ApplicationMasterProtocol#allocate request. On >> blacklisted nodes, containers will not be assigned thereafter. Java Doc https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/yarn/api/ApplicationMasterProtocol.html#allocate(org.apache.hadoop.yarn.api.protocolrecords.AllocateRequest) https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/yarn/api/protocolrecords/AllocateRequest.html Thanks & Regards Rohith Sharma K S From: Chris Mawata [mailto:chris.maw...@gmail.com] Sent: 29 March 2015 01:10 To: user@hadoop.apache.org Subject: Does Hadoop 2.6.0 have job level blacklisting? At http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html#Monitoring_Health_of_NodeManagers is a description of how you can have a script check the health of a node and indicate to the ResourceManager that it is unhealthy. This seems to be at the cluster level. Is there still job level blacklisting as there was in earlier versions? Chris Mawata
RE: How to troubleshoot failed or stuck jobs
Hi 1. For the Failed jobs, you can directly check the MRAppMaster logs. There you get reason for failed jobs. 2. For the stuck job, you need to do some ground work to identify what is going wrong. It can be either YARN issue or MapReduce issue. 2.1 In a recent time, I have face job stuck many times if headroom calculation goes wrong. Headroom is sent by RM to ApplicationMaster and AM uses this as deciding factors ( https://issues.apache.org/jira/i#browse/YARN-1680 ). Corresponding parent jira is https://issues.apache.org/jira/i#browse/YARN-1198 2.2 When the job is stuck, YARN – try to get ClusterMemory Used, ClusterMemory Reserved, Total Memory, How many NodeManagers? What is the headroom sent to AM. MapReduce – Any NM’s are blacklisted, Does all the reducers tasks are using ClusterMemory? By default Reducers start before Mapper completion. In case if Mapper fails because of some unstable node, then reducers take over the cluster. Here, it is expected reducers should be pre-empted. Need to identify whether reducers are getting pre-empted. MRAppMaster log would help for some extent to analyze the issue. Thanks & Regards Rohith Sharma K S From: Krish Donald [mailto:gotomyp...@gmail.com] Sent: 02 March 2015 11:09 To: user@hadoop.apache.org Subject: Re: How to troubleshoot failed or stuck jobs Thanks for Link Ted, However wanted to understand the approach which should be taken when troubleshooting failed or stuck jobs ? On Sun, Mar 1, 2015 at 8:52 PM, Ted Yu mailto:yuzhih...@gmail.com>> wrote: Here are some related discussions and JIRA: http://search-hadoop.com/m/LgpTk2gxrGx http://search-hadoop.com/m/LgpTk2YLArE https://issues.apache.org/jira/browse/MAPREDUCE-6190 Cheers On Sun, Mar 1, 2015 at 8:41 PM, Krish Donald mailto:gotomyp...@gmail.com>> wrote: Hi, Wanted to understand, How to troubleshoot failed or stuck jobs ? Thanks Krish
RE: about the jobid
Hi Yarn application id allocation based on the daemon ResourceManager start time(assuming cluster is MR2 else JobTracker start time). Say if you have 3 job client submitting jobs to Yarn, then application id are application__0001, application__0002, application__0003 AND corresponding job id's are job__0001, job__0002 and job__0003 respectively. >>>> Is the jobid should be job_201502281500_ ? what is the problem? No, this is behaviour. In your case, 201502271057 is the start time of ResourceManager. So all the applications submitted to Yarn start with application_201502271057_ and corresponding job id is Job_201502271057_. The '' is counter for every job submission. Thanks & Regards Rohith Sharma K S -Original Message- From: lujinhong [mailto:lujinh...@yahoo.com] Sent: 01 March 2015 19:40 To: User Hadoop Subject: about the jobid Hi, all. I run nutch in deploy mode at about 3pm, 02/28/2015, but the jobid is job_201502271057_0251.I found that 201502271057 is the time I start hadoop(by start-all.sh). Is the jobid should be job_201502281500_ ? what is the problem? system date: [jediael@master history]$ date Sat Feb 28 15:39:00 CST 2015 log files of hadoop: /mnt/jediael/hadoop-1.2.1/logs/history [jediael@master history]$ ls donejob_201502271057_0245_conf.xml job_201502271057_0248_conf.xml job_201502271057_0251_1425107493248_jediael_%5BFeb2815%5Dfetch job_201502271057_0243_conf.xml job_201502271057_0246_conf.xml job_201502271057_0249_conf.xml job_201502271057_0251_conf.xml job_201502271057_0244_conf.xml job_201502271057_0247_conf.xml job_201502271057_0250_conf.xml stdout of fetcher job: 15/02/28 15:11:32 INFO zookeeper.ClientCnxn: EventThread shut down 15/02/28 15:11:32 INFO zookeeper.ZooKeeper: Session: 0x4bc8f7c30a031b closed 15/02/28 15:11:33 INFO mapred.JobClient: Running job: job_201502271057_0251 15/02/28 15:11:34 INFO mapred.JobClient: map 0% reduce 0% 15/02/28 15:11:51 INFO mapred.JobClient: map 100% reduce 0% 15/02/28 15:12:00 INFO mapred.JobClient: map 100% reduce 16% 15/02/28 15:12:03 INFO mapred.JobClient: map 100% reduce 53%
RE: YarnClient to get the running applications list in java
Simple way to meet your goal , you can add hadoop jars into project classpath. I.e If you have hadoop package, extract it and add all the jars into project classpath. Then you change java code below YarnConfiguration conf = new YarnConfiguration(); conf.set("yarn.resourcemanager.address", "rm-ip:port"); // Running RM address YarnClient yarnClient = YarnClient.createYarnClient(); yarnClient.init(conf); yarnClient.start(); // you need to start YarnClient service // code to getApplications() } Thanks & Regards Rohith Sharma K S From: Mouzzam Hussain [mailto:monibab...@gmail.com] Sent: 26 February 2015 16:23 To: user@hadoop.apache.org Subject: YarnClient to get the running applications list in java I am working with YarnClient for the 1st time. My goal is to get and display the applications running on Yarn using Java. My project setup is as follows: public static void main(String[] args) throws IOException, YarnException { // Create yarnClient YarnConfiguration conf = new YarnConfiguration(); YarnClient yarnClient = YarnClient.createYarnClient(); yarnClient.init(conf); try { List applications = yarnClient.getApplications(); System.err.println("yarn client : " + applications.size()); } catch (YarnException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } I get the following exception when i run the program: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/HadoopIllegalArgumentException at projects.HelloWorld.main(HelloWorld.java:16) ... 6 more Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.HadoopIllegalArgumentException at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) The POM file is as follows: http://maven.apache.org/POM/4.0.0"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd";> 4.0.0 BigContent ManagementServer 1.0-SNAPSHOT UTF-8 2.4.0 1.2.1 org.apache.maven.plugins maven-compiler-plugin 3.2 1.7 1.7 org.apache.maven.plugins maven-war-plugin 2.3 default-war none war-exploded prepare-package exploded custom-war package war src/main/webapp/WEB-INF/web.xml resource2 org.apache.spark spark-streaming_2.10 ${spark.version} provided com.sun.jersey jersey-core 1.9.1 org.apache.hadoop hadoop-client ${hadoop.version} javax.servlet * org.apache.hadoop hadoop-yarn-common ${hadoop.version} org.apache.hadoop hadoop-common ${hadoop.version} provided
RE: Node manager contributing to one queue's resources
Hi If you are using CapacityScheduler, can you try using DominantResourceCalculator i.e configuring below property value in capacity-scheduler.xml file. yarn.scheduler.capacity.resource-calculator org.apache.hadoop.yarn.util.resource. DominantResourceCalculator The basic Idea it works is as follows ‘if user A runs CPU-heavy tasks and user B runs memory-heavy tasks, it attempts to equalize CPU share of user A with Memory-share of user B’ See Java Doc https://apache.googlesource.com/hadoop-common/+/60e3b885ba8344d9f448202f5f2c290b5606ff8f/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/resource/DominantResourceCalculator.java I think this may help you!!! Thanks & Regards Rohith Sharma K S From: twinkle sachdeva [mailto:twinkle.sachd...@gmail.com] Sent: 26 February 2015 14:05 To: USers Hadoop Subject: Node manager contributing to one queue's resources Hi, I have to run two kind of applications, one requiring less cores but more memory ( Application_High_Mem) and another application which requires more cores but less memory ( Application_High_Core). I can use specific queues to submit them to, but that can lead to one node contributing to only one one such application and having some part of resources as idle. Is there a way, let's say extending concept of queues at node manager level to do this or some other way, in which i can achieve it in YARN? Thanks, Twinkle
RE: Time out after 600 for YARN mapreduce application
Looking into attemptID, this is mapper task getting timed out in MapReduce job. The configuration that can be used to increase the value is 'mapreduce.task.timeout'. The task timed out is because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster for 10 mins. Does MR job is custom job? If so any operation are you doing in cleanup() of Mapper ? Sometimes there would be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in task to timeout. Thanks & Regards Rohith Sharma K S From: Alexandru Pacurar [mailto:alexandru.pacu...@propertyshark.com] Sent: 11 February 2015 15:34 To: user@hadoop.apache.org Subject: Time out after 600 for YARN mapreduce application Hello, I keep encountering an error when running nutch on hadoop YARN: AttemptID:attempt_1423062241884_9970_m_09_0 Timed out after 600 secs Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores. I am pretty sure one of these parameters is to the threshold I'm hitting: yarn.am.liveness-monitor.expiry-interval-ms yarn.nm.liveness-monitor.expiry-interval-ms yarn.resourcemanager.nm.liveness-monitor.interval-ms but I would like to understand why. The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time. Thanks, Alex
RE: Error with winutils.sln
Download patch from jira : https://issues.apache.org/jira/i#browse/HADOOP-9922 Thanks & Regards Rohith Sharma K S From: Venkat Ramakrishnan [mailto:venkat.archit...@gmail.com] Sent: 10 February 2015 17:06 To: user@hadoop.apache.org Subject: Re: Error with winutils.sln Thank you Rohit. Could you please point me to the documentation/information/location related to Hadoop's 9922 patch? Thx, Venkat. On Tue, Feb 10, 2015 at 4:51 PM, Rohith Sharma K S mailto:rohithsharm...@huawei.com>> wrote: There are some issues for compiling Hadoop in win32 platform. Even I am facing same issues. I think it is explicitly removed the support. But It is possible to compile successfully by tweaking some of the files. Follow the below instructions 1. Apply the patch HADOOP-9922.patch to your 2.6 version patch –p1 < HADOOP-9922.patch 2.Replace “Release|x64” with “Release|Win32” in $HADOOP_HOME\hadoop-common-project\hadoop-common\src\main\winutils\winutils.sln 3. Replace “x64” with “Win32” in $HADOOP_HOME\hadoop-common-project\hadoop-common\src\main\winutils\winutils.vcxproj and $HADOOP_HOME\hadoop-common-project\hadoop-common\src\main\winutils\libwinutils.vcxproj If in your machine native compilation does not happen because of cmake not installed or any other reason then you will face issue while compiling hdfs project. So, for the sake of compiling you can skip native compilation at hdfs. 4. To skip native compilation, add “${skipTests}” or “true” in $HADOOP_HOME \hadoop-hdfs-project\hadoop-hdfs\pom.xml. ${skipTests} Note : there are 2 occurrences, you add at both 2 occurrence And compile using “mvn clean install –DskipTests” Hope this will help to compile.. enjoy with Hadoop!!! Thanks & Regards Rohith Sharma K S From: Venkat Ramakrishnan [mailto:venkat.archit...@gmail.com<mailto:venkat.archit...@gmail.com>] Sent: 10 February 2015 16:22 To: user@hadoop.apache.org<mailto:user@hadoop.apache.org> Subject: Error with winutils.sln Hello, I'm getting the following error while compiling with Windows 7 (32 bit). I have set the Platform as Win32. The error complains about solution configuration being different from winutils.sln: . . . . [DEBUG] Configuring mojo org.codehaus.mojo:exec-maven-plugin:1.2:exec from plugin realm ClassRealm[plugin>org.codehaus.mojo:exec-maven-plugin:1.2, parent: sun.misc.Launcher$AppClassLoader@647e05<mailto:sun.misc.Launcher$AppClassLoader@647e05>] [DEBUG] Configuring mojo 'org.codehaus.mojo:exec-maven-plugin:1.2:exec' with basic configurator --> [DEBUG] (f) arguments = [D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common/src/main/winutils/winutils.sln, /nologo, /p:Configuration=Release, /p:OutDir=D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\target/bin/, /p:IntermediateOutputPath=D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\target/winutils/, /p:WsceConfigDir=../etc/hadoop, /p:WsceConfigFile=wsce-site.xml] [DEBUG] (f) basedir = D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common [DEBUG] (f) classpathScope = runtime [DEBUG] (f) executable = msbuild [DEBUG] (f) longClasspath = false [DEBUG] (f) project = MavenProject: org.apache.hadoop:hadoop-common:2.6.0 @ D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\pom.xml [DEBUG] (f) session = org.apache.maven.execution.MavenSession@157dc72<mailto:org.apache.maven.execution.MavenSession@157dc72> [DEBUG] (f) skip = false [DEBUG] -- end configuration -- [DEBUG] Executing command line: msbuild D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common/src/main/winutils/winutils.sln /nologo /p:Configuration=Release /p:OutDir=D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\target/bin/ /p:IntermediateOutputPath=D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\target/winutils/ /p:WsceConfigDir=../etc/hadoop /p:WsceConfigFile=wsce-site.xml Build started 07-02-2015 09:55:21. Project "D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\src\main\winutils\winutils.sln" on node 1 (default targets). D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\src\main\winutils\winutils.sln.metaproj : error MSB4126: The specified solution configuration "Release|Win32" is invalid. Please specify a valid solution configuration using the Configuration and Platform properties (e.g. MSBuild.exe Solution.sln /p:Configuration=Debug /p:Platform="Any CPU") or leave those properties blank to use the default solution configuration. [D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\src\main\winutils\winutils.sln] Done Building Project "D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\src\main\winutils\winutils.sln" (default targets) -- FAILED. Build FAILED. "D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\src\mai
RE: Error with winutils.sln
There are some issues for compiling Hadoop in win32 platform. Even I am facing same issues. I think it is explicitly removed the support. But It is possible to compile successfully by tweaking some of the files. Follow the below instructions 1. Apply the patch HADOOP-9922.patch to your 2.6 version patch –p1 < HADOOP-9922.patch 2.Replace “Release|x64” with “Release|Win32” in $HADOOP_HOME\hadoop-common-project\hadoop-common\src\main\winutils\winutils.sln 3. Replace “x64” with “Win32” in $HADOOP_HOME\hadoop-common-project\hadoop-common\src\main\winutils\winutils.vcxproj and $HADOOP_HOME\hadoop-common-project\hadoop-common\src\main\winutils\libwinutils.vcxproj If in your machine native compilation does not happen because of cmake not installed or any other reason then you will face issue while compiling hdfs project. So, for the sake of compiling you can skip native compilation at hdfs. 4. To skip native compilation, add “${skipTests}” or “true” in $HADOOP_HOME \hadoop-hdfs-project\hadoop-hdfs\pom.xml. ${skipTests} Note : there are 2 occurrences, you add at both 2 occurrence And compile using “mvn clean install –DskipTests” Hope this will help to compile.. enjoy with Hadoop!!! Thanks & Regards Rohith Sharma K S From: Venkat Ramakrishnan [mailto:venkat.archit...@gmail.com] Sent: 10 February 2015 16:22 To: user@hadoop.apache.org Subject: Error with winutils.sln Hello, I'm getting the following error while compiling with Windows 7 (32 bit). I have set the Platform as Win32. The error complains about solution configuration being different from winutils.sln: . . . . [DEBUG] Configuring mojo org.codehaus.mojo:exec-maven-plugin:1.2:exec from plugin realm ClassRealm[plugin>org.codehaus.mojo:exec-maven-plugin:1.2, parent: sun.misc.Launcher$AppClassLoader@647e05<mailto:sun.misc.Launcher$AppClassLoader@647e05>] [DEBUG] Configuring mojo 'org.codehaus.mojo:exec-maven-plugin:1.2:exec' with basic configurator --> [DEBUG] (f) arguments = [D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common/src/main/winutils/winutils.sln, /nologo, /p:Configuration=Release, /p:OutDir=D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\target/bin/, /p:IntermediateOutputPath=D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\target/winutils/, /p:WsceConfigDir=../etc/hadoop, /p:WsceConfigFile=wsce-site.xml] [DEBUG] (f) basedir = D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common [DEBUG] (f) classpathScope = runtime [DEBUG] (f) executable = msbuild [DEBUG] (f) longClasspath = false [DEBUG] (f) project = MavenProject: org.apache.hadoop:hadoop-common:2.6.0 @ D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\pom.xml [DEBUG] (f) session = org.apache.maven.execution.MavenSession@157dc72<mailto:org.apache.maven.execution.MavenSession@157dc72> [DEBUG] (f) skip = false [DEBUG] -- end configuration -- [DEBUG] Executing command line: msbuild D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common/src/main/winutils/winutils.sln /nologo /p:Configuration=Release /p:OutDir=D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\target/bin/ /p:IntermediateOutputPath=D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\target/winutils/ /p:WsceConfigDir=../etc/hadoop /p:WsceConfigFile=wsce-site.xml Build started 07-02-2015 09:55:21. Project "D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\src\main\winutils\winutils.sln" on node 1 (default targets). D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\src\main\winutils\winutils.sln.metaproj : error MSB4126: The specified solution configuration "Release|Win32" is invalid. Please specify a valid solution configuration using the Configuration and Platform properties (e.g. MSBuild.exe Solution.sln /p:Configuration=Debug /p:Platform="Any CPU") or leave those properties blank to use the default solution configuration. [D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\src\main\winutils\winutils.sln] Done Building Project "D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\src\main\winutils\winutils.sln" (default targets) -- FAILED. Build FAILED. "D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\src\main\winutils\winutils.sln" (default target) (1) -> (ValidateSolutionConfiguration target) -> D:\h\hadoop-2.6.0-src\hadoop-common-project\hadoop-common\src\main\winutils\winutils.sln.metaproj : error MSB4126: The specified solution configuration "Release|Win32" is invalid. Please specify a valid solution configuration using the Configuration and Platform properties (e.g. MSBuild.exe Solution.sln /p:Configuration=Debug /p:Platform="Any CPU") or leave those properties blank to use the default solution configuration. [D:\h\hadoop-2.6.0-src\hadoop-com
RE: Can not execute failover for RM HA
Currently automatic failover is not supported by YARN. This is open issue in Yarn Refer : https://issues.apache.org/jira/i#browse/YARN-1177 Thanks & Regards Rohith Sharma K S From: 郝东 [mailto:donhof...@163.com] Sent: 10 February 2015 16:12 To: user@hadoop.apache.org Subject: Can not execute failover for RM HA I just set up ResourceManager HA. Both of the resourcemanagers started correctly. When I killed the active one, the other became active. But when I used the following command to do a manual failover, I got exceptions. I don't know what cause this problem. Could anyone help me ? Many Thanks! Command: yarn rmadmin -failover rm1 rm2 Exceptions: Exception in thread "main" java.lang.UnsupportedOperationException: RMHAServiceTarget doesn't have a corresponding ZKFC address at org.apache.hadoop.yarn.client.RMHAServiceTarget.getZKFCAddress(RMHAServiceTarget.java:51) at org.apache.hadoop.ha.HAServiceTarget.getZKFCProxy(HAServiceTarget.java:94) at org.apache.hadoop.ha.HAAdmin.gracefulFailoverThroughZKFCs(HAAdmin.java:315) at org.apache.hadoop.ha.HAAdmin.failover(HAAdmin.java:286) at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:453) at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:382) at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:318) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:434)
RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons
There are several ways to confirm from YARN that total number of Killed/Failed applications in cluster 1. Get from RM web UI lists OR 2. From admin try using this to get numbers of failed and killed applications: ./yarn application -list -appStates FAILED,KILLED 3. Using client API's Since metrics values are displayed in ganglia is incorrect, I get doubt that 1. does ganglia is pointing out to correct RM cluster? Or 2. what is the method ganglia uses to retrieve QueueMetrics? 3. Any client program calculates you have written retrieve apps and calculate it? Thanks & Regards Rohith Sharma K S -Original Message- From: Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com] Sent: 04 February 2015 11:03 To: user@hadoop.apache.org Cc: yarn-...@hadoop.apache.org Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60. The metrics in Ganglia shows around around 10-30 apps killed every 5 mins which is very high wrt to the apps running at any given time(40-60). The RM logs though show 0 failed apps in audit logs during that hour. The RM UI also doesnt show any apps in Applications->Failed tab . The logs are getting rolled over at a slower rate ..every 1-2 hours. Am searching for "Application Finished - Failed" to find the apps failed. Please let me know if I am missing something here. Thanks Suma On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S < rohithsharm...@huawei.com> wrote: > Hi > > > > Could you give more information, which version of hadoop are you using? > > > > >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. > > May be I suspect that Logs might be rolled out. Does more applications > are running? > > > > All the applications history will be displayed on RM web UI (provided > RM is not restarted or RM recovery enabled). May be you can check > these applications lists. > > > > For finding reasons for application killed/failed, one way is you can > check in NodeManager logs also. Here you need to check using > container_id for corresponding application. > > > > Thanks & Regards > > Rohith Sharma K S > > > > *From:* Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com] > *Sent:* 03 February 2015 21:35 > *To:* user@hadoop.apache.org; yarn-...@hadoop.apache.org > *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons > > > > Hello, > > > Was trying to debug reasons for Killed/Failed apps and was checking > for the applications that were killed/failed in RM logs - from RMAuditLogger. > > QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. > Is it possible that some logs are missed by AuditLogger or is it the > other way round and metrics are being reported higher ? > > Thanks > > Suma >
RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons
Hi Could you give more information, which version of hadoop are you using? >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. >> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. May be I suspect that Logs might be rolled out. Does more applications are running? All the applications history will be displayed on RM web UI (provided RM is not restarted or RM recovery enabled). May be you can check these applications lists. For finding reasons for application killed/failed, one way is you can check in NodeManager logs also. Here you need to check using container_id for corresponding application. Thanks & Regards Rohith Sharma K S From: Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com] Sent: 03 February 2015 21:35 To: user@hadoop.apache.org; yarn-...@hadoop.apache.org Subject: QueueMetrics.AppsKilled/Failed metrics and failure reasons Hello, Was trying to debug reasons for Killed/Failed apps and was checking for the applications that were killed/failed in RM logs - from RMAuditLogger. QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it possible that some logs are missed by AuditLogger or is it the other way round and metrics are being reported higher ? Thanks Suma
RE: hadoop yarn
Refer below link, http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html Thanks & Regards Rohith Sharma K S From: siva kumar [mailto:siva165...@gmail.com] Sent: 20 January 2015 11:24 To: user@hadoop.apache.org Subject: hadoop yarn Hi All, Can anyone suggest me few links for writing MR2 program on Yarn ? Thanks and regrads, siva
RE: node manager ports during mapreduce job
Hi Could you give more information regarding problem? I did not get what do you mean by this statement >> Upon submitting the mapreduce job to the resource manager, it is getting >> stuck while at getResources() for 10 min, timing out and then it is trying >> other node manager. If MRAppMaster does not communicate to RM for 10 mins, RM will expire that applicationattempt and try to re launch it. But you have mentioned that it is trying to other node manager, which daemon is trying to other node manager? I suggest you that whenever there is problem like getting stuck, take a thread dump using jstack , this would help analyzing issue faster. Any free ports i.e 1024<=x<=65365 should work fine. Thanks & Regards Rohith Sharma K S From: hitarth trivedi [mailto:t.hita...@gmail.com] Sent: 12 January 2015 07:01 To: user@hadoop.apache.org Subject: node manager ports during mapreduce job Hi, We have a resource manager with 4 node managers. Upon submitting the mapreduce job to the resource manager, it is getting stuck while at getResources() for 10 min, timing out and then it is trying other node manager. When only one nodemanager running, everything is fine. Upon turning off the firewall on all node managers, everything seems working. Upon looking at the netstat, it was wide range of ports between 3 to 61000 that noedmanagers/reosurcemanagers were communicating. So I opened the tcp ports in the range 3:61000 and turned on the firewall. But it does not seem to work. Any idea, what needs to be done here? Thx -Hitarth
RE: Question about shuffle/merge/sort phrase
whose responsibility is it that brings each key with all its values together >> You can set combiner class in your job. For more information , refer http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html Thanks & Regards Rohith Sharma K S This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! From: Todd [mailto:bit1...@163.com] Sent: 21 December 2014 19:29 To: user@hadoop.apache.org Subject: Question about shuffle/merge/sort phrase Hi, Hadoopers, I got a question about shuffle/sort/merge phrase related.. My understanding is that shuffle is used to transfer the mapper output(key/value pairs) from mapper node to reducer node, and merge phrase is used to merge all the mapper output from all mapper nodes, and sort phrase is used to sort the key/value pair by key, Then my question, whose responsibility is it that brings each key with all its values together (The reducer's input is a key and an iterative values). Thanks.
RE: How do I enable debug mode
You can use below configuration at client for changing log level at MR ApplicationMaster :yarn.app.mapreduce.am.log.level=DEBUG Mapper : mapreduce.map.log.level=DEBUG Reducer : mapreduce.reduce.log.level=DEBUG Thanks & Regards Rohith Sharma K S This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! From: Gino Gu01 [mailto:gino_g...@infosys.com] Sent: 04 December 2014 13:32 To: user@hadoop.apache.org Subject: How do I enable debug mode Hello, I have below code in mapreduce program. if(logger.isDebugEnabled()){ logger.info("Mapper value =" + value); } How do I enable debug mode to print “Mapper value =” in the logs. I tried modifying hadoop-2.5.1/etc/hadoop/ log4j.properties, and it still doesn’t work. Thanks CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
RE: Job object toString() is throwing an exception
Could you give error message or stack trace? From: Corey Nolet [mailto:cjno...@gmail.com] Sent: 26 November 2014 07:54 To: user@hadoop.apache.org Subject: Job object toString() is throwing an exception I was playing around in the Spark shell and newing up an instance of Job that I could use to configure the inputformat for a job. By default, the Scala shell println's the result of every command typed. It throws an exception when it printlns the newly created instance of Job because it looks like it's setting a state upon allocation and it's not happy with the state that it's in when toString() is called before the job is submitted. I'm using Hadoop 2.5.1. I don't see any tickets for this for 2.6. Has anyone else ran into this?
RE: Hadoop Installation Path problem
The problem is with setting JAVA_HOME. There is .(Dot) before /usr which cause append current directory. export JAVA_HOME=./usr/lib64/jdk1.7.0_71/jdk7u71 Do not use .(Dot) before /usr. Thanks & Regards Rohith Sharma K S This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! From: Anand Murali [mailto:anand_vi...@yahoo.com] Sent: 24 November 2014 17:44 To: user@hadoop.apache.org; user@hadoop.apache.org Subject: Hadoop Installation Path problem Hi All: I have done the follwoing in hadoop-env.sh export JAVA_HOME=./usr/lib64/jdk1.7.0_71/jdk7u71 export HADOOP_HOME=/home/anand_vihar/hadoop export PATH=:$PATH:$JAVA_HOME:$HADOOP_HOME/bin:$HADOOP_HOME/sbin Now when I run hadoop-env.sh and type hadoop version, I get this error. /home/anand_vihar/hadoop/bin/hadoop: line 133: /home/anand_vihar/hadoop/etc/hadoop/usr/lib64/jdk1.7.0_71/jdk7u71/bin/java: No such file or directory /home/anand_vihar/hadoop/bin/hadoop: line 133: exec: /home/anand_vihar/hadoop/etc/hadoop/usr/lib64/jdk1.7.0_71/jdk7u71/bin/java: cannot execute: No such file or directory Can somebody advise. I have asked this to many people, they all say the obvious path problem, but where I cannot debug. This has become a show stopper for me. Help most welcome. Thanks Regards Anand Murali 11/7, 'Anand Vihar', Kandasamy St, Mylapore Chennai - 600 004, India Ph: (044)- 28474593/ 43526162 (voicemail)
RE: Resource Manager's container allocation behavior !
Hi Hamza Zafar I would like to let you know first that ApplicationMasterProtocol# allocate() has not only for requesting container but also doubles up as a heartbeat to let the ResourceManager know that the ApplicationMaster is alive So basically your ApplicationMaster should be keep sending heartbeat to RM via allocate() call. Container allocation will happen when NodeMager sends heartbeats to RM. This is the reason for you allocation time reduced when you decrease heartbet-interval-ms. Why the application is not provided with all requested containers in first allocate call? >> For the first call, RM updates request but allocation will happen when NM >> heartbeat to RM. So for 2nd call , containers will be received by AM. Thanks & Regards Rohith Sharma K S From: Hamza Zafar [mailto:11bscshza...@seecs.edu.pk] Sent: 22 November 2014 00:45 To: user@hadoop.apache.org Subject: Resource Manager's container allocation behavior ! My Hadoop Cluster has 52GB memory , 56 virtual cores Scenario: I submit an application to a default queue while there is no other application running on the cluster. I create a request for 32 containers with same priority, 512MB memory and 1 virtual core . In the first allocate call I receive 0 containers from RM, in further allocate calls I start receiving containers. I keep on sending allocate calls until all the containers have been allocated. Why the application is not provided with all requested containers in first allocate call? I changed the configuration property "yarn.resourcemanager.nodemanagers.heartbeat-interval-ms" from 1000ms to 100ms .Now at 100ms heatbeat interval the container allocation time has reduced, but still the AM has to make the same number of allocate calls as it was done before when the heartbeat interval was 1000ms.
RE: Change the blocksize in 2.5.1
It seems HADOOP_CONF_DIR is poiniting different location!!? May be you can check hdfs-site.xml is in classpath when you execute hdfs command. Thanks & Regards Rohith Sharma K S -Original Message- From: Tomás Fernández Pena [mailto:tf.p...@gmail.com] On Behalf Of Tomás Fernández Pena Sent: 20 November 2014 15:41 To: user@hadoop.apache.org Subject: Change the blocksize in 2.5.1 Hello everyone, I've just installed Hadoop 2.5.1 from source code, and I have problems changing the default block size. My hdfs-site.xml file I've set the property dfs.blocksize 67108864 to have blocks of 64 MB, but it seems that the system ignore this setting. When I copy a new file, it uses a block size of 128M. Only if I specify the block size when the file is created (ie hdfs dfs -Ddfs.blocksize=$((64*1024*1024)) -put file .) it uses a block size of 64 MB. Any idea? Best regards Tomas -- Tomás Fernández Pena Centro de Investigacións en Tecnoloxías da Información, CITIUS. Univ. Santiago de Compostela Tel: +34 881816439, Fax: +34 881814112, https://citius.usc.es/equipo/persoal-adscrito/?tf.pena Pubkey 1024D/81F6435A, Fprint=D140 2ED1 94FE 0112 9D03 6BE7 2AFF EDED 81F6 435A
RE: MR job fails with too many mappers
If log aggregation is enabled, log folder will be deleted. So I suggest disable “yarn.log-aggregation-enable” and run job again. All the logs remains at log folder. Then you can find container logs Thanks & Regards Rohith Sharma K S This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! From: francexo83 [mailto:francex...@gmail.com] Sent: 18 November 2014 22:15 To: user@hadoop.apache.org Subject: Re: MR job fails with too many mappers Hi, thank you for your quick response, but I was not able to see the logs for the container. I get a "no such file or directory" when I try to access the logs of the container from the shell: cd /var/log/hadoop-yarn/containers/application_1416304409718_0032 It seems that the container has never been created. thanks 2014-11-18 16:43 GMT+01:00 Rohith Sharma K S mailto:rohithsharm...@huawei.com>>: Hi Could you get syserr and sysout log for contrainer.? These logs will be available in the same location syslog for container. ${yarn.nodemanager.log-dirs}// This helps to find problem!! Thanks & Regards Rohith Sharma K S From: francexo83 [mailto:francex...@gmail.com<mailto:francex...@gmail.com>] Sent: 18 November 2014 20:53 To: user@hadoop.apache.org<mailto:user@hadoop.apache.org> Subject: MR job fails with too many mappers Hi All, I have a small hadoop cluster with three nodes and HBase 0.98.1 installed on it. The hadoop version is 2.3.0 and below my use case scenario. I wrote a map reduce program that reads data from an hbase table and does some transformations on these data. Jobs are very simple so they didn't need the reduce phase. I also wrote a TableInputFormat extension in order to maximize the number of concurrent maps on the cluster. In other words, each row should be processed by a single map task. Everything goes well until the number of rows and consequently mappers exceeds 30 quota. This is the only exception I see when the job fails: Application application_1416304409718_0032 failed 2 times due to AM Container for appattempt_1416304409718_0032_02 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:511) at org.apache.hadoop.util.Shell.run(Shell.java:424) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:656) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 Cluster configuration details: Node1: 12 GB, 4 core Node2: 6 GB, 4 core Node3: 6 GB, 4 core yarn.scheduler.minimum-allocation-mb=2048 yarn.scheduler.maximum-allocation-mb=4096 yarn.nodemanager.resource.memory-mb=6144 Regards
RE: Starting YARN in HA mode Hadoop 2.5.1
You need to start manually. Yarn does not support starting all the RM’s in cluster. Thanks & Regards Rohith Sharma K S From: Jogeshwar Karthik Akundi [mailto:ajkart...@gmail.com] Sent: 18 November 2014 19:15 To: user@hadoop.apache.org Subject: Starting YARN in HA mode Hadoop 2.5.1 Hi, I am using Hadoop 2.5.1 and trying to enable HA mode for NN and RM. the start-dfs.sh script provided starts both the namenodes and all the datanodes. However, the start-yarn.sh starts only one RM and all the NodeManagers. till now, these scripts allowed me to startup the entire cluster from a single machine (the primary node). But now, the secondary RM is not starting up. I tried to google around but couldn't find any information. Tried reading through the yarn-daemon*.sh scripts, but don't find a hint on how to start both the RMs at one shot. Any pointers? Am I missing something? -- There is no charge for awesomeness
RE: MR job fails with too many mappers
Hi Could you get syserr and sysout log for contrainer.? These logs will be available in the same location syslog for container. ${yarn.nodemanager.log-dirs}// This helps to find problem!! Thanks & Regards Rohith Sharma K S From: francexo83 [mailto:francex...@gmail.com] Sent: 18 November 2014 20:53 To: user@hadoop.apache.org Subject: MR job fails with too many mappers Hi All, I have a small hadoop cluster with three nodes and HBase 0.98.1 installed on it. The hadoop version is 2.3.0 and below my use case scenario. I wrote a map reduce program that reads data from an hbase table and does some transformations on these data. Jobs are very simple so they didn't need the reduce phase. I also wrote a TableInputFormat extension in order to maximize the number of concurrent maps on the cluster. In other words, each row should be processed by a single map task. Everything goes well until the number of rows and consequently mappers exceeds 30 quota. This is the only exception I see when the job fails: Application application_1416304409718_0032 failed 2 times due to AM Container for appattempt_1416304409718_0032_02 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:511) at org.apache.hadoop.util.Shell.run(Shell.java:424) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:656) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 Cluster configuration details: Node1: 12 GB, 4 core Node2: 6 GB, 4 core Node3: 6 GB, 4 core yarn.scheduler.minimum-allocation-mb=2048 yarn.scheduler.maximum-allocation-mb=4096 yarn.nodemanager.resource.memory-mb=6144 Regards
RE: How to set job-priority on a hadoop job
Hi Sunil In MR2v, there is no job priority. There is open Jira for ApplicationPriority that is still in progress. https://issues.apache.org/jira/browse/YARN-1963 https://issues.apache.org/jira/browse/MAPREDUCE-5870 You need to wait untill this feature comes up!! Thanks & Regards Rohith Sharma K S From: Sunil S Nandihalli [mailto:sunil.nandiha...@gmail.com] Sent: 03 November 2014 10:02 To: user@hadoop.apache.org Subject: How to set job-priority on a hadoop job Hi Everybody, I see that we can set job priority on a hadoop job. I have been trying to do it using the following command. hadoop job -set-priority job-id VERY_LOW It does not seem to be working.. after that I noticed that http://archive.cloudera.com/cdh/3/hadoop/capacity_scheduler.html says that the job-priority on a queue is disabled by default. I would like to enable it. No amount of googling gave me an actionable to enable priorities on job-queues. Can somebody help? Thanks, Sunil
RE: YarnChild didn't be killed after running mapreduce
This is strange!! Can you get ps -aef | grep fro this process? What is the application status in RM UI? Thanks & Regards Rohith Sharma K S This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! From: dwld0...@gmail.com [mailto:dwld0...@gmail.com] Sent: 31 October 2014 13:05 To: user@hadoop.apache.org Subject: YarnChild didn't be killed after running mapreduce All I runed mapreduce example successfully,but it always appeared invalid process on the nodemanager nodes,as follow: 27398 DataNode 27961 Jps 13669 QuorumPeerMain 27822 -- process information unavailable 18349 ThriftServer 27557 NodeManager I deleted this invalid process under /tmp/hsperfdata_yarn ,it will be there after running mapreduce(yarn) again. I had modified many parameters in yarn-site.xml and mapred-site.xml. yarn-site.xml yarn.nodemanager.resource.memory-mb 4096 yarn.nodemanager.resource.cpu-vcores 2 yarn.scheduler.minimum-allocation-mb 256 yarn.scheduler.maximum-allocation-mb 2048 yarn.scheduler.minimum-allocation-vcores 1 yarn.scheduler.maximum-allocation-vcores 2 mapred-site.xml mapreduce.map.memory.mb 512 mapreduce.map.cpu.vcores 2 mapreduce.reduce.memory.mb 512 mapreduce.reduce.cpu.vcores 2 All didn't work. It has been up for a long time. There ware no error log,only found some suspicious logs,as follow: 2014-10-31 14:35:59,306 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1414736576842_0001_01_08 2014-10-31 14:35:59,350 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 27818 for container-id container_1414736576842_0001_01_08: 107.9 MB of 1 GB physical memory used; 1.5 GB of 2.1 GB virtual memory used 2014-10-31 14:36:01,068 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header... 2014-10-31 14:36:01,702 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1414736576842_0001_01_08 2014-10-31 14:36:01,702 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root IP=192.168.200.128 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1414736576842_0001 CONTAINERID=container_1414736576842_0001_01_08 2014-10-31 14:36:01,703 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1414736576842_0001_01_08 transitioned from RUNNING to KILLING 2014-10-31 14:36:01,703 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1414736576842_0001_01_08 2014-10-31 14:36:01,724 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1414736576842_0001_01_08 is : 143 2014-10-31 14:36:01,791 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1414736576842_0001_01_08 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 2014-10-31 14:36:01,791 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /hadoop/yarn/local/usercache/root/appcache/application_1414736576842_0001/container_1414736576842_0001_01_08 2014-10-31 14:36:01,792 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1414736576842_0001 CONTAINERID=container_1414736576842_0001_01_08 2014-10-31 14:36:01,792 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1414736576842_0001_01_08 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE 2014-10-31 14:36:01,792 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1414736576842_0001_01_08 from application application_1414736576842_0001 2014-10-31 14:36:01,792 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1414736576842_0001_01_08 for log-aggregation 2014-10-31 14:36:01,793 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1414736576842_0001 dwld0...@gmail.com<mailto:dwld0...@gmail.com>
RE: mapred job pending at "Starting scan to move intermediate done files"
Hi, This is problem with your memory configurations in cluster. You have configured "yarn.nodemanager.resource.memory-mb" as 64MB which is too low. 1. ApplicationMaster required 2GB to launch container but cluster memory it self has 64MB. So container never get assigned. 2. Further steps, map memory is 64MB, but map.opts has 1024MB in mapred-site.xml. Again it is contradictory. Change NodeManger memory to 8GB and map/reduce memory to 2GB. Try running job. Thanks & Regards Rohith Sharma K S This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! From: mail list [mailto:louis.hust...@gmail.com] Sent: 23 October 2014 07:55 To: user@hadoop.apache.org Subject: mapred job pending at "Starting scan to move intermediate done files" hi, all, I am new to hadoop, and I install the hadoop-2.5.1 on ubuntu with Pseudo-distributed mode. When I run a mapped job, the job output the following logs: louis@ubuntu:~/src/hadoop-book$ hadoop jar hadoop-examples.jar v3.MaxTemperatureDriver input/ncdc/all max-temp 14/10/22 19:09:56 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/10/22 19:09:57 INFO input.FileInputFormat: Total input paths to process : 2 14/10/22 19:09:58 INFO mapreduce.JobSubmitter: number of splits:2 14/10/22 19:09:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1414030015373_0001 14/10/22 19:09:58 INFO impl.YarnClientImpl: Submitted application application_1414030015373_0001 14/10/22 19:09:58 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1414030015373_0001/ 14/10/22 19:09:58 INFO mapreduce.Job: Running job: job_1414030015373_0001 As you see, the job halt. Then I check the jps output: louis@ubuntu:~/src/hadoop-2.5.1$<mailto:louis@ubuntu:~/src/hadoop-2.5.1$> jps 22433 SecondaryNameNode 22716 NodeManager 22240 DataNode 22577 ResourceManager 23083 JobHistoryServer 23148 Jps 22080 NameNode It seems nothing wrong, then i check the mapred-louis-historyserver-ubuntu.log: 2014-10-22 19:09:03,831 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: History Cleaner started 2014-10-22 19:09:03,837 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: History Cleaner complete 2014-10-22 19:11:33,830 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move intermediate done files 2014-10-22 19:14:33,830 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move intermediate done files 2014-10-22 19:17:33,832 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move intermediate done files 2014-10-22 19:20:33,830 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move intermediate done files Then i check the web ui: It seems the job is pending The attachment contains some configuration files at etc/hadoop/ . Any idea will be appreciated !
RE: Reduce fails always
Hi How much data does wordcount job is processing? What is the disk space ("df -h" ) available in the node where it always fail? The point I didn't understand is why it uses only one datanode disc space? >> For reducers task running, containers can be allocated at any node. I >> think, in your cluster one of the machines disk space is very low. So >> whichever the task running on that particular node is failing. Thanks & Regards Rohith Sharma K S From: Abdul Navaz [mailto:navaz@gmail.com] Sent: 06 October 2014 08:21 To: user@hadoop.apache.org Subject: Reduce fails always Hi All, I am running sample word count job in a 9 node cluster and I am getting the below error message. hadoop jar chiu-wordcount2.jar WordCount /user/hduser/getty/file1.txt /user/hduser/getty/out10 -D mapred.reduce.tasks=2 14/10/05 18:08:45 INFO mapred.JobClient: map 99% reduce 26% 14/10/05 18:08:48 INFO mapred.JobClient: map 99% reduce 28% 14/10/05 18:08:51 INFO mapred.JobClient: map 100% reduce 28% 14/10/05 18:08:57 INFO mapred.JobClient: map 98% reduce 0% 14/10/05 18:08:58 INFO mapred.JobClient: Task Id : attempt_201410051754_0003_r_00_0, Status : FAILED FSError: java.io.IOException: No space left on device 14/10/05 18:08:59 WARN mapred.JobClient: Error reading task outputhttp://pcvm1-10.utahddc.geniracks.net:50060/tasklog?plaintext=true&attemptid=attempt_201410051754_0003_r_00_0&filter=stdout 14/10/05 18:08:59 WARN mapred.JobClient: Error reading task outputhttp://pcvm1-10.utahddc.geniracks.net:50060/tasklog?plaintext=true&attemptid=attempt_201410051754_0003_r_00_0&filter=stderr 14/10/05 18:08:59 INFO mapred.JobClient: Task Id : attempt_201410051754_0003_m_15_0, Status : FAILED FSError: java.io.IOException: No space left on device 14/10/05 18:09:02 INFO mapred.JobClient: map 99% reduce 0% 14/10/05 18:09:07 INFO mapred.JobClient: map 99% reduce 1% I can see it uses all disk space on one of the datanode when shuffling starts. As soon as disc space on the node becomes nill it throws me this error and job aborts. The point I didn't understand is why it uses only one datanode disc space. I have change the number of reducer as 4 still it uses only one datanode disc and throws above error. How can I fix this issue? Thanks & Regards, Navaz
RE: Cannot fine profiling log file
HI Have you enable log aggregation..? 1. If log aggregation is enabled then you can get logs from hdfs below path. Both aggregated logs and profiler will be in same file. ${yarn.nodemanager.remote-app-log-dir}/${user}/logs// If not enabled, then check inside ${yarn.nodemanager.log-dirs}///profile.out(default name) Thanks & Regards Rohith Sharma K S From: Jakub Stransky [mailto:stransky...@gmail.com] Sent: 23 September 2014 16:27 To: user@hadoop.apache.org Subject: Cannot fine profiling log file Hello experienced users, I did try to use profiling of tasks during mapreduce mapreduce.task.profile true mapreduce.task.profile.maps 0-5 mapreduce.task.profile.params -agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s The file got generated I see that thrugh Resource Manager console but I can't find it from where to download. Where to find that file or how to download it? Thanks for any advices! Jakub
RE: About extra containers being allocated in distributed shell example.
This looks to be an open issue https://issues.apache.org/jira/i#browse/YARN-1902. Thanks & Regards Rohith Sharma K S From: Smita Deshpande [mailto:smita.deshpa...@cumulus-systems.com] Sent: 22 September 2014 10:45 To: user@hadoop.apache.org Subject: RE: About extra containers being allocated in distributed shell example. Any suggestion/workaround on this one? -Smita From: Smita Deshpande Sent: Tuesday, September 16, 2014 3:00 PM To: 'user@hadoop.apache.org' Subject: About extra containers being allocated in distributed shell example. Hi, In YARN distributed shell example, I am setting up my request for containers to the RM using the following call (I am asking for 9 containers here) private ContainerRequest setupContainerAskForRM(Resource capability) {} But when actually RMCallbackHandler allocates containers in following call (I am getting 23 containers here) @Override public void onContainersAllocated(List allocatedContainers) {} I am getting extra containers which expire after 600 seconds. Will these extra launched containers which are not doing anything will have any performance issue in my application? At one point in my application, out of 19K containers 12K containers expired because they were not used. Can anybody suggest any workaround on this or is it a bug? -Smita
Why 2 different approach for deleting localized resources and aggregated logs?
Hi I see two different approach for deleting localized resources and aggregated logs. 1. Localized resources are deleted based on the size of localizer cache, per local directory. 2. Aggregated logs are deleted based on the time(if enabled). Is there any specific thoughts for 2 different implementations why it is? Can aggregated logs also can be deleted based on size? Thanks & Regards Rohith Sharma K S
RE: change yarn application priority
Hi Currently there is no provision for changing application priority within the same queue. Follow the Jira https://issues.apache.org/jira/i#browse/YARN-1963 for this new feature. One way you can achieve by using enabling scheduler monitors for CapacitySchedulers. Steps to be follow is 1. Configure 2 queues, follow http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html 2. Enable scheduler monitor yarn.resourcemanager.scheduler.monitor.enable = true One job you submit to queue 1 which run 2hours. Another job you submit queue 2. Hope this will help you. Thanks & Regards Rohith Sharma K S This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! From: Henry Hung [mailto:ythu...@winbond.com] Sent: 30 May 2014 11:53 To: user@hadoop.apache.org Subject: change yarn application priority HI All, I have an application that consumes all of nodemanager capacity (30 Map and 1 Reducer) and will need 4 hours to finish. Let's say I need to run another application that will be quicker to finish (30 minutes) and only need 1 Map and 1 Reducer. If I just execute the new application, it will be in queue waiting for the 1st application to finish. Is there a way to change the 2nd application priority to higher than the 1st and let resourcemanager immediately execute the 2nd application? I'm using Hadoop-2.2.0. Best regards, Henry The privileged confidential information contained in this email is intended for use only by the addressees as indicated by the original sender of this email. If you are not the addressee indicated in this email or are not responsible for delivery of the email to such a person, please kindly reply to the sender indicating this fact and delete all copies of it from your computer and network server immediately. Your cooperation is highly appreciated. It is advised that any unauthorized use of confidential information of Winbond is strictly prohibited; and any information in this email irrelevant to the official business of Winbond shall be deemed as neither given nor endorsed by Winbond.
RE: Cleanup activity on YARN containers
Is there something like shutdown hook for containers? >> There is no containers specific shutdown hook. I was telling about Java shutdown hook i.e 'Runtime.getRuntime().addShutdownHook(Thread<http://docs.oracle.com/javase/7/docs/api/java/lang/Thread.html> hook)' during start of container JVM. In hook, clean up can be done. Thanks & Regards Rohith Sharma K S From: Krishna Kishore Bonagiri [mailto:write2kish...@gmail.com] Sent: 09 April 2014 10:49 To: user@hadoop.apache.org Subject: Re: Cleanup activity on YARN containers Hi Rohith, Is there something like shutdown hook for containers? Can you please also tell me how to use that? Thanks, Kishore On Wed, Apr 9, 2014 at 8:34 AM, Rohith Sharma K S mailto:rohithsharm...@huawei.com>> wrote: For local container clean up, can be cleaned at ShutDownHook. !!?? Thanks & Regards Rohith Sharma K S From: Krishna Kishore Bonagiri [mailto:write2kish...@gmail.com<mailto:write2kish...@gmail.com>] Sent: 08 April 2014 20:01 To: user@hadoop.apache.org<mailto:user@hadoop.apache.org> Subject: Re: Cleanup activity on YARN containers Hi Rohith, Thanks for the reply. Mine is a YARN application. I have some files that are local to where the containers run on, and I want to clean them up at the end of the container execution. So, I want to do this cleanup on the same node my container ran on. With what you are suggesting, I can't delete the files local to the container. Is there any other way? Thanks, Kishore On Tue, Apr 8, 2014 at 8:55 AM, Rohith Sharma K S mailto:rohithsharm...@huawei.com>> wrote: Hi Kishore, Is jobs are submitted through MapReduce or Is it Yarn Application? 1. For MapReduce Framwork, framework itself provides facility to clean up per task level. Is there any callback kind of facility, in which I can write some code to be executed on my container at the end of my application or at the end of that particular container execution? >>> You can override setup() and cleanup() for doing initialization and >>> cleanup of your task. This facility is provided by MapReduce framework. The call flow of task execution is The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context), followed by map(Object, Object, Context) / reduce(Object, Iterable, Context) for each key/value pair. Finally cleanup(Context) is called. Note : In clean up, do not hold container for more than "mapreduce.task.timeout". Because, once map/reduce is completed, progress will not be sent to applicationmaster(ping is not considered as status update). If your application is taking more than value configured for "mapreduce.task.timeout", then application master consider this task as timedout. In such case, you need to increase value for "mapreduce.task.timeout" based on your cleanup time. 2. For Yarn Application, completed container's list are sent to ApplicationMaster in heartbeat. Here you can do clean up activities for containers. Hope this will help for you. :)!! Thanks & Regards Rohith Sharma K S From: Krishna Kishore Bonagiri [mailto:write2kish...@gmail.com<mailto:write2kish...@gmail.com>] Sent: 07 April 2014 16:41 To: user@hadoop.apache.org<mailto:user@hadoop.apache.org> Subject: Cleanup activity on YARN containers Hi, Is there any callback kind of facility, in which I can write some code to be executed on my container at the end of my application or at the end of that particular container execution? I want to do some cleanup activities at the end of my application, and the clean up is not related to the localized resources that are downloaded from HDFS. Thanks, Kishore
RE: Cleanup activity on YARN containers
For local container clean up, can be cleaned at ShutDownHook. !!?? Thanks & Regards Rohith Sharma K S From: Krishna Kishore Bonagiri [mailto:write2kish...@gmail.com] Sent: 08 April 2014 20:01 To: user@hadoop.apache.org Subject: Re: Cleanup activity on YARN containers Hi Rohith, Thanks for the reply. Mine is a YARN application. I have some files that are local to where the containers run on, and I want to clean them up at the end of the container execution. So, I want to do this cleanup on the same node my container ran on. With what you are suggesting, I can't delete the files local to the container. Is there any other way? Thanks, Kishore On Tue, Apr 8, 2014 at 8:55 AM, Rohith Sharma K S mailto:rohithsharm...@huawei.com>> wrote: Hi Kishore, Is jobs are submitted through MapReduce or Is it Yarn Application? 1. For MapReduce Framwork, framework itself provides facility to clean up per task level. Is there any callback kind of facility, in which I can write some code to be executed on my container at the end of my application or at the end of that particular container execution? >>> You can override setup() and cleanup() for doing initialization and >>> cleanup of your task. This facility is provided by MapReduce framework. The call flow of task execution is The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context), followed by map(Object, Object, Context) / reduce(Object, Iterable, Context) for each key/value pair. Finally cleanup(Context) is called. Note : In clean up, do not hold container for more than "mapreduce.task.timeout". Because, once map/reduce is completed, progress will not be sent to applicationmaster(ping is not considered as status update). If your application is taking more than value configured for "mapreduce.task.timeout", then application master consider this task as timedout. In such case, you need to increase value for "mapreduce.task.timeout" based on your cleanup time. 2. For Yarn Application, completed container's list are sent to ApplicationMaster in heartbeat. Here you can do clean up activities for containers. Hope this will help for you. :)!! Thanks & Regards Rohith Sharma K S From: Krishna Kishore Bonagiri [mailto:write2kish...@gmail.com<mailto:write2kish...@gmail.com>] Sent: 07 April 2014 16:41 To: user@hadoop.apache.org<mailto:user@hadoop.apache.org> Subject: Cleanup activity on YARN containers Hi, Is there any callback kind of facility, in which I can write some code to be executed on my container at the end of my application or at the end of that particular container execution? I want to do some cleanup activities at the end of my application, and the clean up is not related to the localized resources that are downloaded from HDFS. Thanks, Kishore
RE: Cleanup activity on YARN containers
Hi Kishore, Is jobs are submitted through MapReduce or Is it Yarn Application? 1. For MapReduce Framwork, framework itself provides facility to clean up per task level. Is there any callback kind of facility, in which I can write some code to be executed on my container at the end of my application or at the end of that particular container execution? >>> You can override setup() and cleanup() for doing initialization and >>> cleanup of your task. This facility is provided by MapReduce framework. The call flow of task execution is The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context), followed by map(Object, Object, Context) / reduce(Object, Iterable, Context) for each key/value pair. Finally cleanup(Context) is called. Note : In clean up, do not hold container for more than "mapreduce.task.timeout". Because, once map/reduce is completed, progress will not be sent to applicationmaster(ping is not considered as status update). If your application is taking more than value configured for "mapreduce.task.timeout", then application master consider this task as timedout. In such case, you need to increase value for "mapreduce.task.timeout" based on your cleanup time. 2. For Yarn Application, completed container's list are sent to ApplicationMaster in heartbeat. Here you can do clean up activities for containers. Hope this will help for you. :)!! Thanks & Regards Rohith Sharma K S From: Krishna Kishore Bonagiri [mailto:write2kish...@gmail.com] Sent: 07 April 2014 16:41 To: user@hadoop.apache.org Subject: Cleanup activity on YARN containers Hi, Is there any callback kind of facility, in which I can write some code to be executed on my container at the end of my application or at the end of that particular container execution? I want to do some cleanup activities at the end of my application, and the clean up is not related to the localized resources that are downloaded from HDFS. Thanks, Kishore
RE: Job fails if I change HADOOP_USER_NAME
Hi Ashwin, How I enable debug for AM container logs ? >> Set below configurations changing log level for AM, map and reducer task. >> Default values are INFO. yarn.app.mapreduce.am.log.level mapreduce.map.log.level mapreduce.reduce.log.level and to which location are they written to ? >> These are written into {yarn.nodemanager.log-dirs}/ while >> executing job. Once application is finished, 1. If log aggregation is enabled, then all container logs aggregated to HDFS. The log path in hdfs is {yarn.nodemanager.remote-app-log-dir}/${user} 2. If log aggregation is disabled,then all container logs remain in local machine where containers has run i.e {yarn.nodemanager.log-dirs}/mailto:ashwinshanka...@gmail.com] Sent: 22 March 2014 03:38 To: user@hadoop.apache.org Subject: Re: Job fails if I change HADOOP_USER_NAME Hi Rohit, How I enable debug for AM container logs ? and to which location are they written to ? I tried changing log4j.prop and can see DEBUGs for RM,NM etc but I don't see AM related debug logs. Thanks, Ashwin On Fri, Mar 21, 2014 at 3:05 AM, Rohith Sharma K S mailto:rohithsharm...@huawei.com>> wrote: Hi The below stack trace is generic for any am launcher failed to launch. Can debug on AM container logs, so get proper stacktrace.? Thanks & Regards Rohith Sharma K S From: Ashwin Shankar [mailto:ashwinshanka...@gmail.com<mailto:ashwinshanka...@gmail.com>] Sent: 21 March 2014 14:02 To: user@hadoop.apache.org<mailto:user@hadoop.apache.org> Subject: Job fails if I change HADOOP_USER_NAME Hi, I'm writing a new feature in Fair scheduler and wanted to test it out by running jobs submitted by different users from my laptop. My sleep job runs fine as long as the user name is my mac user name. If I change my hadoop user name by setting HADOOP_USER_NAME, my jobs fail with the exception org.apache.hadoop.util.Shell$ExitCodeException. I also tried creating a new user account on my laptop and running a job as that user but I get the same exception. Please let me know if any of you have come across this. I tried changing ulimits max proc(to 1024),but doesn't solve the problem. Here is the stack trace : Job job_1395389889916_0001 failed with state FAILED due to: Application application_1395389889916_0001 failed 3 times due to AM Container for appattempt_1395389889916_0001_03 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) -- Thanks, Ashwin -- Thanks, Ashwin
RE: Job fails if I change HADOOP_USER_NAME
Hi The below stack trace is generic for any am launcher failed to launch. Can debug on AM container logs, so get proper stacktrace.? Thanks & Regards Rohith Sharma K S From: Ashwin Shankar [mailto:ashwinshanka...@gmail.com] Sent: 21 March 2014 14:02 To: user@hadoop.apache.org Subject: Job fails if I change HADOOP_USER_NAME Hi, I'm writing a new feature in Fair scheduler and wanted to test it out by running jobs submitted by different users from my laptop. My sleep job runs fine as long as the user name is my mac user name. If I change my hadoop user name by setting HADOOP_USER_NAME, my jobs fail with the exception org.apache.hadoop.util.Shell$ExitCodeException. I also tried creating a new user account on my laptop and running a job as that user but I get the same exception. Please let me know if any of you have come across this. I tried changing ulimits max proc(to 1024),but doesn't solve the problem. Here is the stack trace : Job job_1395389889916_0001 failed with state FAILED due to: Application application_1395389889916_0001 failed 3 times due to AM Container for appattempt_1395389889916_0001_03 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) -- Thanks, Ashwin
RE: NodeHealthReport local-dirs turned bad
Hi There is no relation to NameNode format Does NodeManger is started with default configuration? If no , any NodeManger health script is configured? Suspect can be 1. /hadoop does not have permission or 2. disk is full Thanks & Regards Rohith Sharma K S -Original Message- From: Margusja [mailto:mar...@roo.ee] Sent: 19 March 2014 17:04 To: user@hadoop.apache.org Subject: NodeHealthReport local-dirs turned bad Hi I have one node in unhealthy status: Total Vmem allocated for Containers 4.20 GB Vmem enforcement enabledfalse Total Pmem allocated for Container 2 GB Pmem enforcement enabledfalse NodeHealthyStatus false LastNodeHealthTime Wed Mar 19 13:31:24 EET 2014 NodeHealthReport1/1 local-dirs turned bad: /hadoop/yarn/local;1/1 log-dirs turned bad: /hadoop/yarn/log Node Manager Version: 2.2.0.2.0.6.0-101 from b07b2906c36defd389c8b5bd22bebc1bead8115b by jenkins source checksum 82bd166aa0ada92b44f8a154836b92 on 2014-01-09T05:24Z Hadoop Version: 2.2.0.2.0.6.0-101 from b07b2906c36defd389c8b5bd22bebc1bead8115b by jenkins source checksum 704f1e463ebc4fb89353011407e965 on 2014-01-09T05:18Z I tried: Deleted /hadoop/* and did namenode -format again Restarted nodemanager but still in unhealthy mode. Is there any guideline what I should do? -- Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE "(serialNumber=37303140314)"
RE: How to configure nodemanager.health-checker.script.path
Hi Health script should execute successfully. If your health check required to fail, than add ERROR that print in console. This is because health script may fail because of Syntax error, Command not found(IOexception) or several other reasons. In order to work health script, Do not add "exit -1". #!/bin/bash echo "ERROR disk full" Thanks & Regards Rohith Sharma K S From: Anfernee Xu [mailto:anfernee...@gmail.com] Sent: 19 March 2014 10:32 To: user Subject: How to configure nodemanager.health-checker.script.path Hello, I'm running MR with 2.2.0 release, I noticed we can configure "nodemanager.health-checker.script.path" in yarn-site.xml to customize NM health checking, so I add below properties to yarn-site.xml yarn.nodemanager.health-checker.script.path /scratch/software/hadoop2/hadoop-dc/node_health.sh yarn.nodemanager.health-checker.interval-ms 1 To get a feel about this, the /scratch/software/hadoop2/hadoop-dc/node_health.sh simply print an ERROR message as below #!/bin/bash echo "ERROR disk full" exit -1 But it seems not working, the node is still in health state, did I missing something? Thanks for your help. -- --Anfernee
RE: issue of "Log aggregation has not completed or is not enabled."
Just for confirmation, 1. Does NodeManager is restarted after enabling LogAggregation? If Yes, check for NodeManager start up logs for Log Aggregation Service start is success. Thanks & Regards Rohith Sharma K S From: ch huang [mailto:justlo...@gmail.com] Sent: 18 March 2014 13:09 To: user@hadoop.apache.org Subject: issue of "Log aggregation has not completed or is not enabled." hi,maillist: i try look application log use the following process # yarn application -list Application-Id Application-Name User Queue State Final-State Tracking-URL application_1395126130647_0014 select user_id as userid, adverti...stattime(Stage-1) hivehive FINISHED SUCCEEDED ch18:19888/jobhistory/job/job_1395126130647_0014 # yarn logs -applicationId application_1395126130647_0014 Logs not available at /var/log/hadoop-yarn/apps/root/logs/application_1395126130647_0014 Log aggregation has not completed or is not enabled. but i do enable Log aggregation function ,here is my yarn-site.xml configuration about log aggregation yarn.log-aggregation-enable true Where to aggregate logs to. yarn.nodemanager.remote-app-log-dir /var/log/hadoop-yarn/apps the application logs is not put on hdfs successfully,why? # hadoop fs -ls /var/log/hadoop-yarn/apps/root/logs/application_1395126130647_0014 ls: `/var/log/hadoop-yarn/apps/root/logs/application_1395126130647_0014': No such file or directory
RE: ResourceManager shutting down
Hi Hitesh, Yes it is an issue. This is handled in https://issues.apache.org/jira/i#browse/YARN-713 fixes DNS Issue. This fix available on hadoop-2.4(unreleased). Thanks & Regards Rohith Sharma K S -Original Message- From: Hitesh Shah [mailto:hit...@apache.org] Sent: 14 March 2014 09:03 To: user@hadoop.apache.org Subject: Re: ResourceManager shutting down Hi John Would you mind filing a jira with more details. The RM going down just because a host was not resolvable or DNS timed out is something that should be addressed. thanks -- Hitesh On Mar 13, 2014, at 2:29 PM, John Lilley wrote: > Never mind... we figured out its DNS entry was going missing. > john > > From: John Lilley [mailto:john.lil...@redpoint.net] > Sent: Thursday, March 13, 2014 2:52 PM > To: user@hadoop.apache.org > Subject: ResourceManager shutting down > > We have this erratic behavior where every so often the RM will shutdown with > an UnknownHostException. The odd thing is, the host it complains about have > been in use for days at that point without problem. Any ideas? > Thanks, > John > > > 2014-03-13 14:38:14,746 INFO rmapp.RMAppImpl > (RMAppImpl.java:handle(578)) - application_1394204725813_0220 State > change from ACCEPTED to RUNNING > 2014-03-13 14:38:15,794 FATAL resourcemanager.ResourceManager > (ResourceManager.java:run(449)) - Error in handling event type > NODE_UPDATE to the scheduler > java.lang.IllegalArgumentException: java.net.UnknownHostException: > skitzo.office.datalever.com > at > org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:418) > at > org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(BuilderUtils.java:247) > at > org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.java:195) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.createContainerToken(LeafQueue.java:1297) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1345) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1211) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1170) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:871) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:645) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:559) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:690) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:734) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:86) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:662) > Caused by: java.net.UnknownHostException: skitzo.office.datalever.com > ... 15 more > 2014-03-13 14:38:15,794 INFO resourcemanager.ResourceManager > (ResourceManager.java:run(453)) - Exiting, bbye.. > 2014-03-13 14:38:15,911 INFO mortbay.log (Slf4jLog.java:info(67)) - > Stopped selectchannelconnec...@metallica.office.datalever.com:8088 > 2014-03-13 14:38:16,013 ERROR > delegation.AbstractDelegationTokenSecretManager > (AbstractDelegationTokenSecretManager.java:run(557)) - > InterruptedExcpetion recieved for ExpiredTokenRemover thread > java.lang.InterruptedException: sleep interrupted > 2014-03-13 14:38:16,013 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:stop(200)) - Stopping ResourceManager metrics > system... > 2014-03-13 14:38:16,014 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:stop(206)) - ResourceManager metrics system stopped. > 2014-03-13 14:38:16,014 INFO impl.MetricsSystemImpl > (MetricsSystemImpl.java:shutdown(572)) - ResourceManager metrics system > shutdown complete. > 2014-03-13 14:38:16,015 WARN amlauncher.ApplicationMasterLauncher > (ApplicationMasterLauncher.java:run(98)) - > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher$LauncherThread > interrupted. Returning. > 2014-03-13 14:38:16,015 INFO ipc.Server (Server.java:stop(2442)) - > Stopping server on 8141 > 2014-03-13 14:38:16,017 INFO ipc.Server (Server.java:stop(2442)) - > Stopping server on 8050 ... and so on, it shuts down >
RE: NodeManager health Question
Hi , As troubleshooting, few things you can verify 1. check RM web UI for "Is there any 'Active Nodes' in Yarn cluster"?. http://< yarn.resourcemanager.webapp.address>/cluster. And also verify for "Lost Nodes" or "Unhealthy Nodes" or "Rebooted Nodes". If there any active nodes, then cross verify for "Memory Total". This should be "Memory Total = Number of Active Nodes * value of { yarn.nodemanager.resource.memory-mb }" 2. NodeManger logs give more information. NM logs also check. >>> In Yarn, my Hive queries are "Accepted" but are "Unassigned" and do not run This may be your Yarn Cluster does not have enough memory to launch container. Possible reason could be 1. None of the NM are sending heart beat to RM.(check RM Web UI for Unhealthy Nodes) 2. All the NM are lost/unhealthy. 3. Full cluster capacity is Used. So yarn scheduler is waiting for some container to get over, so it can assign released memory to other containers. Looking into your DataNode socket timeout exception ( that too 8 minutes!!!), I suspect that Hadoop cluster Network is UNSTABLE. Better to debug on network. Thanks & Regards Rohith Sharma K S From: Clay McDonald [mailto:stuart.mcdon...@bateswhite.com] Sent: 14 March 2014 01:30 To: 'user@hadoop.apache.org' Subject: NodeManager health Question Hello all, I have laid out my POC in a project plan and have HDP 2.0 installed. HDFS is running fine and have loaded up about 6TB of data to run my test on. I have a series of SQL queries that I will run in Hive ver. 0.12.0. I had to manually install Hue and still have a few issues I'm working on there. But at the moment, my most pressing issue is with Hive jobs not running. In Yarn, my Hive queries are "Accepted" but are "Unassigned" and do not run. See attached. In Ambari, the datanodes all have the following error; NodeManager health CRIT for 20 days CRITICAL: NodeManager unhealthy >From the datanode logs I found the following; ERROR datanode.DataNode (DataXceiver.java:run(225)) - dc-bigdata1.bateswhite.com:50010:DataXceiver error processing READ_BLOCK operation src: /172.20.5.147:51299 dest: /172.20.5.141:50010 java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/172.20.5.141:50010 remote=/172.20.5.147:51299] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221) at java.lang.Thread.run(Thread.java:662) Also, in the namenode log I see the following; 2014-03-13 13:50:57,204 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1355)) - No groups available for user dr.who If anyone can point me in the right direction to troubleshoot this, I would really appreciate it! Thanks! Clay
RE: class org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto overrides final method getUnknownFields
Hi The reason for " org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet" is hadoop is compiled with protoc-2.5.0 version, but in the classpath lower version of protobuf is present. 1. Check MRAppMaster classpath, which version of protobuf is in classpath. Expected to have 2.5.0 version. Thanks & Regards Rohith Sharma K S -Original Message- From: Margusja [mailto:mar...@roo.ee] Sent: 03 March 2014 22:45 To: user@hadoop.apache.org Subject: Re: class org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto overrides final method getUnknownFields Hi 2.2.0 and 2.3.0 gave me the same container log. A little bit more details. I'll try to use external java client who submits job. some lines from maven pom.xml file: org.apache.hadoop hadoop-client 2.3.0 org.apache.hadoop hadoop-core 1.2.1 lines from external client: ... 2014-03-03 17:36:01 INFO FileInputFormat:287 - Total input paths to process : 1 2014-03-03 17:36:02 INFO JobSubmitter:396 - number of splits:1 2014-03-03 17:36:03 INFO JobSubmitter:479 - Submitting tokens for job: job_1393848686226_0018 2014-03-03 17:36:04 INFO YarnClientImpl:166 - Submitted application application_1393848686226_0018 2014-03-03 17:36:04 INFO Job:1289 - The url to track the job: http://vm38.dbweb.ee:8088/proxy/application_1393848686226_0018/ 2014-03-03 17:36:04 INFO Job:1334 - Running job: job_1393848686226_0018 2014-03-03 17:36:10 INFO Job:1355 - Job job_1393848686226_0018 running in uber mode : false 2014-03-03 17:36:10 INFO Job:1362 - map 0% reduce 0% 2014-03-03 17:36:10 INFO Job:1375 - Job job_1393848686226_0018 failed with state FAILED due to: Application application_1393848686226_0018 failed 2 times due to AM Container for appattempt_1393848686226_0018_02 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) ... Lines from namenode: ... 14/03/03 19:12:42 INFO namenode.FSEditLog: Number of transactions: 900 Total time for transactions(ms): 69 Number of transactions batched in Syncs: 0 Number of syncs: 542 SyncTimes(ms): 9783 14/03/03 19:12:42 INFO BlockStateChange: BLOCK* addToInvalidates: blk_1073742050_1226 90.190.106.33:50010 14/03/03 19:12:42 INFO hdfs.StateChange: BLOCK* allocateBlock: /user/hduser/input/data666.noheader.data. BP-802201089-90.190.106.33-1393506052071 blk_1073742056_1232{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[90.190.106.33:50010|RBW]]} 14/03/03 19:12:44 INFO hdfs.StateChange: BLOCK* InvalidateBlocks: ask 90.190.106.33:50010 to delete [blk_1073742050_1226] 14/03/03 19:12:53 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 90.190.106.33:50010 is added to blk_1073742056_1232{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[90.190.106.33:50010|RBW]]} size 0 14/03/03 19:12:53 INFO hdfs.StateChange: DIR* completeFile: /user/hduser/input/data666.noheader.data is closed by DFSClient_NONMAPREDUCE_-915999412_15 14/03/03 19:12:54 INFO BlockStateChange: BLOCK* addToInvalidates: blk_1073742051_1227 90.190.106.33:50010 14/03/03 19:12:54 INFO hdfs.StateChange: BLOCK* allocateBlock: /user/hduser/input/data666.noheader.data.info. BP-802201089-90.190.106.33-1393506052071 blk_1073742057_1233{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[90.190.106.33:50010|RBW]]} 14/03/03 19:12:54 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 90.190.106.33:50010 is added to blk_1073742057_1233{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[90.190.106.33:50010|RBW]]} size 0 14/03/03 19:12:54 INFO hdfs.StateChange: DIR* completeFile: /user/hduser/input/data666.noheader.data.info is closed by DFSClient_NONMAPREDUCE_-915999412_15 14/03/03 19:12:55 INFO hdfs.StateChange: BLOCK* allocateBlock: /user/hduser/.staging/job_1393848686226_0019/job
RE: Problem in Submitting a Map-Reduce Job to Remote Hadoop Cluster
One more configuration to be added config.set("mapreduce.framework.name<http://mapreduce.framework.name>","yarn"); Thanks Rohith From: Rohith Sharma K S [mailto:rohithsharm...@huawei.com] Sent: 03 March 2014 09:02 To: user@hadoop.apache.org Subject: RE: Problem in Submitting a Map-Reduce Job to Remote Hadoop Cluster Hi, Set below configuration in your word count job. Configuration config= new Configuration(); config.set("fs.default.name<http://fs.default.name>","hdfs://xyz-hostname:9000"); config.set("mapred.job.tracker","xyz-hostname:9001"); config.set("yarn.application.classpath ","$HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $YARN_HOME/share/hadoop/mapreduce/*, $YARN_HOME/share/hadoop/mapreduce/lib/*, $YARN_HOME/share/hadoop/yarn/*, $YARN_HOME/share/hadoop/yarn/lib/*"); yarn.application.classpath $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $YARN_HOME/share/hadoop/mapreduce/*, $YARN_HOME/share/hadoop/mapreduce/lib/*, $YARN_HOME/share/hadoop/yarn/*, $YARN_HOME/share/hadoop/yarn/lib/* Thanks & Regards Rohith Sharma K S From: Senthil Sekar [mailto:senthil...@gmail.com] Sent: 01 March 2014 19:41 To: user@hadoop.apache.org<mailto:user@hadoop.apache.org> Subject: Problem in Submitting a Map-Reduce Job to Remote Hadoop Cluster Hi , I have a remote server (Cent - OS - 6.3 ) with CDH-4.0.1 installed. I do have another Windows-7.0 machine from which iam trying to submit simple WordCount Map reduce job (i have included the HADOOP - 2.0.0 lib Jars in my Eclipse environment) I am getting the below Exception when i try to run it from ECLIPSE of my Windows7 Machine //--- Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name<http://mapreduce.framework.name> and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121) at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83) at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:487) at org.apache.hadoop.mapred.JobClient.(JobClient.java:466) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:879) at com.pss.WordCount.main(WordCount.java:79) //- Please find the code below //- public class WordCount { public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line=value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer { @Override public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { // TODO Auto-generated method stub int sum=0; while(values.hasNext()) { sum+=values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws IOException { Configuration config= new Configuration(); config.set("fs.default.name<http://fs.default.name>","hdfs://xyz-hostname:9000")
RE: Problem in Submitting a Map-Reduce Job to Remote Hadoop Cluster
Hi, Set below configuration in your word count job. Configuration config= new Configuration(); config.set("fs.default.name<http://fs.default.name>","hdfs://xyz-hostname:9000"); config.set("mapred.job.tracker","xyz-hostname:9001"); config.set("yarn.application.classpath ","$HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $YARN_HOME/share/hadoop/mapreduce/*, $YARN_HOME/share/hadoop/mapreduce/lib/*, $YARN_HOME/share/hadoop/yarn/*, $YARN_HOME/share/hadoop/yarn/lib/*"); yarn.application.classpath $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $YARN_HOME/share/hadoop/mapreduce/*, $YARN_HOME/share/hadoop/mapreduce/lib/*, $YARN_HOME/share/hadoop/yarn/*, $YARN_HOME/share/hadoop/yarn/lib/* Thanks & Regards Rohith Sharma K S From: Senthil Sekar [mailto:senthil...@gmail.com] Sent: 01 March 2014 19:41 To: user@hadoop.apache.org Subject: Problem in Submitting a Map-Reduce Job to Remote Hadoop Cluster Hi , I have a remote server (Cent - OS - 6.3 ) with CDH-4.0.1 installed. I do have another Windows-7.0 machine from which iam trying to submit simple WordCount Map reduce job (i have included the HADOOP - 2.0.0 lib Jars in my Eclipse environment) I am getting the below Exception when i try to run it from ECLIPSE of my Windows7 Machine //--- Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name<http://mapreduce.framework.name> and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121) at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83) at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:487) at org.apache.hadoop.mapred.JobClient.(JobClient.java:466) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:879) at com.pss.WordCount.main(WordCount.java:79) //- Please find the code below //- public class WordCount { public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line=value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer { @Override public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { // TODO Auto-generated method stub int sum=0; while(values.hasNext()) { sum+=values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws IOException { Configuration config= new Configuration(); config.set("fs.default.name<http://fs.default.name>","hdfs://xyz-hostname:9000"); config.set("mapred.job.tracker","xyz-hostname:9001"); JobConf conf= new JobConf(config); conf.setJarByClass(WordCount.class); //conf.setJar(jar); conf.setJobName("WordCount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);
RE: RM AM_RESYNC signal to AM
Hi Gaurav If NodeManage is killed, then containers running on this NM won't be killed immediately. RM holds node information for 10 minutes(default node expiry). Possibly there should be 1. After 10 minutes , container is killed. 2. NM is killed and restarted before 10 minutes. 1. In what all scenarios does the RM sends AM_RESYNC signal to AM? >>> In two scenario's RM sends AM_RESYNC to AM. a. When there is responseID mismatch. AM sends response id to RM in registration and every heart beat. RM validate responseId in every heartbeat sent by AM. b. When application attempts does not exist in RM cache. In your case, this scenario might be occurring. When NM is killed, it removed all the attempt data from RM. But still appliclation master is trying to connect RM. 2. Should the RM not send the AM_SHUTDOWN signal to AM when node manager is killed? >> As such AM_SHUTDOWN is NOT sent from RM. Community may be planning >> improvement on this. Thanks & Regards Rohith Sharma K S From: Gaurav Gupta [mailto:gau...@datatorrent.com] Sent: 28 February 2014 00:03 To: user@hadoop.apache.org Subject: RM AM_RESYNC signal to AM Hi, I killed the node manager on the node where AM was running and the AM master got the AM_RESYNC command signal from RM. I have following questions 3. In what all scenarios does the RM sends AM_RESYNC signal to AM? 4. Should the RM not send the AM_SHUTDOWN signal to AM when node manager is killed? Thanks -Gaurav
JobHistoryEventHandler failed with AvroTypeException.
Hi all, I am using Hadoop-2.3 for Yarn Cluster. While running job, I encountered below exception in MRAppmaster. Why this error is logging? 2014-02-21 22:10:33,841 INFO [Thread-355] org.apache.hadoop.service.AbstractService: Service JobHistoryEventHandler failed in state STOPPED; cause: org.apache.avro.AvroTypeException: Attempt to process a enum when a string was expected. org.apache.avro.AvroTypeException: Attempt to process a enum when a string was expected. at org.apache.avro.io.parsing.Parser.advance(Parser.java:93) at org.apache.avro.io.JsonEncoder.writeEnum(JsonEncoder.java:217) at org.apache.avro.specific.SpecificDatumWriter.writeEnum(SpecificDatumWriter.java:54) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:67) at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:106) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58) at org.apache.hadoop.mapreduce.jobhistory.EventWriter.write(EventWriter.java:66) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$MetaInfo.writeEvent(JobHistoryEventHandler.java:870) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:517) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:332) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:159) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1386) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:550) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:602) Thanks & Regards Rohith Sharma K S
RE: what happens to a client attempting to get a new app when the resource manager is already down
Default Retry time period is 15 minutes. Setting configuration "yarn.resourcemanager.connect.max-wait.ms" to lesser value, retry period can be reduced in client side. Thanks & Regards Rohith Sharma K S From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com] On Behalf Of Vinod Kumar Vavilapalli Sent: 05 February 2014 22:43 To: user@hadoop.apache.org; REYANE OUKPEDJO Subject: Re: what happens to a client attempting to get a new app when the resource manager is already down Is this on trunk or a released version? I think the default behavior (when RM HA is not enabled) shouldn't have client loop forever. Let me know and we can see if this needs fixing. Thanks, +vinod On Jan 31, 2014, at 7:52 AM, REYANE OUKPEDJO mailto:r.oukpe...@yahoo.com>> wrote: Hi there, I am trying to solve a problem. My client run as a server. And was trying to make my client aware about the fact the resource manager is down but I could not figure out. The reason is that the call : yarnClient.createApplication(); never return when the resource manager is down. However it just stay in a loops and sleep after 10 iteration and continue the same loops. Below you can find the logs. Any idea how to leave this loop ? is there any parameter that control the number of seconds before giving up. Thanks Reyane OUKPEDJO logs 14/01/31 10:48:05 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:48:06 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:48:37 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:48:38 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:48:39 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:48:40 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:48:41 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:48:42 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:48:43 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:48:44 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:48:45 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:48:46 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:49:17 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:49:18 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:49:19 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:49:20 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:49:21 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/01/31 10:49:22 INFO ipc.Client: Retrying connect to server: isblade2/9.32.160.125:8032. Already tried 5
Reducers are launched after jobClient is exited.
Hi All , I ran job with 1 Map and 1 Reducers ( mapreduce.job.reduce.slowstart.completedmaps=1 ). Map failed ( because of error in Mapper implementation), but still Reducers are launched by applicationMaster. These reducers killed by applicationMaster while stopping RMCommunicator service. 1. Why Reducers are launching after job is finished.? ( Is this is bug in MR? ) Our use case is when job is finished(succeeded/failed),client program delete the JobOutput directory. Here, jobclient exit immediately after jobStatus is set. ( in below log, at 2014-01-23 07:34:43,166) But , in the below log as mentioned reducers are launched later , Reducer temporary directory and files are created(_temporary). These files left in hdfs undeleted forever. Kindly suggest your thoughts, how we can handle this situation? 2014-01-23 07:34:43,151 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1389970937094_0047_m_00 Task Transitioned from RUNNING to FAILED 2014-01-23 07:34:43,151 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 1 2014-01-23 07:34:43,151 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Job failed as tasks failed. failedMaps:1 failedReduces:0 2014-01-23 07:34:43,153 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1389970937094_0047Job Transitioned from RUNNING to FAIL_ABORT 2014-01-23 07:34:43,153 INFO [CommitterEvent Processor #0] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: JOB_ABORT 2014-01-23 07:34:43,166 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1389970937094_0047Job Transitioned from FAIL_ABORT to FAILED ... ... 2014-01-23 07:34:43,707 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:1 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:1 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:4 ContRel:0 HostLocal:1 RackLocal:0 2014-01-23 07:34:43,709 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=12288 2014-01-23 07:34:43,709 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold reached. Scheduling reduces. 2014-01-23 07:34:43,709 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: All maps assigned. Ramping up all remaining reduces:1 ... ... 2014-01-23 07:34:45,714 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1 2014-01-23 07:34:45,714 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce 2014-01-23 07:34:45,714 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1389970937094_0047_01_06 to attempt_1389970937094_0047_r_00_0 ... ... 2014-01-23 07:34:45,724 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1389970937094_0047_r_00_0 TaskAttempt Transitioned from UNASSIGNED to ASSIGNED 2014-01-23 07:34:45,725 INFO [ContainerLauncher #8] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_LAUNCH for container container_1389970937094_0047_01_06 taskAttempt attempt_1389970937094_0047_r_00_0 2014-01-23 07:34:45,725 INFO [ContainerLauncher #8] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Launching attempt_1389970937094_0047_r_00_0 2014-01-23 07:34:45,727 INFO [ContainerLauncher #8] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Shuffle port returned by ContainerManager for attempt_1389970937094_0047_r_00_0 : 11234 2014-01-23 07:34:45,728 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: TaskAttempt: [attempt_1389970937094_0047_r_00_0] using containerId: [container_1389970937094_0047_01_06 on NM: [linux85:11232] 2014-01-23 07:34:45,728 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1389970937094_0047_r_00_0 TaskAttempt Transitioned from ASSIGNED to RUNNING 2014-01-23 07:34:45,728 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1389970937094_0047_r_00 Task Transitioned from SCHEDULED to RUNNING ... . 2014-01-23 07:34:48,178 INFO [Thread-59] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1389970937094_0047_r
RE: unable to compile hadoop source code
You can read Build instructions for Hadoop. http://svn.apache.org/repos/asf/hadoop/common/trunk/BUILDING.txt For your problem, proto-buf not set in PATH. After setting, recheck proto-buffer version is 2.5 From: nagarjuna kanamarlapudi [mailto:nagarjuna.kanamarlap...@gmail.com] Sent: 07 January 2014 09:18 To: user@hadoop.apache.org Subject: unable to compile hadoop source code Hi, I checked out the source code from https://svn.apache.org/repos/asf/hadoop/common/trunk/ I tried to compile the code with mvn. I am compiling this on a mac os X , mavericks. Any help is appreciated. It failed at the following stage [INFO] Apache Hadoop Auth Examples ... SUCCESS [5.017s] [INFO] Apache Hadoop Common .. FAILURE [1:39.797s] [INFO] Apache Hadoop NFS . SKIPPED [INFO] Apache Hadoop Common Project .. SKIPPED [INFO] [ERROR] Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:3.0.0-SNAPSHOT:protoc (compile-protoc) on project hadoop-common: org.apache.maven.plugin.MojoExecutionException: 'protoc --version' did not return a version -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. Thanks, Nagarjuna K