Folks, any idea on the failure? On 5 October 2015 at 11:46, Rajat Jain <[email protected]> wrote:
> YARN ships with these jars: > > > /usr/lib/hadoop/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-xc-1.9.13.jar > > /usr/lib/hadoop/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-jaxrs-1.9.13.jar > > /usr/lib/hadoop/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-core-asl-1.9.13.jar > > /usr/lib/hadoop/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jackson-mapper-asl-1.9.13.jar > /usr/lib/hadoop/share/hadoop/yarn/lib/jackson-annotations-2.2.3.jar > /usr/lib/hadoop/share/hadoop/yarn/lib/jackson-databind-2.2.3.jar > /usr/lib/hadoop/share/hadoop/yarn/lib/jackson-core-2.2.3.jar > > /usr/lib/hadoop/share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/jackson-xc-1.9.13.jar > > /usr/lib/hadoop/share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/jackson-jaxrs-1.9.13.jar > > /usr/lib/hadoop/share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/jackson-core-asl-1.9.13.jar > > /usr/lib/hadoop/share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/jackson-mapper-asl-1.9.13.jar > /usr/lib/hadoop/share/hadoop/common/lib/jackson-xc-1.9.13.jar > /usr/lib/hadoop/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar > /usr/lib/hadoop/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar > /usr/lib/hadoop/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar > > > Tez ships with these jars: > > /usr/lib/tez/lib/jackson-xc-1.9.13.jar > /usr/lib/tez/lib/jackson-annotations-2.2.3.jar > /usr/lib/tez/lib/jackson-jaxrs-1.9.13.jar > /usr/lib/tez/lib/jackson-core-asl-1.9.13.jar > /usr/lib/tez/lib/jackson-databind-2.2.3.jar > /usr/lib/tez/lib/jackson-core-2.2.3.jar > /usr/lib/tez/lib/jackson-mapper-asl-1.9.13.jar > > > > > On 5 October 2015 at 11:21, Jonathan Eagles <[email protected]> wrote: > >> What's happening here is that the Tez AM has a runtime dependency on >> jackson libraries due to ATSHistoryLoggingService (due to >> TimelineClientImpl in YARN). It looks like your application is shipping a >> different version of jackson that YARN was expecting. Can you check what >> two versions of jackson library you are running with? >> >> Jon >> >> On Mon, Oct 5, 2015 at 12:30 AM, Rajat Jain <[email protected]> wrote: >> >>> Here is the AM log. My setup doesn't use Ambari. Can you attach the >>> /etc/security/limits.d/yarn.conf file? >>> >>> 2015-10-05 04:43:36,769 INFO [main] Configuration.deprecation: >>> fs.default.name is deprecated. Instead, use fs.defaultFS >>> 2015-10-05 04:43:36,972 WARN [main] util.NativeCodeLoader: Unable to load >>> native-hadoop library for your platform... using builtin-java classes where >>> applicable >>> 2015-10-05 04:43:37,077 INFO [main] app.DAGAppMaster: Using Standard Socket >>> Factory for communication >>> 2015-10-05 04:43:37,269 INFO [main] app.DAGAppMaster: Created DAGAppMaster >>> for application appattempt_1444019975627_0001_000001, versionInfo=[ >>> component=tez-dag, version=0.7.0, >>> revision=e1c833e8be209145149cc6d3e1a01f8fa6e9b85f, >>> SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, >>> buildTime=2015-09-29T13:26:49Z ] >>> 2015-10-05 04:43:37,292 INFO [main] app.DAGAppMaster: Comparing client >>> version with AM version, clientVersion=0.7.0, AMVersion=0.7.0 >>> 2015-10-05 04:43:37,947 WARN [main] shortcircuit.DomainSocketFactory: The >>> short-circuit local reads feature cannot be used because libhadoop cannot >>> be loaded. >>> 2015-10-05 04:43:38,086 INFO [main] app.DAGAppMaster: Adding session token >>> to jobTokenSecretManager for application >>> 2015-10-05 04:43:38,090 INFO [main] common.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.rm.container.AMContainerEventType for class >>> org.apache.tez.dag.app.rm.container.AMContainerMap >>> 2015-10-05 04:43:38,092 INFO [main] common.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.rm.node.AMNodeEventType for class >>> org.apache.tez.dag.app.rm.node.AMNodeTracker >>> 2015-10-05 04:43:38,093 INFO [main] common.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class >>> org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler >>> 2015-10-05 04:43:38,094 INFO [main] common.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.dag.event.DAGEventType for class >>> org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher >>> 2015-10-05 04:43:38,094 INFO [main] common.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.dag.event.VertexEventType for class >>> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher >>> 2015-10-05 04:43:38,095 INFO [main] common.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.dag.event.TaskEventType for class >>> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher >>> 2015-10-05 04:43:38,096 INFO [main] common.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class >>> org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher >>> 2015-10-05 04:43:38,097 INFO [main] common.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.dag.event.SpeculatorEventType for independent >>> dispatch using: class >>> org.apache.tez.dag.app.DAGAppMaster$SpeculatorEventHandler >>> 2015-10-05 04:43:38,097 INFO [main] common.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class >>> org.apache.tez.dag.app.DAGAppMaster$SpeculatorEventHandler >>> 2015-10-05 04:43:38,111 INFO [main] common.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.rm.AMSchedulerEventType for class >>> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler >>> 2015-10-05 04:43:38,113 INFO [main] common.AsyncDispatcher: Registering >>> class org.apache.tez.dag.app.rm.NMCommunicatorEventType for class >>> org.apache.tez.dag.app.launcher.ContainerLauncherImpl >>> 2015-10-05 04:43:38,215 INFO [main] node.AMNodeTracker: >>> blacklistDisablePercent is 33, blacklistingEnabled: true, >>> maxTaskFailuresPerNode: 10 >>> 2015-10-05 04:43:38,217 INFO [main] web.WebUIService: Tez UI History URL: >>> http://ec2-54-158-112-249.compute-1.amazonaws.com:10500/#/tez-app/application_1444019975627_0001 >>> 2015-10-05 >>> <http://ec2-54-158-112-249.compute-1.amazonaws.com:10500/#/tez-app/application_1444019975627_00012015-10-05> >>> 04:43:38,219 INFO [main] launcher.ContainerLauncherImpl: Upper limit on >>> the thread pool size is 500 >>> 2015-10-05 04:43:38,219 INFO [main] history.HistoryEventHandler: >>> Initializing HistoryEventHandler >>> 2015-10-05 04:43:38,224 INFO [main] ats.ATSHistoryLoggingService: >>> Initializing ATSService >>> 2015-10-05 04:43:38,806 INFO [main] impl.TimelineClientImpl: Timeline >>> service address: >>> http://ec2-54-158-112-249.compute-1.amazonaws.com:8188/ws/v1/timeline/ >>> 2015-10-05 >>> <http://ec2-54-158-112-249.compute-1.amazonaws.com:8188/ws/v1/timeline/2015-10-05> >>> 04:43:38,806 INFO [main] ats.ATSHistoryLoggingService: Using >>> org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager to manage >>> Timeline ACLs >>> 2015-10-05 04:43:38,971 INFO [main] impl.TimelineClientImpl: Timeline >>> service address: >>> http://ec2-54-158-112-249.compute-1.amazonaws.com:8188/ws/v1/timeline/ >>> 2015-10-05 >>> <http://ec2-54-158-112-249.compute-1.amazonaws.com:8188/ws/v1/timeline/2015-10-05> >>> 04:43:38,971 INFO [main] recovery.RecoveryService: Initializing >>> RecoveryService >>> 2015-10-05 04:43:38,974 INFO [main] history.HistoryEventHandler: >>> [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: >>> applicationId=application_1444019975627_0001, appSubmitTime=1444020210133, >>> launchTime=1444020217266 >>> 2015-10-05 04:43:38,975 INFO [main] history.HistoryEventHandler: >>> [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: >>> appAttemptId=appattempt_1444019975627_0001_000001, >>> appSubmitTime=1444020210133, launchTime=1444020217266 >>> 2015-10-05 04:43:38,982 INFO >>> [ServiceThread:org.apache.tez.dag.history.HistoryEventHandler] >>> history.HistoryEventHandler: Starting HistoryEventHandler >>> 2015-10-05 04:43:38,982 INFO >>> [ServiceThread:org.apache.tez.dag.history.HistoryEventHandler] >>> ats.ATSHistoryLoggingService: Starting ATSService >>> 2015-10-05 04:43:38,985 INFO >>> [ServiceThread:org.apache.tez.dag.app.launcher.ContainerLauncherImpl] >>> impl.ContainerManagementProtocolProxy: >>> yarn.client.max-cached-nodemanagers-proxies : 0 >>> 2015-10-05 04:43:38,990 INFO [ServiceThread:DAGClientRPCServer] >>> ipc.CallQueueManager: Using callQueue class >>> java.util.concurrent.LinkedBlockingQueue >>> 2015-10-05 04:43:38,990 INFO >>> [ServiceThread:org.apache.tez.dag.app.TaskAttemptListenerImpTezDag] >>> ipc.CallQueueManager: Using callQueue class >>> java.util.concurrent.LinkedBlockingQueue >>> 2015-10-05 04:43:38,997 INFO >>> [ServiceThread:org.apache.tez.dag.history.HistoryEventHandler] >>> recovery.RecoveryService: Starting RecoveryService >>> 2015-10-05 04:43:39,001 INFO [Socket Reader #1 for port 47120] ipc.Server: >>> Starting Socket Reader #1 for port 47120 >>> 2015-10-05 04:43:39,001 INFO [Socket Reader #1 for port 47029] ipc.Server: >>> Starting Socket Reader #1 for port 47029 >>> 2015-10-05 04:43:39,027 INFO [IPC Server Responder] ipc.Server: IPC Server >>> Responder: starting >>> 2015-10-05 04:43:39,033 INFO [IPC Server listener on 47029] ipc.Server: IPC >>> Server listener on 47029: starting >>> 2015-10-05 04:43:39,035 INFO [IPC Server listener on 47120] ipc.Server: IPC >>> Server listener on 47120: starting >>> 2015-10-05 04:43:39,035 INFO [IPC Server Responder] ipc.Server: IPC Server >>> Responder: starting >>> 2015-10-05 04:43:39,035 INFO [ServiceThread:DAGClientRPCServer] >>> client.DAGClientServer: Instantiated DAGClientRPCServer at >>> ip-10-149-127-154.ec2.internal/10.149.127.154:47029 >>> 2015-10-05 04:43:39,070 INFO >>> [ServiceThread:org.apache.tez.dag.app.web.WebUIService] mortbay.log: >>> Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via >>> org.mortbay.log.Slf4jLog >>> 2015-10-05 04:43:39,074 INFO >>> [ServiceThread:org.apache.tez.dag.app.web.WebUIService] >>> http.HttpRequestLog: Http request log for http.requests. is not defined >>> 2015-10-05 04:43:39,083 INFO >>> [ServiceThread:org.apache.tez.dag.app.web.WebUIService] http.HttpServer2: >>> Added global filter 'safety' >>> (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter) >>> 2015-10-05 04:43:39,087 INFO >>> [ServiceThread:org.apache.tez.dag.app.web.WebUIService] http.HttpServer2: >>> Added filter AM_PROXY_FILTER >>> (class=org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter) to >>> context >>> 2015-10-05 04:43:39,087 INFO >>> [ServiceThread:org.apache.tez.dag.app.web.WebUIService] http.HttpServer2: >>> Added filter AM_PROXY_FILTER >>> (class=org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter) to >>> context static >>> 2015-10-05 04:43:39,090 INFO >>> [ServiceThread:org.apache.tez.dag.app.web.WebUIService] http.HttpServer2: >>> adding path spec: /* >>> 2015-10-05 04:43:39,097 INFO >>> [ServiceThread:org.apache.tez.dag.app.web.WebUIService] http.HttpServer2: >>> Jetty bound to port 34881 >>> 2015-10-05 04:43:39,098 INFO >>> [ServiceThread:org.apache.tez.dag.app.web.WebUIService] mortbay.log: >>> jetty-6.1.26 >>> 2015-10-05 04:43:39,127 INFO >>> [ServiceThread:org.apache.tez.dag.app.web.WebUIService] mortbay.log: >>> Extract >>> jar:file:/media/ephemeral1/yarn/local/filecache/20/hadoop-yarn-common-2.6.0.jar!/webapps/ >>> to /tmp/Jetty_0_0_0_0_34881_webapps____.97xzaz/webapp >>> 2015-10-05 04:43:39,288 INFO >>> [ServiceThread:org.apache.tez.dag.app.web.WebUIService] mortbay.log: NO JSP >>> Support for /, did not find org.apache.jasper.servlet.JspServlet >>> 2015-10-05 04:43:39,298 FATAL [HistoryEventHandlingThread] >>> yarn.YarnUncaughtExceptionHandler: Thread >>> Thread[HistoryEventHandlingThread,5,main] threw an Error. Shutting down >>> now... >>> java.lang.NoSuchMethodError: >>> org.codehaus.jackson.map.ObjectMapper.setSerializationInclusion(Lorg/codehaus/jackson/map/annotate/JsonSerialize$Inclusion;)Lorg/codehaus/jackson/map/ObjectMapper; >>> at >>> org.apache.hadoop.yarn.webapp.YarnJacksonJaxbJsonProvider.configObjectMapper(YarnJacksonJaxbJsonProvider.java:59) >>> at >>> org.apache.hadoop.yarn.webapp.YarnJacksonJaxbJsonProvider.locateMapper(YarnJacksonJaxbJsonProvider.java:52) >>> at >>> org.codehaus.jackson.jaxrs.JacksonJsonProvider.writeTo(JacksonJsonProvider.java:501) >>> at >>> com.sun.jersey.api.client.RequestWriter.writeRequestEntity(RequestWriter.java:300) >>> at >>> com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:204) >>> at >>> com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147) >>> at >>> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:226) >>> at >>> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:162) >>> at >>> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:237) >>> at com.sun.jersey.api.client.Client.handle(Client.java:648) >>> at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) >>> at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) >>> at >>> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563) >>> at >>> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:472) >>> at >>> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:321) >>> at >>> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:301) >>> at >>> org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.handleEvents(ATSHistoryLoggingService.java:342) >>> at >>> org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.access$700(ATSHistoryLoggingService.java:52) >>> at >>> org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService$1.run(ATSHistoryLoggingService.java:186) >>> at java.lang.Thread.run(Thread.java:745) >>> 2015-10-05 04:43:39,303 INFO [HistoryEventHandlingThread] util.ExitUtil: >>> Exiting with status -1 >>> 2015-10-05 04:43:39,307 INFO [Thread-2] app.DAGAppMaster: >>> DAGAppMasterShutdownHook invoked >>> 2015-10-05 04:43:39,307 INFO [Thread-2] app.DAGAppMaster: DAGAppMaster >>> received a signal. Signaling TaskScheduler >>> 2015-10-05 04:43:39,307 INFO [Thread-2] rm.TaskSchedulerEventHandler: >>> TaskScheduler notified that iSignalled was : true >>> 2015-10-05 04:43:39,397 INFO >>> [ServiceThread:org.apache.tez.dag.app.web.WebUIService] mortbay.log: >>> Started [email protected]:34881 >>> 2015-10-05 04:43:39,397 INFO >>> [ServiceThread:org.apache.tez.dag.app.web.WebUIService] webapp.WebApps: Web >>> app / started at 34881 >>> 2015-10-05 04:43:39,608 INFO >>> [ServiceThread:org.apache.tez.dag.app.web.WebUIService] webapp.WebApps: >>> Registered webapp guice modules >>> 2015-10-05 04:43:39,610 INFO >>> [ServiceThread:org.apache.tez.dag.app.web.WebUIService] web.WebUIService: >>> Instantiated WebUIService at http://ip-10-149-127-154.ec2.internal:34881/ui/ >>> 2015-10-05 <http://ip-10-149-127-154.ec2.internal:34881/ui/2015-10-05> >>> 04:43:39,645 INFO >>> [ServiceThread:org.apache.tez.dag.app.rm.TaskSchedulerEventHandler] >>> rm.YarnTaskSchedulerService: TaskScheduler initialized with configuration: >>> maxRMHeartbeatInterval: 1000, containerReuseEnabled: true, reuseRackLocal: >>> true, reuseNonLocal: false, localitySchedulingDelay: 250, >>> preemptionPercentage: 10, numHeartbeatsBetweenPreemptions: 3, >>> idleContainerMinTimeout: 5000, idleContainerMaxTimeout: 10000, >>> sessionMinHeldContainers: 0 >>> 2015-10-05 04:43:39,719 INFO >>> [ServiceThread:org.apache.tez.dag.app.rm.TaskSchedulerEventHandler] >>> client.RMProxy: Connecting to ResourceManager at >>> ec2-54-158-112-249.compute-1.amazonaws.com/10.157.14.248:8030 >>> 2015-10-05 >>> <http://ec2-54-158-112-249.compute-1.amazonaws.com/10.157.14.248:80302015-10-05> >>> 04:43:39,812 INFO [main] history.HistoryEventHandler: >>> [HISTORY][DAG:N/A][Event:AM_STARTED]: >>> appAttemptId=appattempt_1444019975627_0001_000001, startTime=1444020219811 >>> 2015-10-05 04:43:39,812 INFO [main] app.DAGAppMaster: In Session mode. >>> Waiting for DAG over RPC >>> 2015-10-05 04:43:39,813 INFO [Thread-2] history.HistoryEventHandler: >>> Stopping HistoryEventHandler >>> 2015-10-05 04:43:39,814 INFO [Thread-2] recovery.RecoveryService: Stopping >>> RecoveryService >>> 2015-10-05 04:43:39,814 INFO [Thread-2] recovery.RecoveryService: Handle >>> the remaining events in queue, queue size=0 >>> 2015-10-05 04:43:39,814 INFO [RecoveryEventHandlingThread] >>> recovery.RecoveryService: EventQueue take interrupted. Returning >>> 2015-10-05 04:43:39,814 INFO [Thread-2] ats.ATSHistoryLoggingService: >>> Stopping ATSService, eventQueueBacklog=1 >>> 2015-10-05 04:43:39,814 WARN [Thread-2] ats.ATSHistoryLoggingService: >>> ATSService being stopped, eventQueueBacklog=1, maxTimeLeftToFlush=-1, >>> waitForever=true >>> >>> >>> On 5 October 2015 at 10:48, Gopal Vijayaraghavan <[email protected]> >>> wrote: >>> >>>> >>>> > The failures are always intermittent. Any idea why this happens? >>>> >>>> First up you should try 0.7.1, because of TEZ-2663. >>>> >>>> > Caused by: org.apache.tez.dag.api.SessionNotRunning: TezSession has >>>> >already shutdown. Application application_1444019975627_0001 failed 2 >>>> >times due to AM Container for appattempt_1444019975627_0001_000002 >>>> exited >>>> >with exitCode: 255 >>>> >>>> Can you say what is printed in the AppMaster logs? >>>> >>>> I've seen this occasionally happen due to bad setup of cluster >>>> uid-limits. >>>> >>>> The ambari sets that up in a file named >>>> >>>> /etc/security/limits.d/yarn.conf (yarn.conf.j2) >>>> >>>> >>>> Check for that file. Otherwise as shuffle handler spawns threads, the >>>> container launchers will start to intermittently fail (the default is >>>> 1024 >>>> threads per-user, yarn.conf ups this to 65,000 threads). >>>> >>>> Cheers, >>>> Gopal >>>> >>>> >>>> >>> >> >
