[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012316#comment-15012316 ]
Tsuyoshi Ozawa commented on YARN-4348: -------------------------------------- After taking a look deeper, the reply packet against sync operation looks to be pending in blocking list of ClientCnxn#EventTrread. After the thread interrupted, it's processed correctly. {code} 2015-11-19 07:21:47,955 DEBUG [main-SendThread(127.0.0.1:11221)] zookeeper.ClientCnxn (ClientCnxn.java:readResponse(733)) - Got auth sessionid:0x1511cb079430000 ... 2015-11-19 07:21:48,019 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(88)) - Processing request:: sessionid:0x1511cb079430000 type:sync: cxid:0xb zxid:0xfffffffffffffffe txntype:unknown reqpath:/rmstore/ZKRMStateRoot ... 2015-11-19 07:21:48,013 DEBUG [main-SendThread(127.0.0.1:11221)] zookeeper.ClientCnxn (ClientCnxn.java:readResponse(818)) - Reading reply sessionid:0x1511cb079430000, packet:: clientPath:null serverPath:null finished:false header:: 10,1 replyHeader:: 10,11,0 request:: '/rmstore/ZKRMStateRoot/AMRMTokenSecretManagerRoot,,v{s{31,s{'world,'anyone}}},0 response:: '/rmstore/ZKRMStateRoot/AMRMTokenSecretManagerRoot 2015-11-19 07:21:48,019 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(88)) - Processing request:: sessionid:0x1511cb079430000 type:sync: cxid:0xb zxid:0xfffffffffffffffe txntype:unknown reqpath:/rmstore/ZKRMStateRoot 2015-11-19 07:21:48,019 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(160)) - sessionid:0x1511cb079430000 type:sync: cxid:0xb zxid:0xfffffffffffffffe txntype:unknown reqpath:/rmstore/ZKRMStateRoot ... 2015-11-19 07:22:03,027 INFO [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore failed in state STARTED; cause: java.io.IOException: failing to sync operation at starting up RM java.io.IOException: failing to sync operation at starting up RM ... 2015-11-19 07:22:03,029 INFO [main] event.AsyncDispatcher (AsyncDispatcher.java:serviceStop(141)) - AsyncDispatcher is draining to stop, igonring any new events. 2015-11-19 07:22:03,030 INFO [main-EventThread] recovery.ZKRMStateStore (ZKRMStateStore.java:processResult(122)) - ZooKeeper sync operation succeeded. path: /rmstore/ZKRMStateRoot 2015-11-19 07:22:03,030 INFO [org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread] recovery.ZKRMStateStore (ZKRMStateStore.java:run(1131)) - org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread thread interrupted! Exiting! 2015-11-19 07:22:03,030 INFO [main-EventThread] recovery.ZKRMStateStore (ZKRMStateStore.java:processResult(124)) - ZooKeeper sync operation succeeded. path: /rmstore/ZKRMStateRoot {code} > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > ---------------------------------------------------------------------------------------- > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.7.2, 2.6.2 > Reporter: Tsuyoshi Ozawa > Assignee: Tsuyoshi Ozawa > Attachments: YARN-4348-branch-2.7.002.patch, YARN-4348.001.patch, > YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)