[ 
https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012316#comment-15012316
 ] 

Tsuyoshi Ozawa commented on YARN-4348:
--------------------------------------

After taking a look deeper, the reply packet against sync operation looks to be 
pending in blocking list of ClientCnxn#EventTrread. After the thread 
interrupted, it's processed correctly.

{code}
2015-11-19 07:21:47,955 DEBUG [main-SendThread(127.0.0.1:11221)] 
zookeeper.ClientCnxn (ClientCnxn.java:readResponse(733)) - Got auth 
sessionid:0x1511cb079430000
...
2015-11-19 07:21:48,019 DEBUG [SyncThread:0] server.FinalRequestProcessor 
(FinalRequestProcessor.java:processRequest(88)) - Processing request:: 
sessionid:0x1511cb079430000 type:sync: cxid:0xb zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/rmstore/ZKRMStateRoot
...
2015-11-19 07:21:48,013 DEBUG [main-SendThread(127.0.0.1:11221)] 
zookeeper.ClientCnxn (ClientCnxn.java:readResponse(818)) - Reading reply 
sessionid:0x1511cb079430000, packet:: clientPath:null serverPath:null 
finished:false header:: 10,1  replyHeader:: 10,11,0  request:: 
'/rmstore/ZKRMStateRoot/AMRMTokenSecretManagerRoot,,v{s{31,s{'world,'anyone}}},0
  response:: '/rmstore/ZKRMStateRoot/AMRMTokenSecretManagerRoot 
2015-11-19 07:21:48,019 DEBUG [SyncThread:0] server.FinalRequestProcessor 
(FinalRequestProcessor.java:processRequest(88)) - Processing request:: 
sessionid:0x1511cb079430000 type:sync: cxid:0xb zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/rmstore/ZKRMStateRoot
2015-11-19 07:21:48,019 DEBUG [SyncThread:0] server.FinalRequestProcessor 
(FinalRequestProcessor.java:processRequest(160)) - sessionid:0x1511cb079430000 
type:sync: cxid:0xb zxid:0xfffffffffffffffe txntype:unknown 
reqpath:/rmstore/ZKRMStateRoot
...
2015-11-19 07:22:03,027 INFO  [main] service.AbstractService 
(AbstractService.java:noteFailure(272)) - Service 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore failed in 
state STARTED; cause: java.io.IOException: failing to sync operation at 
starting up RM
java.io.IOException: failing to sync operation at starting up RM
...
2015-11-19 07:22:03,029 INFO  [main] event.AsyncDispatcher 
(AsyncDispatcher.java:serviceStop(141)) - AsyncDispatcher is draining to stop, 
igonring any new events.
2015-11-19 07:22:03,030 INFO  [main-EventThread] recovery.ZKRMStateStore 
(ZKRMStateStore.java:processResult(122)) - ZooKeeper sync operation succeeded. 
path: /rmstore/ZKRMStateRoot
2015-11-19 07:22:03,030 INFO  
[org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread]
 recovery.ZKRMStateStore (ZKRMStateStore.java:run(1131)) - 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread
 thread interrupted! Exiting!
2015-11-19 07:22:03,030 INFO  [main-EventThread] recovery.ZKRMStateStore 
(ZKRMStateStore.java:processResult(124)) - ZooKeeper sync operation succeeded. 
path: /rmstore/ZKRMStateRoot
{code}


> ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of 
> zkSessionTimeout
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-4348
>                 URL: https://issues.apache.org/jira/browse/YARN-4348
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.2, 2.6.2
>            Reporter: Tsuyoshi Ozawa
>            Assignee: Tsuyoshi Ozawa
>         Attachments: YARN-4348-branch-2.7.002.patch, YARN-4348.001.patch, 
> YARN-4348.001.patch, log.txt
>
>
> Jian mentioned that the current internal ZK configuration of ZKRMStateStore 
> can cause a following situation:
> 1. syncInternal timeouts, 
> 2. but sync succeeded later on.
> We should use zkResyncWaitTime as the timeout value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to