[
https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsuyoshi Ozawa updated YARN-4348:
---------------------------------
Attachment: YARN-4348-branch-2.7.002.patch
The test failure I mentioned is caused by using zkResyncWaitTime as the timeout
value of sync operation - the default value of zkResyncWaitTime is smaller than
zkSessionTimeout. We should use the timeout value which is larger than
zkSessionTimeout, so just changing to use zkSessionTimeout * 3.
In addition to this, we should care about the failure of sync operation at
startup time to preventing RM from continuing to run in illegal state - ZK's
inconsistent view.
Attaching a patch to fix the test failure and the error handling at startup
time(startInternal). [~jianhe], could you take a look?
> ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of
> zkSessionTimeout
> ----------------------------------------------------------------------------------------
>
> Key: YARN-4348
> URL: https://issues.apache.org/jira/browse/YARN-4348
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.7.2, 2.6.2
> Reporter: Tsuyoshi Ozawa
> Assignee: Tsuyoshi Ozawa
> Attachments: YARN-4348-branch-2.7.002.patch, YARN-4348.001.patch,
> YARN-4348.001.patch, log.txt
>
>
> Jian mentioned that the current internal ZK configuration of ZKRMStateStore
> can cause a following situation:
> 1. syncInternal timeouts,
> 2. but sync succeeded later on.
> We should use zkResyncWaitTime as the timeout value.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)