[
https://issues.apache.org/jira/browse/PHOENIX-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117156#comment-17117156
]
Sandeep Guggilam commented on PHOENIX-4216:
-------------------------------------------
I have spent some time looking at few builds that had test related to the
"Master not initialized" exception. The Master is not able to complete the
initialization as it waiting for the region server to report to it. The region
server actually reported to the master but the master rejected the request
because of clock skew issue
_org.apache.hadoop.hbase.ClockOutOfSyncException: Server
asf948.gq1.ygridcore.net,36973,1590112065298 has been rejected; Reported time
is too far out of sync with master. Time difference of 1589507264841ms > max
allowed of 30000ms_
There are multiple things to understand here :
# The Region server uses EnvironmentEdgeManager.currentTime to report the
current time and HMaster uses System.currentTimeMillis() to get the current
time for computation against the reported time by RS. Ideally, even the
EnvironmentEdgeManager should give the same as System.currenttimemillis() here
unless we use some other delegate which I am not sure is possible in
HRegionServer startup. On other note, should we just use EnvironmentEdgeManager
even in HMaster for computation ?
# We try to get the diff of the time between RS and Master like "abs(a-b) =c"
and see if c is greater than configured value. In the log message we just log
"c" (the difference). Should we also log either "a" or "b" to understand who
(master or slave) is reporting the wrong value ?
[~apurtell] Can you please provide your thoughts ?
> Figure out why tests randomly fail with master not able to initialize in 200
> seconds
> ------------------------------------------------------------------------------------
>
> Key: PHOENIX-4216
> URL: https://issues.apache.org/jira/browse/PHOENIX-4216
> Project: Phoenix
> Issue Type: Bug
> Affects Versions: 5.0.0, 4.15.0, 4.14.3
> Reporter: Samarth Jain
> Priority: Major
> Labels: phoenix-hardening, precommit, quality-improvement
> Fix For: 5.1.0, 4.16.0
>
> Attachments: Precommit-3849.log
>
>
> Sample failure:
> https://builds.apache.org/job/PreCommit-PHOENIX-Build/1450//testReport/
> [~apurtell] - Looking at the thread dump in the above link, do you see why
> master startup failed? I couldn't see any obvious deadlocks
--
This message was sent by Atlassian Jira
(v8.3.4#803005)