[ https://issues.apache.org/jira/browse/PHOENIX-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117156#comment-17117156 ]
Sandeep Guggilam commented on PHOENIX-4216: ------------------------------------------- I have spent some time looking at few builds that had test related to the "Master not initialized" exception. The Master is not able to complete the initialization as it waiting for the region server to report to it. The region server actually reported to the master but the master rejected the request because of clock skew issue _org.apache.hadoop.hbase.ClockOutOfSyncException: Server asf948.gq1.ygridcore.net,36973,1590112065298 has been rejected; Reported time is too far out of sync with master. Time difference of 1589507264841ms > max allowed of 30000ms_ There are multiple things to understand here : # The Region server uses EnvironmentEdgeManager.currentTime to report the current time and HMaster uses System.currentTimeMillis() to get the current time for computation against the reported time by RS. Ideally, even the EnvironmentEdgeManager should give the same as System.currenttimemillis() here unless we use some other delegate which I am not sure is possible in HRegionServer startup. On other note, should we just use EnvironmentEdgeManager even in HMaster for computation ? # We try to get the diff of the time between RS and Master like "abs(a-b) =c" and see if c is greater than configured value. In the log message we just log "c" (the difference). Should we also log either "a" or "b" to understand who (master or slave) is reporting the wrong value ? [~apurtell] Can you please provide your thoughts ? > Figure out why tests randomly fail with master not able to initialize in 200 > seconds > ------------------------------------------------------------------------------------ > > Key: PHOENIX-4216 > URL: https://issues.apache.org/jira/browse/PHOENIX-4216 > Project: Phoenix > Issue Type: Bug > Affects Versions: 5.0.0, 4.15.0, 4.14.3 > Reporter: Samarth Jain > Priority: Major > Labels: phoenix-hardening, precommit, quality-improvement > Fix For: 5.1.0, 4.16.0 > > Attachments: Precommit-3849.log > > > Sample failure: > https://builds.apache.org/job/PreCommit-PHOENIX-Build/1450//testReport/ > [~apurtell] - Looking at the thread dump in the above link, do you see why > master startup failed? I couldn't see any obvious deadlocks -- This message was sent by Atlassian Jira (v8.3.4#803005)