[ https://issues.apache.org/jira/browse/HBASE-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554472#comment-13554472 ]
Lars Hofhansl commented on HBASE-7551: -------------------------------------- Either way :) Ram did the work of tracking the issue down... > nodeChildrenChange event may happen after the transition to > RS_ZK_REGION_SPLITTING in SplitTransaction causing the SPLIT event to be > missed in the master side. > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-7551 > URL: https://issues.apache.org/jira/browse/HBASE-7551 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.94.4 > Reporter: ramkrishna.s.vasudevan > Assignee: ramkrishna.s.vasudevan > Priority: Critical > Fix For: 0.96.0, 0.94.5 > > Attachments: 7551-0.94-test.txt, 7551-0.94-v1.txt, 7551-trunk.txt, > 7551-trunk-v2.txt, testSplitTransactionOnCluster-output.txt > > > This came from HBASE-7468. > I got the issue. I am able to reproduce this > See the logs > {code} > 2013-01-14 14:37:21,760 INFO [main] regionserver.SplitTransaction(216): > Starting split of region > testShouldClearRITWhenNodeFoundInSplittingState,,1358154439514.a9e57d09c58b3ef3b949d602232fb2c2. > 2013-01-14 14:37:21,760 DEBUG [main] regionserver.SplitTransaction(871): > regionserver:61665-0x13c384e4e4f0002 Creating ephemeral node for > a9e57d09c58b3ef3b949d602232fb2c2 in SPLITTING state > 2013-01-14 14:37:21,844 DEBUG [main] zookeeper.ZKAssign(757): > regionserver:61665-0x13c384e4e4f0002 Attempting to transition node > a9e57d09c58b3ef3b949d602232fb2c2 from RS_ZK_REGION_SPLITTING to > RS_ZK_REGION_SPLITTING > 2013-01-14 14:37:21,849 DEBUG [Thread-873-EventThread] > zookeeper.ZooKeeperWatcher(277): master:62334-0x13c384e4e4f001b Received > ZooKeeper Event, type=NodeChildrenChanged, state=SyncConnected, > path=/hbase/unassigned > 2013-01-14 14:37:21,853 DEBUG [main] zookeeper.ZKUtil(1565): > regionserver:61665-0x13c384e4e4f0002 Retrieved 140 byte(s) of data from znode > /hbase/unassigned/a9e57d09c58b3ef3b949d602232fb2c2; > data=region=testShouldClearRITWhenNodeFoundInSplittingState,,1358154439514.a9e57d09c58b3ef3b949d602232fb2c2., > origin=Ram.Home,61665,1358154325430, state=RS_ZK_REGION_SPLITTING > 2013-01-14 14:37:21,918 DEBUG [main] zookeeper.ZKAssign(820): > regionserver:61665-0x13c384e4e4f0002 Successfully transitioned node > a9e57d09c58b3ef3b949d602232fb2c2 from RS_ZK_REGION_SPLITTING to > RS_ZK_REGION_SPLITTING > 2013-01-14 14:37:21,919 DEBUG [Thread-873-EventThread] zookeeper.ZKUtil(417): > master:62334-0x13c384e4e4f001b Set watcher on existing znode > /hbase/unassigned/a9e57d09c58b3ef3b949d602232fb2c2 > {code} > Here we can observe that the SPLITTING node was first created. Then we > transit it to SPLITTING to SPLITTING so that AM can have the nodeDataChange > event. But for the nodeDataChange event to happen first nodeChildrenChange > event should happen so that the master can set a watcher on the node. > Now when this hang happens, we can see that after the transition happens only > then the watcher is set by nodeChildrenChange event and so the SPLITTING to > SPLITTING event itself is missed or skipped. > Ideally the nodeChildrenChange event iterates thro the list of new znodes on > the /hbase/assignment nodes. And then creates a watcher on that. One reason > could be there are more than one znode and so the watch setting operation > takes time. The order of execution is different when we try running from > eclipse and when we run mvn tests. > My conclusion is that the testcase actually reveals the problem but the same > can happen in any case where the SPLITTING event can get missed out. May be > some of the SPLIT related bugs that were raised is due to this? Need to > analyse. > Any suggestions welcome. We should ensure that the transition from SPLITTING > to SPLITTING should happen only after the master has set the watch on the > znode and we should be sure of that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira