[ https://issues.apache.org/jira/browse/HBASE-11217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006683#comment-14006683 ]
Enis Soztutar commented on HBASE-11217: --------------------------------------- hadoopqa is still not updated with git repo AFAIK. I've run the tests locally. Committed the addendum to master and 0.98. > Race between SplitLogManager task creation + TimeoutMonitor > ----------------------------------------------------------- > > Key: HBASE-11217 > URL: https://issues.apache.org/jira/browse/HBASE-11217 > Project: HBase > Issue Type: Bug > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Priority: Critical > Fix For: 0.99.0, 0.98.3 > > Attachments: hbase-11217_addendum.patch, hbase-11217_v1.patch, > hbase-11217_v2.patch > > > Some time ago, we reported a test failure in HBASE-11036, which resulted in > already-split and merged regions coming back to life, causing split brain for > region boundaries and resulting in data loss. > It turns out that the root cause was not concurrent online schema change + > region split/merge, but meta log splitting failing and the meta updates > getting lost. This in turn causes the region split/merge information and > assignment to be lost causing large scale data loss. > Logs below shows that the split task for meta log is created, but before the > znode is created, the timeout thread kicks in and sees the unassigned task. > Then it does a get on znode which fails with NoNode (because the znode is not > created yet). This causes the task to be marked complete (setDone(path, > SUCCESS)) which means that the logs are lost. Meta is assigned elsewhere (and > opened with the same seqId as previous) confirming data loss in meta. > {code} > 2014-04-16 18:31:26,267 INFO > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] > handler.MetaServerShutdownHandler: Splitting hbase:meta logs for > hor13n03.gq1.ygridcore.net,60020,1397672668647 > 2014-04-16 18:31:26,274 DEBUG > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.MasterFileSystem: > Renamed region directory: > hdfs://hor13n01.gq1.ygridcore.net:8020/apps/hbase/data/WALs/hor13n03.gq1.ygridcore.net,60020,1397672668647-splitting > 2014-04-16 18:31:26,274 INFO > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.SplitLogManager: dead > splitlog workers [hor13n03.gq1.ygridcore.net,60020,1397672668647] > 2014-04-16 18:31:26,276 DEBUG > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.SplitLogManager: > Scheduling batch of logs to split > 2014-04-16 18:31:26,276 INFO > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.SplitLogManager: > started splitting 1 logs in > [hdfs://hor13n01.gq1.ygridcore.net:8020/apps/hbase/data/WALs/hor13n03.gq1.ygridcore.net,60020,1397672668647-splitting] > 2014-04-16 18:31:26,276 INFO > [hor13n02.gq1.ygridcore.net,60000,1397672191204.splitLogManagerTimeoutMonitor] > master.SplitLogManager: total tasks = 1 unassigned = 1 > tasks={/hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta=last_update > = -1 last_version = - > 2014-04-16 18:31:26,276 DEBUG > [hor13n02.gq1.ygridcore.net,60000,1397672191204.splitLogManagerTimeoutMonitor] > master.SplitLogManager: resubmitting unassigned task(s) after timeout > 2014-04-16 18:31:26,277 WARN [main-EventThread] > master.SplitLogManager$GetDataAsyncCallback: task znode > /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta > vanished. > 2014-04-16 18:31:26,277 INFO [main-EventThread] master.SplitLogManager: Done > splitting > /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta > 2014-04-16 18:31:26,282 DEBUG [main-EventThread] master.SplitLogManager: put > up splitlog task at znode > /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta > > > 2014-04-16 18:31:26,286 WARN > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.SplitLogManager: > returning success without actually splitting and deleting all the log files > in path > hdfs://hor13n01.gq1.ygridcore.net:8020/apps/hbase/data/WALs/hor13n03.gq1.ygridcore.net,60020,1397672668647-splitting > 2014-04-16 18:31:26,286 INFO > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.SplitLogManager: > finished splitting (more than or equal to) 9 bytes in 1 log files in > [hdfs://hor13n01.gq1.ygridcore.net:8020/apps/hbase/data/WALs/hor13n03.gq1.ygridcore.net,60020,1397672668647-splitting] > in 10ms > 2014-04-16 18:31:26,290 DEBUG [main-EventThread] > master.SplitLogManager$DeleteAsyncCallback: deleted > /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta > 2014-04-16 18:31:26,290 WARN [main-EventThread] > master.SplitLogManager$GetDataAsyncCallback: task znode > /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta > vanished. > 2014-04-16 18:31:26,290 DEBUG [main-EventThread] master.SplitLogManager: > unacquired orphan task is done > /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta > 2014-04-16 18:31:26,291 INFO [main-EventThread] master.SplitLogManager: task > /hbase/splitWAL/RESCAN0000000383 entered state: DONE > hor13n02.gq1.ygridcore.net,60000,1397672191204 > 2014-04-16 18:31:26,291 DEBUG > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.AssignmentManager: > based on AM, current region=hbase:meta,,1.1588230740 is on > server=hor13n03.gq1.ygridcore.net,60020,1397672668647 server being checked: > hor13n03.gq1.ygridcore.net,60020,1397672668647 > 2014-04-16 18:31:26,291 INFO > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] > handler.MetaServerShutdownHandler: Server > hor13n03.gq1.ygridcore.net,60020,1397672668647 was carrying META. Trying to > assign. > 2014-04-16 18:31:26,291 INFO > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.RegionStates: > Transitioned {1588230740 state=OPEN, ts=1397672681933, > server=hor13n03.gq1.ygridcore.net,60020,1397672668647} to {1588230740 > state=OFFLINE, ts=1397673086291, > server=hor13n03.gq1.ygridcore.net,60020,1397672668647} > 2014-04-16 18:31:26,291 INFO > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.RegionStates: > Offlined 1588230740 from hor13n03.gq1.ygridcore.net,60020,1397672668647 > 2014-04-16 18:31:26,299 INFO [main-EventThread] > master.SplitLogManager$DeleteAsyncCallback: > /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta > does not exist. Either was created but deleted behind our back by another > pending delete OR was deleted in earl > 2014-04-16 18:31:26,299 DEBUG [main-EventThread] master.SplitLogManager: > deleted task without in memory state > /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta > 2014-04-16 18:31:26,299 DEBUG [main-EventThread] > master.SplitLogManager$DeleteAsyncCallback: deleted > /hbase/splitWAL/RESCAN0000000383 > 2014-04-16 18:31:26,299 DEBUG [main-EventThread] master.SplitLogManager: > deleted task without in memory state /hbase/splitWAL/RESCAN0000000383 > 2014-04-16 18:31:26,301 INFO > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] catalog.CatalogTracker: > Failed verification of hbase:meta,,1 at > address=hor13n03.gq1.ygridcore.net,60020,1397672668647, > exception=java.net.ConnectException: Connection refused > 2014-04-16 18:31:26,301 INFO > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] > zookeeper.ZooKeeperNodeTracker: Unsetting hbase:meta region location in > ZooKeeper > 2014-04-16 18:31:26,315 DEBUG > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.AssignmentManager: No > previous transition plan found (or ignoring an existing plan) for > hbase:meta,,1.1588230740; generated random plan=hri=hbase:meta,,1.1588230740, > src=, dest=hor13n04.gq1.ygridcore.net,60020,1397672685370; 3 (online=3, > available=3) available servers, forceNew > 2014-04-16 18:31:26,315 DEBUG > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] zookeeper.ZKAssign: > master:60000-0x3456a48dd7d0223, > quorum=hor13n04.gq1.ygridcore.net:2181,hor13n03.gq1.ygridcore.net:2181,hor13n20.gq1.ygridcore.net:2181, > baseZNode=/hbase Creating (or updating) unassigned node 1588230740 with > OFFLINE state > 2014-04-16 18:31:26,323 INFO > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.AssignmentManager: > Assigning hbase:meta,,1.1588230740 to > hor13n04.gq1.ygridcore.net,60020,1397672685370 > 2014-04-16 18:31:26,323 INFO > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.RegionStates: > Transitioned {1588230740 state=OFFLINE, ts=1397673086315, > server=hor13n03.gq1.ygridcore.net,60020,1397672668647} to {1588230740 > state=PENDING_OPEN, ts=1397673086323, > server=hor13n04.gq1.ygridcore.net,60020,1397672685370} > 2014-04-16 18:31:28,337 DEBUG > [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.DeadServer: Finished > processing hor13n03.gq1.ygridcore.net,60020,1397672668647 > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)