[ https://issues.apache.org/jira/browse/HBASE-10249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jean-Daniel Cryans updated HBASE-10249: --------------------------------------- Attachment: HBASE-10249-0.94-v0.patch Two things I've noticed that I'm fixing in the attached patch for 0.94: - The multi path doesn't check if the znode that we're moving is ours, so we end up deleting our own queue (!!!). - Looking at the link for the latest failure, we do check that in the non-multi path but when we do it it takes a few hundreds of milliseconds. It seems that they all end up counting towards the 10 seconds limit that we have in order to clear all the queues. I moved the checking of the path before the sleeping in NodeFailoverWorker.run so that we don't waste time on ourselves. Regardless, this code is racy: {noformat} int numberOfOldSource = 1; // default wait once while (numberOfOldSource > 0) { Thread.sleep(SLEEP_TIME); numberOfOldSource = manager.getOldSources().size(); } {noformat} We basically say "let's wait 10 seconds and see if we can transfer _all_ the queues during that time". If some queues are still being transferred, and the others we did transfer are already done, they won't count as an oldSource, and so we can miss them. The most extreme case is moving 1 queue with enough znodes that it takes more than 10 seconds to move (I've seen that), in which case the sync tool will stop even though there might be many more queues to transfer. > Intermittent TestReplicationSyncUpTool failure > ---------------------------------------------- > > Key: HBASE-10249 > URL: https://issues.apache.org/jira/browse/HBASE-10249 > Project: HBase > Issue Type: Bug > Reporter: Lars Hofhansl > Assignee: Demai Ni > Fix For: 0.98.0, 0.96.2, 0.99.0, 0.94.17 > > Attachments: HBASE-10249-0.94-v0.patch, HBASE-10249-trunk-v0.patch > > > New issue to keep track of this. -- This message was sent by Atlassian JIRA (v6.1.5#6160)