[jira] [Updated] (HBASE-10249) Intermittent TestReplicationSyncUpTool failure

Jean-Daniel Cryans (JIRA) Thu, 16 Jan 2014 14:15:52 -0800

     [ 
https://issues.apache.org/jira/browse/HBASE-10249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jean-Daniel Cryans updated HBASE-10249:
---------------------------------------

    Attachment: HBASE-10249-0.94-v0.patch

Two things I've noticed that I'm fixing in the attached patch for 0.94:

- The multi path doesn't check if the znode that we're moving is ours, so we 
end up deleting our own queue (!!!).
- Looking at the link for the latest failure, we do check that in the non-multi 
path but when we do it it takes a few hundreds of milliseconds. It seems that 
they all end up counting towards the 10 seconds limit that we have in order to 
clear all the queues. I moved the checking of the path before the sleeping in 
NodeFailoverWorker.run so that we don't waste time on ourselves.

Regardless, this code is racy:

{noformat}
    int numberOfOldSource = 1; // default wait once
      while (numberOfOldSource > 0) {
        Thread.sleep(SLEEP_TIME);
        numberOfOldSource = manager.getOldSources().size();
    }
{noformat}

We basically say "let's wait 10 seconds and see if we can transfer _all_ the 
queues during that time". If some queues are still being transferred, and the 
others we did transfer are already done, they won't count as an oldSource, and 
so we can miss them. The most extreme case is moving 1 queue with enough znodes 
that it takes more than 10 seconds to move (I've seen that), in which case the 
sync tool will stop even though there might be many more queues to transfer.

> Intermittent TestReplicationSyncUpTool failure
> ----------------------------------------------
>
>                 Key: HBASE-10249
>                 URL: https://issues.apache.org/jira/browse/HBASE-10249
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Demai Ni
>             Fix For: 0.98.0, 0.96.2, 0.99.0, 0.94.17
>
>         Attachments: HBASE-10249-0.94-v0.patch, HBASE-10249-trunk-v0.patch
>
>
> New issue to keep track of this.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HBASE-10249) Intermittent TestReplicationSyncUpTool failure

Reply via email to