[ 
https://issues.apache.org/jira/browse/ACCUMULO-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13916304#comment-13916304
 ] 

ASF subversion and git services commented on ACCUMULO-2422:
-----------------------------------------------------------

Commit 7eeff02c7cf883765a33575a19d208be30e1e17c in accumulo's branch 
refs/heads/1.6.0-SNAPSHOT from [~bhavanki]
[ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=7eeff02 ]

ACCUMULO-2422 Refine renewal of master lock watcher

The first commit for ACCUMULO-2422 succeeds in renewing the watch on another 
master's lock
node when needed. This commit refines the solution:

- The renewal was happening even after the master is able to acquire the lock. 
This led to a
  spurious log error message. This commit skips renewing the watch in that case.
- If the renewal returns a null status, meaning the other master's lock node 
disappeared, the
  master now immediately tries again to acquire the lock. This matches watch 
establishment in
  other areas.

A lot of logging at the trace level was added to ZooLock to assist future 
troubleshooting.


> Backup master can miss acquiring lock when primary exits
> --------------------------------------------------------
>
>                 Key: ACCUMULO-2422
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2422
>             Project: Accumulo
>          Issue Type: Bug
>          Components: fate, master
>    Affects Versions: 1.5.1
>            Reporter: Bill Havanki
>            Assignee: Bill Havanki
>            Priority: Critical
>              Labels: failover, locking
>             Fix For: 1.6.0, 1.5.2
>
>
> While running randomwalk tests with agitation for the 1.5.1 release, I've 
> seen situations where a backup master that is eligible to grab the master 
> lock continues to wait. When this condition arises and the other master 
> restarts, both wait for the lock without success.
> I cannot reproduce the problem reliably, and I think more investigation is 
> needed to see what circumstances could be causing the problem.
> h3. Diagnosis and Work Around
> This failure condition can occur on start up and on backup/active failover of 
> the Master role. If the follow log entry is the final entry on all Master 
> logs you should restart all Master roles, staggering by a few seconds.
> {noformat}
> [master.Master] INFO : trying to get master lock
> {noformat}
> If starting a cluster with multiple Master roles, you should stagger Master 
> role starts by a few seconds.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to