[ 
https://issues.apache.org/jira/browse/ACCUMULO-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915804#comment-13915804
 ] 

Bill Havanki commented on ACCUMULO-2422:
----------------------------------------

Timeframe I observed is indefinitely, or at least until the agitator bounces 
both masters.

Stack dumps haven't been terribly helpful so far, although I haven't done a lot 
of testing yet. Looking at the ZK data is more informative. I added a bunch of 
logging in master for what it sees in ZK and what it decides to do. So far it 
appears, at least in one case, that the backup master just didn't notice the 
active master's node getting deleted. I have even more logging in there now, 
which I'm checking this morning, to see if it gets no event at all, or why it 
doesn't process it.

> Backup master can miss acquiring lock when primary exits
> --------------------------------------------------------
>
>                 Key: ACCUMULO-2422
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2422
>             Project: Accumulo
>          Issue Type: Bug
>          Components: fate, master
>    Affects Versions: 1.5.0
>            Reporter: Bill Havanki
>            Assignee: Bill Havanki
>            Priority: Critical
>              Labels: failover, locking
>
> While running randomwalk tests with agitation for the 1.5.1 release, I've 
> seen situations where a backup master that is eligible to grab the master 
> lock continues to wait. When this condition arises and the other master 
> restarts, both wait for the lock without success.
> I cannot reproduce the problem reliably, and I think more investigation is 
> needed to see what circumstances could be causing the problem.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to