[ 
https://issues.apache.org/jira/browse/ACCUMULO-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Turner updated ACCUMULO-954:
----------------------------------

    Priority: Critical  (was: Minor)
    
> ZooLock watcher can stop watching
> ---------------------------------
>
>                 Key: ACCUMULO-954
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-954
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.4.2
>            Reporter: Adam Fuchs
>            Assignee: Keith Turner
>            Priority: Critical
>             Fix For: 1.5.0, 1.4.3
>
>
> Basically, this will result in tablet servers failing to recognize when they 
> lose their locks. I think the worst that can happen with this is a tablet 
> server can fail to die after it loses its lock, which could bog down clients 
> and create a bunch of noise in the cluster. I believe there could also be 
> useless files generated that wouldn't get garbage collected. !METADATA table 
> write protections and logger write protections should prevent any permanent 
> damage or data loss. We have seen this result in warnings and errors that 
> look like multiple hosting of tablets.
> {code}
> 2013-01-09 19:59:27,742 [tabletserver.TabletServer] INFO : port = 9997
> 2013-01-09 19:59:27,926 [zookeeper.ZooLock] DEBUG: event 
> /accumulo/655f93d8-20fc-451f-a457-458b5717a11e/tservers/172.16.2.25:9997 
> NodeDeleted SyncConnected
> 2013-01-09 19:59:27,931 [tabletserver.TabletServer] INFO : Waiting for tablet 
> server lock
> 2013-01-09 19:59:32,943 [tabletserver.TabletServer] DEBUG: Obtained tablet 
> server lock 
> /accumulo/655f93d8-20fc-451f-a457-458b5717a11e/tservers/172.16.2.25:9997/zlock-0000000000
> 2013-01-09 19:59:36,703 [tabletserver.TabletServer] DEBUG: Got loadTablet 
> message from user: !SYSTEM
> {code}
> Here's what happened:
> 1. Tablet server fails to get lock, triggering the watcher on the parent node.
> 2. Watcher doesn't get reset, and doesn't take any action.
> 3. Loop in TabletServer:~2659 retries, but uses the same ZooLock object.
> 4. TabletServer loses its lock, but receives a connection loss message before 
> the NodeDeleted message.
> 5. TabletServer continues to try to do work instead of killing itself.
> We could probably patch this for 1.4 by creating the ZooLock within the 
> announceExistence loop, instead of reusing the one. Eventually, we ought to 
> have an else branch in both of the Watchers that either reset the watch 
> (resilient against zookeeper connection hiccups) or just kill the server to 
> be safe.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to