[ 
https://issues.apache.org/jira/browse/ACCUMULO-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13574206#comment-13574206
 ] 

Hudson commented on ACCUMULO-954:
---------------------------------

Integrated in Accumulo-Trunk-Hadoop-2.0 #73 (See 
[https://builds.apache.org/job/Accumulo-Trunk-Hadoop-2.0/73/])
    ACCUMULO-954 made zoolock report when its no longer able to monitor lock 
node and there does not know the status of the lock (Revision 1443790)

     Result = SUCCESS
kturner : 
Files : 
* 
/accumulo/trunk/fate/src/main/java/org/apache/accumulo/fate/zookeeper/ZooLock.java
* 
/accumulo/trunk/server/src/main/java/org/apache/accumulo/server/gc/SimpleGarbageCollector.java
* 
/accumulo/trunk/server/src/main/java/org/apache/accumulo/server/master/Master.java
* 
/accumulo/trunk/server/src/main/java/org/apache/accumulo/server/master/TServerLockWatcher.java
* 
/accumulo/trunk/server/src/main/java/org/apache/accumulo/server/tabletserver/TabletServer.java
* 
/accumulo/trunk/server/src/main/java/org/apache/accumulo/server/zookeeper/ZooLock.java
* 
/accumulo/trunk/test/src/main/java/org/apache/accumulo/test/functional/SplitRecoveryTest.java
* 
/accumulo/trunk/test/src/main/java/org/apache/accumulo/test/functional/ZombieTServer.java
* 
/accumulo/trunk/test/src/test/java/org/apache/accumulo/fate/zookeeper/ZooLockTest.java

                
> ZooLock watcher can stop watching
> ---------------------------------
>
>                 Key: ACCUMULO-954
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-954
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.4.2
>            Reporter: Adam Fuchs
>            Assignee: Keith Turner
>            Priority: Minor
>             Fix For: 1.5.0, 1.4.3
>
>
> Basically, this will result in tablet servers failing to recognize when they 
> lose their locks. I think the worst that can happen with this is a tablet 
> server can fail to die after it loses its lock, which could bog down clients 
> and create a bunch of noise in the cluster. I believe there could also be 
> useless files generated that wouldn't get garbage collected. !METADATA table 
> write protections and logger write protections should prevent any permanent 
> damage or data loss. We have seen this result in warnings and errors that 
> look like multiple hosting of tablets.
> {code}
> 2013-01-09 19:59:27,742 [tabletserver.TabletServer] INFO : port = 9997
> 2013-01-09 19:59:27,926 [zookeeper.ZooLock] DEBUG: event 
> /accumulo/655f93d8-20fc-451f-a457-458b5717a11e/tservers/172.16.2.25:9997 
> NodeDeleted SyncConnected
> 2013-01-09 19:59:27,931 [tabletserver.TabletServer] INFO : Waiting for tablet 
> server lock
> 2013-01-09 19:59:32,943 [tabletserver.TabletServer] DEBUG: Obtained tablet 
> server lock 
> /accumulo/655f93d8-20fc-451f-a457-458b5717a11e/tservers/172.16.2.25:9997/zlock-0000000000
> 2013-01-09 19:59:36,703 [tabletserver.TabletServer] DEBUG: Got loadTablet 
> message from user: !SYSTEM
> {code}
> Here's what happened:
> 1. Tablet server fails to get lock, triggering the watcher on the parent node.
> 2. Watcher doesn't get reset, and doesn't take any action.
> 3. Loop in TabletServer:~2659 retries, but uses the same ZooLock object.
> 4. TabletServer loses its lock, but receives a connection loss message before 
> the NodeDeleted message.
> 5. TabletServer continues to try to do work instead of killing itself.
> We could probably patch this for 1.4 by creating the ZooLock within the 
> announceExistence loop, instead of reusing the one. Eventually, we ought to 
> have an else branch in both of the Watchers that either reset the watch 
> (resilient against zookeeper connection hiccups) or just kill the server to 
> be safe.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to