[ https://issues.apache.org/jira/browse/ACCUMULO-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572675#comment-13572675 ]
Hudson commented on ACCUMULO-954: --------------------------------- Integrated in Accumulo-Trunk-Hadoop-2.0 #70 (See [https://builds.apache.org/job/Accumulo-Trunk-Hadoop-2.0/70/]) ACCUMULO-954 suppressed warnings from zookeeper during unit test (Revision 1443085) Result = UNSTABLE kturner : Files : * /accumulo/trunk/test/src/test/java/org/apache/accumulo/fate/zookeeper/ZooLockTest.java * /accumulo/trunk/test/src/test/java/org/apache/accumulo/test/MiniAccumuloClusterTest.java > ZooLock watcher can stop watching > --------------------------------- > > Key: ACCUMULO-954 > URL: https://issues.apache.org/jira/browse/ACCUMULO-954 > Project: Accumulo > Issue Type: Bug > Components: tserver > Affects Versions: 1.4.2 > Reporter: Adam Fuchs > Assignee: Keith Turner > Priority: Minor > Fix For: 1.5.0, 1.4.3 > > > Basically, this will result in tablet servers failing to recognize when they > lose their locks. I think the worst that can happen with this is a tablet > server can fail to die after it loses its lock, which could bog down clients > and create a bunch of noise in the cluster. I believe there could also be > useless files generated that wouldn't get garbage collected. !METADATA table > write protections and logger write protections should prevent any permanent > damage or data loss. We have seen this result in warnings and errors that > look like multiple hosting of tablets. > {code} > 2013-01-09 19:59:27,742 [tabletserver.TabletServer] INFO : port = 9997 > 2013-01-09 19:59:27,926 [zookeeper.ZooLock] DEBUG: event > /accumulo/655f93d8-20fc-451f-a457-458b5717a11e/tservers/172.16.2.25:9997 > NodeDeleted SyncConnected > 2013-01-09 19:59:27,931 [tabletserver.TabletServer] INFO : Waiting for tablet > server lock > 2013-01-09 19:59:32,943 [tabletserver.TabletServer] DEBUG: Obtained tablet > server lock > /accumulo/655f93d8-20fc-451f-a457-458b5717a11e/tservers/172.16.2.25:9997/zlock-0000000000 > 2013-01-09 19:59:36,703 [tabletserver.TabletServer] DEBUG: Got loadTablet > message from user: !SYSTEM > {code} > Here's what happened: > 1. Tablet server fails to get lock, triggering the watcher on the parent node. > 2. Watcher doesn't get reset, and doesn't take any action. > 3. Loop in TabletServer:~2659 retries, but uses the same ZooLock object. > 4. TabletServer loses its lock, but receives a connection loss message before > the NodeDeleted message. > 5. TabletServer continues to try to do work instead of killing itself. > We could probably patch this for 1.4 by creating the ZooLock within the > announceExistence loop, instead of reusing the one. Eventually, we ought to > have an else branch in both of the Watchers that either reset the watch > (resilient against zookeeper connection hiccups) or just kill the server to > be safe. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira