[ https://issues.apache.org/jira/browse/ACCUMULO-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Keith Turner updated ACCUMULO-1277: ----------------------------------- Assignee: Keith Turner (was: Eric Newton) I had some code in place that delayed deleting empty tserver nodes, but it looks like I just dropped it. Ooppss. I'll take a look at this. Nice write up. > Race condition between master and tserver when acquiring tserver lock > --------------------------------------------------------------------- > > Key: ACCUMULO-1277 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1277 > Project: Accumulo > Issue Type: Bug > Components: master, tserver > Affects Versions: 1.5.0, 1.4.3 > Reporter: Daniel P Truitt > Assignee: Keith Turner > > When restarting a stopped tserver, the following happens: > The tserver (in TabletServer.announceExistence()) creates an entry in > zookeeper at /accumulo/instance-id/tserver/host:port. > This in turn triggers master to execute the call chain: > LiveTServerSet.process(WatchedEvent) > LiveTServerSet.scanServers() > LiveTServerSet.checkServer(Set<TServerInstance>, Set<TServerInstance>, > String, String) > The checkServer() method checks to see if the ZooLock data has been created > yet (if tserver loses the race, it has not yet been created) causing master > to then delete the tserver node. > When the tserver attempts to create the ZooLock, the parent path no longer > exists and creating the lock fails. Eventually the tserver will time out > waiting to create the lock, and fail to start. > This problem is easier to reproduce in a smallish cluster using a single > zookeeper node, where there is more latency between the tserver and zookeeper > than there is between the master and zookeeper. > This behavior was introduced in the fix for ACCUMULO-1049. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira