[ https://issues.apache.org/jira/browse/ZOOKEEPER-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102249#comment-15102249 ]
Chris Nauroth commented on ZOOKEEPER-1936: ------------------------------------------ I'm in favor of the approach in patch v2. This would be a deterministic fix. Adding a delay like the first patch might still not work if we got unlucky in the way the OS scheduled the threads. What do others think? [~tedyu], could you please do the following? # Make the same fix for {{snapDir}}, which is right after the code you already changed in {{FileTxnSnapLog}}. Otherwise, we might get past the {{dataDir}} creation only to fail again on {{snapDir}}. # Post 2 patch files: one that applied to trunk and one that applies to branch-3.4. # Generate the patch files with {{git diff --no-prefix}} for compatibility with our pre-commit automation. Thank you! > Server exits when unable to create data directory due to race > -------------------------------------------------------------- > > Key: ZOOKEEPER-1936 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1936 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.4.6, 3.5.0 > Reporter: Harald Musum > Assignee: Andrew Purtell > Priority: Minor > Attachments: ZOOKEEPER-1936.patch, ZOOKEEPER-1936.v2.patch > > > We sometime see issues with ZooKeeper server not starting and seeing this > error in the log: > [2014-05-27 09:29:48.248] ERROR : - > .org.apache.zookeeper.server.ZooKeeperServerMain Unexpected exception, > exiting abnormally\nexception=\njava.io.IOException: Unable to create data > directory /home/y/var/zookeeper/version-2\n\tat > org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:85)\n\tat > org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:103)\n\tat > org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)\n\tat > org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)\n\tat > org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)\n\tat > org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)\n\t > [...] > Stack trace from JVM gives this: > "PurgeTask" daemon prio=10 tid=0x000000000201d000 nid=0x1727 runnable > [0x00007f55d7dc7000] > java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createDirectory(Native Method) > at java.io.File.mkdir(File.java:1310) > at java.io.File.mkdirs(File.java:1337) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:84) > at org.apache.zookeeper.server.PurgeTxnLog.purge(PurgeTxnLog.java:68) > at > org.apache.zookeeper.server.DatadirCleanupManager$PurgeTask.run(DatadirCleanupManager.java:140) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > "zookeeper server" prio=10 tid=0x00000000027df800 nid=0x1715 runnable > [0x00007f55d7ed8000] > java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createDirectory(Native Method) > at java.io.File.mkdir(File.java:1310) > at java.io.File.mkdirs(File.java:1337) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:84) > at > org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:103) > at > org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86) > at > org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52) > at > org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116) > at > org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78) > [...] > So it seems that when autopurge is used (as it is in our case), it might > happen at the same time as starting the server itself. In FileTxnSnapLog() it > will check if the directory exists and create it if not. These two tasks do > this at the same time, and mkdir fails and server exits the JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)