[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102249#comment-15102249
 ] 

Chris Nauroth commented on ZOOKEEPER-1936:
------------------------------------------

I'm in favor of the approach in patch v2.  This would be a deterministic fix.  
Adding a delay like the first patch might still not work if we got unlucky in 
the way the OS scheduled the threads.  What do others think?

[~tedyu], could you please do the following?

# Make the same fix for {{snapDir}}, which is right after the code you already 
changed in {{FileTxnSnapLog}}.  Otherwise, we might get past the {{dataDir}} 
creation only to fail again on {{snapDir}}.
# Post 2 patch files: one that applied to trunk and one that applies to 
branch-3.4.
# Generate the patch files with {{git diff --no-prefix}} for compatibility with 
our pre-commit automation.

Thank you!

> Server exits when unable to create data directory due to race 
> --------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1936
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1936
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.6, 3.5.0
>            Reporter: Harald Musum
>            Assignee: Andrew Purtell
>            Priority: Minor
>         Attachments: ZOOKEEPER-1936.patch, ZOOKEEPER-1936.v2.patch
>
>
> We sometime see issues with ZooKeeper server not starting and seeing this 
> error in the log:
> [2014-05-27 09:29:48.248] ERROR   : -               
> .org.apache.zookeeper.server.ZooKeeperServerMain    Unexpected exception,
> exiting abnormally\nexception=\njava.io.IOException: Unable to create data
> directory /home/y/var/zookeeper/version-2\n\tat
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:85)\n\tat
> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:103)\n\tat
> org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)\n\tat
> org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)\n\tat
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)\n\tat
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)\n\t
> [...]
> Stack trace from JVM gives this:
> "PurgeTask" daemon prio=10 tid=0x000000000201d000 nid=0x1727 runnable
> [0x00007f55d7dc7000]
>    java.lang.Thread.State: RUNNABLE
>     at java.io.UnixFileSystem.createDirectory(Native Method)
>     at java.io.File.mkdir(File.java:1310)
>     at java.io.File.mkdirs(File.java:1337)
>     at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:84)
>     at org.apache.zookeeper.server.PurgeTxnLog.purge(PurgeTxnLog.java:68)
>     at
> org.apache.zookeeper.server.DatadirCleanupManager$PurgeTask.run(DatadirCleanupManager.java:140)
>     at java.util.TimerThread.mainLoop(Timer.java:555)
>     at java.util.TimerThread.run(Timer.java:505)
> "zookeeper server" prio=10 tid=0x00000000027df800 nid=0x1715 runnable
> [0x00007f55d7ed8000]
>    java.lang.Thread.State: RUNNABLE
>     at java.io.UnixFileSystem.createDirectory(Native Method)
>     at java.io.File.mkdir(File.java:1310)
>     at java.io.File.mkdirs(File.java:1337)
>     at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:84)
>     at
> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:103)
>     at
> org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)
>     at
> org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)
>     at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
>     at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
> [...]
> So it seems that when autopurge is used (as it is in our case), it might 
> happen at the same time as starting the server itself. In FileTxnSnapLog() it 
> will check if the directory exists and create it if not. These two tasks do 
> this at the same time, and mkdir fails and server exits the JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to