[
https://issues.apache.org/jira/browse/ZOOKEEPER-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13408705#comment-13408705
]
Marshall McMullen commented on ZOOKEEPER-1453:
----------------------------------------------
I disabled write cache on the drive that holds my zookeeper database, and it
still fails in exactly the same way :-<.
Here's the part that really baffles me, I tried removing the on-disk database
entirely (the version-2 directory) and starting up zookeeper again on the
thought that it would just pull down a fresh copy of the database from one of
its peers. Unfortunately it still fails to connect. See the output below:
root@SF-42:/sf/data# java -cp
/opt/zookeeper-3.5.0-p7/zookeeper-3.5.0-p7.jar:/opt/zookeeper-3.5.0-p7/lib/log4j-1.2.16.jar:/opt/zookeeper-3.5.0-p7/lib/commons-cli-1.2.jar:/opt/zookeeper-3.5.0-p7/lib/slf4j-log4j12-1.6.2.jar:/opt/zookeeper-3.5.0-p7/lib/netty-3.2.5.Final.jar:/opt/zookeeper-3.5.0-p7/lib/jline-0.9.94.jar:/opt/zookeeper-3.5.0-p7/lib/javacc.jar:/opt/zookeeper-3.5.0-p7/lib/slf4j-api-1.6.2.jar:/opt/zookeeper-3.5.0-p7/conf
-Dzookeeper.root.logger=DEBUG,CONSOLE -Dzookeeper.log.dir=.
-Dzookeeper.tracelog.dir=/sf/data/zookeeper/10.10.5.42/
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false
-Djute.maxbuffer=4194304 org.apache.zookeeper.server.quorum.QuorumPeerMain
/sf/data/zookeeper/10.10.5.42/10.10.5.42_2181.cfg
2012-07-07 10:20:23,270 [myid:] - INFO [main:QuorumPeerConfig@99] - Reading
configuration from: /sf/data/zookeeper/10.10.5.42/10.10.5.42_2181.cfg
2012-07-07 10:20:23,279 [myid:2] - INFO [main:DatadirCleanupManager@78] -
autopurge.snapRetainCount set to 5
2012-07-07 10:20:23,279 [myid:2] - INFO [main:DatadirCleanupManager@79] -
autopurge.purgeInterval set to 1
2012-07-07 10:20:23,280 [myid:2] - INFO
[PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2012-07-07 10:20:23,289 [myid:2] - INFO
[PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
2012-07-07 10:20:23,290 [myid:2] - INFO [main:QuorumPeerMain@131] - Starting
quorum peer
2012-07-07 10:20:23,300 [myid:2] - INFO [main:NIOServerCnxnFactory@108] -
binding to port /10.10.5.42:2181
2012-07-07 10:20:23,308 [myid:2] - INFO [main:QuorumPeer@1107] - tickTime set
to 2000
2012-07-07 10:20:23,308 [myid:2] - INFO [main:QuorumPeer@1127] -
minSessionTimeout set to -1
2012-07-07 10:20:23,308 [myid:2] - INFO [main:QuorumPeer@1138] -
maxSessionTimeout set to -1
2012-07-07 10:20:23,308 [myid:2] - INFO [main:QuorumPeer@1153] - initLimit set
to 10
2012-07-07 10:20:23,321 [myid:2] - INFO [main:QuorumPeer@620] - currentEpoch
not found! Creating with a reasonable default of 0. This should only happen
when you are upgrading your installation
2012-07-07 10:20:23,322 [myid:2] - INFO [main:QuorumPeer@635] - acceptedEpoch
not found! Creating with a reasonable default of 0. This should only happen
when you are upgrading your installation
2012-07-07 10:20:23,325 [myid:2] - INFO
[QuorumPeerListener:QuorumCnxManager$Listener@530] - My election bind port:
/10.10.5.42:2183
2012-07-07 10:20:23,333 [myid:2] - INFO
[QuorumPeer[myid=2]/10.10.5.42:2181:QuorumPeer@860] - LOOKING
2012-07-07 10:20:23,334 [myid:2] - INFO
[QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@831] - New election. My
id = 2, proposed zxid=0x0
2012-07-07 10:20:23,341 [myid:2] - INFO
[NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] - Accepted
socket connection from /10.10.5.44:48534
2012-07-07 10:20:23,342 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumCnxManager@191] - Have smaller server identifier,
so dropping the connection: (3, 2)
2012-07-07 10:20:23,342 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@635] - Notification: 2 (n.leader),
0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch),
LOOKING (my state)0 (n.config version)
2012-07-07 10:20:23,345 [myid:2] - WARN
[NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] - Exception causing
close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
2012-07-07 10:20:23,346 [myid:2] - INFO
[NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@1002] - Closed socket
connection for client /10.10.5.44:48534 (no session established for client)
2012-07-07 10:20:23,544 [myid:2] - INFO
[QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@865] - Notification time
out: 400
2012-07-07 10:20:23,545 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumCnxManager@191] - Have smaller server identifier,
so dropping the connection: (3, 2)
2012-07-07 10:20:23,545 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@635] - Notification: 2 (n.leader),
0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch),
LOOKING (my state)0 (n.config version)
2012-07-07 10:20:23,680 [myid:2] - INFO
[NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] - Accepted
socket connection from /10.10.5.44:48535
2012-07-07 10:20:23,680 [myid:2] - WARN
[NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] - Exception causing
close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
2012-07-07 10:20:23,680 [myid:2] - INFO
[NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@1002] - Closed socket
connection for client /10.10.5.44:48535 (no session established for client)
2012-07-07 10:20:23,946 [myid:2] - INFO
[QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@865] - Notification time
out: 800
2012-07-07 10:20:23,946 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumCnxManager@191] - Have smaller server identifier,
so dropping the connection: (3, 2)
2012-07-07 10:20:23,947 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@635] - Notification: 2 (n.leader),
0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch),
LOOKING (my state)0 (n.config version)
2012-07-07 10:20:24,014 [myid:2] - INFO
[NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] - Accepted
socket connection from /10.10.5.44:48536
2012-07-07 10:20:24,014 [myid:2] - WARN
[NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] - Exception causing
close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
2012-07-07 10:20:24,015 [myid:2] - INFO
[NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@1002] - Closed socket
connection for client /10.10.5.44:48536 (no session established for client)
2012-07-07 10:20:24,349 [myid:2] - INFO
[NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] - Accepted
socket connection from /10.10.5.44:48650
2012-07-07 10:20:24,349 [myid:2] - WARN
[NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] - Exception causing
close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
2012-07-07 10:20:24,349 [myid:2] - INFO
[NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@1002] - Closed socket
connection for client /10.10.5.44:48650 (no session established for client)
2012-07-07 10:20:24,683 [myid:2] - INFO
[NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxnFactory@227] - Accepted
socket connection from /10.10.5.44:48678
2012-07-07 10:20:24,683 [myid:2] - WARN
[NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@354] - Exception causing
close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
2012-07-07 10:20:24,683 [myid:2] - INFO
[NIOServerCxn.Factory:/10.10.5.42:2181:NIOServerCnxn@1002] - Closed socket
connection for client /10.10.5.44:48678 (no session established for client)
2012-07-07 10:20:24,747 [myid:2] - INFO
[QuorumPeer[myid=2]/10.10.5.42:2181:FastLeaderElection@865] - Notification time
out: 1600
> corrupted logs may not be correctly identified by FileTxnIterator
> -----------------------------------------------------------------
>
> Key: ZOOKEEPER-1453
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1453
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.3.3
> Reporter: Patrick Hunt
> Priority: Critical
> Attachments: 10.10.5.123-withPath1489.tar.gz, 10.10.5.123.tar.gz,
> 10.10.5.42-withPath1489.tar.gz, 10.10.5.42.tar.gz,
> 10.10.5.44-withPath1489.tar.gz, 10.10.5.44.tar.gz
>
>
> See ZOOKEEPER-1449 for background on this issue. The main problem is that
> during server recovery
> org.apache.zookeeper.server.persistence.FileTxnLog.FileTxnIterator.next()
> does not indicate if the available logs are valid or not. In some cases (say
> a truncated record and a single txnlog in the datadir) we will not detect
> that the file is corrupt, vs reaching the end of the file.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira