[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293116#comment-13293116
 ] 

Thawan Kooburat commented on ZOOKEEPER-1484:
--------------------------------------------

Just noticed that log are from different machines. So the actual root cause is 
not yet found, but I think the issue that I point out seem to be a legitimate 
problem. 
                
> Missing znode found in the follower
> -----------------------------------
>
>                 Key: ZOOKEEPER-1484
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1484
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.3
>            Reporter: Thawan Kooburat
>            Assignee: Thawan Kooburat
>            Priority: Critical
>
> We noticed that one of the follower fail to restart due to missing parent node
> {noformat}
> 2012-05-29 15:44:41,037 [myid:9] - INFO [main:FileSnap@83] - Reading snapshot 
> /var/facebook/zeus-server/data/global-ropt.0/version-2/snapshot.3d001f19c9
> 2012-05-29 15:44:43,300 [myid:9] - ERROR [main:FileTxnSnapLog@220] - Parent 
> /phpunittest/1862297546 missing for /phpunittest/1862297546/dir1
> 2012-05-29 15:44:43,302 [myid:9] - ERROR [main:QuorumPeer@488] - Unable to 
> load database on disk
> java.io.IOException: Failed to process transaction type: 1 error: 
> KeeperErrorCode = NoNode for /phpunittest/1862297546
> {noformat}
> We believed that the root cause is due to bugs in follower sync-up logic. Due 
> to race condition, the follower may miss some proposals. The log below show 
> that the follower see the commit message but it haven't seen this proposal 
> before
> {noformat}
> 2012-05-15 15:11:27,449 [myid:13] - WARN 
> [QuorumPeer[myid=13]/0.0.0.0:2182:Learner@378] - Got zxid 0x3c00282dc9 
> expected 0x3c00282dca
> {noformat}
> I can reproduce this by keep running FollowerResyncConcurrencyTest until 
> failure occurs. I suspected that the root caused is due to how we handle 
> toBeApplied and outstandingProposals in the leader. 
> 1. In-flight proposals is removed from outstandingProposal before it is added 
> to toBeApplied. Most of the problem I seen so far seem to caused by this gap.
> 2. startForwarding() iterate through outstandingProposal without locking 
> PrepRequestProcessor properly, so there is possibility of missing in-flight 
> proposal. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to