I'm trying to debug a really obscure problem we've been seeing integrating multiop into our code base.
After a lot of debugging, I've isolated the error scenario down to when the multi op gets submitted through any non-leader. The symptoms I see when I submit the multi-op to a non-leader is that the multi-op, no matter how small (in this case it only has one op in it to create a single zknode) always times out. When submitting it to non-leaders, I am seeing a couple of key messages in the log files: 2011-07-12 13:20:59,168 - WARN [LearnerHandler-/127.0.0.7:53952:Leader@507][] - *Commiting zxid 0x100000000 from /127.0.0.7:2182 not first!* 2011-07-12 13:20:59,169 - WARN [LearnerHandler-/127.0.0.7:53952:Leader@509][] - *First is 0 * 2011-07-12 13:20:59,169 - INFO [LearnerHandler-/127.0.0.7:53952:Leader@533][] - Have quorum of supporters; starting up and setting last processed zxid: 4294967296 2011-07-12 13:20:59,379 - INFO [NIOServerCxn.Factory:/127.0.0.5:2181 :NIOServerCnxnFactory@197][] - Accepted socket connection from / 127.0.0.5:53553 2011-07-12 13:20:59,381 - WARN [NIOServerCxn.Factory:/127.0.0.5:2181 :ZooKeeperServer@807][] - Connection request from old client / 127.0.0.5:53553; will be dropped if server is in r-o mode 2011-07-12 13:20:59,381 - INFO [NIOServerCxn.Factory:/127.0.0.5:2181 :ZooKeeperServer@853][] - Client attempting to establish new session at / 127.0.0.5:53553 2011-07-12 13:20:59,385 - INFO [SyncThread:3:FileTxnLog@195][] - Creating new log file: log.100000001 2011-07-12 13:20:59,387 - WARN [QuorumPeer[myid=1]/127.0.0.5:2181 :Follower@118][] - Got zxid 0x100000001 expected 0x1 2011-07-12 13:20:59,387 - INFO [SyncThread:1:FileTxnLog@195][] - Creating new log file: log.100000001 2011-07-12 13:20:59,387 - WARN [QuorumPeer[myid=2]/127.0.0.6:2181 :Follower@118][] - Got zxid 0x100000001 expected 0x1 If I instead submit the exact same multi-op to the leader instead of a follower, it succeeds every time. I'm suspecting there is a bug in how a Multi-op is forwarded between a follower and a leader (or vice versa). I'm going to be digging into this tonight, but if anyone has any pointers on where to look, I'd be super appreciative. Thanks, Marshall
