(replies inline) On Wed, 14 Jan 2015, Flavio Junqueira wrote:
> I'm not sure what could be causing the communication between servers to fail. > Are you sure this is a ZK problem? Could it be a problem with your setting? We're not certain of much at this point to be honest. We've executed some network stress tests and see no disruptions in the layer 1-3 network communications. I'm not aware of any Zookeeper server-side setting that would result in the behavior we're seeing but I'm happy to try out some configuration changes if someboyd believes those will either help alleviate the problem or help identify it with more exactness. > On Wednesday, January 14, 2015 4:33 PM, R. Tyler Croy > <[email protected]> wrote: > > > A colleague of mine and I are still fully trying to understand the behavior > we're seeing from one of our Zookeeper clusters running 3.4.6. > > Some log snippets are included further down, but the symptom of what we're > seeing is that nodes in the cluster periodically detach themselves entirely > from the rest of the cluster. The period is not however consistent, in at > least > one known case we've seen multiple nodes detach around the same time causing > leader re-election challenges (putting it kindly!). > > Our investigation has led us to see a large number of transaction errors > (txnType: -1) in our trace logs with "n/a" paths. To give you an idea of how > large of a number we're talking about, here's a summary from ~20min of logs: > > -> # grep txntype zookeeper.log* | perl -pe 's/.*txntype/txntype/' | sort | > uniq -c | sort -nr > 70585 txntype:-1 reqpath:n/a > 4860 txntype:5 reqpath:n/a > 7 txntype:-11 reqpath:n/a > 7 txntype:-10 reqpath:n/a > [..snip..] > > The number of session create/close transaction types looks about right as does > the number of 'set data' transactions; however the "reqpath" being "n/a" does > not look entirely correct. > > > We're only using Zookeeper to hold offsets for normal Kafka high-level > consumers and Storm's KafkaSpout consumers, and I'm relatively confident that > those consumers are behaving correctly. > > > We have also come across a number of ostensibly resolved bugs in JIRA that, to > our untrained eyes, look similar to what we're seeing: > > "Zookeeper 3.3.4 loses ephemeral nodes under stress" > <https://issues.apache.org/jira/browse/ZOOKEEPER-1740> > "the threads number of a zookeeper is increased all the time" > <https://issues.apache.org/jira/browse/ZOOKEEPER-939> (not seeing the > thread growth mentioned here, just log similarities) > > > The node detaching issue is not something we've been able to recreate on our > own, so I don't entirely have a reproduction case. This is ZK 3.4.6 on OpenJDK > 1.7.0u67 running on Ubuntu 12.04, other than that I'm happy to provide > information that might help determine the cause of our problems! > > > Cheers (logs snippets to follow) > > > Error logs from an incident where a node detached itself: > > 2015-01-02 20:10:37,576 [myid:4] - ERROR > [LearnerHandler-/10.96.100.69:54260:LearnerHandler@633] - Unexpected > exception causing shutdown while sock still open > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:152) > at java.net.SocketInputStream.read(SocketInputStream.java:122) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > at java.io.BufferedInputStream.read(BufferedInputStream.java:254) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546) > 2015-01-02 20:10:37,577 [myid:4] - ERROR > [LearnerHandler-/10.96.100.90:43220:LearnerHandler@633] - Unexpected > exception causing shutdown while sock still open > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:152) > at java.net.SocketInputStream.read(SocketInputStream.java:122) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > at java.io.BufferedInputStream.read(BufferedInputStream.java:254) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546) > 2015-01-02 20:10:37,577 [myid:4] - WARN > [LearnerHandler-/10.96.100.69:54260:LearnerHandler@646] - ******* GOODBYE > /10.96.100.69:54260 ******** > 2015-01-02 20:10:37,576 [myid:4] - ERROR > [LearnerHandler-/10.96.100.136:48175:LearnerHandler@633] - Unexpected > exception causing shutdown while sock still open > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:152) > at java.net.SocketInputStream.read(SocketInputStream.java:122) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > at java.io.BufferedInputStream.read(BufferedInputStream.java:254) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546) > 2015-01-02 20:10:37,578 [myid:4] - WARN > [LearnerHandler-/10.96.14.103:47982:LearnerHandler@646] - ******* GOODBYE > /10.96.14.103:47982 ******** > 2015-01-02 20:10:37,577 [myid:4] - ERROR > [LearnerHandler-/10.84.11.131:47798:LearnerHandler@633] - Unexpected > exception causing shutdown while sock still open > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:152) > at java.net.SocketInputStream.read(SocketInputStream.java:122) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > at java.io.BufferedInputStream.read(BufferedInputStream.java:254) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546) > 2015-01-02 20:10:37,578 [myid:4] - WARN > [LearnerHandler-/10.84.11.131:47798:LearnerHandler@646] - ******* GOODBYE > /10.84.11.131:47798 ******** > 2015-01-02 20:10:37,578 [myid:4] - WARN > [LearnerHandler-/10.96.100.136:48175:LearnerHandler@646] - ******* GOODBYE > /10.96.100.136:48175 ******** > > > > Some trace logs from "normal operation" that looked suspect to us: > > > 2015-01-12 14:05:12,753 [myid:1] - TRACE > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooTrace@71] - loadData â?????? > session in archive: 381306787892823513 with timeout: 6000 > 2015-01-12 14:05:12,797 [myid:1] - INFO > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] - Snapshotting: > 0xb3ea0ee8f4db to /srv/zookeeper/snapshots/version-2/snapshot.b3ea0ee8f4db > 2015-01-12 14:05:12,822 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@374] - Got zxid > 0xb3ea0ee8f4dc expected 0x1 > 2015-01-12 14:05:12,833 [myid:1] - DEBUG > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing > request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a > zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a > 2015-01-12 14:05:12,833 [myid:1] - DEBUG > [CommitProcessor:1:FinalRequestProcessor@88] - Processing request:: > sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc > txntype:-1 reqpath:n/a > 2015-01-12 14:05:12,838 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] > - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483a > zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a > 2015-01-12 14:05:12,838 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] > - Ignoring processTxn failure hdr: -1 : error: -110 > 2015-01-12 14:05:12,838 [myid:1] - DEBUG > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing > request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d > zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a > 2015-01-12 14:05:12,839 [myid:1] - DEBUG > [CommitProcessor:1:FinalRequestProcessor@88] - Processing request:: > sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd > txntype:-1 reqpath:n/a > 2015-01-12 14:05:12,839 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] > - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483d > zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a > 2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] > - Ignoring processTxn failure hdr: -1 : error: -110 > 2015-01-12 14:05:12,839 [myid:1] - DEBUG > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing > request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4840 > zxid:0xb3ea0ee8f4de txntype:-1 reqpath:n/a > 2015-01-12 14:05:12,839 [myid:1] - DEBUG > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing > request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4843 > zxid:0xb3ea0ee8f4df txntype:-1 reqpath:n/a > > >
signature.asc
Description: Digital signature
