Re: Large number of errored transactions and periodic node disconnects on 3.4.6

Ben Smith Wed, 14 Jan 2015 21:19:45 -0800

On Wed, Jan 14, 2015 at 1:28 PM, Flavio Junqueira <[email protected]>
wrote:


> Do you have anything running along with the ZK servers on the same
> machine?


No the zookeeper machines are 'dedicated'. The only other stuff running is
infrastructure related (metrics and log collection agents that incur no/low
load).


> Is that a virtualized environment?


Yes, this is in in the 'cloud' (: They are on AWS c3.large instances. The
instance CPU and Memory resources are really low.


> I'm wondering if a concurrent workload is causing servers to lag behind
> and drop the ensemble. Also, how many disk devices are you using? is there
> a dedicated device for the txn log?
>

They are only using one device each, it is not dedicated, but it also isn't
anywhere close to saturation, using iostat to validate svctm:

https://gist.github.com/polynomial/2dc6aefefd7fa31afc73

Here is the config if that helps too:

$ echo 'conf' | nc localhost 2181
clientPort=2181
dataDir=/path/to/zookeeper/snapshots/version-2
dataLogDir=/path/to/zookeeper/transactions/version-2
tickTime=2000
maxClientCnxns=60
minSessionTimeout=4000
maxSessionTimeout=40000
serverId=4
initLimit=10
syncLimit=5
electionAlg=3
electionPort=3888
quorumPort=2888
peerType=0

-ben


>
> -Flavio
>
> > On 14 Jan 2015, at 19:07, R. Tyler Croy <[email protected]> wrote:
> >
> > (replies inline)
> >
> > On Wed, 14 Jan 2015, Flavio Junqueira wrote:
> >
> >> I'm not sure what could be causing the communication between servers to
> fail.
> >> Are you sure this is a ZK problem? Could it be a problem with your
> setting?
> >
> >
> > We're not certain of much at this point to be honest. We've executed some
> > network stress tests and see no disruptions in the layer 1-3 network
> > communications.
> >
> > I'm not aware of any Zookeeper server-side setting that would result in
> the
> > behavior we're seeing but I'm happy to try out some configuration
> changes if
> > someboyd believes those will either help alleviate the problem or help
> identify
> > it with more exactness.
> >
> >
> >
> >
> >> On Wednesday, January 14, 2015 4:33 PM, R. Tyler Croy <
> [email protected]> wrote:
> >>
> >>
> >> A colleague of mine and I are still fully trying to understand the
> behavior
> >> we're seeing from one of our Zookeeper clusters running 3.4.6.
> >>
> >> Some log snippets are included further down, but the symptom of what
> we're
> >> seeing is that nodes in the cluster periodically detach themselves
> entirely
> >> from the rest of the cluster. The period is not however consistent, in
> at least
> >> one known case we've seen multiple nodes detach around the same time
> causing
> >> leader re-election challenges (putting it kindly!).
> >>
> >> Our investigation has led us to see a large number of transaction errors
> >> (txnType: -1) in our trace logs with "n/a" paths. To give you an idea
> of how
> >> large of a number we're talking about, here's a summary from ~20min of
> logs:
> >>
> >> -> # grep txntype zookeeper.log* | perl -pe 's/.*txntype/txntype/' |
> sort | uniq -c | sort -nr
> >> 70585 txntype:-1 reqpath:n/a
> >> 4860 txntype:5 reqpath:n/a
> >> 7 txntype:-11 reqpath:n/a
> >> 7 txntype:-10 reqpath:n/a
> >> [..snip..]
> >>
> >> The number of session create/close transaction types looks about right
> as does
> >> the number of 'set data' transactions; however the "reqpath" being
> "n/a" does
> >> not look entirely correct.
> >>
> >>
> >> We're only using Zookeeper to hold offsets for normal Kafka high-level
> >> consumers and Storm's KafkaSpout consumers, and I'm relatively
> confident that
> >> those consumers are behaving correctly.
> >>
> >>
> >> We have also come across a number of ostensibly resolved bugs in JIRA
> that, to
> >> our untrained eyes, look similar to what we're seeing:
> >>
> >> "Zookeeper 3.3.4 loses ephemeral nodes under stress"
> >>     <https://issues.apache.org/jira/browse/ZOOKEEPER-1740>
> >> "the threads number of a zookeeper is increased all the time"
> >>     <https://issues.apache.org/jira/browse/ZOOKEEPER-939> (not seeing
> the
> >>     thread growth mentioned here, just log similarities)
> >>
> >>
> >> The node detaching issue is not something we've been able to recreate
> on our
> >> own, so I don't entirely have a reproduction case. This is ZK 3.4.6 on
> OpenJDK
> >> 1.7.0u67 running on Ubuntu 12.04, other than that I'm happy to provide
> >> information that might help determine the cause of our problems!
> >>
> >>
> >> Cheers (logs snippets to follow)
> >>
> >>
> >> Error logs from an incident where a node detached itself:
> >>
> >> 2015-01-02 20:10:37,576 [myid:4] - ERROR
> [LearnerHandler-/10.96.100.69:54260:LearnerHandler@633] - Unexpected
> exception causing shutdown while sock still open
> >> java.net.SocketTimeoutException: Read timed out
> >>         at java.net.SocketInputStream.socketRead0(Native Method)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
> >>         at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> >>         at
> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> >>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
> >>         at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> >>         at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> >>         at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> >>         at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
> >> 2015-01-02 20:10:37,577 [myid:4] - ERROR
> [LearnerHandler-/10.96.100.90:43220:LearnerHandler@633] - Unexpected
> exception causing shutdown while sock still open
> >> java.net.SocketTimeoutException: Read timed out
> >>         at java.net.SocketInputStream.socketRead0(Native Method)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
> >>         at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> >>         at
> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> >>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
> >>         at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> >>         at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> >>         at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> >>         at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
> >> 2015-01-02 20:10:37,577 [myid:4] - WARN
> [LearnerHandler-/10.96.100.69:54260:LearnerHandler@646] - ******* GOODBYE
> /10.96.100.69:54260 ********
> >> 2015-01-02 20:10:37,576 [myid:4] - ERROR
> [LearnerHandler-/10.96.100.136:48175:LearnerHandler@633] - Unexpected
> exception causing shutdown while sock still open
> >> java.net.SocketTimeoutException: Read timed out
> >>         at java.net.SocketInputStream.socketRead0(Native Method)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
> >>         at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> >>         at
> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> >>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
> >>         at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> >>         at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> >>         at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> >>         at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
> >> 2015-01-02 20:10:37,578 [myid:4] - WARN
> [LearnerHandler-/10.96.14.103:47982:LearnerHandler@646] - ******* GOODBYE
> /10.96.14.103:47982 ********
> >> 2015-01-02 20:10:37,577 [myid:4] - ERROR
> [LearnerHandler-/10.84.11.131:47798:LearnerHandler@633] - Unexpected
> exception causing shutdown while sock still open
> >> java.net.SocketTimeoutException: Read timed out
> >>         at java.net.SocketInputStream.socketRead0(Native Method)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
> >>         at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> >>         at
> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> >>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
> >>         at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> >>         at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> >>         at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> >>         at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
> >> 2015-01-02 20:10:37,578 [myid:4] - WARN
> [LearnerHandler-/10.84.11.131:47798:LearnerHandler@646] - ******* GOODBYE
> /10.84.11.131:47798 ********
> >> 2015-01-02 20:10:37,578 [myid:4] - WARN
> [LearnerHandler-/10.96.100.136:48175:LearnerHandler@646] - *******
> GOODBYE /10.96.100.136:48175 ********
> >>
> >>
> >>
> >> Some trace logs from "normal operation" that looked suspect to us:
> >>
> >>
> >> 2015-01-12 14:05:12,753 [myid:1] - TRACE
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooTrace@71] - loadData â??????
> session in archive: 381306787892823513 with timeout: 6000
> >> 2015-01-12 14:05:12,797 [myid:1] - INFO
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] -
> Snapshotting: 0xb3ea0ee8f4db to
> /srv/zookeeper/snapshots/version-2/snapshot.b3ea0ee8f4db
> >> 2015-01-12 14:05:12,822 [myid:1] - WARN
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@374] - Got zxid
> 0xb3ea0ee8f4dc expected 0x1
> >> 2015-01-12 14:05:12,833 [myid:1] - DEBUG
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a
> zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
> >> 2015-01-12 14:05:12,833 [myid:1] - DEBUG
> [CommitProcessor:1:FinalRequestProcessor@88] - Processing request::
> sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc
> txntype:-1 reqpath:n/a
> >> 2015-01-12 14:05:12,838 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90]
> - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483a
> zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
> >> 2015-01-12 14:05:12,838 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949]
> - Ignoring processTxn failure hdr: -1 : error: -110
> >> 2015-01-12 14:05:12,838 [myid:1] - DEBUG
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d
> zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
> >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG
> [CommitProcessor:1:FinalRequestProcessor@88] - Processing request::
> sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd
> txntype:-1 reqpath:n/a
> >> 2015-01-12 14:05:12,839 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90]
> - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483d
> zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
> >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949]
> - Ignoring processTxn failure hdr: -1 : error: -110
> >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4840
> zxid:0xb3ea0ee8f4de txntype:-1 reqpath:n/a
> >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4843
> zxid:0xb3ea0ee8f4df txntype:-1 reqpath:n/a
> >>
> >>
> >>
>
>

Re: Large number of errored transactions and periodic node disconnects on 3.4.6

Reply via email to