[jira] [Commented] (AMQ-5082) ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening

Jim Robinson (JIRA) Thu, 03 Sep 2015 09:08:23 -0700

    [ 
https://issues.apache.org/jira/browse/AMQ-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729326#comment-14729326
 ]


Jim Robinson commented on AMQ-5082:
-----------------------------------

If you look at this code:

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.zookeeper/zookeeper/3.3.0/org/apache/zookeeper/server/ZooKeeperServer.java#664

you can see that if you do not set a minSessionTimeout or maxSessionTimeout 
then zookeeper will default to

minSessionTimeout == 2 x ticktime

maxSessionTimeout == 20 x ticktime

The ticktime is milliseconds, so that means your configuration is currently:

tickTime == 2 seconds
minSessionTimeout == 4 seconds
maxSessionTimeout == 40 seconds

My configuration is the same.  So your zkSessionTimeout is within the accepted 
range the server will allow, but what I've been finding is that  with my setup 
the server only became stable at the higher end of that timeout range.   In my 
case I set it to 40 seconds, the maximum the zookeeper server would allow.

I think what's going on is that the code i ZooKeeperGroup.scala, e.g., at

https://github.com/apache/activemq/blob/master/activemq-leveldb-store/src/main/scala/org/apache/activemq/leveldb/replicated/groups/ZooKeeperGroup.scala#L142

isn't handling all the possible exceptions that might get thrown, like with a 
timeout, and breaks if the timeout gets hit.

I've been too busy on other things to dig into it any more, but I'm pretty 
confident that the activemq zookeeper client code that interacts with zookeeper 
server needs to be reevaluated to handle the different ways a connection might 
be lost.


> ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening
> -------------------------------------------------------------------
>
>                 Key: AMQ-5082
>                 URL: https://issues.apache.org/jira/browse/AMQ-5082
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: activemq-leveldb-store
>    Affects Versions: 5.9.0, 5.10.0
>            Reporter: Scott Feldstein
>            Assignee: Christian Posta
>            Priority: Critical
>             Fix For: 5.13.0
>
>         Attachments: 03-07.tgz, amq_5082_threads.tar.gz, 
> mq-node1-cluster.failure, mq-node2-cluster.failure, mq-node3-cluster.failure, 
> zookeeper.out-cluster.failure
>
>
> I have a 3 node amq cluster and one zookeeper node using a replicatedLevelDB 
> persistence adapter.
> {code}
>         <persistenceAdapter>
>             <replicatedLevelDB
>               directory="${activemq.data}/leveldb"
>               replicas="3"
>               bind="tcp://0.0.0.0:0"
>               zkAddress="zookeep0:2181"
>               zkPath="/activemq/leveldb-stores"/>
>         </persistenceAdapter>
> {code}
> After about a day or so of sitting idle there are cascading failures and the 
> cluster completely stops listening all together.
> I can reproduce this consistently on 5.9 and the latest 5.10 (commit 
> 2360fb859694bacac1e48092e53a56b388e1d2f0).  I am going to attach logs from 
> the three mq nodes and the zookeeper logs that reflect the time where the 
> cluster starts having issues.
> The cluster stops listening Mar 4, 2014 4:56:50 AM (within 5 seconds).
> The OSs are all centos 5.9 on one esx server, so I doubt networking is an 
> issue.
> If you need more data it should be pretty easy to get whatever is needed 
> since it is consistently reproducible.
> This bug may be related to AMQ-5026, but looks different enough to file a 
> separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AMQ-5082) ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening

Reply via email to