[jira] [Comment Edited] (AMQ-5082) ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening

Gregor Stephen (JIRA) Fri, 18 Dec 2015 07:38:22 -0800

    [ 
https://issues.apache.org/jira/browse/AMQ-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064099#comment-15064099
 ]


Gregor Stephen edited comment on AMQ-5082 at 12/18/15 3:36 PM:
---------------------------------------------------------------

We are seeing something very similar to this in our development environment.

We have a 3-node ActiveMQ cluster where each node has ActiveMQ 5.12.0 and 
Zookeeper 3.4.6 (*note, we have done some testing with Zookeeper 3.4.7, but 
this has failed to resolve the issue. Time constraints have so far prevented us 
from testing ActiveMQ 5.13).

What we have found is that when we stop the master ZooKeeper process (via the 
"end process tree" command in Task Manager), the remaining two ZooKeeper nodes 
continue to function as normal. Sometimes the ActiveMQ cluster is able to 
handle this, but sometimes it does not.

When the cluster fails, we typically see this in the ActiveMQ log:

2015-12-18 09:08:45,157 | WARN  | Too many cluster members are connected.  
Expected at most 3 members but there are 4 connected. | 
org.apache.activemq.leveldb.replicated.MasterElector | 
WrapperSimpleAppMain-EventThread
...
...
2015-12-18 09:27:09,722 | WARN  | Session 0x351b43b4a560016 for server null, 
unexpected error, closing socket connection and attempting reconnect | 
org.apache.zookeeper.ClientCnxn | 
WrapperSimpleAppMain-SendThread(192.168.0.10:2181)
java.net.ConnectException: Connection refused: no further information
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)[:1.7.0_79]
        at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)[:1.7.0_79]
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)[zookeeper-3.4.6.jar:3.4.6-1569965]
        at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)[zookeeper-3.4.6.jar:3.4.6-1569965]
        
We were immediately concerned by the fact that (A)ActiveMQ seems to think there 
are four members in the cluster when it is only configured with 3 and (B) when 
the exception is raised, the server appears to be null. We then increased 
ActiveMQ's logging level to DEBUG in order to display the list of members:

2015-12-18 09:33:04,236 | DEBUG | ZooKeeper group changed: Map(localhost -> 
ListBuffer((0000000156,{"id":"localhost","container":null,"address":null,"position":-1,"weight":5,"elected":null}),
 
(0000000157,{"id":"localhost","container":null,"address":null,"position":-1,"weight":1,"elected":null}),
 
(0000000158,{"id":"localhost","container":null,"address":"tcp://192.168.0.11:61619","position":-1,"weight":10,"elected":null}),
 
(0000000159,{"id":"localhost","container":null,"address":null,"position":-1,"weight":10,"elected":null})))
 | org.apache.activemq.leveldb.replicated.MasterElector | ActiveMQ 
BrokerService[localhost] Task-14

Can anyone suggest why this may be happening and/or suggest a way to resolve 
this?


was (Author: glstephen):
We are seeing something very similar to this in our development environment.

We have a 3-node ActiveMQ cluster where each node has ActiveMQ 5.12.0 and 
Zookeeper 3.4.6 (*note, we have done some testing with Zookeeper 3.4.7, but 
this has failed to resolve the issue. Time constraints have so far prevented us 
from testing ActiveMQ 5.13).

What we have found is that when we stop the master ZooKeeper process (via the 
"end process tree" command in Task Manager), the remaining two ZooKeeper nodes 
continue to function as normal. Sometimes the ActiveMQ cluster is able to 
handle this, but sometimes it does not.

When the cluster fails, we typically see this in the ActiveMQ log:

2015-12-18 09:08:45,157 | WARN  | Too many cluster members are connected.  
Expected at most 3 members but there are 4 connected. | 
org.apache.activemq.leveldb.replicated.MasterElector | 
WrapperSimpleAppMain-EventThread
...
...
2015-12-18 09:27:09,722 | WARN  | Session 0x351b43b4a560016 for server null, 
unexpected error, closing socket connection and attempting reconnect | 
org.apache.zookeeper.ClientCnxn | 
WrapperSimpleAppMain-SendThread(192.168.0.10:2181)
java.net.ConnectException: Connection refused: no further information
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)[:1.7.0_79]
        at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)[:1.7.0_79]
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)[zookeeper-3.4.6.jar:3.4.6-1569965]
        at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)[zookeeper-3.4.6.jar:3.4.6-1569965]
        
We were immediately concerned by the fact that (A)ActiveMQ seems to think there 
are four members in the cluster when it is only configured with 3 and (B) when 
the exception is raised, the server appears to be null. We then increased 
ActiveMQ's logging level to DEBUG in order to display the list of members:

2015-12-18 09:33:04,236 | DEBUG | ZooKeeper group changed: Map(localhost -> 
ListBuffer((0000000156,{"id":"localhost","container":null,"address":null,"position":-1,"weight":5,"elected":null}),
 
(0000000157,{"id":"localhost","container":null,"address":null,"position":-1,"weight":1,"elected":null}),
 
(0000000158,{"id":"localhost","container":null,"address":"tcp://192.168.0.11:61619","position":-1,"weight":10,"elected":null}),
 
(0000000159,{"id":"localhost","container":null,"address":null,"position":-1,"weight":10,"elected":null})))
 | org.apache.activemq.leveldb.replicated.MasterElector | ActiveMQ 
BrokerService[localhost] Task-14

> ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening
> -------------------------------------------------------------------
>
>                 Key: AMQ-5082
>                 URL: https://issues.apache.org/jira/browse/AMQ-5082
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: activemq-leveldb-store
>    Affects Versions: 5.9.0, 5.10.0
>            Reporter: Scott Feldstein
>            Assignee: Christian Posta
>            Priority: Critical
>             Fix For: 5.14.0
>
>         Attachments: 03-07.tgz, amq_5082_threads.tar.gz, 
> mq-node1-cluster.failure, mq-node2-cluster.failure, mq-node3-cluster.failure, 
> zookeeper.out-cluster.failure
>
>
> I have a 3 node amq cluster and one zookeeper node using a replicatedLevelDB 
> persistence adapter.
> {code}
>         <persistenceAdapter>
>             <replicatedLevelDB
>               directory="${activemq.data}/leveldb"
>               replicas="3"
>               bind="tcp://0.0.0.0:0"
>               zkAddress="zookeep0:2181"
>               zkPath="/activemq/leveldb-stores"/>
>         </persistenceAdapter>
> {code}
> After about a day or so of sitting idle there are cascading failures and the 
> cluster completely stops listening all together.
> I can reproduce this consistently on 5.9 and the latest 5.10 (commit 
> 2360fb859694bacac1e48092e53a56b388e1d2f0).  I am going to attach logs from 
> the three mq nodes and the zookeeper logs that reflect the time where the 
> cluster starts having issues.
> The cluster stops listening Mar 4, 2014 4:56:50 AM (within 5 seconds).
> The OSs are all centos 5.9 on one esx server, so I doubt networking is an 
> issue.
> If you need more data it should be pretty easy to get whatever is needed 
> since it is consistently reproducible.
> This bug may be related to AMQ-5026, but looks different enough to file a 
> separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (AMQ-5082) ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening

Reply via email to