Re: [Artemis] Master fails to start up after failback

Martyn Taylor Wed, 21 Oct 2015 09:19:08 -0700

This should now be fixed upstream as part of:https://issues.apache.org/jira/browse/ARTEMIS-273


On 20/10/15 19:29, Mihkel Nõges wrote:

Thanks Martyn!


I will try this tomorrow.

BR!
Mihkel

On 20 October 2015 at 18:59, Martyn Taylor <mtay...@redhat.com> wrote:

Hi Mihkel,

I tried reproducing this locally and ran into an issue straight away when
running the example.  After some investigation it appears that there is an
issue in the ArtemisServerImpl which is preventing the journal files from
being replicated properly when using relative paths in the configuration
for data directories.

I am working on a fix for this at the moment.  In the meantime, you could
try using absolute paths in your server configuration for the follow
elements:

       <paging-directory></paging-directory>
       <bindings-directory></bindings-directory>
       <journal-directory></journal-directory>
<large-messages-directory></large-messages-directory>

Regards
Martyn

On 19/10/15 09:04, Mihkel Nõges wrote:

Basic flow of getting unresponsive failback cluster:
Have machine with Ubuntu 14.04.3

     1. Install libaio1, Java 1.8.0_60, maven 3.3.3, download and extract
     apache-artemis-1.1.0-bin
     <
http://www.eu.apache.org/dist/activemq/activemq-artemis/1.1.0/apache-artemis-1.1.0-bin.tar.gz
in
     /opt
     2. run $ mvn -Prelease install and $ mnv verify in
        /opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback
     SUCCESS
     3. Clean data folders and starts both servers manually:
     $
     cd
/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target
     $ rm -R server0/data/
     $ rm -R server1/data/
     $ server0/bin/artemis-service start
     Starting artemis-service
     artemis-service is now running (7154)
     $ server1/bin/artemis-service start
     Starting artemis-service
     artemis-service is now running (7180)
     4. Kill master server and wait for slave to take over

     $ kill -9 7154

     $ tail -f server1/log/artemis.log
     08:52:54,798 INFO  [org.apache.activemq.artemis.core.server]
AMQ221043:
     Protocol module found: [artemis-stomp-protocol]. Adding protocol
support
     for: STOMP
     08:53:02,145 INFO  [org.apache.activemq.artemis.core.server]
AMQ221109:
     Apache ActiveMQ Artemis Backup Server version 1.1.0 [null] started,
waiting
     live to fail before it gets active
     08:53:03,582 INFO  [org.apache.activemq.artemis.core.server]
AMQ221024:
     Backup server
     ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0 is
     synchronized with live-server.
     08:53:03,777 INFO  [org.apache.activemq.artemis.core.server]
AMQ221031:
     backup announced
     08:55:59,292 INFO  [org.apache.activemq.artemis.core.server]
AMQ221037:
     ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0 to
     become 'live'
     08:55:59,302 WARN  [org.apache.activemq.artemis.core.client]
AMQ212004:
     Failed to connect to server.
     08:55:59,778 INFO  [org.apache.activemq.artemis.core.server]
AMQ221003:
     trying to deploy queue jms.queue.exampleQueue
     08:55:59,829 WARN  [org.apache.activemq.artemis.core.client]
AMQ212034:
     There are more than one servers on the network broadcasting the same
node
     id. You will see this message exactly once (per node) if a node is
     restarted, in which case it can be safely ignored. But if it is logged
     continuously it means you really do have more than one node on the
same
     network active concurrently with the same node id. This could occur
if you
     have a backup node active at the same time as its live node.
     nodeID=64ddff0f-7636-11e5-bfa8-f5004e6195f0
     08:55:59,836 INFO  [org.apache.activemq.artemis.core.server]
AMQ221007:
     Server is now live
     08:55:59,869 INFO  [org.apache.activemq.artemis.core.server]
AMQ221020:
     Started Acceptor at broker3:61617 for protocols
     [CORE,MQTT,AMQP,HORNETQ,STOMP,OPENWIRE]
     5.


     Start master again and observer the logs:
     $ server0/bin/artemis-service start
     Starting artemis-service
     artemis-service is now running (7388)

$ tail -f server0/log/artemis.log
08:57:24,625 INFO  [org.apache.activemq.artemis.core.server] AMQ221012:
Using AIO Journal
08:57:24,694 INFO  [org.apache.activemq.artemis.core.server] AMQ221043:
Protocol module found: [artemis-server]. Adding protocol support for: CORE
08:57:24,702 INFO  [org.apache.activemq.artemis.core.server] AMQ221043:
Protocol module found: [artemis-amqp-protocol]. Adding protocol support
for: AMQP
08:57:24,731 INFO  [org.apache.activemq.artemis.core.server] AMQ221043:
Protocol module found: [artemis-hornetq-protocol]. Adding protocol support
for: HORNETQ
08:57:24,733 INFO  [org.apache.activemq.artemis.core.server] AMQ221043:
Protocol module found: [artemis-mqtt-protocol]. Adding protocol support
for: MQTT
08:57:24,743 INFO  [org.apache.activemq.artemis.core.server] AMQ221043:
Protocol module found: [artemis-openwire-protocol]. Adding protocol
support
for: OPENWIRE
08:57:24,878 INFO  [org.apache.activemq.artemis.core.server] AMQ221043:
Protocol module found: [artemis-stomp-protocol]. Adding protocol support
for: STOMP
08:57:25,082 INFO  [org.apache.activemq.artemis.core.server] AMQ221109:
Apache ActiveMQ Artemis Backup Server version 1.1.0 [null] started,
waiting
live to fail before it gets active
08:57:27,043 INFO  [org.apache.activemq.artemis.core.server] AMQ221024:
Backup server
ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0 is
synchronized with live-server.
08:57:27,948 INFO  [org.apache.activemq.artemis.core.server] AMQ221031:
backup announced
08:57:31,227 WARN  [org.apache.activemq.artemis.core.client] AMQ212037:
Connection failure has been detected: AMQ119015: The connection was
disconnected because of server shutdown [code=DISCONNECTED]
08:57:31,252 WARN  [org.apache.activemq.artemis.core.client] AMQ212037:
Connection failure has been detected: AMQ119015: The connection was
disconnected because of server shutdown [code=DISCONNECTED]
08:57:31,307 WARN  [org.apache.activemq.artemis.core.client] AMQ212037:
Connection failure has been detected: AMQ119015: The connection was
disconnected because of server shutdown [code=DISCONNECTED]
08:57:31,339 INFO  [org.apache.activemq.artemis.core.server] AMQ221037:
ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0 to
become 'live'
08:57:31,360 WARN  [org.apache.activemq.artemis.core.client] AMQ212004:
Failed to connect to server.
08:57:31,413 ERROR [org.apache.activemq.artemis.core.server] AMQ224008:
Failed to store id: java.lang.IllegalStateException: Cannot find add info
1
at

org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:799)
[artemis-journal-1.1.0.jar:1.1.0]
at

org.apache.activemq.artemis.core.journal.impl.JournalBase.appendDeleteRecord(JournalBase.java:183)
[artemis-journal-1.1.0.jar:1.1.0]
at

org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:79)
[artemis-journal-1.1.0.jar:1.1.0]
at

org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.deleteID(JournalStorageManager.java:1194)
[artemis-server-1.1.0.jar:1.1.0]
at

org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.deleteID(BatchingIDGenerator.java:152)
[artemis-server-1.1.0.jar:1.1.0]
at

org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.cleanup(BatchingIDGenerator.java:75)
[artemis-server-1.1.0.jar:1.1.0]
at

org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.loadBindingJournal(JournalStorageManager.java:
1784)
[artemis-server-1.1.0.jar:1.1.0]
at

org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:
1625)
[artemis-server-1.1.0.jar:1.1.0]
at

org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:
1535)
[artemis-server-1.1.0.jar:1.1.0]
at

org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249)
[artemis-server-1.1.0.jar:1.1.0]
at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60]
08:57:31,540 WARN  [org.apache.activemq.artemis.core.server] AMQ222173:
Queue jms.queue.exampleQueue is duplicated during reload. This queue will
be renamed as jms.queue.exampleQueue-0
08:57:31,550 ERROR [org.apache.activemq.artemis.core.server] AMQ224000:
Failure in initialisation: java.lang.IllegalStateException: Cursor 2 had
already been created
at

org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl.createSubscription(PageCursorProviderImpl.java:97)
[artemis-server-1.1.0.jar:1.1.0]
at

org.apache.activemq.artemis.core.server.impl.PostOfficeJournalLoader.initQueues(PostOfficeJournalLoader.java:140)
[artemis-server-1.1.0.jar:1.1.0]
at

org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:
1631)
[artemis-server-1.1.0.jar:1.1.0]
at

org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:
1535)
[artemis-server-1.1.0.jar:1.1.0]
at

org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249)
[artemis-server-1.1.0.jar:1.1.0]
at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60]


On 19 October 2015 at 10:31, Mihkel Nõges <mihkel.no...@transferwise.com>
wrote:

Hi Clebert,

I do not have other code to share with you but the example code in
Artemis
1.1.0 binary deployment package. I'm running
org.apache.activemq.artemis.jms.example.ReplicatedFailbackExample

And only commented out the serverStart and killServer calls which I am
doing manually.

I do not think I do any of the steps too fast as I tail the server log
files in parallel and see everything is finished when I start the fail
back. I have waited many minutes in between.

Only changes in configuration to the test is changing localhost addresses
with broker3 to make the cluster accessible remotely.

BR!
MIhkel

On 18 October 2015 at 17:49, Clebert <clebert.suco...@gmail.com> wrote:

Im not on my computer now but it sounds like you are doing a fail back

immediately after failed over. It takes some time (seconds) to the
server
to activate on the backup.

Later the server will need to copy the data back before it can be
activated in fail back mode.

It sounds the live is not reaching backup for fail back.

I will try looking it at it on Monday. Maybe you could post your example
at your GitHub fork.

-- Clebert Suconic typing on the iPhone.

On Oct 18, 2015, at 08:15, Mihkel Nõges <mihkel.no...@transferwise.com>
wrote:

Hello again!

I would be very grateful If someone could answer my questions. We need

the high availability to work to use the broker in production.

When I run the replicated-failback example in one machine (broker3) it

succeeds.

It fails when I run the same test - exactly the same servers with

slightly modified client remotely.

I run client in debug mode from my IDE with commented out serverStart

and killServer calls.

Deleted data folders and started the servers:
artemis@broker3
:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$

rm -R server0/data/

artemis@broker3
:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$

rm -R server1/data/

artemis@broker3
:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$

server0/bin/artemis-service start

Starting artemis-service

artemis-service is now running (23357)

artemis@broker3
:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$

server1/bin/artemis-service start

Starting artemis-service

artemis-service is now running (23383)

Starting client and stopping on breakpoint at line 103:
//ServerUtil.killServer(server0);
// Step 11. Acknowledging the 2nd half of the sent messages will fail

as failover to the

// backup server has occurred
try {
     message0.acknowledge();  //line 103
killing server0
artemis@broker3
:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$

kill -9 23357

Proceeding to breakpoint at line 121:
//server0 = ServerUtil.startServer(args[0],

ReplicatedFailbackExample.class.getSimpleName() + "0", 0, 10000);

// Step 11. Acknowledging the 2nd half of the sent messages will fail

as failover to the

// backup server has occurred
try {
     message0.acknowledge(); // line 121
Starting server0:
artemis@broker3
:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$

server0/bin/artemis-service start

Starting artemis-service

artemis-service is now running (24240)

Server0 writes ERROR to it's log (see attached server0_artemis.log).
Now when trying to proceed with the client it writes the following in

the log and does not exit, but remains hanging forever:

Oct 18, 2015 2:55:34 PM

org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl
fail

WARN: AMQ212037: Connection failure has been detected: AMQ119015: The

connection was disconnected because of server shutdown
[code=DISCONNECTED]

Got message: This is text message 20 (redelivered?: false)

Got exception while acknowledging message: AMQ119014: Timed out after

waiting 30,000 ms for response when sending packet 43

Got message: This is text message 21 (redelivered?: false)

Got message: This is text message 22 (redelivered?: false)

Got message: This is text message 23 (redelivered?: false)

Got message: This is text message 24 (redelivered?: false)

Got message: This is text message 25 (redelivered?: false)

Got message: This is text message 26 (redelivered?: false)

Got message: This is text message 27 (redelivered?: false)

Got message: This is text message 28 (redelivered?: false)

Got message: This is text message 29 (redelivered?: false)

As a result the slave (server1) remains stopped, not restarted as

expected and the master (server0) process appears to be running but does
not accept any connections.

Exactly the same behavior is observable every time I try this.

BR!
Mihkel

On 13 October 2015 at 20:17, Mihkel Nõges <
mihkel.no...@transferwise.com> wrote:
Hi Clebert,

No test, just doing it on command line with standalone servers. I'm

using 1.1.0 installed with wget, not the snapshot.
I'm wondering what should be the suggested procedure for admins to do
changes to HA cluster of 2 or 3 nodes of Artemis. If one of the nodes

is
master by configuration, do they need to change it's config before
restarting it to become slave to have seamless change process and make
some
instance master by configuration only if all the instances need to be
restarted?

I tried also a cluster with 2 masters and 2 slaves with 2 separate
group-name values, but for some reason the second master I started

became
slave for the first immediately. I expected it to become a clustered
load-balancing parallel master. Our loads are not yet that high to
require
more than one master, so it's just a theoretical question.

BR!

Mihkel

On 13 October 2015 at 20:03, Clebert Suconic <
clebert.suco...@gmail.com> wrote:

The master needs to copy its data from the backup back to live before

it's activated.

Do you have a test replicating this?

Did you try the snapshot build?

On Tue, Oct 13, 2015 at 11:58 AM, Mihkel Nõges
<mihkel.no...@transferwise.com> wrote:

Hi,

I configured replicating HA master-slave of Artemis 1.1.0 instances

on

Ubuntu 14.04.3.

When I kill master the slave takes over as expected and starts

serving as

new master. When I then start the old master, it fails with the

following

errors in the log:

16:35:46,476 ERROR [org.apache.activemq.artemis.core.server]

AMQ224008:

Failed to store id: java.lang.IllegalStateException: Cannot find

add info 1

at

org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:799)

[artemis-journal-1.1.0.jar:1.1.0]

at

org.apache.activemq.artemis.core.journal.impl.JournalBase.appendDeleteRecord(JournalBase.java:183)

[artemis-journal-1.1.0.jar:1.1.0]

at

org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:79)

[artemis-journal-1.1.0.jar:1.1.0]

at

org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.deleteID(JournalStorageManager.java:1194)

[artemis-server-1.1.0.jar:1.1.0]

at

org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.deleteID(BatchingIDGenerator.java:152)

[artemis-server-1.1.0.jar:1.1.0]

at

org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.cleanup(BatchingIDGenerator.java:75)

[artemis-server-1.1.0.jar:1.1.0]

at

org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.loadBindingJournal(JournalStorageManager.java:
1784)

[artemis-server-1.1.0.jar:1.1.0]

at

org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:
1625)

[artemis-server-1.1.0.jar:1.1.0]

at

org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:
1535)

[artemis-server-1.1.0.jar:1.1.0]

at

org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249)

[artemis-server-1.1.0.jar:1.1.0]

at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60]

16:35:46,572 WARN  [org.apache.activemq.artemis.core.server]

AMQ222173:

Queue jms.queue.DLQ is duplicated during reload. This queue will be

renamed

as jms.queue.DLQ-0

16:35:46,572 ERROR [org.apache.activemq.artemis.core.server]

AMQ224000:

Failure in initialisation: java.lang.IllegalStateException: Cursor

2 had

already been created

at

org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl.createSubscription(PageCursorProviderImpl.java:97)

[artemis-server-1.1.0.jar:1.1.0]

at

org.apache.activemq.artemis.core.server.impl.PostOfficeJournalLoader.initQueues(PostOfficeJournalLoader.java:140)

[artemis-server-1.1.0.jar:1.1.0]

at

org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:
1631)

[artemis-server-1.1.0.jar:1.1.0]

at

org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:
1535)

[artemis-server-1.1.0.jar:1.1.0]

at

org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249)

[artemis-server-1.1.0.jar:1.1.0]

at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60]

As a result both master and the slave remain unaccessible and no

further

restarts solve the situation.

Attached also master and slave broker.xml files.

BR!

Mihkel Nõges


--
Clebert Suconic

Re: [Artemis] Master fails to start up after failback

Reply via email to