Hi Justin,

thank you very, very much!

I haven't changed the configuration yet but your excellent explanation absolutely makes sense to me. I also noticed the increasing amount of bridge connection after a graceful shutdown/startup sequence.

Enabling persistence is fine for us.
I'll change the configuration and let you know.

Thank you again very much for your support!

Regards,
Oliver

P.S. Pls let me know how I can "buy you a coffee/beer/whatever" :)


On 4/20/23 04:41, Justin Bertram wrote:
I was able to take the archive you attached and reproduce the issue in just
a few minutes. Thanks for the great reproducer!

During reproduction I noticed something odd in the log. In a two-node
cluster you would expect each node to have 1 bridge each (i.e. going to the
*other* node of the cluster). However, after killing and restarting node 1
each node actually had more than one bridge. After looking at your
configuration more closely I saw that you had disabled persistence (i.e.
using <persistence-enabled>false</persistence-enabled>). This has a
specific impact on a clustered configuration because when a node starts
with an empty journal it generates a unique node ID and persists it to
disk. This ID is what identifies the node in the cluster so that everybody
in the cluster "knows" who everybody else is. When a node is restarted for
any reason the other nodes in the cluster are able to recognize it as the
same node based on the ID. However, when you disable persistence you
disable the persistent ID so every time a node restarts it is seen as a
"new" node in the cluster. Given that you're using the default
reconnect-attempts of -1 (i.e. infinite) on your cluster-connection that
means every time you restart a node all the other nodes in the cluster will
keep trying to reconnect to this never-to-return node forever. Furthermore,
they'll be trying to reconnect every 500 milliseconds. This reconnection
thrashing appears to be causing the ordering issue because as soon as I
enabled persistence I was unable to reproduce the problem anymore. I also
tried leaving persistence disabled and also setting reconnect-attempts = 0
and that also appears to have solved the problem.

I don't yet know *why* the reconnection thrashing appears to be causing the
problem, but I believe you can effectively work-around the issue either by
enabling persistence or disabling reconnection or at least setting
reconnect-attempts to a low value and increasing the retry-interval (e.g.
using 5 and 10000 respectively).

Hope that helps!


Justin

On Thu, Mar 16, 2023 at 7:59 AM Oliver Lins <l...@lins-it.de> wrote:

Hi,

I've attached an archive containing the test apps, logs and a readme file.

If you have any questions pls let me know.

Thank you,
Oliver

On 3/15/23 16:31, Justin Bertram wrote:
I just need a way to reproduce what you're seeing so once you get your
reproducer in order let me know. Thanks!


Justin

On Wed, Mar 15, 2023 at 9:36 AM Oliver Lins <l...@lins-it.de> wrote:

Hi Justin,

thank you for your fast reply.

   > Would it be possible for you to work up a way to reproduce the
behavior you're seeing?
Yes, I can reproduce the behavior. I have simplified producer and
consumer Java code to reproduce.
The code is not yet the bare minimum necessary to work, but I can change
that.

   >  If so, is the order-of-creation only essential per producer [...]
Yes, the order is only essential per producer.

Please let me know how I can assist you.

Thank you,
Oliver

On 3/15/23 14:58, Justin Bertram wrote:
Based on your description, attached configuration, and logs I don't see
anything wrong, per se. Would it be possible for you to work up a way
to
reproduce the behavior you're seeing?

Do you ever have more than 1 producer? If so, is the order-of-creation
only
essential per producer or is it essential across all producers?


Justin

On Wed, Mar 15, 2023 at 8:29 AM Oliver Lins <l...@lins-it.de> wrote:

Hi,

we are using Artemis with the following setup:
- 2 independent broker instances (on 2 hosts)
- a cluster configuration to create a Core bridge between both
instances
(no failover, no HA)
- multiple JMS clients produce and consume AMQP messages using topics
- the clients do a failover themself
- Artemis versions (2.21.0, 2.29.0-SNAPSHOT cloned on 08.03)

Every thing is working fine. Independent of the Artemis instance the
producer or consumers are connected to they receive all messages in
the
order of creation.

To simulate a server failure we kill (-9) Artemis instance 1 and
restart
the instance again (~ 1/2 minute later).
- 1 producer connects to the restarted instance 1
- multiple consumers are (still) connected to instance 2
- 1 consumer connects to the restarted instance 1

The producer sends messages with a delay of 1 ms.
Now we see that
- the order of messages received by the consumer connected to
instance 1
frequently does not match the order the messages are created
- the order of messages received by consumers connected to instance 2
matches the order the messages are created

It is essential for us that the messages arrive in the order of
creation.
Do you have any ideas what went wrong or we are doing wrong?

Thanks in advance,
Oliver

Pls note: the attached files are used to reproduce what we saw in
production.
        This test configuration uses 1 docker instance per Artemis
broker.
        Both instances are running on the same host using different
ports.
--
Dipl.-Ing. FH der technischen Informatik
Tel.: +49 179 2911883
Email: ol...@lins-it.de
Internet:
          http://www.lins-it.de


--
Dipl.-Ing. FH der technischen Informatik
Tel.: +49 179 2911883
Email: ol...@lins-it.de
Internet:
         http://www.lins-it.de


--
Dipl.-Ing. FH der technischen Informatik
Tel.: +49 179 2911883
Email: ol...@lins-it.de
Internet:
        http://www.lins-it.de

Reply via email to