Hello,
We had a strange error on ActiveMQ last week, and wanted to check if someone
has experienced this before.
Background
A couple of weeks ago we patched the ActiveMQ Prod VMs, after they were
restarted the wrong configuration was setup causing a "Split brain" problem
between the master and the slave.
To troubleshoot the invalid configuration before going to production we had 2
test VMs created to verify the update process from the previous (static
configuration) the new configuration using Multi Cast. The testing worked as
expected and we were ready to update the configuration on production.
On Sept 27th the correct configuration was (same as you are currently using) we
ended up having 2 masters and 2 slaves on at the same time - this happened
because the test VMs had not been turned off yet. When we realized this, we
turned the test VMs immediately. There were no errors or warnings in the
ActiveMQ or Activity Manager logs, thus we thought there would not be an issue.
A couple days after (Oct 1st) the test VMs were decommissioned, and ERRORs
started being generated in the ActiveMQ logs, because it could not find the
test VMs:
Example Error Message
2024-10-01 12:40:19,056 ERROR [org.apache.activemq.artemis.core.client]
AMQ214016: Failed to create netty connection
java.net.UnknownHostException: amq11test
at java.net.InetAddress$CachedAddresses.get(InetAddress.java:797) ~[?:?]
at java.net.InetAddress.getAllByName0(InetAddress.java:1533) ~[?:?]
at java.net.InetAddress.getAllByName(InetAddress.java:1386) ~[?:?]
at java.net.InetAddress.getAllByName(InetAddress.java:1307) ~[?:?]
at java.net.InetAddress.getByName(InetAddress.java:1257) ~[?:?]
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156)
~[netty-common-4.1.86.Final.jar:4.1.86.Final]
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153)
~[netty-common-4.1.86.Final.jar:4.1.86.Final]
at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
....
On Oct 3rd at 8:15 AM the program scheduling work continued communicating with
ActiveMQ, however no jobs were being pulled from the ActiveMQ queues. The logs
on the ActiveMQ only included the previous error I had included, and there were
no errors on program scheduling work.
Solution
* Restarted the master ActiveMQ - this solved the Failed to create netty
connection ERROR
* Added a monitor (checkAMQLog) script to Active MQ to get notified if an
ERROR or warning is triggered
* For future ActiveMQ debugging in test VMs -use a different port for
troubleshooting
We are working to perform a root cause analysis on this issue - however we are
not able to find a specific error in the artemis log when the jobs stopped
being pulled from the queue. Please let me know if this behavior is expected or
additional commands that can be used to troubleshoot in future if it were to
happen again.
Thanks for your help!
Erick