Re: error indication when cluster shared storage is not available

Clebert Suconic Sun, 08 Mar 2026 20:28:51 -0700

Using mirror and lock coordinator could be an alternative here?

Clebert Suconic



On Sun, Mar 8, 2026 at 5:41 PM Vilius Šumskas via users <
[email protected]> wrote:

> Yes. I spent couple of hours on weekend to test Artemis behaviour with
> soft vs. hard NFS mounts, also some other tests. These are the results.
>
> First test I made was by changing NFS IP allowlist by modifying exports
> file. Since this was done on NFS server side the test affected both,
> primary and backup Artemis nodes. When IP allowlist is changed modern NFS
> client react by disconnect mount point completely. On primary node such
> disconnect correctly resulted in Critical IO Error and automatic broker
> shutdown. Backup didn't react to this at all. It only detected that
> connector of primary is down, but didn't start serving clients, nor did it
> produce Critical IO Error. Not sure if this correct behaviour. I suspect it
> didn't produce IO errors since the broker was not started in primary mode,
> but why it didn't TRY to take over?
> You can see full logs of this test in the attached files
> "primary_nfsleveldeny.txt" and "backup_nfsleveldeny.txt".
>
> Second test was performed with hard NFS mount option by denying outbound
> traffic to NFS server's IP address on primary node with "firewall-cmd
> --direct --add-rule ipv4 filter OUTPUT 0 -d <nfs_server_ip>/32  -j REJECT".
> Full logs are in the attached "primary_nfshard.txt" and
> "backup_nfshard.txt" files.
> * At ~22:13 I have logged in Artemis Console on primary.
> * At ~22:14 the traffic to shared storage on primary was denied.
> As during our incident, this didn't produce any logs on primary. No client
> login errors, no IO errors, nothing. On Artemis Console I could even send a
> message to ExpiredQueue (though I could not browse it). I could also create
> new address, search for new address in Console, and delete it. How is this
> even possible? Should I assume some broker operations happen completely in
> memory and are not written to journal, until needed? Though, after a while,
> address search in Console stopped working. I assume some kind of Console
> cache expired. But I still could list consumer and producers.
> * At around the same time (22:14) backup node took over. This is probably
> expected even with hard NFS option because Artemis lock on shared volume
> has expired?
> * Cluster stayed this way, with primary producing no errors, until ~22:22
> when I have removed firewall rule, and outbound traffic from primary to NFS
> storage could flow again.
> This resulted in lost lock and Critical IO Errors, and primary have
> finally shutdown.
> Maybe it's just me, but I think there should be a way to detect such
> stalled NFS mount and shutdown the broker sooner.
>
> Now to the last test. It was performed with soft NFS mount option by
> denying outbound traffic to NFS server's IP address on primary node using
> the same firewalld rule. Logs are in "primary_nfssoft.txt" file.
> * At ~22:36 the traffic to shared storage on primary was denied.
> * At ~22:37 this resulted in lock failure, however the broker didn't
> produce Critical IO Error and didn't try to automatically shutdown until
> 22:40. Not sure why. On one hand, this matches our timeo=600,retrans=2
> mount options, but shouldn't broker try to shutdown right away with "Lost
> NodeManager lock" (just like in previous test after mount point came back)?
> I could even use Artemis console and create or delete address. Which is,
> again, strange for the cluster node which just lost the lock.
> * Anyway, so at 22:40 it is shutting down, it produces other Critical IO
> errors, Critical Analyzer also kicks in. Broker also produced thread dump
> during the process (attached separately), however it never actually shuts
> down completely. I could see java process trying to do something.
> * I waited more than 15 minutes, but full shutdown never occurred.
> * In parallel, backup at ~22:36 took over, so no surprises there. However
> since primary was not fully down, Artemis client didn't failover to backup.
>
> Summarizing all the tests, my main concerns is regarding data integrity
> during NFS incidents, be it on one or both nodes. If NFS mount is not
> available, where does all the message and topology data go? Memory? If yes,
> what happens if NFS mount point doesn't appear in time and broker is
> killed? Do we loose all addresses or queues created during that time? This
> incident and tests tells me that the journal is not what I have expected it
> to be.
>
> Test were performed with latest Artemis version 2.52.0 on Rocky Linux 9.7
> and NetApp NFS cloud storage.
>
> P.S. I didn't include intr mount option in my tests, as it is deprecated
> now and completely ignored by kernels above version 2.6.25. I will prepare
> a PR about this for documentation in a few.
>
> --
>     Vilius
>
> -----Original Message-----
> From: Justin Bertram <[email protected]>
> Sent: Wednesday, March 4, 2026 8:50 PM
> To: [email protected]
> Subject: Re: error indication when cluster shared storage is not available
>
> Any results to share from your testing?
>
>
> Justin
>
> On Tue, Mar 3, 2026 at 9:27 AM Vilius Šumskas via users <
> [email protected]> wrote:
> >
> > Thank you Justin for the explanation.
> >
> > I guess ActiveMQBasicSecurityManager part explains why we didn't see any
> "unknown user" errors in the logs. All users were loaded into memory in
> advance. It's strange though that clients were connecting, could not do so,
> but the broker didn't print anything about connection resets or other
> related information. We are using default logging level btw. It was not the
> same network issue because clients live on the different subnet than
> storage. We could ping Artemis nodes from clients successfully during
> incident, and fast (millisecond) reconnections at Qpid level indicates that
> it was not TCP level issue.
> >
> > I just checked NFS mount recommendations. We are using
> timeo=600,retrans=2, however indeed we are using hard instead of soft
> option. I'm going to try to reproduce an issue with both settings to see
> how it behaves. Could you elaborate a bit why documentation says that NFS
> recommendation regarding soft option and data corruption can be safely
> ignored?
> >
> > --
> >     Vilius
> >
> > -----Original Message-----
> > From: Justin Bertram <[email protected]>
> > Sent: Tuesday, March 3, 2026 4:43 PM
> > To: [email protected]
> > Subject: Re: error indication when cluster shared storage is not
> > available
> >
> > Since Artemis 2.11.0 [1] the broker will periodically evaluate the
> shared journal file-lock to ensure it hasn't been lost and/or the backup
> hasn't activated. Assuming proper configuration, I would have expected this
> component to shut down the broker in your situation.
> > Since it didn't shut down the broker my hunch is that your NFS mount is
> not configured properly. Can you confirm that you're following the NFS
> mount recommendations [2]? I'm specifically thinking about using soft vs.
> hard.
> >
> > It's worth noting that the ActiveMQBasicSecurityManager accesses the
> journal only when the broker starts. It reads all user/role information
> from the journal and loads it into memory. The only exception is if an
> administrator uses the management API to add, remove, or update a user,
> role, etc. at which point the broker will write to the journal.
> >
> > Also, if there is no activity on the broker, the critical analyzer has
> no chance to detect problems.
> >
> > Based on your description, it sounds like the same network problem that
> caused an issue with NFS might also have prevented clients from connecting
> to the broker.
> >
> >
> > Justin
> >
> > [1] https://issues.apache.org/jira/browse/ARTEMIS-2421
> > [2]
> > https://artemis.apache.org/components/artemis/documentation/latest/ha.
> > html#nfs-mount-recommendations
> >
> > On Mon, Mar 2, 2026 at 4:11 PM Vilius Šumskas via users <
> [email protected]> wrote:
> > >
> > > Hello,
> > >
> > >
> > >
> > > we have a pretty straightforward Artemis HA cluster consisting from 2
> nodes, primary and a backup. Cluster uses NFS4.1 shared storage to store
> the journal. In addition, we are using ActiveMQBasicSecurityManager for
> authentication, which means information about Artemis users are on the same
> shared storage.
> > >
> > >
> > >
> > > Couple of days ago we had an incident with our shared storage
> provider. During the incident the storage was fully unreachable network
> wise. The interesting part is that during the incident Artemis didn’t print
> any exceptions or any errors in the logs. No messages that journal could
> not be reachable, no messages about failure to reach the backup, even
> though the backup was also experiencing the same issue with the storage.
> External AMQP client connections also didn’t result in the usual warning in
> the logs for “unknown users”, even though on the client side Qpid clients
> constantly printed “cannot connect” errors. As if broker instances were
> unreachable by the clients but inside the broker all processes just stopped
> hanging and waiting for the storage.
> > >
> > > Critical analyzer also didn’t kick in for some reason. Usually it
> works very well for us, when the same NFS storage slows down considerably,
> but not this time.
> > >
> > >
> > >
> > > Only after I completely restarted primary VM node, and it could not
> mount NFS storage completely (after waiting 3 minutes to timeout during
> restart), then Artemis booted and started producing IOExceptions, “unknown
> user” errors, “connection failed to backup node” errors, and every other
> possible error related to unreachable journal, as expected.
> > >
> > >
> > >
> > > Is the silence in the logs due to unreachable NFS storage a bug? If
> so, what developers need for the reproducible case? As I said, there is
> nothing in the logs at the moment, but I could try to reproduce it on
> testing environment with any combination of debugging properties if needed.
> > >
> > >
> > >
> > > If it’s not a bug, how should we ensure proper alerting (and possibly
> automatic Artemis shutdown) in case shared storage is down? Do we miss some
> NFS mount option or critical analyzer setting, maybe? Currently we are
> using defaults.
> > >
> > >
> > >
> > > Any pointers are much appreciated!
> > >
> > >
> > >
> > > --
> > >
> > >    Best Regards,
> > >
> > >
> > >
> > >     Vilius Šumskas
> > >
> > >     Rivile
> > >
> > >     IT manager
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

Re: error indication when cluster shared storage is not available

Reply via email to