Using mirror and lock coordinator could be an alternative here? Clebert Suconic
On Sun, Mar 8, 2026 at 5:41 PM Vilius Šumskas via users < [email protected]> wrote: > Yes. I spent couple of hours on weekend to test Artemis behaviour with > soft vs. hard NFS mounts, also some other tests. These are the results. > > First test I made was by changing NFS IP allowlist by modifying exports > file. Since this was done on NFS server side the test affected both, > primary and backup Artemis nodes. When IP allowlist is changed modern NFS > client react by disconnect mount point completely. On primary node such > disconnect correctly resulted in Critical IO Error and automatic broker > shutdown. Backup didn't react to this at all. It only detected that > connector of primary is down, but didn't start serving clients, nor did it > produce Critical IO Error. Not sure if this correct behaviour. I suspect it > didn't produce IO errors since the broker was not started in primary mode, > but why it didn't TRY to take over? > You can see full logs of this test in the attached files > "primary_nfsleveldeny.txt" and "backup_nfsleveldeny.txt". > > Second test was performed with hard NFS mount option by denying outbound > traffic to NFS server's IP address on primary node with "firewall-cmd > --direct --add-rule ipv4 filter OUTPUT 0 -d <nfs_server_ip>/32 -j REJECT". > Full logs are in the attached "primary_nfshard.txt" and > "backup_nfshard.txt" files. > * At ~22:13 I have logged in Artemis Console on primary. > * At ~22:14 the traffic to shared storage on primary was denied. > As during our incident, this didn't produce any logs on primary. No client > login errors, no IO errors, nothing. On Artemis Console I could even send a > message to ExpiredQueue (though I could not browse it). I could also create > new address, search for new address in Console, and delete it. How is this > even possible? Should I assume some broker operations happen completely in > memory and are not written to journal, until needed? Though, after a while, > address search in Console stopped working. I assume some kind of Console > cache expired. But I still could list consumer and producers. > * At around the same time (22:14) backup node took over. This is probably > expected even with hard NFS option because Artemis lock on shared volume > has expired? > * Cluster stayed this way, with primary producing no errors, until ~22:22 > when I have removed firewall rule, and outbound traffic from primary to NFS > storage could flow again. > This resulted in lost lock and Critical IO Errors, and primary have > finally shutdown. > Maybe it's just me, but I think there should be a way to detect such > stalled NFS mount and shutdown the broker sooner. > > Now to the last test. It was performed with soft NFS mount option by > denying outbound traffic to NFS server's IP address on primary node using > the same firewalld rule. Logs are in "primary_nfssoft.txt" file. > * At ~22:36 the traffic to shared storage on primary was denied. > * At ~22:37 this resulted in lock failure, however the broker didn't > produce Critical IO Error and didn't try to automatically shutdown until > 22:40. Not sure why. On one hand, this matches our timeo=600,retrans=2 > mount options, but shouldn't broker try to shutdown right away with "Lost > NodeManager lock" (just like in previous test after mount point came back)? > I could even use Artemis console and create or delete address. Which is, > again, strange for the cluster node which just lost the lock. > * Anyway, so at 22:40 it is shutting down, it produces other Critical IO > errors, Critical Analyzer also kicks in. Broker also produced thread dump > during the process (attached separately), however it never actually shuts > down completely. I could see java process trying to do something. > * I waited more than 15 minutes, but full shutdown never occurred. > * In parallel, backup at ~22:36 took over, so no surprises there. However > since primary was not fully down, Artemis client didn't failover to backup. > > Summarizing all the tests, my main concerns is regarding data integrity > during NFS incidents, be it on one or both nodes. If NFS mount is not > available, where does all the message and topology data go? Memory? If yes, > what happens if NFS mount point doesn't appear in time and broker is > killed? Do we loose all addresses or queues created during that time? This > incident and tests tells me that the journal is not what I have expected it > to be. > > Test were performed with latest Artemis version 2.52.0 on Rocky Linux 9.7 > and NetApp NFS cloud storage. > > P.S. I didn't include intr mount option in my tests, as it is deprecated > now and completely ignored by kernels above version 2.6.25. I will prepare > a PR about this for documentation in a few. > > -- > Vilius > > -----Original Message----- > From: Justin Bertram <[email protected]> > Sent: Wednesday, March 4, 2026 8:50 PM > To: [email protected] > Subject: Re: error indication when cluster shared storage is not available > > Any results to share from your testing? > > > Justin > > On Tue, Mar 3, 2026 at 9:27 AM Vilius Šumskas via users < > [email protected]> wrote: > > > > Thank you Justin for the explanation. > > > > I guess ActiveMQBasicSecurityManager part explains why we didn't see any > "unknown user" errors in the logs. All users were loaded into memory in > advance. It's strange though that clients were connecting, could not do so, > but the broker didn't print anything about connection resets or other > related information. We are using default logging level btw. It was not the > same network issue because clients live on the different subnet than > storage. We could ping Artemis nodes from clients successfully during > incident, and fast (millisecond) reconnections at Qpid level indicates that > it was not TCP level issue. > > > > I just checked NFS mount recommendations. We are using > timeo=600,retrans=2, however indeed we are using hard instead of soft > option. I'm going to try to reproduce an issue with both settings to see > how it behaves. Could you elaborate a bit why documentation says that NFS > recommendation regarding soft option and data corruption can be safely > ignored? > > > > -- > > Vilius > > > > -----Original Message----- > > From: Justin Bertram <[email protected]> > > Sent: Tuesday, March 3, 2026 4:43 PM > > To: [email protected] > > Subject: Re: error indication when cluster shared storage is not > > available > > > > Since Artemis 2.11.0 [1] the broker will periodically evaluate the > shared journal file-lock to ensure it hasn't been lost and/or the backup > hasn't activated. Assuming proper configuration, I would have expected this > component to shut down the broker in your situation. > > Since it didn't shut down the broker my hunch is that your NFS mount is > not configured properly. Can you confirm that you're following the NFS > mount recommendations [2]? I'm specifically thinking about using soft vs. > hard. > > > > It's worth noting that the ActiveMQBasicSecurityManager accesses the > journal only when the broker starts. It reads all user/role information > from the journal and loads it into memory. The only exception is if an > administrator uses the management API to add, remove, or update a user, > role, etc. at which point the broker will write to the journal. > > > > Also, if there is no activity on the broker, the critical analyzer has > no chance to detect problems. > > > > Based on your description, it sounds like the same network problem that > caused an issue with NFS might also have prevented clients from connecting > to the broker. > > > > > > Justin > > > > [1] https://issues.apache.org/jira/browse/ARTEMIS-2421 > > [2] > > https://artemis.apache.org/components/artemis/documentation/latest/ha. > > html#nfs-mount-recommendations > > > > On Mon, Mar 2, 2026 at 4:11 PM Vilius Šumskas via users < > [email protected]> wrote: > > > > > > Hello, > > > > > > > > > > > > we have a pretty straightforward Artemis HA cluster consisting from 2 > nodes, primary and a backup. Cluster uses NFS4.1 shared storage to store > the journal. In addition, we are using ActiveMQBasicSecurityManager for > authentication, which means information about Artemis users are on the same > shared storage. > > > > > > > > > > > > Couple of days ago we had an incident with our shared storage > provider. During the incident the storage was fully unreachable network > wise. The interesting part is that during the incident Artemis didn’t print > any exceptions or any errors in the logs. No messages that journal could > not be reachable, no messages about failure to reach the backup, even > though the backup was also experiencing the same issue with the storage. > External AMQP client connections also didn’t result in the usual warning in > the logs for “unknown users”, even though on the client side Qpid clients > constantly printed “cannot connect” errors. As if broker instances were > unreachable by the clients but inside the broker all processes just stopped > hanging and waiting for the storage. > > > > > > Critical analyzer also didn’t kick in for some reason. Usually it > works very well for us, when the same NFS storage slows down considerably, > but not this time. > > > > > > > > > > > > Only after I completely restarted primary VM node, and it could not > mount NFS storage completely (after waiting 3 minutes to timeout during > restart), then Artemis booted and started producing IOExceptions, “unknown > user” errors, “connection failed to backup node” errors, and every other > possible error related to unreachable journal, as expected. > > > > > > > > > > > > Is the silence in the logs due to unreachable NFS storage a bug? If > so, what developers need for the reproducible case? As I said, there is > nothing in the logs at the moment, but I could try to reproduce it on > testing environment with any combination of debugging properties if needed. > > > > > > > > > > > > If it’s not a bug, how should we ensure proper alerting (and possibly > automatic Artemis shutdown) in case shared storage is down? Do we miss some > NFS mount option or critical analyzer setting, maybe? Currently we are > using defaults. > > > > > > > > > > > > Any pointers are much appreciated! > > > > > > > > > > > > -- > > > > > > Best Regards, > > > > > > > > > > > > Vilius Šumskas > > > > > > Rivile > > > > > > IT manager > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected]
