Re: error indication when cluster shared storage is not available

Justin Bertram Tue, 03 Mar 2026 08:12:57 -0800

> Could you elaborate a bit why documentation says that NFS recommendation 
> regarding soft option and data corruption can be safely ignored?


My understanding of that recommendation in the NFS documentation is
that it is based on use-cases where application writes to disk are
buffered and the application doesn't ensure the data actually syncs.
This is a very common use case. In such cases the sync can fail, but
since the application didn't wait for it, it won't know about the
failure; consequently, data may be corrupted or lost.

However, the broker doesn't work this way by default. It always
ensures data syncs to disk at the appropriate stages. If an error
occurs during the sync and the mount uses "soft," the broker will be
notified which in turn will notify the client (one way or another).
When the client receives the error it will know the message wasn't
handled and should resend it to ensure integrity.

Of course, you can trade data integrity for speed by setting
journal-datasync to false, but that's certainly not recommended for
production data.

We've recommended soft NFS mounts for years, and I can't recall ever
seeing a related data integrity problem. The broker is written and
configured to prevent this. However, anybody who uses a "hard" mount
eventually runs into a problem just like yours: the broker hangs,
waiting indefinitely on storage. Furthermore, if the storage problem
is isolated to the primary, the backup is essentially useless with a
hard mount. You actually want the primary to fail here so the backup
can take over and continue serving clients. Otherwise you just have a
needless outage.

Hope that helps!


Justin

On Tue, Mar 3, 2026 at 9:27 AM Vilius Šumskas via users
<[email protected]> wrote:
>
> Thank you Justin for the explanation.
>
> I guess ActiveMQBasicSecurityManager part explains why we didn't see any 
> "unknown user" errors in the logs. All users were loaded into memory in 
> advance. It's strange though that clients were connecting, could not do so, 
> but the broker didn't print anything about connection resets or other related 
> information. We are using default logging level btw. It was not the same 
> network issue because clients live on the different subnet than storage. We 
> could ping Artemis nodes from clients successfully during incident, and fast 
> (millisecond) reconnections at Qpid level indicates that it was not TCP level 
> issue.
>
> I just checked NFS mount recommendations. We are using timeo=600,retrans=2, 
> however indeed we are using hard instead of soft option. I'm going to try to 
> reproduce an issue with both settings to see how it behaves. Could you 
> elaborate a bit why documentation says that NFS recommendation regarding soft 
> option and data corruption can be safely ignored?
>
> --
>     Vilius
>
> -----Original Message-----
> From: Justin Bertram <[email protected]>
> Sent: Tuesday, March 3, 2026 4:43 PM
> To: [email protected]
> Subject: Re: error indication when cluster shared storage is not available
>
> Since Artemis 2.11.0 [1] the broker will periodically evaluate the shared 
> journal file-lock to ensure it hasn't been lost and/or the backup hasn't 
> activated. Assuming proper configuration, I would have expected this 
> component to shut down the broker in your situation.
> Since it didn't shut down the broker my hunch is that your NFS mount is not 
> configured properly. Can you confirm that you're following the NFS mount 
> recommendations [2]? I'm specifically thinking about using soft vs. hard.
>
> It's worth noting that the ActiveMQBasicSecurityManager accesses the journal 
> only when the broker starts. It reads all user/role information from the 
> journal and loads it into memory. The only exception is if an administrator 
> uses the management API to add, remove, or update a user, role, etc. at which 
> point the broker will write to the journal.
>
> Also, if there is no activity on the broker, the critical analyzer has no 
> chance to detect problems.
>
> Based on your description, it sounds like the same network problem that 
> caused an issue with NFS might also have prevented clients from connecting to 
> the broker.
>
>
> Justin
>
> [1] https://issues.apache.org/jira/browse/ARTEMIS-2421
> [2] 
> https://artemis.apache.org/components/artemis/documentation/latest/ha.html#nfs-mount-recommendations
>
> On Mon, Mar 2, 2026 at 4:11 PM Vilius Šumskas via users 
> <[email protected]> wrote:
> >
> > Hello,
> >
> >
> >
> > we have a pretty straightforward Artemis HA cluster consisting from 2 
> > nodes, primary and a backup. Cluster uses NFS4.1 shared storage to store 
> > the journal. In addition, we are using ActiveMQBasicSecurityManager for 
> > authentication, which means information about Artemis users are on the same 
> > shared storage.
> >
> >
> >
> > Couple of days ago we had an incident with our shared storage provider. 
> > During the incident the storage was fully unreachable network wise. The 
> > interesting part is that during the incident Artemis didn’t print any 
> > exceptions or any errors in the logs. No messages that journal could not be 
> > reachable, no messages about failure to reach the backup, even though the 
> > backup was also experiencing the same issue with the storage. External AMQP 
> > client connections also didn’t result in the usual warning in the logs for 
> > “unknown users”, even though on the client side Qpid clients constantly 
> > printed “cannot connect” errors. As if broker instances were unreachable by 
> > the clients but inside the broker all processes just stopped hanging and 
> > waiting for the storage.
> >
> > Critical analyzer also didn’t kick in for some reason. Usually it works 
> > very well for us, when the same NFS storage slows down considerably, but 
> > not this time.
> >
> >
> >
> > Only after I completely restarted primary VM node, and it could not mount 
> > NFS storage completely (after waiting 3 minutes to timeout during restart), 
> > then Artemis booted and started producing IOExceptions, “unknown user” 
> > errors, “connection failed to backup node” errors, and every other possible 
> > error related to unreachable journal, as expected.
> >
> >
> >
> > Is the silence in the logs due to unreachable NFS storage a bug? If so, 
> > what developers need for the reproducible case? As I said, there is nothing 
> > in the logs at the moment, but I could try to reproduce it on testing 
> > environment with any combination of debugging properties if needed.
> >
> >
> >
> > If it’s not a bug, how should we ensure proper alerting (and possibly 
> > automatic Artemis shutdown) in case shared storage is down? Do we miss some 
> > NFS mount option or critical analyzer setting, maybe? Currently we are 
> > using defaults.
> >
> >
> >
> > Any pointers are much appreciated!
> >
> >
> >
> > --
> >
> >    Best Regards,
> >
> >
> >
> >     Vilius Šumskas
> >
> >     Rivile
> >
> >     IT manager
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: error indication when cluster shared storage is not available

Reply via email to