Op 18-11-2025 om 12:59 schreef Wido den Hollander:


Op 17-11-2025 om 17:17 schreef Levin Ng:
Hi Wido,

I’m curious of your settings, I have M30 and X10, both have same issues on VM failover interruption until fallback to NFSv3, kvm just cannot resume the I/O no matter we extend the guest vm kernel hung wait time or shorten the retry in NFS client. Would you mind toshare your setup details?


I think we are using NFSv3 and not NFSv4 in our case. It has been a while since I looked at this.

It is NFS with TCP and not UDP, that's for sure.


I was able to access a hypervisor, this is what I see how it's mounted:

10.255.253.6:/mnt/pool001/ps-c14-35-3 on /mnt/609935f9-099b-31a6-9bd8-69f0e20876ff type nfs (rw,nosuid,nodev,noexec,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=1800,retrans=2,sec=sys,mountaddr=10.255.253.6,mountvers=3,mountport=1010,mountproto=udp,local_lock=none,addr=10.255.253.6)


wdh@hv-138-zzz-xx:~$ cat /etc/nfsmount.conf
# Ansible managed

[ NFSMount_Global_Options ]
timeo=1800
wdh@hv-138-zzz-xx:~$

wdh@hv-138-zzz-xx:~$ sudo virsh pool-dumpxml 609935f9-099b-31a6-9bd8-69f0e20876ff
<pool type='netfs'>
  <name>609935f9-099b-31a6-9bd8-69f0e20876ff</name>
  <uuid>609935f9-099b-31a6-9bd8-69f0e20876ff</uuid>
  <capacity unit='bytes'>27487790694400</capacity>
  <allocation unit='bytes'>1942711828480</allocation>
  <available unit='bytes'>25545078865920</available>
  <source>
    <host name='10.255.253.6'/>
    <dir path='/mnt/pool001/ps-c14-35-3'/>
    <format type='auto'/>
  </source>
  <target>
    <path>/mnt/609935f9-099b-31a6-9bd8-69f0e20876ff</path>
    <permissions>
      <mode>0755</mode>
      <owner>0</owner>
      <group>0</group>
    </permissions>
  </target>
</pool>

wdh@hv-138-zzz-xx:~$

Hope this helps!

Wido

Wido

Regards,
Levin

On 17 Nov 2025 at 21:44 +0800, Wido den Hollander <[email protected]>, wrote:


Op 14-11-2025 om 14:31 schreef Marty Godsey:
Thank you for all the replies. I will answer all the additional
questions I got here:

1.
Hard vs soft mount.
1.
It originally was a hard mount and I moved it to a soft mount as
part of troubleshooting.
2.
I did increase the retries by 1. I will increase the timeout and
see if it helps.
2.
You are correct, Levin, about the failover process.


I am seeing if delegations are enabled on the TrueNAS side and disabling
it if it is.


You should be able to use a TrueNAS appliance with failover. We are
running multiple of the M50 of TrueNAS where failover just works.

Do know that you will be looking at a couple of delays:

- The MAC changes of the virtual IP, thus you need to wait for ARP to
refresh on the router/HV, etc
- The TCP connections need to re-establish for NFS

The VM will stall for ~30s, but should resume work.

What virtual disk driver are you using with KVM? virtio-scsi is the one
you want.

Wido

*Marty Godsey*



*T:* 859-328-1100 <tel:859-328-1100>

*E:* [email protected] <mailto:[email protected]>

*www.rudio.net <https://www.yourdomain.url>*


RudioLogo_small.png

Book a meeting with Marty : Calendly <https://calendly.com/rudio- martyg>








*From: *Levin Ng <[email protected]>
*Date: *Friday, November 14, 2025 at 4:52 AM
*To: *[email protected] <[email protected]>,
[email protected] <[email protected]>
*Subject: *Re: QCOW2 on NFS shared storage

WARNING: This email originated from outside of the organization. Do not
click links or open attachments unless you recognize the sender and know
the content is safe.


IMO, TrueNAS enterprise on their hardware seems to involve re-importing
the pool and restarting the NFS service during a failover. This means
that the NFS client cannot automatically reclaim its state to the new
NFS primary, unlike the ordinary Linux Pacemaker NFS cluster, which can
handle client locks and reconnections. Even though I believe this is the
same case for TrueNAS scale.

On 14 Nov 2025 at 17:19 +0800, Jürgen Gotteswinter
<[email protected]>, wrote:
Also, make sure you disabled nfs4 delegations. This can cause all
kind of weird problems. Maybe thats the reason why nfs3 is working fine
for you

https://atpscan.global.hornetsecurity.com?
d=wl_Nq-2x4309y4ublm1zNkztrBGNkS23_h-
Ps7ygOg4&f=ksVB60nFtB2n5ydZvcUL45qsuldt3FK71UQ9qDzWqCDrjdpt3tuD2PulDVVa5DXc&i=&k=3dW2&m=Xo1vkb_85McoiV9E_1C40dxet1yXEdjVK5eaCf2h54SCx0_Cwhf7F3tHG-bxaJjtuTOmLNdC_UnK-jTrYDjBYZvkdE19v19Do18GT4T0bSP2xvRkZ4N5xw15xgvRzfiw&n=y0QdqxuGC9FC5cVDw0obLWAgKNnR-_iwAXRfYja2X6EN2wf1Q1uSch2oTwPbHs81&r=A-urvz8q40HZw9FDnc0GilMTJu4_2hTUxPKd_7XmMvIiirQJR54jpG5L_poQlHrP&s=d53e22aeb95d24e9f27d8cbf79e34bd739a4cc59b928be40540f8038efe2a865&u=https%3A%2F%2Futcc.utoronto.ca%2F~cks%2Fspace%2Fblog%2Flinux%2FNFSv4DelegationMandatoryLock
 
<https://atpscan.global.hornetsecurity.com?d=wl_Nq-2x4309y4ublm1zNkztrBGNkS23_h-Ps7ygOg4&f=ksVB60nFtB2n5ydZvcUL45qsuldt3FK71UQ9qDzWqCDrjdpt3tuD2PulDVVa5DXc&i=&k=3dW2&m=Xo1vkb_85McoiV9E_1C40dxet1yXEdjVK5eaCf2h54SCx0_Cwhf7F3tHG-bxaJjtuTOmLNdC_UnK-jTrYDjBYZvkdE19v19Do18GT4T0bSP2xvRkZ4N5xw15xgvRzfiw&n=y0QdqxuGC9FC5cVDw0obLWAgKNnR-_iwAXRfYja2X6EN2wf1Q1uSch2oTwPbHs81&r=A-urvz8q40HZw9FDnc0GilMTJu4_2hTUxPKd_7XmMvIiirQJR54jpG5L_poQlHrP&s=d53e22aeb95d24e9f27d8cbf79e34bd739a4cc59b928be40540f8038efe2a865&u=https%3A%2F%2Futcc.utoronto.ca%2F~cks%2Fspace%2Fblog%2Flinux%2FNFSv4DelegationMandatoryLock>

Am 14.11.25, 09:58 schrieb "Jürgen Gotteswinter"
<[email protected] <mailto:juergen.gotteswinter@mgm-
tp.com.INVA>LID <mailto:[email protected]>LID>>:


Have you tried mounting your shares with „hard“ instead of „soft“?
Reducing timeo and retrans might also improve your situation.


Von: Marty Godsey <[email protected]
<mailto:[email protected]>LID <mailto:[email protected]>LID>>
Antworten an: "[email protected]
<mailto:[email protected]
<mailto:[email protected]>>" <[email protected]
<mailto:[email protected] <mailto:[email protected]>>>
Datum: Freitag, 14. November 2025 um 09:45
An: "[email protected] <mailto:[email protected]
<mailto:[email protected]>>" <[email protected]
<mailto:[email protected] <mailto:[email protected]>>>
Betreff: Re: QCOW2 on NFS shared storage


Hello guys,


I will answer in this email to all the different responses I got.




1. Describe HA setup.


* It’s a basic two-node HA cluster using ZFS. It is a TrueNAS F60
all-NVME. The network is using basic MLAG. The Linux hosts also have an
MLAG as well. I understand this is not a network failover, but I wanted
to describe the network a little.


1. NFS Version.


* I am using NFS 4.1 now. Here is my connect string: nfs
vers=4.1,rsize=1048576,wsize=1048576,nconnect=16,_netdev,soft,intr,timeo=100,retrans=3,noatime,nodiratime,async
 0 0


* When changing it to NFSv3, I can perform a failover and it does not
lock up the VM. Though the speed is slower which is expected on ver3.


1. Disk timeout issue.


* It does not take 30 seconds. From the time of starting the failover
test, to NFS being able to be written to again, is about 5-8 seconds.
The network side drops no test pings, sometimes maybe 1, and the mounts
on the can be written to at that 8-10 mark.





Marty




From: Eric Green <[email protected]
<mailto:[email protected] <mailto:[email protected]>>>
Date: Friday, November 14, 2025 at 1:19 AM
To: [email protected] <mailto:[email protected]
<mailto:[email protected]>> <[email protected]
<mailto:[email protected] <mailto:[email protected]>>>
Subject: Re: QCOW2 on NFS shared storage
WARNING: This email originated from outside of the organization. Do
not click links or open attachments unless you recognize the sender and
know the content is safe.




This isn’t a qcow2 issue. This is a file system timeout issue in the
virtual machine. Your NFS failover event is taking longer than 30
seconds, which is the default block device timeout for Linux block
devices. Your Linux system then switches the file system to read only
mode which sends everything to heck in a handbasket. On Windows VMs it
does the blue screen of death and reboots to an OS Not Found prompt.


You can either increase the block device timeout on Linux or speed up
your failover.


VMware ESXi handled this situation by pausing the virtual machine
when it detected NFS delays. I don’t know that qemu/kvm has that
ability, it pauses the VM when migrating a VM to a different physical
server but not when there are delays in the underlying NFS. CloudStack
can only use functionality provided by the hypervisor.


Get Outlook for iOS<https://atpscan.global.hornetsecurity.com?
d=iaMacNWpHFZp16egj48xXzQLNtl5-
MUyd6fiGcJD5wU&f=ownC0kQ0FqAUHOQwHQAEcna8VQQZRik6595JyUagAfjZ3gh3NKHeMHMAGop8F5LG&i=&k=g9ag&m=zNLsDU4pJcjgaPQY2kNIyV4fc0-KI8C-qiAx0zW0yBGwy_41ByYwRPzLlUnz8jt_lBhHdsTVMS6j95mQOQuBw3V-AwBiLKQVvP5GjyKUapb1vBxQ-vkGvy39tAg8KkY0&n=CVVSUCQWE72mk52lZFEejSZDbTVA42p-OksxkTd0eDDAsbr_WmSruP0PT5j8k4lI&r=UzY4302GaoSnF-QNFAgsfOQiQzMoVQbgV_JhRpMrwIaQaggz1aM8UaIn5LrNvAGO&s=1304210f6ab8d39d31696d792a09a6567b3951eb81ee4a0b1e3e11cbcdbb4837&u=https%3A%2F%2Faka.ms%2Fo0ukef
 
<https://atpscan.global.hornetsecurity.com?d=iaMacNWpHFZp16egj48xXzQLNtl5-MUyd6fiGcJD5wU&f=ownC0kQ0FqAUHOQwHQAEcna8VQQZRik6595JyUagAfjZ3gh3NKHeMHMAGop8F5LG&i=&k=g9ag&m=zNLsDU4pJcjgaPQY2kNIyV4fc0-KI8C-qiAx0zW0yBGwy_41ByYwRPzLlUnz8jt_lBhHdsTVMS6j95mQOQuBw3V-AwBiLKQVvP5GjyKUapb1vBxQ-vkGvy39tAg8KkY0&n=CVVSUCQWE72mk52lZFEejSZDbTVA42p-OksxkTd0eDDAsbr_WmSruP0PT5j8k4lI&r=UzY4302GaoSnF-QNFAgsfOQiQzMoVQbgV_JhRpMrwIaQaggz1aM8UaIn5LrNvAGO&s=1304210f6ab8d39d31696d792a09a6567b3951eb81ee4a0b1e3e11cbcdbb4837&u=https%3A%2F%2Faka.ms%2Fo0ukef>>
 
<https://atpscan.global.hornetsecurity.com?d=iaMacNWpHFZp16egj48xXzQLNtl5-MUyd6fiGcJD5wU&amp;f=ownC0kQ0FqAUHOQwHQAEcna8VQQZRik6595JyUagAfjZ3gh3NKHeMHMAGop8F5LG&amp;i=&amp;k=g9ag&amp;m=zNLsDU4pJcjgaPQY2kNIyV4fc0-KI8C-qiAx0zW0yBGwy_41ByYwRPzLlUnz8jt_lBhHdsTVMS6j95mQOQuBw3V-AwBiLKQVvP5GjyKUapb1vBxQ-vkGvy39tAg8KkY0&amp;n=CVVSUCQWE72mk52lZFEejSZDbTVA42p-OksxkTd0eDDAsbr_WmSruP0PT5j8k4lI&amp;r=UzY4302GaoSnF-QNFAgsfOQiQzMoVQbgV_JhRpMrwIaQaggz1aM8UaIn5LrNvAGO&amp;s=1304210f6ab8d39d31696d792a09a6567b3951eb81ee4a0b1e3e11cbcdbb4837&amp;u=https%3A%2F%2Faka.ms%2Fo0ukef&gt;
 
<https://atpscan.global.hornetsecurity.com?d=iaMacNWpHFZp16egj48xXzQLNtl5-MUyd6fiGcJD5wU&amp;f=ownC0kQ0FqAUHOQwHQAEcna8VQQZRik6595JyUagAfjZ3gh3NKHeMHMAGop8F5LG&amp;i=&amp;k=g9ag&amp;m=zNLsDU4pJcjgaPQY2kNIyV4fc0-KI8C-qiAx0zW0yBGwy_41ByYwRPzLlUnz8jt_lBhHdsTVMS6j95mQOQuBw3V-AwBiLKQVvP5GjyKUapb1vBxQ-vkGvy39tAg8KkY0&amp;n=CVVSUCQWE72mk52lZFEejSZDbTVA42p-OksxkTd0eDDAsbr_WmSruP0PT5j8k4lI&amp;r=UzY4302GaoSnF-QNFAgsfOQiQzMoVQbgV_JhRpMrwIaQaggz1aM8UaIn5LrNvAGO&amp;s=1304210f6ab8d39d31696d792a09a6567b3951eb81ee4a0b1e3e11cbcdbb4837&amp;u=https%3A%2F%2Faka.ms%2Fo0ukef&gt;>>
________________________________
From: Marty Godsey <[email protected]
<mailto:[email protected]>LID <mailto:[email protected]>LID>>
Sent: Thursday, November 13, 2025 3:25 PM
To: [email protected] <mailto:[email protected]
<mailto:[email protected]>> <[email protected]
<mailto:[email protected] <mailto:[email protected]>>>
Subject: QCOW2 on NFS shared storage


Hello everyone.


So, I am learning, or reading at least, that QCOW2 file format,
running a shared NFS storage that’s HA, does not to like to failover.


I have an HA NAS running NFS 4.1 and everything works fine until I
test the failure scenario of a failover on the storage nodes. When I do
this, the entire VM locks up and must be hard reset to recover.


Is this true? Do people not use QCOW2 on HA NFS storage? Are there
time outs I can set, etc..


Thank you for all the input.




Marty







Reply via email to