Re: NFS lockup after UDP fragments getting lost (was: 8.1 tstile lockup after nfs send error 51)
>> It's worth reminding that -o tcp is an option. > Not for NFS through a (stateful) filtering router, no. True. But then, not over network hops that drop port 2049, either. Break the assumptions underlying the 'net and you have to expect breakage from stuff built atop it. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: NFS lockup after UDP fragments getting lost (was: 8.1 tstile lockup after nfs send error 51)
At 19:14 Uhr + 31.07.2019, m...@netbsd.org wrote: >It's worth reminding that -o tcp is an option. Not for NFS through a (stateful) filtering router, no. Reboot the router, and you will have to walk up to every client and reboot it. With NFS over UDP, the clients will recover. Cheerio, hauke -- "It's never straight up and down" (DEVO)
Re: NFS lockup after UDP fragments getting lost (was: 8.1 tstile lockup after nfs send error 51)
At 10:45 Uhr +0200 31.07.2019, Edgar Fuß wrote: >Thanks to riastradh@, this tuned out to be caused by an (UDP, hard) HFS >mount combined with a mis-configured IPFilter that blocked all but the >first fragment of a fragmented NFS reply (e.g., readdir) combined with a >NetBSD design error (or so Taylor says) that a vnode lock may be held >accross I/O, in this case, network I/O. I ran into a similar issue 2004ish, connecting RedHat Linux clients to a (NetBSD) nfs (udp) server through a (NetBSD, ipfilter) filtering router. Darren back then told me Linux sends fragmented packets tail-first, which ipfilter was not prepared to deal with. I switched to pf, which was able deal with the scenario just fine, and didn't look back. Cheerio, hauke -- "It's never straight up and down" (DEVO)
Re: NFS lockup after UDP fragments getting lost (was: 8.1 tstile lockup after nfs send error 51)
On Wed, Jul 31, 2019 at 07:11:54AM -0700, Jason Thorpe wrote: > > > On Jul 31, 2019, at 1:45 AM, Edgar Fuß wrote: > > > > NetBSD design error (or so Taylor says) that a vnode lock may be held > > accross I/O > > 100% > > NetBSD's VFS locking protocol needs a serious overhaul. At least one other > BSD-family VFS (the one in XNU) completely eliminated locking of vnodes at > the VFS layer (it's all pushed into the file system back-ends who now have > more control over their own locking requirements). It does have some > additional complexities around reference / busy counting and vnode identity, > but it works very well in practice. > > I don't know what FreeBSD has done in this area. > > -- thorpej > IMNT_MPSAFE, which NFS isn't?
Re: NFS lockup after UDP fragments getting lost (was: 8.1 tstile lockup after nfs send error 51)
On Wed, Jul 31, 2019 at 11:42:26AM -0500, Don Lee wrote: > If you go back a few years, you can find a thread where I reported tstile > lockups on PPC. I don’t remember the details, but it was back in 6.1 as I > recall. This is not a new problem, and not limited to NFS. I still have a > similar problem with my 7.2 system, usually triggered when I do backups > (dump/restore). The dump operation locks up and cannot be killed. The system > continues, except any process that trips over the tstile also locks up. > Eventually, the system grinds to a complete halt. (can’t even log in) If I > catch it before that point, I can almost reboot, but I have to power cycle to > kill the tstile process(es), or the reboot also hangs. It's worth reminding that -o tcp is an option.
Re: NFS lockup after UDP fragments getting lost (was: 8.1 tstile lockup after nfs send error 51)
If you go back a few years, you can find a thread where I reported tstile lockups on PPC. I don’t remember the details, but it was back in 6.1 as I recall. This is not a new problem, and not limited to NFS. I still have a similar problem with my 7.2 system, usually triggered when I do backups (dump/restore). The dump operation locks up and cannot be killed. The system continues, except any process that trips over the tstile also locks up. Eventually, the system grinds to a complete halt. (can’t even log in) If I catch it before that point, I can almost reboot, but I have to power cycle to kill the tstile process(es), or the reboot also hangs. -dgl- > On Jul 31, 2019, at 9:11 AM, Jason Thorpe wrote: > > >> On Jul 31, 2019, at 1:45 AM, Edgar Fuß wrote: >> >> NetBSD design error (or so Taylor says) that a vnode lock may be held >> accross I/O > > 100% > > NetBSD's VFS locking protocol needs a serious overhaul. At least one other > BSD-family VFS (the one in XNU) completely eliminated locking of vnodes at > the VFS layer (it's all pushed into the file system back-ends who now have > more control over their own locking requirements). It does have some > additional complexities around reference / busy counting and vnode identity, > but it works very well in practice. > > I don't know what FreeBSD has done in this area. > > -- thorpej >
Re: NFS lockup after UDP fragments getting lost (was: 8.1 tstile lockup after nfs send error 51)
> On Jul 31, 2019, at 1:45 AM, Edgar Fuß wrote: > > NetBSD design error (or so Taylor says) that a vnode lock may be held accross > I/O 100% NetBSD's VFS locking protocol needs a serious overhaul. At least one other BSD-family VFS (the one in XNU) completely eliminated locking of vnodes at the VFS layer (it's all pushed into the file system back-ends who now have more control over their own locking requirements). It does have some additional complexities around reference / busy counting and vnode identity, but it works very well in practice. I don't know what FreeBSD has done in this area. -- thorpej
NFS lockup after UDP fragments getting lost (was: 8.1 tstile lockup after nfs send error 51)
Thanks to riastradh@, this tuned out to be caused by an (UDP, hard) HFS mount combined with a mis-configured IPFilter that blocked all but the first fragment of a fragmented NFS reply (e.g., readdir) combined with a NetBSD design error (or so Taylor says) that a vnode lock may be held accross I/O, in this case, network I/O. It should be reproducable with a default NFS mount and a block in all with frag-body IPFilter rule and then trying to readdir. Now, in some cases, the machine in question recovered after fixing the filter rules, in others, it didn't, forcing a reboot. This strikes me as a bug because the same lock-up could as well have been caused by network problems instead of ipfilter mis-configuration. It looks like the operation to which the reply was lost sometimes doesn't get retried. Do we have some weird bug where the first fragment arriving stops the timeout but the blocking of the remaining fragments cause it to wedge?