On Wed, Jul 6, 2011 at 4:50 AM, Martin Birgmeier <la5lb...@aon.at> wrote: > Hi Artem, > > I have exactly the same problem as you are describing below, also with quite > a number of amd mounts. > > In addition to the scenario you describe, another way this happens here > is when downloading a file via firefox to a directory currently open in > dolphin (KDE file manager). This will almost surely trigger the symptoms > you describe. > > I've had 7.4 running on the box before, now with 8.2 this has started to > happen. > > Alas, I don't have a solution.
I may be on to something. Here's what seems to be happening in my case: * Process, that's in the middle of a syscall accessing amd mountpoint gets interrupted. * If the syscall was restartable, msleep at the beginning of get_reply: loop in in clnt_dg_call() would return ERESTART. * ERESTART will result in clnt_dg_call() returning with RPC_CANTRECV * clnt_reconnect_call() then will try to reconnect, and msleep will eventually fail with ERESTART in clnt_dg_call() again and the whole thing will be repeating for a while. I'm not familiar enough with the RPC code, but looking and clnt_vc.c and other RPC places, it appears that both EINTR and ERESTART should translate into RPC_INTR error. However in clnt_dg.c that's not the case and that's what seems to make amd-mounted accesses hang. Following patch (against RELENG-8 @ r225118) seems to have fixed the issue for me. With the patch I no longer see the hangs or ICMP storms on the test case that could reliably reproduce the issue within minutes. Let me know if it helps in your case. --- a/sys/rpc/clnt_dg.c +++ b/sys/rpc/clnt_dg.c @@ -636,7 +636,7 @@ get_reply: */ if (error != EWOULDBLOCK) { errp->re_errno = error; - if (error == EINTR) + if (error == EINTR || error == ERESTART) errp->re_status = stat = RPC_INTR; else errp->re_status = stat = RPC_CANTRECV; --Artem > > We should probably file a PR, but I don't even know where to assign it to. > Amd does not seem much maintained, it's probably using some old-style > mounts (it never mounts anything via IPv6, for example). > > Regards, > > Martin > >> Hi, >> >> I wonder if someone else ran into this issue before and, maybe, have a >> solution. >> >> I've been running into a problem where access to filesystems mouted >> with amd wedges processes in an unkillable state and produces ICMP >> storm on loopback interface.I've managed to narrow down to NFS >> reconnect, but that's when I ran out of ideas. >> >> Usually the problem happens when I abort a parallel build job in an >> i386 jail on FreeBSD-8/amd64 (r223055). When the build job is killed >> now and then I end up with one process consuming 100% of CPU time on >> one of the cores. At the same time I get a lot of messages on the >> console saying "Limiting icmp unreach response from 49837 to 200 >> packets/sec" and the loopback traffic goes way up. >> >> As far as I can tell here's what's happening: >> >> * My setup uses a lot of filesystems mounted by amd. >> * amd itself pretends to be an NFS server running on the localhost and >> serving requests for amd mounts. >> * Now and then amd seems to change the ports it uses. Beats me why. >> * the problem seems to happen when some process is about to access amd >> mountpoint when amd instance disappears from the port it used to >> listen on. In my case it does correlate with interrupted builds, but I >> have no clue why. >> * NFS client detects disconnect and tries to reconnect using the same >> destination port. >> * That generates ICMP response as port is unreachable and it reconnect >> call returns almost immediatelly. >> * We try to reconnect again, and again, and again.... >> * the process in this state is unkillable >> >> Here's what the stack of the 'stuck' process looks like in those rare >> moments when it gets to sleep: >> 18779 100511 collect2 - mi_switch+0x176 >> turnstile_wait+0x1cb _mtx_lock_sleep+0xe1 sleepq_catch_signals+0x386 >> sleepq_timedwait_sig+0x19 _sleep+0x1b1 clnt_dg_call+0x7e6 >> clnt_reconnect_call+0x12e nfs_request+0x212 nfs_getattr+0x2e4 >> VOP_GETATTR_APV+0x44 nfs_bioread+0x42a VOP_READLINK_APV+0x4a >> namei+0x4f9 kern_statat_vnhook+0x92 kern_statat+0x15 >> freebsd32_stat+0x2e syscallenter+0x23d >> >> * Usually some timeout expires in few minutes, the process dies, ICMP >> storm stops and the system is usable again. >> * On occasion the process is stuck forever and I have to reboot the box. >> >> I'm not sure who's to blame here. >> >> Is the automounter at fault for disappearing from the port it was >> supposed to listen to? >> If NFS guilty of trying blindly to reconnect on the same port and not >> giving up sooner? >> Should I flog the operator (ALA myself) for misconfiguring something >> (what?) in amd or NFS? >> >> More importantly -- how do I fix it? >> Any suggestions on fixing/debugging this issue? >> >> --Artem > _______________________________________________ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"