We used to have a similar problem on NFSv3 with FAI 4.1.1 (once in 100s of installs).
I've appended to this email what I've currently patched to address this issue (with FAI-4.1.1 as base). I was suspecting a dangling process or mount, this is why I introduced a lot of debugging code (I wanted to determine whether the NFS client hung at the umount -a or reboot itself). I thought the only significant thing this fixed was removing the -i from the reboot command, which I reported to upstream. However, it could very well be that my debugging changes have actually contributed to the fix, as the -i flag was removed in FAI-4.2 and you're still experiencing a hang (assuming it's the same problem). -Sander --- a/lib/subroutines +++ b/lib/subroutines @@ -527,7 +527,7 @@ cd / sync - killall -q sshd udevd + killall -qw sshd udevd rsyslogd cdromdevice=$(awk '/ name:/ {print $3}' /proc/sys/dev/cdrom/info) # Verify whether the installation is from a fai-cd image, and whether it's actually mounted (instead of NFS mounted, for instance) @@ -565,14 +565,26 @@ # never reached, because chroot will reboot the machine die "Internal error when calling /tmp/rebootCD." >&2 fi - umount $FAI_ROOT/proc - umount -arf 2>/dev/null + + # Dump state of fai-client so we can reverse-engineer what goes wrong after a NFS hang + pstree -Apl > $FAI_ROOT/var/log/fai/pstree.log + lsof -n > $FAI_ROOT/var/log/fai/lsof.log + + for dir in $(mount | grep $FAI_ROOT | awk '{print $3}' | LC_ALL=C sort -r); do + umount $dir + done + + if mount | grep -q $FAI_ROOT; then + echo "dangling mounts:" + mount | grep $FAI_ROOT + sleep 10 + fi # reboot or halt? if [ "$flag_halt" -gt "0" ]; then - exec halt -dfip; + exec halt -dfip else - exec reboot -dfi; + exec reboot -df fi } On Tue, Jun 10, 2014 at 2:41 AM, Peter Keller <psil...@cs.wisc.edu> wrote: > Hello, > > I have a question: > > Sometimes, maybe 2% of the time, when FAI 4.2 finishes installing and is > shutting down to reboot, I get into a state where messages are logged to > the screen about NFS not responding, and then ok, and then not responding, > and then ok, and so on. They repeat every 5 minutes or so. The machine > stays in this state and never actually reboots causing a manual interrupt > in > the automated install. The NFS server, AFAICT, was ok the whole time. > The faiserver is a wheezy machine and I'm not using nfs 4. > > Has anyone ever seen this before? > > Thank you. > > -pete > -- Sander Brandenburg Director of Technology eSATURNUS T. +32 16 40 12 82 www.esaturnus.com