Hi everyone - So my two threads really turn out to be one thread, and I'm replying to both here. I apologize for the mess.
First, I really appreciate all the responses and pointers, I'm very grateful for all the help. THANK YOU to all those who responded on either thread! To recap, essentially what I have is: 1. I've always run a bunch of high-traffic 42.3 Xen PV guests on 42.3 hosts. 2. I upgraded the 42.3 hosts to 15.1, via online upgrade and then via fresh load. 3. When I did that, *one* of my 42.3 guests started hanging at random every 2-7 days. The hangs seemed to be related to high network and/or disk traffic. I discovered by accident that if I did an "xl trigger nmi" I could "unhang" the guest and make it resume duty, more or less, without a reboot, but I have no idea why the hang occurs or why it's recoverable in that way. Chasing this down has been painful. It initially looked like sshd was the culprit, but it wasn't. I thought the kernel mismatch might be the issue, but other 42.3 guests run on their 15.1 hosts without a problem, and upgrading the guest to 15.1 didn't solve it. Olaf has been pointing me to new kernels, and that helped somewhat - moving to the SLE kernel extended the guest uptime from a few days to a few weeks (buying me much needed sleep, thank you!) but I still don't have a solution. The problem seems to be in this particular guest... somewhere. The problem seems to travel with the guest: If I clone the guest, and bring it up elsewhere that clone also has the problem. So I've resorted to making a copy of the guest and staging it on a different host just so I can stress-test it. To stress-test it, I basically initiate lots of high-traffic requests against the troubled guest from an outside source. Initially, the guest was hanging during a single full outbound rsync. To prevent SSD wear I modified the command and used a hack to simulate the traffic. If I boot the guest, and, from a different connected machine, do stuff like: nohup ssh 192.168.1.11 tar cf - --one-file-system /a | cat > /dev/null & nohup ssh 192.168.1.11 cat /dev/zero | cat > /dev/null & (where 1.11 is the troubled guest, and /a is a 4TB filesystem full of data) I can make the troubled guest hang in somewhere between 45 minutes and 12 hours. Thanks to Olaf, Jan, Tony and Fajar, I've been able to try a number of things, but so far, I've had no luck: Uprading openssh to the latest version did not solve it. Upgrading the guest to 15.1 (unifying the kernels) did not solve it. Upgrading the 42.3 guest kernel to a different version helped... but did not solve it. Removing some possibly problematic kernel modules did not solve it. Removing Docker did not solve it. I had optimizations from the Xen best practices page in /etc/sysctl.conf for just this guest - removing those did not solve it. The only solution I've found seems to be starting the guest over fresh. If I do a fresh load of 15.1 as a guest and mount that same /a filesystem, and do those same tests... the freshly-loaded guest works fine.... it's rock solid. I had those same tests running against the freshly-loaded guest for over 24 hours and it did just fine. I can literally just swap out root filesystem images - booting the troubled guest's root filesystem results in the hangs - booting the fresh-load seems completely reliable. In short it seems to me now that there's something in this particular guest's root filesystem image... something I can't find... that is causing this. The image started as a 13.1 (thirteen point one) fresh load (years ago) and has been in-place upgraded ever since.... so I am concluding that something bad has been brought forward that I'm not aware of. It seems at this point that I just need to rebuild the guest as a fresh 15.1 load, reinstalling only what I currently need, and going from there, and so that's what I'm going to do. The usual absence of useful log data when the machine crashes is frustrating, and the time it takes (1-12 hours) to make a test machine crash makes the test process slow, so I'm feeling like I should just abandon this and replace the guest. If any of this triggers anything for anyone, please let me know. Otherwise, I'm continuing to stress test my freshly-loaded guest for a few more days (just to be sure) and then I'll start the reconnect and replacement process. It really would have been nice to find out what on that troubled guest was causing the issue, but it's probably some legacy thing brought forward that's causing instability, and since each test cycle takes so long, the "process of elimination" could take months or more. And of course as soon as I send this, something new will break, making this all invalid. :-) Anyway THANK YOU ALL for your support and help here, I am very grateful! Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscr...@opensuse.org To contact the owner, e-mail: opensuse-virtual+ow...@opensuse.org