Hi everyone -

So my two threads really turn out to be one thread, and I'm replying
to both here.  I apologize for the mess.

First, I really appreciate all the responses and pointers, I'm very
grateful for all the help.  THANK YOU to all those who responded on
either thread!

To recap, essentially what I have is:

1. I've always run a bunch of high-traffic 42.3 Xen PV guests on 42.3 hosts.
2. I upgraded the 42.3 hosts to 15.1, via online upgrade and then via
fresh load.
3. When I did that, *one* of my 42.3 guests started hanging at random
every 2-7 days.

The hangs seemed to be related to high network and/or disk traffic. I
discovered by accident that if I did an "xl trigger nmi" I could
"unhang" the guest and make it resume duty, more or less, without a
reboot, but I have no idea why the hang occurs or why it's recoverable
in that way.

Chasing this down has been painful.  It initially looked like sshd was
the culprit, but it wasn't.  I thought the kernel mismatch might be
the issue, but other 42.3 guests run on their 15.1 hosts without a
problem, and upgrading the guest to 15.1 didn't solve it.  Olaf has
been pointing me to new kernels, and that helped somewhat - moving to
the SLE kernel extended the guest uptime from a few days to a few
weeks (buying me much needed sleep, thank you!)  but I still don't
have a solution.

The problem seems to be in this particular guest... somewhere. The
problem seems to travel with the guest:  If I clone the guest, and
bring it up elsewhere that clone also has the problem. So I've
resorted to making a copy of the guest and staging it on a different
host just so I can stress-test it.

To stress-test it, I basically initiate lots of high-traffic requests
against the troubled guest from an outside source.   Initially, the
guest was hanging during a single full outbound rsync.  To prevent SSD
wear I modified the command and used a hack to simulate the traffic.
If I boot the guest, and, from a different connected machine, do stuff
like:

nohup ssh 192.168.1.11 tar cf - --one-file-system /a | cat > /dev/null &
nohup ssh 192.168.1.11 cat /dev/zero | cat > /dev/null &

(where 1.11 is the troubled guest, and /a is a 4TB filesystem full of
data)  I can make the troubled guest hang in somewhere between 45
minutes and 12 hours.

Thanks to Olaf, Jan, Tony and Fajar, I've been able to try a number of
things, but so far, I've had no luck:

Uprading openssh to the latest version did not solve it.
Upgrading the guest to 15.1 (unifying the kernels) did not solve it.
Upgrading the 42.3 guest kernel to a different version helped... but
did not solve it.
Removing some possibly problematic kernel modules did not solve it.
Removing Docker did not solve it.
I had optimizations from the Xen best practices page in
/etc/sysctl.conf for just this guest - removing those did not solve
it.

The only solution I've found seems to be starting the guest over
fresh.  If I do a fresh load of 15.1 as a guest and mount that same /a
filesystem, and do those same tests... the freshly-loaded guest works
fine.... it's rock solid.  I had those same tests running against the
freshly-loaded guest for over 24 hours and it did just fine.  I can
literally just swap out root filesystem images - booting the troubled
guest's root filesystem results in the hangs - booting the fresh-load
seems completely reliable.

In short it seems to me now that there's something in this particular
guest's root filesystem image... something I can't find... that is
causing this.   The image started as a 13.1 (thirteen point one) fresh
load (years ago) and has been in-place upgraded ever since.... so I am
concluding that something bad has been brought forward that I'm not
aware of.

It seems at this point that I just need to rebuild the guest as a
fresh 15.1 load, reinstalling only what I currently need, and going
from there, and so that's what I'm going to do.  The usual absence of
useful log data when the machine crashes is frustrating, and the time
it takes (1-12 hours) to make a test machine crash makes the test
process slow, so I'm feeling like I should just abandon this and
replace the guest.

If any of this triggers anything for anyone, please let me know.
Otherwise, I'm continuing to stress test my freshly-loaded guest for a
few more days (just to be sure) and then I'll start the reconnect and
replacement process.   It really would have been nice to find out what
on that troubled guest was causing the issue, but it's probably some
legacy thing brought forward that's causing instability, and since
each test cycle takes so long, the "process of elimination" could take
months or more.

And of course as soon as I send this, something new will break, making
this all invalid.  :-)

Anyway THANK YOU ALL for your support and help here, I am very grateful!

Glen
-- 
To unsubscribe, e-mail: opensuse-virtual+unsubscr...@opensuse.org
To contact the owner, e-mail: opensuse-virtual+ow...@opensuse.org

Reply via email to