Hi everyone,
I'd like to start a discussion about what people with lots of HP servers
and RHEL5 do to investiguate crashes and take crashdumps for analysis. I
guess I'm not the only one in this boat and perhaps there are others out
there with good practices that they'd be willing to share..
We have about everything intel/amd from the G1's to the G7's (DL360 to
DL580 systems) - thousands- and we're having a hard time getting reliable
crashdumps (The Solaris guys mock us because -they- always get a crashdump
when there's a panic of some kind). We -do- have kdump properly
configured, of course, so at least that is not an issue. The issue is
having RHEL detect a hang and reliably take a dump..
Over the course of the years, we've run into spurrious NMI's, cciss bugs
and a lot of other issues. As a result we're running with most of the
sysctl panic stuff disabled:
# sysctl -a|grep panic
vm.panic_on_oom = 0
kernel.hung_task_panic = 0
kernel.softlockup_panic = 0
kernel.panic_on_unrecovered_nmi = 0
kernel.unknown_nmi_panic = 0
kernel.panic_on_oops = 1
kernel.panic = 0
In RHEL6, RedHat officialled introduced kmod-hpwdt (HP Watchdog timer)
that interacts with an iLo2 or iLo3 remote controller to initiate a crash
when the timer expires. As per the developpers' recomendation, and because
I tought it would be a 'nice-to-have' (tm) I've
backported that module to RHEL5 and asked for an RFE that RedHat
officially backports it to RHEL5. My rpms are here:
http://vscojot.free.fr/dist/kmod-hpwdt/hpwdt-1.2.0/RHEL5/SRPMS/hpwdt-1.2.0-2.el5_4.src.rpm
http://vscojot.free.fr/dist/kmod-hpwdt/hpwdt-1.2.0/RHEL5/i386
http://vscojot.free.fr/dist/kmod-hpwdt/hpwdt-1.2.0/RHEL5/i386/kmod-hpwdt-1.2.0-2.el5_4.i686.rpm
http://vscojot.free.fr/dist/kmod-hpwdt/hpwdt-1.2.0/RHEL5/i386/kmod-hpwdt-PAE-1.2.0-2.el5_4.i686.rpm
http://vscojot.free.fr/dist/kmod-hpwdt/hpwdt-1.2.0/RHEL5/i386/kmod-hpwdt-xen-1.2.0-2.el5_4.i686.rpm
http://vscojot.free.fr/dist/kmod-hpwdt/hpwdt-1.2.0/RHEL5/x86_64
http://vscojot.free.fr/dist/kmod-hpwdt/hpwdt-1.2.0/RHEL5/x86_64/kmod-hpwdt-1.2.0-2.el5_4.x86_64.rpm
http://vscojot.free.fr/dist/kmod-hpwdt/hpwdt-1.2.0/RHEL5/x86_64/kmod-hpwdt-xen-1.2.0-2.el5_4.x86_64.rpm
I'm just wondering what other people here are doing. Do you trust the NMI
panic stuff or do you use the HP hpwdt-1.1.3 rpm's (they require you to
have a compiler on your system)? Do you panic on OOM?
Any recommandations, good or bad?
Best regards,
Vincent
_______________________________________________
rhelv5-list mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/rhelv5-list