On Mon, Aug 30, 2010 at 7:54 AM, Andy Cress <andy.cr...@us.kontron.com> wrote:
Thanks very much Andy for taking the time for such a detailed response! That sure helps! > > Yes, that is a key function that all IPMI BMCs are supposed to provide. > The BMC is generally not affected by what the OS does, unless there are > IPMI-aware applications running in the OS, specifically talking to the > BMC. Nothing that I am aware of other than ipmitool. None of the vendor-specific GUIs etc. > 1) IPMI LAN configuration. Make sure that the IPMI LAN was properly > configured. It sounds like you may have tested this beforehand. Even > something like the ARP configuration could cause the port to no longer > be visible to the router. Yes, I had tested extensively prior to failure. This is a HPC cluster with about ~300 identical servers and other servers in the group are still responding perfectly. All are on their own dedicated IP subnet although the IMPI physical network is the same as the normal 1GiGE eth network. i.e. IPMI traffic is piggybacking on the same eth adapter port. > > 3) Some OS-resident (custom?) IPMI-aware application that may be causing > trouble/stress/configuration problems with the BMC. Nothing that I can imagine. I'm using CentOS and fairly standard Linux tools. >n a healthy > system, the 'ps -ef' output on the target should show any ipmi-related > processes that are running. I don't see any suspicious processes (on a sister node that hasn't crashed). But here's a ps -ef if anything out of place is evident to you. [r...@eu001 ~]# ps -ef UID PID PPID C STIME TTY TIME CMD root 1 0 0 Aug29 ? 00:00:01 init [3] root 2 1 0 Aug29 ? 00:00:00 [migration/0] root 3 1 0 Aug29 ? 00:00:00 [ksoftirqd/0] root 4 1 0 Aug29 ? 00:00:00 [watchdog/0] root 5 1 0 Aug29 ? 00:00:00 [migration/1] root 6 1 0 Aug29 ? 00:00:00 [ksoftirqd/1] root 7 1 0 Aug29 ? 00:00:00 [watchdog/1] root 8 1 0 Aug29 ? 00:00:00 [migration/2] root 9 1 0 Aug29 ? 00:00:00 [ksoftirqd/2] root 10 1 0 Aug29 ? 00:00:00 [watchdog/2] root 11 1 0 Aug29 ? 00:00:00 [migration/3] root 12 1 0 Aug29 ? 00:00:00 [ksoftirqd/3] root 13 1 0 Aug29 ? 00:00:00 [watchdog/3] root 14 1 0 Aug29 ? 00:00:00 [migration/4] root 15 1 0 Aug29 ? 00:00:00 [ksoftirqd/4] root 16 1 0 Aug29 ? 00:00:00 [watchdog/4] root 17 1 0 Aug29 ? 00:00:00 [migration/5] root 18 1 0 Aug29 ? 00:00:00 [ksoftirqd/5] root 19 1 0 Aug29 ? 00:00:00 [watchdog/5] root 20 1 0 Aug29 ? 00:00:00 [migration/6] root 21 1 0 Aug29 ? 00:00:00 [ksoftirqd/6] root 22 1 0 Aug29 ? 00:00:00 [watchdog/6] root 23 1 0 Aug29 ? 00:00:00 [migration/7] root 24 1 0 Aug29 ? 00:00:00 [ksoftirqd/7] root 25 1 0 Aug29 ? 00:00:00 [watchdog/7] root 26 1 0 Aug29 ? 00:00:00 [events/0] root 27 1 0 Aug29 ? 00:00:00 [events/1] root 28 1 0 Aug29 ? 00:00:00 [events/2] root 29 1 0 Aug29 ? 00:00:00 [events/3] root 30 1 0 Aug29 ? 00:00:00 [events/4] root 31 1 0 Aug29 ? 00:00:00 [events/5] root 32 1 0 Aug29 ? 00:00:00 [events/6] root 33 1 0 Aug29 ? 00:00:00 [events/7] root 34 1 0 Aug29 ? 00:00:00 [khelper] root 169 1 0 Aug29 ? 00:00:00 [kthread] root 181 169 0 Aug29 ? 00:00:00 [kblockd/0] root 182 169 0 Aug29 ? 00:00:00 [kblockd/1] root 183 169 0 Aug29 ? 00:00:00 [kblockd/2] root 184 169 0 Aug29 ? 00:00:00 [kblockd/3] root 185 169 0 Aug29 ? 00:00:00 [kblockd/4] root 186 169 0 Aug29 ? 00:00:00 [kblockd/5] root 187 169 0 Aug29 ? 00:00:00 [kblockd/6] root 188 169 0 Aug29 ? 00:00:00 [kblockd/7] root 189 169 0 Aug29 ? 00:00:00 [kacpid] root 302 169 0 Aug29 ? 00:00:00 [cqueue/0] root 303 169 0 Aug29 ? 00:00:00 [cqueue/1] root 304 169 0 Aug29 ? 00:00:00 [cqueue/2] root 305 169 0 Aug29 ? 00:00:00 [cqueue/3] root 306 169 0 Aug29 ? 00:00:00 [cqueue/4] root 307 169 0 Aug29 ? 00:00:00 [cqueue/5] root 308 169 0 Aug29 ? 00:00:00 [cqueue/6] root 309 169 0 Aug29 ? 00:00:00 [cqueue/7] root 312 169 0 Aug29 ? 00:00:00 [khubd] root 314 169 0 Aug29 ? 00:00:00 [kseriod] root 437 169 0 Aug29 ? 00:00:00 [pdflush] root 438 169 0 Aug29 ? 00:00:30 [pdflush] root 439 169 0 Aug29 ? 00:00:00 [kswapd0] root 440 169 0 Aug29 ? 00:00:00 [kswapd1] root 441 169 0 Aug29 ? 00:00:00 [aio/0] root 442 169 0 Aug29 ? 00:00:00 [aio/1] root 443 169 0 Aug29 ? 00:00:00 [aio/2] root 444 169 0 Aug29 ? 00:00:00 [aio/3] root 445 169 0 Aug29 ? 00:00:00 [aio/4] root 446 169 0 Aug29 ? 00:00:00 [aio/5] root 447 169 0 Aug29 ? 00:00:00 [aio/6] root 448 169 0 Aug29 ? 00:00:00 [aio/7] root 598 169 0 Aug29 ? 00:00:00 [kpsmoused] root 703 169 0 Aug29 ? 00:00:00 [mpt_poll_0] root 704 169 0 Aug29 ? 00:00:00 [scsi_eh_0] root 732 169 0 Aug29 ? 00:00:00 [kstriped] root 769 169 0 Aug29 ? 00:00:01 [kjournald] root 794 169 0 Aug29 ? 00:00:00 [kauditd] root 827 1 0 Aug29 ? 00:00:00 /sbin/udevd -d root 1416 169 0 Aug29 ? 00:00:00 [cxgb3] root 2046 169 0 Aug29 ? 00:00:00 [kmpathd/0] root 2047 169 0 Aug29 ? 00:00:00 [kmpathd/1] root 2048 169 0 Aug29 ? 00:00:00 [kmpathd/2] root 2049 169 0 Aug29 ? 00:00:00 [kmpathd/3] root 2050 169 0 Aug29 ? 00:00:00 [kmpathd/4] root 2051 169 0 Aug29 ? 00:00:00 [kmpathd/5] root 2052 169 0 Aug29 ? 00:00:00 [kmpathd/6] root 2053 169 0 Aug29 ? 00:00:00 [kmpathd/7] root 2054 169 0 Aug29 ? 00:00:00 [kmpath_handlerd] root 2089 169 0 Aug29 ? 00:03:03 [kjournald] root 2091 169 0 Aug29 ? 00:00:00 [kjournald] root 2093 169 0 Aug29 ? 00:00:00 [kjournald] root 2320 169 0 Aug29 ? 00:00:00 [iw_cxgb3] root 2394 169 0 Aug29 ? 00:00:00 [ib_mcast] root 2395 169 0 Aug29 ? 00:00:00 [ib_inform] root 2396 169 0 Aug29 ? 00:00:00 [local_sa] root 2406 169 0 Aug29 ? 00:00:00 [ib_cm/0] root 2407 169 0 Aug29 ? 00:00:00 [ib_cm/1] root 2408 169 0 Aug29 ? 00:00:00 [ib_cm/2] root 2409 169 0 Aug29 ? 00:00:00 [ib_cm/3] root 2410 169 0 Aug29 ? 00:00:00 [ib_cm/4] root 2411 169 0 Aug29 ? 00:00:00 [ib_cm/5] root 2412 169 0 Aug29 ? 00:00:00 [ib_cm/6] root 2413 169 0 Aug29 ? 00:00:00 [ib_cm/7] root 2433 169 0 Aug29 ? 00:00:00 [ipoib] root 2476 169 0 Aug29 ? 00:00:00 [ib_addr] root 2486 169 0 Aug29 ? 00:00:00 [iw_cm_wq] root 2496 169 0 Aug29 ? 00:00:00 [rdma_cm] root 3111 1 0 Aug29 ? 00:00:00 /sbin/dhclient -1 -q -lf /var/lib/dhclient/dhclient-eth1.leases -pf /var/run/dhclien root 3383 1 0 Aug29 ? 00:00:00 auditd root 3385 3383 0 Aug29 ? 00:00:00 /sbin/audispd root 3415 1 0 Aug29 ? 00:00:00 syslogd -m 0 root 3418 1 0 Aug29 ? 00:00:00 klogd -x root 3432 1 0 Aug29 ? 00:00:00 irqbalance rpc 3452 1 0 Aug29 ? 00:00:00 portmap root 3489 169 0 Aug29 ? 00:00:00 [rpciod/0] root 3490 169 0 Aug29 ? 00:00:00 [rpciod/1] root 3491 169 0 Aug29 ? 00:00:00 [rpciod/2] root 3492 169 0 Aug29 ? 00:00:00 [rpciod/3] root 3493 169 0 Aug29 ? 00:00:00 [rpciod/4] root 3494 169 0 Aug29 ? 00:00:00 [rpciod/5] root 3495 169 0 Aug29 ? 00:00:00 [rpciod/6] root 3496 169 0 Aug29 ? 00:00:00 [rpciod/7] root 3509 1 0 Aug29 ? 00:00:00 rpc.statd root 3541 1 0 Aug29 ? 00:00:00 rpc.idmapd dbus 3564 1 0 Aug29 ? 00:00:00 dbus-daemon --system root 3617 1 0 Aug29 ? 00:00:00 [lockd] root 3642 1 0 Aug29 ? 00:00:00 pcscd root 3656 1 0 Aug29 ? 00:00:00 /usr/sbin/acpid 68 3669 1 0 Aug29 ? 00:00:00 hald root 3670 3669 0 Aug29 ? 00:00:00 hald-runner 68 3678 3670 0 Aug29 ? 00:00:00 hald-addon-acpi: listening on acpid socket /var/run/acpid.socket root 3740 1 0 Aug29 ? 00:00:00 /usr/bin/hidd --server root 3775 1 0 Aug29 ? 00:00:00 automount root 3799 1 0 Aug29 ? 00:00:00 /usr/sbin/sshd ntp 3818 1 0 Aug29 ? 00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g root 3831 1 0 Aug29 ? 00:00:00 crond root 3872 1 0 Aug29 ? 00:00:06 /opt/torque/sbin/pbs_mom root 3896 1 0 Aug29 ? 00:00:00 /usr/sbin/atd condor 3911 1 0 Aug29 ? 00:00:07 /usr/sbin/condor_master -pidfile /condor/var/run/condor/master.pid condor 3921 3911 0 Aug29 ? 00:00:25 condor_startd -f root 3960 1 0 Aug29 ? 00:00:00 /usr/sbin/smartd -q never root 3963 1 0 Aug29 tty1 00:00:00 /sbin/mingetty tty1 root 3965 1 0 Aug29 tty2 00:00:00 /sbin/mingetty tty2 root 3969 1 0 Aug29 tty3 00:00:00 /sbin/mingetty tty3 root 3970 1 0 Aug29 tty4 00:00:00 /sbin/mingetty tty4 root 3971 1 0 Aug29 tty5 00:00:00 /sbin/mingetty tty5 root 3973 1 0 Aug29 tty6 00:00:00 /sbin/mingetty tty6 root 4322 3799 0 Aug29 ? 00:00:00 sshd: cfarbe...@pts/0 512 4323 4322 0 Aug29 pts/0 00:00:00 -bash 512 17952 3872 0 10:00 ? 00:00:00 orted -mca ess env -mca orte_ess_jobid 2802450432 -mca orte_ess_vpid 2 -mca orte_ess 512 17953 17952 0 10:00 ? 00:00:00 /bin/sh /opt/bin/dacapo_nexus_2.7.8.exec /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv 512 17954 17952 0 10:00 ? 00:00:00 /bin/sh /opt/bin/dacapo_nexus_2.7.8.exec /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv 512 17955 17952 0 10:00 ? 00:00:00 /bin/sh /opt/bin/dacapo_nexus_2.7.8.exec /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv 512 17956 17952 0 10:00 ? 00:00:00 /bin/sh /opt/bin/dacapo_nexus_2.7.8.exec /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv 512 17957 17952 0 10:00 ? 00:00:00 /bin/sh /opt/bin/dacapo_nexus_2.7.8.exec /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv 512 17958 17952 0 10:00 ? 00:00:00 /bin/sh /opt/bin/dacapo_nexus_2.7.8.exec /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv 512 17959 17953 99 10:00 ? 00:03:37 /opt/bin/dacapo_2.7.8_nexus.run /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3 512 17960 17954 99 10:00 ? 00:03:37 /opt/bin/dacapo_2.7.8_nexus.run /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3 512 17961 17952 0 10:00 ? 00:00:00 /bin/sh /opt/bin/dacapo_nexus_2.7.8.exec /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv 512 17962 17952 0 10:00 ? 00:00:00 /bin/sh /opt/bin/dacapo_nexus_2.7.8.exec /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSv 512 17963 17955 99 10:00 ? 00:03:37 /opt/bin/dacapo_2.7.8_nexus.run /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3 512 17964 17956 99 10:00 ? 00:03:37 /opt/bin/dacapo_2.7.8_nexus.run /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3 512 17965 17957 99 10:00 ? 00:03:37 /opt/bin/dacapo_2.7.8_nexus.run /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3 512 17966 17961 99 10:00 ? 00:03:37 /opt/bin/dacapo_2.7.8_nexus.run /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3 512 17967 17958 99 10:00 ? 00:03:36 /opt/bin/dacapo_2.7.8_nexus.run /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3 512 17968 17962 99 10:00 ? 00:03:37 /opt/bin/dacapo_2.7.8_nexus.run /work/cfarberow/ORR/Ir3re/test/3.5/tmpi0uSvB Out-Ir3 root 18046 3831 0 10:03 ? 00:00:00 crond root 18047 18046 0 10:03 ? 00:00:00 [bash] <defunct> root 18068 18046 0 10:03 ? 00:00:00 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t root 18074 3799 0 10:03 ? 00:00:00 sshd: r...@pts/9 root 18075 18074 0 10:03 pts/9 00:00:00 -bash root 18105 18075 0 10:03 pts/9 00:00:00 ps -ef [r...@eu001 ~]# ps -ef | grep ipmi root 18107 18075 0 10:04 pts/9 00:00:00 grep ipmi > > 4) A bug in the BMC. You didn't mention which vendor's IPMI BMC is > being used, but from the To list, it might be Dell (?). Get the BMC > version number and find out if there is an upgrade from the vendor. > That is more important than what ipmitool does. If the BMC is in a bad > state, the history from the IPMI SEL may be helpful to the vendor. If > it is reproducible after an upgrade, the vendor should be able to fix > it. Yup! It is indeed Dell. A R410-server. I've posted on the Del list too. I'll wait to see if I get any ideas there. Thanks again! -- Rahul ------------------------------------------------------------------------------ Sell apps to millions through the Intel(R) Atom(Tm) Developer Program Be part of this innovative community and reach millions of netbook users worldwide. Take advantage of special opportunities to increase revenue and speed time-to-market. Join now, and jumpstart your future. http://p.sf.net/sfu/intel-atom-d2d _______________________________________________ Ipmitool-devel mailing list Ipmitool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ipmitool-devel