Hello, we operate a server running wheezy and with a SuperMicro BMC¹ and experience random reboots.
¹) http://www.supermicro.com/products/motherboard/Xeon/C600/X9DRW-iF.cfm I think I traced them back to the BMC watchdog, which we have enabled: ipmitool> bmc watchdog get Watchdog Timer Use: SMS/OS (0x44) Watchdog Timer Is: Started/Running Watchdog Timer Actions: Hard Reset (0x01) Pre-timeout interval: 0 seconds Timer Expiration Flags: 0x00 Initial Countdown: 900 sec Present Countdown: 899 sec On the system, freeipmi-bmc-watchdog 1.1.5-3 is running. In debug mode, a successful run looks like this: Oct 09 05:28:04 ===================================================== Oct 09 05:28:04 Get Watchdog Timer Request Oct 09 05:28:04 ===================================================== Oct 09 05:28:04 [ 25h] = cmd[ 8b] Oct 09 05:28:04 ===================================================== Oct 09 05:28:04 Get Watchdog Timer Request Oct 09 05:28:04 ===================================================== Oct 09 05:28:04 [ 25h] = cmd[ 8b] Oct 09 05:28:04 [ 0h] = comp_code[ 8b] Oct 09 05:28:04 [ 4h] = timer_use[ 3b] Oct 09 05:28:04 [ 0h] = reserved1[ 3b] Oct 09 05:28:04 [ 1h] = timer_state[ 1b] Oct 09 05:28:04 [ 0h] = log[ 1b] Oct 09 05:28:04 [ 1h] = timeout_action[ 3b] Oct 09 05:28:04 [ 0h] = reserved2[ 1b] Oct 09 05:28:04 [ 0h] = pre_timeout_interrupt[ 3b] Oct 09 05:28:04 [ 0h] = reserved3[ 1b] Oct 09 05:28:04 [ 0h] = pre_timeout_interval[ 8b] Oct 09 05:28:04 [ 0h] = reserved4[ 1b] Oct 09 05:28:04 [ 0h] = timer_use_expiration_flag.bios_frb2[ 1b] Oct 09 05:28:04 [ 0h] = timer_use_expiration_flag.bios_post[ 1b] Oct 09 05:28:04 [ 0h] = timer_use_expiration_flag.os_load[ 1b] Oct 09 05:28:04 [ 0h] = timer_use_expiration_flag.sms_os[ 1b] Oct 09 05:28:04 [ 0h] = timer_use_expiration_flag.oem[ 1b] Oct 09 05:28:04 [ 0h] = reserved5[ 1b] Oct 09 05:28:04 [ 0h] = reserved6[ 1b] Oct 09 05:28:04 [ 2328h] = initial_countdown_value[16b] Oct 09 05:28:04 [ 20D2h] = present_countdown_value[16b] Oct 09 05:28:04 ===================================================== Oct 09 05:28:04 Reset Watchdog Timer Request Oct 09 05:28:04 ===================================================== Oct 09 05:28:04 [ 22h] = cmd[ 8b] Oct 09 05:28:04 ===================================================== Oct 09 05:28:04 Reset Watchdog Timer Request Oct 09 05:28:04 ===================================================== Oct 09 05:28:04 [ 22h] = cmd[ 8b] Oct 09 05:28:04 [ 0h] = comp_code[ 8b] Every now and then, the following will happen: Oct 09 05:29:04 ===================================================== Oct 09 05:29:04 Get Watchdog Timer Request Oct 09 05:29:04 ===================================================== Oct 09 05:29:04 [ 25h] = cmd[ 8b] Oct 09 05:29:06 ===================================================== Oct 09 05:29:06 Get Watchdog Timer Request Oct 09 05:29:06 ===================================================== Oct 09 05:29:06 [ 25h] = cmd[ 8b] Oct 09 05:29:06 [ 0h] = comp_code[ 8b] Oct 09 05:29:06 [ 3h] = timer_use[ 3b] Oct 09 05:29:06 [ 7h] = reserved1[ 3b] Oct 09 05:29:06 [ 0h] = timer_state[ 1b] [Oct 09 05:29:06]: _get_watchdog_timer_cmd: fiid_obj_get: 'present_countdown_value': data not available [Oct 09 05:29:06]: timer stopped by another process [Oct 09 05:29:06]: stopping bmc-watchdog daemon Oct 09 05:29:06 [ 1h] = log[ 1b] And then the machine reboots after the timer expires. We've worked with Supermicro and the vendor, replaced the mainboard and tried all different firmwares and BIOS versions, but the problem persists. However, this is the only case in 533 exactly identical such systems sold in the last 3 years by the vendor. I am the only one using Debian, apparently. Do you have any idea what this could be and — more importantly — how I could address this? I'd like to keep the watchdog functionality, but as it stands I have to turn it off, of course, unless I find a cure. If asking here yields no result, I will take this to the freeipmi people… Any input appreciated! Thanks, -- .''`. martin f. krafft <madduck@d.o> @martinkrafft : :' : proud Debian developer `. `'` http://people.debian.org/~madduck `- Debian - when you have better things to do than fixing systems in africa some of the native tribes have a custom of beating the ground with clubs and uttering spine chilling cries. anthropologists call this a form of primitive self-expression. in america they call it golf.
digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current)