Hi all,
We found a bug when using ipmi driver in our machine recently. I don't know
this bug is caused by kernel ipmi driver,
or maybe hardware should be responsible for this issue. Any comments are
welcome, thanks!
In our machine, we found ipmi driver always print messages like this after a
long run:
Bad version: Linux Kernel version 3.0.58(also has problem in SLES11 SP2)
Good version: Linux Kernel 2.6.32
1167440 Jul 30 17:01:15 BMC_test kernel: [ 5156.759059] KCS: State = 5, 42
1167441 Jul 30 17:01:15 BMC_test kernel: [ 5156.759063] KCS: State = 5, 42
1167442 Jul 30 17:01:15 BMC_test kernel: [ 5156.759066] KCS: State = 5, 0
1167443 Jul 30 17:01:15 BMC_test kernel: [ 5156.759070] KCS: State = 0, 1
1167444 Jul 30 17:01:15 BMC_test kernel: [ 5156.760065] KCS: State = 0,
07.257249] KCS: State = 9, 0
1167445 Jul 30 17:01:15 BMC_test kernel: [ 5157.257252] KCS: State = 9, 0
1167446 Jul 30 17:01:15 BMC_test kernel: [ 5157.257256] KCS: State = 9, 0
1167447 Jul 30 17:01:15 BMC_test kernel: [ 5157.257259] KCS: State = 9, 0
1167448 Jul 30 17:01:15 BMC_test kernel: [ 5157.257263] KCS: State = 9, 0
1167449 Jul 30 17:01:15 BMC_test kernel: [ 5157.257263] KCS: State = 9, 0
1167450 Jul 30 17:01:15 BMC_test kernel: [ 5157.257263] KCS: State = 9, 0
1167451 Jul 30 17:01:15 BMC_test kernel: [ 5157.257263] KCS: State = 9, 0
.........................................................................
We found once KCS enter state (9, 0), it can not exit from that loop.
So after a period, BMC will reboot the OS because ipmi can not feed its
watchdog so long.
It seems that kernel always wait OBF bit to 1, but GET_STATUS_OBF(status)
return 0.
because time is 0 here, so check_obf() always return 0.
static inline int check_obf(struct si_sm_data *kcs, unsigned char status,
long time)
{
if (!GET_STATUS_OBF(status)) {
kcs->obf_timeout -= time;
if (kcs->obf_timeout < 0) {
start_error_recovery(kcs, "OBF not ready in time");
return 1;
}
return 0;
}
kcs->obf_timeout = OBF_RETRY_TIMEOUT;
return 1;
}
So kcs_event() always return SI_SM_CALL_WITH_DELAY.
case KCS_ERROR3:
if (state != KCS_IDLE_STATE) {
start_error_recovery(kcs,
"Not in idle state for error3");
break;
}
if (!check_obf(kcs, status, time))
return SI_SM_CALL_WITH_DELAY;
static enum si_sm_result smi_event_handler(struct smi_info *smi_info,
int time)
{
enum si_sm_result si_sm_result;
restart:
/*
* There used to be a loop here that waited a little while
* (around 25us) before giving up. That turned out to be
* pointless, the minimum delays I was seeing were in the 300us
* range, which is far too long to wait in an interrupt. So
* we just run until the state machine tells us something
* happened or it needs a delay.
*/
si_sm_result = smi_info->handlers->event(smi_info->si_sm, time);
time = 0;
while (si_sm_result == SI_SM_CALL_WITHOUT_DELAY)
------------------------------>It looks like we are always in the loop here
si_sm_result = smi_info->handlers->event(smi_info->si_sm, 0);
We found Matthew Garrett committed several patches modified the related code in
smi_timeout()
commit ea4078ca, commit 3326f4f2.
We tried to remove the if checking code, and test the machine under stress,
after more than 24h test, result is ok. without remove this if checking code,
the bug will be triggered after about 8h run test.
do_mod_timer:
if (smi_result != SI_SM_IDLE) ------------------->after remove this
line code, test result seems good. At least better than before.
mod_timer(&(smi_info->si_timer), timeout);
So this is the issue root cause?
Other, I don't know kernel whether needs to provide a mechanism to prevent ipmi
dirver entering this endless loop.
Or this is hardware problem?
--
Thanks!
Yijing
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer