Hi,
Any comments?
On 2013/8/13 16:17, Yijing Wang wrote:
> Hi all,
> We found a bug when using ipmi driver in our machine recently. I don't
> know this bug is caused by kernel ipmi driver,
> or maybe hardware should be responsible for this issue. Any comments are
> welcome, thanks!
>
> In our machine, we found ipmi driver always print messages like this after a
> long run:
>
> Bad version: Linux Kernel version 3.0.58(also has problem in SLES11 SP2)
> Good version: Linux Kernel 2.6.32
>
> 1167440 Jul 30 17:01:15 BMC_test kernel: [ 5156.759059] KCS: State = 5, 42
> 1167441 Jul 30 17:01:15 BMC_test kernel: [ 5156.759063] KCS: State = 5, 42
> 1167442 Jul 30 17:01:15 BMC_test kernel: [ 5156.759066] KCS: State = 5, 0
> 1167443 Jul 30 17:01:15 BMC_test kernel: [ 5156.759070] KCS: State = 0, 1
> 1167444 Jul 30 17:01:15 BMC_test kernel: [ 5156.760065] KCS: State = 0,
> 07.257249] KCS: State = 9, 0
> 1167445 Jul 30 17:01:15 BMC_test kernel: [ 5157.257252] KCS: State = 9, 0
> 1167446 Jul 30 17:01:15 BMC_test kernel: [ 5157.257256] KCS: State = 9, 0
> 1167447 Jul 30 17:01:15 BMC_test kernel: [ 5157.257259] KCS: State = 9, 0
> 1167448 Jul 30 17:01:15 BMC_test kernel: [ 5157.257263] KCS: State = 9, 0
> 1167449 Jul 30 17:01:15 BMC_test kernel: [ 5157.257263] KCS: State = 9, 0
> 1167450 Jul 30 17:01:15 BMC_test kernel: [ 5157.257263] KCS: State = 9, 0
> 1167451 Jul 30 17:01:15 BMC_test kernel: [ 5157.257263] KCS: State = 9, 0
> .........................................................................
>
> We found once KCS enter state (9, 0), it can not exit from that loop.
> So after a period, BMC will reboot the OS because ipmi can not feed its
> watchdog so long.
>
> It seems that kernel always wait OBF bit to 1, but GET_STATUS_OBF(status)
> return 0.
> because time is 0 here, so check_obf() always return 0.
>
> static inline int check_obf(struct si_sm_data *kcs, unsigned char status,
> long time)
> {
> if (!GET_STATUS_OBF(status)) {
> kcs->obf_timeout -= time;
> if (kcs->obf_timeout < 0) {
> start_error_recovery(kcs, "OBF not ready in time");
> return 1;
> }
> return 0;
> }
> kcs->obf_timeout = OBF_RETRY_TIMEOUT;
> return 1;
> }
>
> So kcs_event() always return SI_SM_CALL_WITH_DELAY.
> case KCS_ERROR3:
> if (state != KCS_IDLE_STATE) {
> start_error_recovery(kcs,
> "Not in idle state for error3");
> break;
> }
>
> if (!check_obf(kcs, status, time))
> return SI_SM_CALL_WITH_DELAY;
>
> static enum si_sm_result smi_event_handler(struct smi_info *smi_info,
> int time)
> {
> enum si_sm_result si_sm_result;
>
> restart:
> /*
> * There used to be a loop here that waited a little while
> * (around 25us) before giving up. That turned out to be
> * pointless, the minimum delays I was seeing were in the 300us
> * range, which is far too long to wait in an interrupt. So
> * we just run until the state machine tells us something
> * happened or it needs a delay.
> */
> si_sm_result = smi_info->handlers->event(smi_info->si_sm, time);
> time = 0;
> while (si_sm_result == SI_SM_CALL_WITHOUT_DELAY)
> ------------------------------>It looks like we are always in the loop here
> si_sm_result = smi_info->handlers->event(smi_info->si_sm, 0);
>
>
> We found Matthew Garrett committed several patches modified the related code
> in smi_timeout()
> commit ea4078ca, commit 3326f4f2.
>
> We tried to remove the if checking code, and test the machine under stress,
> after more than 24h test, result is ok. without remove this if checking code,
> the bug will be triggered after about 8h run test.
>
> do_mod_timer:
> if (smi_result != SI_SM_IDLE) ------------------->after remove this
> line code, test result seems good. At least better than before.
> mod_timer(&(smi_info->si_timer), timeout);
>
> So this is the issue root cause?
>
> Other, I don't know kernel whether needs to provide a mechanism to prevent
> ipmi dirver entering this endless loop.
> Or this is hardware problem?
>
--
Thanks!
Yijing
------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer