Hi Corey,

We are using servers, that have a BMC connected via KCS.
We are running SuSE SLES11-SP2, kernel 3.0.101-0.7.17, but
the problem I see seems to be relevant for Mainline also.

For our BMC, ipmi_si.ko uses both, ipmi_thread() and
smi_timeout() in parallel to call smi_event_handler().
But only smi_timeout() passes the elapsed time for timeout
handling to smi_event_handler(), but not ipmi_thread().

After smi_timeout detected IDLE, it does not start the
timer again. Instead, a call to sender() will restart the
timer. Unfortunately, if a command to the BMC is not
started via sender(), but generated internally (eg. from
smi_event_handler() because of smi_info->req_events) and
started via smi_info->handlers->start_transaction(),
AFAICS the timer is not started. So, the timeouts in the
KCS driver while waiting for IBF or OBF won't work in this
situation.

The BMCs in our servers seem to have a sporadic problem
that leads to the error recovery being started because of
an timeout while waiting for OBF. But in case the timer is
stopped while the problem occurs, the driver loop waiting
for OBF forever.

I fixed the problem by simply let ipmi_thread() do the
same handling for timeouts as smi_timeout() does:

--- a/drivers/char/ipmi/ipmi_si_intf.c  2014-03-20 12:09:30.000000000 +0100
+++ b/drivers/char/ipmi/ipmi_si_intf.c  2014-03-20 12:08:17.000000000 +0100
@@ -986,6 +986,8 @@ static int ipmi_thread(void *data)
        unsigned long flags;
        enum si_sm_result smi_result;
        struct timespec busy_until;
+       unsigned long jiffies_now;
+       long time_diff;

        ipmi_si_set_not_busy(&busy_until);
        set_user_nice(current, 19);
@@ -993,7 +995,13 @@ static int ipmi_thread(void *data)
                int busy_wait;

                spin_lock_irqsave(&(smi_info->si_lock), flags);
-               smi_result = smi_event_handler(smi_info, 0);
+
+               jiffies_now = jiffies;
+               time_diff = (((long)jiffies_now - 
(long)smi_info->last_timeout_jiffies)
+                            * SI_USEC_PER_JIFFY);
+               smi_info->last_timeout_jiffies = jiffies_now;
+
+               smi_result = smi_event_handler(smi_info, time_diff);
                spin_unlock_irqrestore(&(smi_info->si_lock), flags);
                busy_wait = ipmi_thread_busy_wait(smi_result, smi_info,
                                                  &busy_until);
@@ -1061,12 +1069,12 @@ static void smi_timeout(unsigned long da
        jiffies_now = jiffies;
        time_diff = (((long)jiffies_now - (long)smi_info->last_timeout_jiffies)
                     * SI_USEC_PER_JIFFY);
+       smi_info->last_timeout_jiffies = jiffies_now;
+
        smi_result = smi_event_handler(smi_info, time_diff);

        spin_unlock_irqrestore(&(smi_info->si_lock), flags);

-       smi_info->last_timeout_jiffies = jiffies_now;
-
        if ((smi_info->irq) && (!smi_info->interrupt_disabled)) {
                /* Running with interrupts, only do long timeouts. */
                timeout = jiffies + SI_TIMEOUT_JIFFIES;


AFAICS, this fixes the problem. But I'm not sure, that it is
the right way to fix it. Maybe it would be better to start the timer
and set smi_info->last_timeout_jiffies to the current jiffies each
time before smi_info->handlers->start_transaction() is called for
'internal' commands? I think this could be the better solution for
other BMC devices, that do not use ipmi_thread().

Best regards,
Bodo

P.S.: There might be another minor issue in ipmi_kcs_sm.c
If check_obf() times out, it starts the error handling but does
not reset kcs->obf_timeout. Thus, in state KCS_ERROR2 check_obf()
is called with the timer already expired. So I'd like to
suggest this small change:

--- a/drivers/char/ipmi/ipmi_kcs_sm.c   2014-03-19 10:42:51.000000000 +0100
+++ b/drivers/char/ipmi/ipmi_kcs_sm.c   2014-03-19 10:48:33.000000000 +0100
@@ -251,8 +251,9 @@ static inline int check_obf(struct si_sm
        if (!GET_STATUS_OBF(status)) {
                kcs->obf_timeout -= time;
                if (kcs->obf_timeout < 0) {
-                   start_error_recovery(kcs, "OBF not ready in time");
-                   return 1;
+                       start_error_recovery(kcs, "OBF not ready in time");
+                       kcs->obf_timeout = OBF_RETRY_TIMEOUT;
+                       return 1;
                }
                return 0;
        }


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to