On 03/21/2014 04:02 PM, Strösser, Bodo wrote:
> Hi Corey,
>
> We are using servers, that have a BMC connected via KCS.
> We are running SuSE SLES11-SP2, kernel 3.0.101-0.7.17, but
> the problem I see seems to be relevant for Mainline also.
>
> For our BMC, ipmi_si.ko uses both, ipmi_thread() and
> smi_timeout() in parallel to call smi_event_handler().
> But only smi_timeout() passes the elapsed time for timeout
> handling to smi_event_handler(), but not ipmi_thread().
>
> After smi_timeout detected IDLE, it does not start the
> timer again. Instead, a call to sender() will restart the
> timer. Unfortunately, if a command to the BMC is not
> started via sender(), but generated internally (eg. from
> smi_event_handler() because of smi_info->req_events) and
> started via smi_info->handlers->start_transaction(),
> AFAICS the timer is not started. So, the timeouts in the
> KCS driver while waiting for IBF or OBF won't work in this
> situation.
>
> The BMCs in our servers seem to have a sporadic problem
> that leads to the error recovery being started because of
> an timeout while waiting for OBF. But in case the timer is
> stopped while the problem occurs, the driver loop waiting
> for OBF forever.

I think I can see what is happening here, but I don't think your fix is
correct.  With your fix the driver would never time out if the interface
wasn't working, because it's starting the timer in the thread, the timer
is what is used to time out operations, and the thread will run
constantly while the driver is operating.

Try the attached patches instead, please.  I've done some minimal testing.

>
> P.S.: There might be another minor issue in ipmi_kcs_sm.c
> If check_obf() times out, it starts the error handling but does
> not reset kcs->obf_timeout. Thus, in state KCS_ERROR2 check_obf()
> is called with the timer already expired. So I'd like to
> suggest this small change:
>
> --- a/drivers/char/ipmi/ipmi_kcs_sm.c   2014-03-19 10:42:51.000000000 +0100
> +++ b/drivers/char/ipmi/ipmi_kcs_sm.c   2014-03-19 10:48:33.000000000 +0100
> @@ -251,8 +251,9 @@ static inline int check_obf(struct si_sm
>       if (!GET_STATUS_OBF(status)) {
>               kcs->obf_timeout -= time;
>               if (kcs->obf_timeout < 0) {
> -                 start_error_recovery(kcs, "OBF not ready in time");
> -                 return 1;
> +                     start_error_recovery(kcs, "OBF not ready in time");
> +                     kcs->obf_timeout = OBF_RETRY_TIMEOUT;
> +                     return 1;
>               }
>               return 0;
>       }
>
Indeed, you are right.  I've attached another patch for that.

-corey
>From 38c53e4a65ec9e9f3dba4680c1f0b48c0a0f8753 Mon Sep 17 00:00:00 2001
From: Corey Minyard <[email protected]>
Date: Mon, 24 Mar 2014 10:42:37 -0500
Subject: [PATCH 1/2] ipmi: Fix a race restarting the timer
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With recent changes it is possible for the timer handler to detect
an idle interface and not start the timer, but the thread to start
an operation at the same time.  The thread will not start the timer
in that instance, resulting in the timer not running.

Instead, move all timer operations under the lock and start the timer
in the thread if it detect non-idle and the timer is not already
running.  Moving under locks allows the last timeout to be set in
both the thread and the timer.

Also fix a few other timeout bugs: setting the last timeout when the
interrupt has to be disabled and the timer started, and setting
the last timeout in check_start_timer_thread possibly racing with
the timer

Reported-by: Bodo Strösser <[email protected]>
Signed-off-by: Corey Minyard <[email protected]>
---
 drivers/char/ipmi/ipmi_si_intf.c | 41 ++++++++++++++++++++++------------------
 1 file changed, 23 insertions(+), 18 deletions(-)

diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
index 03f4189..7799ebe 100644
--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -435,6 +435,12 @@ static void start_clear_flags(struct smi_info *smi_info)
 	smi_info->si_state = SI_CLEARING_FLAGS;
 }
 
+static void smi_mod_timer(struct smi_info *smi_info, unsigned long new_val)
+{
+	smi_info->last_timeout_jiffies = jiffies;
+	mod_timer(&smi_info->si_timer, jiffies + SI_TIMEOUT_JIFFIES);
+}
+
 /*
  * When we have a situtaion where we run out of memory and cannot
  * allocate messages, we just leave them in the BMC and run the system
@@ -447,8 +453,7 @@ static inline void disable_si_irq(struct smi_info *smi_info)
 		start_disable_irq(smi_info);
 		smi_info->interrupt_disabled = 1;
 		if (!atomic_read(&smi_info->stop_operation))
-			mod_timer(&smi_info->si_timer,
-				  jiffies + SI_TIMEOUT_JIFFIES);
+			smi_mod_timer(smi_info, jiffies + SI_TIMEOUT_JIFFIES);
 	}
 }
 
@@ -908,15 +913,7 @@ static void sender(void                *send_info,
 		list_add_tail(&msg->link, &smi_info->xmit_msgs);
 
 	if (smi_info->si_state == SI_NORMAL && smi_info->curr_msg == NULL) {
-		/*
-		 * last_timeout_jiffies is updated here to avoid
-		 * smi_timeout() handler passing very large time_diff
-		 * value to smi_event_handler() that causes
-		 * the send command to abort.
-		 */
-		smi_info->last_timeout_jiffies = jiffies;
-
-		mod_timer(&smi_info->si_timer, jiffies + SI_TIMEOUT_JIFFIES);
+		smi_mod_timer(smi_info, jiffies + SI_TIMEOUT_JIFFIES);
 
 		if (smi_info->thread)
 			wake_up_process(smi_info->thread);
@@ -1005,6 +1002,18 @@ static int ipmi_thread(void *data)
 
 		spin_lock_irqsave(&(smi_info->si_lock), flags);
 		smi_result = smi_event_handler(smi_info, 0);
+
+		/*
+		 * If the driver is doing something, there is a possible
+		 * race with the timer.  If the timer handler see idle,
+		 * and the thread here sees something else, the timer
+		 * handler won't restart the timer even though it is
+		 * required.  So start it here if necessary.
+		 */
+		if (smi_result != SI_SM_IDLE &&
+					!timer_pending(&smi_info->si_timer))
+			smi_mod_timer(smi_info, jiffies + SI_TIMEOUT_JIFFIES);
+
 		spin_unlock_irqrestore(&(smi_info->si_lock), flags);
 		busy_wait = ipmi_thread_busy_wait(smi_result, smi_info,
 						  &busy_until);
@@ -1074,10 +1083,6 @@ static void smi_timeout(unsigned long data)
 		     * SI_USEC_PER_JIFFY);
 	smi_result = smi_event_handler(smi_info, time_diff);
 
-	spin_unlock_irqrestore(&(smi_info->si_lock), flags);
-
-	smi_info->last_timeout_jiffies = jiffies_now;
-
 	if ((smi_info->irq) && (!smi_info->interrupt_disabled)) {
 		/* Running with interrupts, only do long timeouts. */
 		timeout = jiffies + SI_TIMEOUT_JIFFIES;
@@ -1099,7 +1104,8 @@ static void smi_timeout(unsigned long data)
 
  do_mod_timer:
 	if (smi_result != SI_SM_IDLE)
-		mod_timer(&(smi_info->si_timer), timeout);
+		smi_mod_timer(smi_info, timeout);
+	spin_unlock_irqrestore(&(smi_info->si_lock), flags);
 }
 
 static irqreturn_t si_irq_handler(int irq, void *data)
@@ -1147,8 +1153,7 @@ static int smi_start_processing(void       *send_info,
 
 	/* Set up the timer that drives the interface. */
 	setup_timer(&new_smi->si_timer, smi_timeout, (long)new_smi);
-	new_smi->last_timeout_jiffies = jiffies;
-	mod_timer(&new_smi->si_timer, jiffies + SI_TIMEOUT_JIFFIES);
+	smi_mod_timer(new_smi, jiffies + SI_TIMEOUT_JIFFIES);
 
 	/*
 	 * Check if the user forcefully enabled the daemon.
-- 
1.8.3.1

>From 019987261be720bb959e375b54403498e75e1e84 Mon Sep 17 00:00:00 2001
From: Corey Minyard <[email protected]>
Date: Mon, 24 Mar 2014 10:52:45 -0500
Subject: [PATCH 2/2] ipmi: Reset the KCS timeout when starting error recovery
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The OBF timer in KCS was not reset in one situation when error
recovery was started, resulting in an immediate timeout.

Reported-by: Bodo Strösser <[email protected]>
Signed-off-by: Corey Minyard <[email protected]>
---
 drivers/char/ipmi/ipmi_kcs_sm.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/char/ipmi/ipmi_kcs_sm.c b/drivers/char/ipmi/ipmi_kcs_sm.c
index 6a4bdc1..8c25f59 100644
--- a/drivers/char/ipmi/ipmi_kcs_sm.c
+++ b/drivers/char/ipmi/ipmi_kcs_sm.c
@@ -251,8 +251,9 @@ static inline int check_obf(struct si_sm_data *kcs, unsigned char status,
 	if (!GET_STATUS_OBF(status)) {
 		kcs->obf_timeout -= time;
 		if (kcs->obf_timeout < 0) {
-		    start_error_recovery(kcs, "OBF not ready in time");
-		    return 1;
+			kcs->obf_timeout = OBF_RETRY_TIMEOUT;
+			start_error_recovery(kcs, "OBF not ready in time");
+			return 1;
 		}
 		return 0;
 	}
-- 
1.8.3.1

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to