Re: [PATCH V2] printk: rebalance printk

2015-08-12 Thread Pan Xinhui
hi, all
I think you have already known how to reproduce it. Let me still share 
the debug patch to hit this printk issue in several seconds.

+static int auto_printk(void *data)
+{
+   int irq = (int)data;
+   char b[64] = {
+   [0 ... 61] = 'F',
+   '\n',
+   0,
+   };
+
+   int bytes_per_second = 115200/8; //minicom console ratelimit
+   int lines_per_second = bytes_per_second/(sizeof(b)+16);//16 is count of 
extra bytes including log_prefix
+   int lines_with_5sec = lines_per_second * 5;
+   int i = lines_with_5sec;
+
+   if (irq)
+   local_irq_disable();
+   unsigned long j = jiffies;
+   unsigned long k = jiffies;
+   while (i--) {
+   printk("[%1d][%d]%s", irq, i, b);
+   k = jiffies;
+   if (k > j + 5 * HZ) {
+   printk("XINHUI: long time %d,%d\n", irq, 
irqs_disabled());
+   }
+   j = k;
+   }
+   if (irq)
+   local_irq_enable();
+   return 0;
+}
+static int xh_th(const char *val, struct kernel_param *kp)
+{
+   kthread_run(auto_printk, 1, "XH_T1");
+   kthread_run(auto_printk, 0, "XH_T0");
+}
+module_param_call(th, NULL, xh_th, NULL, 0664);
+

thanks
xinhui

On 2015年08月12日 17:37, Pan Xinhui wrote:
> From: Pan Xinhui 
> 
> printk can be called in any context, It's very useful to output debug
> info.
> 
> But it might cause very bad issues on some special cases. For example,
> some driver hit errors, and it dumps many messages like reg values, etc.
> 
> Sometimes, printk is called when irqs disabled. This is OKay if there is
> a few messages. But What would happen if many messages outputed by other
> drivers at same time.
> 
> Here is the scenario.
> CPUACPUB
> //driver dumps debug info local_irq_save(flags);
>   printk()
>  -> console_unlock()
> printk()for(;;) {
>  -> log_store() //log_next_seq++
> 
> //many printk() below   if (console_seq == log_next_seq)
>  break;
>   //did not break in this case
> //no printk() now   }
>   local_irq_restore(flags);
> 
> printk runs on CPUA just store the messages in the buf and return.
> printk runs on CPUB(who owns the console_sem lock) would take the duty
> to flush all messages to consoles. It would take a long time to flush
> messages out, IOW, irq would be disabled for a long time. Such case is
> too bad.  We hit many interrupt related panics, for example, cpu did not
> respond to IPI.
> 
> Here is the soultion, if we detect such case above, try to rebalance it.
> Let CPUA take the duty to flush messages to consoles.
> 
> Introduce five states, START, REQUEST, REPLY, ACK, CANCEL.
> Introduce five actions, request, reply, cancel, ack, start.
> Only two kinds of state transition are allowed. They are
> 
> 1)request replyackstart
>  -> START --> REQUEST -> REPLY -> ACK ->
> |   |
>  <--
> 
> 2)request cancelstart
>  -> START --> REQUEST -> CANCEL ->
> | |
>  <
> 
> To implement the five actions, introduce rebalance_try_request(),
> rebalance_try_reply(), rebalance_try_cancel(), rebalance_try_ack(). As
> action start is always after ack and cancel. rebalance_try_cancel/ack
> would do start as well for simplicity.
> 
> State transition must be performed with logbuf_lock held.
> 
> When we have done the state transition, CPUB who owns the console_sem
> will up rebalance_sem as well. CPUA who is going to take duty fo flush
> messages will down_trylock rebalance_sem. The seqence to up/down two
> sems must be like below, otherwise we might has risks that no one owns
> console_sem to do flush job.
> 
> CPUA  CPUB
> down_trylock(_sem)  up(_sem)
>   up(_sem);
> down_trylock(_sem);
> 
> Signed-off-by: Pan Xinhui 
> ---
> change from v1:
>   rewrite the patch.
> ---
>  kernel/printk/printk.c | 100 
> -
>  1 file changed, 99 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index cf8c242..1ffbd32 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1652,6 +1652,69 @@ static size_t cont_print_text(char *text, size_t size)
>   return textlen;
>  }
>  
> +#define START1
> +#define CANCEL   2
> +#define ACK  4
> +#define REQUEST  8
> +#define REPLY16
> +
> +static struct semaphore rebalance_sem =
> +  

[PATCH V2] printk: rebalance printk

2015-08-12 Thread Pan Xinhui
From: Pan Xinhui 

printk can be called in any context, It's very useful to output debug
info.

But it might cause very bad issues on some special cases. For example,
some driver hit errors, and it dumps many messages like reg values, etc.

Sometimes, printk is called when irqs disabled. This is OKay if there is
a few messages. But What would happen if many messages outputed by other
drivers at same time.

Here is the scenario.
CPUACPUB
//driver dumps debug info   local_irq_save(flags);
printk()
 -> console_unlock()
printk()for(;;) {
 -> log_store() //log_next_seq++

//many printk() below   if (console_seq == log_next_seq)
 break;
  //did not break in this case
//no printk() now   }
local_irq_restore(flags);

printk runs on CPUA just store the messages in the buf and return.
printk runs on CPUB(who owns the console_sem lock) would take the duty
to flush all messages to consoles. It would take a long time to flush
messages out, IOW, irq would be disabled for a long time. Such case is
too bad.  We hit many interrupt related panics, for example, cpu did not
respond to IPI.

Here is the soultion, if we detect such case above, try to rebalance it.
Let CPUA take the duty to flush messages to consoles.

Introduce five states, START, REQUEST, REPLY, ACK, CANCEL.
Introduce five actions, request, reply, cancel, ack, start.
Only two kinds of state transition are allowed. They are

1)request replyackstart
 -> START --> REQUEST -> REPLY -> ACK ->
|   |
 <--

2)request cancelstart
 -> START --> REQUEST -> CANCEL ->
| |
 <

To implement the five actions, introduce rebalance_try_request(),
rebalance_try_reply(), rebalance_try_cancel(), rebalance_try_ack(). As
action start is always after ack and cancel. rebalance_try_cancel/ack
would do start as well for simplicity.

State transition must be performed with logbuf_lock held.

When we have done the state transition, CPUB who owns the console_sem
will up rebalance_sem as well. CPUA who is going to take duty fo flush
messages will down_trylock rebalance_sem. The seqence to up/down two
sems must be like below, otherwise we might has risks that no one owns
console_sem to do flush job.

CPUA  CPUB
down_trylock(_sem)  up(_sem)
  up(_sem);
down_trylock(_sem);

Signed-off-by: Pan Xinhui 
---
change from v1:
rewrite the patch.
---
 kernel/printk/printk.c | 100 -
 1 file changed, 99 insertions(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index cf8c242..1ffbd32 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1652,6 +1652,69 @@ static size_t cont_print_text(char *text, size_t size)
return textlen;
 }
 
+#define START  1
+#define CANCEL 2
+#define ACK4
+#define REQUEST8
+#define REPLY  16
+
+static struct semaphore rebalance_sem =
+__SEMAPHORE_INITIALIZER(rebalance_sem, 0);
+
+static inline bool accept(int *v, int s, int *p)
+{
+   if (*v & s) {
+   *p = *v & s;
+   *v &= ~*p;
+   return true;
+   }
+   return 0;
+}
+
+/* protected by logbuf_lock*/
+static bool rebalance(int v)
+{
+   static int cur = START;
+   int pos = cur;
+   bool ret = 0;
+
+   do {
+   switch (pos) {
+   case START:
+   ret = accept(, REQUEST, );
+   break;
+   case CANCEL:
+   ret = accept(, START, );
+   break;
+   case ACK:
+   ret = accept(, START, );
+   break;
+   case REQUEST:
+   ret = accept(, CANCEL | REPLY, );
+   break;
+   case REPLY:
+   ret = accept(, ACK, );
+   break;
+   default:
+   ret = 0;
+   break;
+   }
+   } while (ret && v);
+
+   if (ret)
+   cur = pos;
+   return ret;
+}
+
+#define rebalance_try_request()\
+   rebalance(REQUEST)
+#define rebalance_try_reply()  \
+   rebalance(REPLY)
+#define rebalance_try_cancel() \
+   rebalance(CANCEL)
+#define rebalance_try_ack()\
+   rebalance(ACK | START)
+
 asmlinkage int vprintk_emit(int facility, int level,
const char *dict, 

[PATCH V2] printk: rebalance printk

2015-08-12 Thread Pan Xinhui
From: Pan Xinhui xinhuix@intel.com

printk can be called in any context, It's very useful to output debug
info.

But it might cause very bad issues on some special cases. For example,
some driver hit errors, and it dumps many messages like reg values, etc.

Sometimes, printk is called when irqs disabled. This is OKay if there is
a few messages. But What would happen if many messages outputed by other
drivers at same time.

Here is the scenario.
CPUACPUB
//driver dumps debug info   local_irq_save(flags);
printk()
 - console_unlock()
printk()for(;;) {
 - log_store() //log_next_seq++

//many printk() below   if (console_seq == log_next_seq)
 break;
  //did not break in this case
//no printk() now   }
local_irq_restore(flags);

printk runs on CPUA just store the messages in the buf and return.
printk runs on CPUB(who owns the console_sem lock) would take the duty
to flush all messages to consoles. It would take a long time to flush
messages out, IOW, irq would be disabled for a long time. Such case is
too bad.  We hit many interrupt related panics, for example, cpu did not
respond to IPI.

Here is the soultion, if we detect such case above, try to rebalance it.
Let CPUA take the duty to flush messages to consoles.

Introduce five states, START, REQUEST, REPLY, ACK, CANCEL.
Introduce five actions, request, reply, cancel, ack, start.
Only two kinds of state transition are allowed. They are

1)request replyackstart
 - START -- REQUEST - REPLY - ACK -
|   |
 --

2)request cancelstart
 - START -- REQUEST - CANCEL -
| |
 

To implement the five actions, introduce rebalance_try_request(),
rebalance_try_reply(), rebalance_try_cancel(), rebalance_try_ack(). As
action start is always after ack and cancel. rebalance_try_cancel/ack
would do start as well for simplicity.

State transition must be performed with logbuf_lock held.

When we have done the state transition, CPUB who owns the console_sem
will up rebalance_sem as well. CPUA who is going to take duty fo flush
messages will down_trylock rebalance_sem. The seqence to up/down two
sems must be like below, otherwise we might has risks that no one owns
console_sem to do flush job.

CPUA  CPUB
down_trylock(rebalance_sem)  up(console_sem)
  up(rebalance_sem);
down_trylock(console_sem);

Signed-off-by: Pan Xinhui xinhuix@intel.com
---
change from v1:
rewrite the patch.
---
 kernel/printk/printk.c | 100 -
 1 file changed, 99 insertions(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index cf8c242..1ffbd32 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1652,6 +1652,69 @@ static size_t cont_print_text(char *text, size_t size)
return textlen;
 }
 
+#define START  1
+#define CANCEL 2
+#define ACK4
+#define REQUEST8
+#define REPLY  16
+
+static struct semaphore rebalance_sem =
+__SEMAPHORE_INITIALIZER(rebalance_sem, 0);
+
+static inline bool accept(int *v, int s, int *p)
+{
+   if (*v  s) {
+   *p = *v  s;
+   *v = ~*p;
+   return true;
+   }
+   return 0;
+}
+
+/* protected by logbuf_lock*/
+static bool rebalance(int v)
+{
+   static int cur = START;
+   int pos = cur;
+   bool ret = 0;
+
+   do {
+   switch (pos) {
+   case START:
+   ret = accept(v, REQUEST, pos);
+   break;
+   case CANCEL:
+   ret = accept(v, START, pos);
+   break;
+   case ACK:
+   ret = accept(v, START, pos);
+   break;
+   case REQUEST:
+   ret = accept(v, CANCEL | REPLY, pos);
+   break;
+   case REPLY:
+   ret = accept(v, ACK, pos);
+   break;
+   default:
+   ret = 0;
+   break;
+   }
+   } while (ret  v);
+
+   if (ret)
+   cur = pos;
+   return ret;
+}
+
+#define rebalance_try_request()\
+   rebalance(REQUEST)
+#define rebalance_try_reply()  \
+   rebalance(REPLY)
+#define rebalance_try_cancel() \
+   rebalance(CANCEL)
+#define rebalance_try_ack()\
+   rebalance(ACK | START)
+
 asmlinkage int 

Re: [PATCH V2] printk: rebalance printk

2015-08-12 Thread Pan Xinhui
hi, all
I think you have already known how to reproduce it. Let me still share 
the debug patch to hit this printk issue in several seconds.

+static int auto_printk(void *data)
+{
+   int irq = (int)data;
+   char b[64] = {
+   [0 ... 61] = 'F',
+   '\n',
+   0,
+   };
+
+   int bytes_per_second = 115200/8; //minicom console ratelimit
+   int lines_per_second = bytes_per_second/(sizeof(b)+16);//16 is count of 
extra bytes including log_prefix
+   int lines_with_5sec = lines_per_second * 5;
+   int i = lines_with_5sec;
+
+   if (irq)
+   local_irq_disable();
+   unsigned long j = jiffies;
+   unsigned long k = jiffies;
+   while (i--) {
+   printk([%1d][%d]%s, irq, i, b);
+   k = jiffies;
+   if (k  j + 5 * HZ) {
+   printk(XINHUI: long time %d,%d\n, irq, 
irqs_disabled());
+   }
+   j = k;
+   }
+   if (irq)
+   local_irq_enable();
+   return 0;
+}
+static int xh_th(const char *val, struct kernel_param *kp)
+{
+   kthread_run(auto_printk, 1, XH_T1);
+   kthread_run(auto_printk, 0, XH_T0);
+}
+module_param_call(th, NULL, xh_th, NULL, 0664);
+

thanks
xinhui

On 2015年08月12日 17:37, Pan Xinhui wrote:
 From: Pan Xinhui xinhuix@intel.com
 
 printk can be called in any context, It's very useful to output debug
 info.
 
 But it might cause very bad issues on some special cases. For example,
 some driver hit errors, and it dumps many messages like reg values, etc.
 
 Sometimes, printk is called when irqs disabled. This is OKay if there is
 a few messages. But What would happen if many messages outputed by other
 drivers at same time.
 
 Here is the scenario.
 CPUACPUB
 //driver dumps debug info local_irq_save(flags);
   printk()
  - console_unlock()
 printk()for(;;) {
  - log_store() //log_next_seq++
 
 //many printk() below   if (console_seq == log_next_seq)
  break;
   //did not break in this case
 //no printk() now   }
   local_irq_restore(flags);
 
 printk runs on CPUA just store the messages in the buf and return.
 printk runs on CPUB(who owns the console_sem lock) would take the duty
 to flush all messages to consoles. It would take a long time to flush
 messages out, IOW, irq would be disabled for a long time. Such case is
 too bad.  We hit many interrupt related panics, for example, cpu did not
 respond to IPI.
 
 Here is the soultion, if we detect such case above, try to rebalance it.
 Let CPUA take the duty to flush messages to consoles.
 
 Introduce five states, START, REQUEST, REPLY, ACK, CANCEL.
 Introduce five actions, request, reply, cancel, ack, start.
 Only two kinds of state transition are allowed. They are
 
 1)request replyackstart
  - START -- REQUEST - REPLY - ACK -
 |   |
  --
 
 2)request cancelstart
  - START -- REQUEST - CANCEL -
 | |
  
 
 To implement the five actions, introduce rebalance_try_request(),
 rebalance_try_reply(), rebalance_try_cancel(), rebalance_try_ack(). As
 action start is always after ack and cancel. rebalance_try_cancel/ack
 would do start as well for simplicity.
 
 State transition must be performed with logbuf_lock held.
 
 When we have done the state transition, CPUB who owns the console_sem
 will up rebalance_sem as well. CPUA who is going to take duty fo flush
 messages will down_trylock rebalance_sem. The seqence to up/down two
 sems must be like below, otherwise we might has risks that no one owns
 console_sem to do flush job.
 
 CPUA  CPUB
 down_trylock(rebalance_sem)  up(console_sem)
   up(rebalance_sem);
 down_trylock(console_sem);
 
 Signed-off-by: Pan Xinhui xinhuix@intel.com
 ---
 change from v1:
   rewrite the patch.
 ---
  kernel/printk/printk.c | 100 
 -
  1 file changed, 99 insertions(+), 1 deletion(-)
 
 diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
 index cf8c242..1ffbd32 100644
 --- a/kernel/printk/printk.c
 +++ b/kernel/printk/printk.c
 @@ -1652,6 +1652,69 @@ static size_t cont_print_text(char *text, size_t size)
   return textlen;
  }
  
 +#define START1
 +#define CANCEL   2
 +#define ACK  4
 +#define REQUEST  8
 +#define REPLY16
 +
 +static struct semaphore rebalance_sem =
 +  __SEMAPHORE_INITIALIZER(rebalance_sem, 0);
 +
 +static inline bool accept(int