[PATCH] Msleep_interruptible() on a dual processor system may wait a long time.

2013-04-29 Thread David VomLehn
Msleep_interruptible() on a dual processor system may wait a long time.

On some reboots, calling msleep_interruptible() from CPU 1 on a dual
processor system will not return for seconds or even minutes. This happens
because ksoftirqd/1 migrates to CPU 0, which is allowed because its
cpus_allowed mask is 0x3. Since ksoftirqd daemons only process the timer queue
for their current CPU, no timer_list entries will be processed on CPU 1 until
the ksoftirqd/1 migrates back to that CPU, which depends on system load and
may take an arbitrary amount of time. The task associated with the
msleep_interruptible() call may thus hang quite a while.

The root cause appears to be to a race condition between select_fallback_rq(),
which selects a runqueue for a task, and set_cpu_active(), which sets the
corresponding bit in cpu_active_mask for a newly active CPU. When ksoftirqd/1
is run for the first time, its cpus_allowed mask is set to 0x2, i.e. it is
restricted to CPU 1. The function select_task_rq() will be called, which calls
select_task_rq_fair(). This will return a 0 for the CPU on which to run the
task. When select_task_rq() finds the task is not allowed to run on CPU 0,
it calls select_fallback_rq() to choose a new CPU. There are two cases:

o   If set_cpu_active() ran for CPU 1 before select_fallback_rq(), the
corresponding bit in cpu_active_mask will be set, allowing ksoftirqd/1
to run on that CPU.
o   If the order of calls was reversed, select_fallback_rq() will call
cpuset_cpus_allowed_fallback(), which will replace the task's
cpus_allowed_mask with cpu_possible_mask, allowing ksoftirqd/1 to
run on any CPU. It will also choose any CPU from the active CPUs. 

In the second case, ksoftirqd/1 will be able to roam freely across the
system's CPUs, neglecting its responsibility to the timer queue.

Signed-off-by: David VomLehn 
---
 include/linux/cpu.h |4 
 include/linux/smp.h |3 +++
 kernel/cpu.c|5 +
 kernel/sched.c  |4 
 kernel/smp.c|   22 --
 5 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index c692acc..9679dfe 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -138,6 +138,7 @@ int cpu_up(unsigned int cpu);
 void notify_cpu_starting(unsigned int cpu);
 extern void cpu_maps_update_begin(void);
 extern void cpu_maps_update_done(void);
+extern int cpu_notify(unsigned long val, void *v);
 
 #else  /* CONFIG_SMP */
 
@@ -160,6 +161,9 @@ static inline void cpu_maps_update_done(void)
 {
 }
 
+static inline int cpu_notify(unsigned long val, void *v)
+{
+}
 #endif /* CONFIG_SMP */
 extern struct sysdev_class cpu_sysdev_class;
 
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 8cc38d3..1065d73 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -37,6 +37,9 @@ int smp_call_function_single(int cpuid, smp_call_func_t func, 
void *info,
 #include 
 #include 
 
+/* Number of CPUs we kicked that we are waiting to become active */
+extern atomic_t active_cpu_pending;
+
 /*
  * main cross-CPU interfaces, handles INIT, TLB flush, STOP, etc.
  * (defined in asm header):
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 563f136..4a11f33 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -150,7 +150,7 @@ static int __cpu_notify(unsigned long val, void *v, int 
nr_to_call,
return notifier_to_errno(ret);
 }
 
-static int cpu_notify(unsigned long val, void *v)
+int cpu_notify(unsigned long val, void *v)
 {
return __cpu_notify(val, v, -1, NULL);
 }
@@ -315,9 +315,6 @@ static int __cpuinit _cpu_up(unsigned int cpu, int 
tasks_frozen)
goto out_notify;
BUG_ON(!cpu_online(cpu));
 
-   /* Now call notifier in preparation. */
-   cpu_notify(CPU_ONLINE | mod, hcpu);
-
 out_notify:
if (ret != 0)
__cpu_notify(CPU_UP_CANCELED | mod, hcpu, nr_calls, NULL);
diff --git a/kernel/sched.c b/kernel/sched.c
index fcc893f..907d166 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2539,6 +2539,10 @@ static int select_fallback_rq(int cpu, struct 
task_struct *p)
int dest_cpu;
const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(cpu));
 
+   /* Loop because preempt we may be disabled or in atomic context */
+   while (atomic_read(&active_cpu_pending) != 0)
+   ;
+
/* Look for allowed, online CPU in same node. */
for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
diff --git a/kernel/smp.c b/kernel/smp.c
index db197d6..d9635af 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -42,6 +42,17 @@ struct call_single_queue {
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_queue, 
call_single_queue);
 
+/*
+ * Declarations to support waiting until the cpu_up() functions are all
+ * called before trying to wake up associated softir

Re: Coding Style: Reverse XMAS tree declarations ? (was Re: [PATCH net-next v6 02/10] dpaa_eth: add support for DPAA Ethernet)

2016-11-04 Thread David VomLehn
On Fri, Nov 04, 2016 at 10:05:15AM -0700, Randy Dunlap wrote:
> On 11/03/16 23:53, Joe Perches wrote:
> > On Thu, 2016-11-03 at 15:58 -0400, David Miller wrote:
> >> From: Madalin Bucur 
> >> Date: Wed, 2 Nov 2016 22:17:26 +0200
> >>
> >>> This introduces the Freescale Data Path Acceleration Architecture
> >>> +static inline size_t bpool_buffer_raw_size(u8 index, u8 cnt)
> >>> +{
> >>> + u8 i;
> >>> + size_t res = DPAA_BP_RAW_SIZE / 2;
> >>
> >> Always order local variable declarations from longest to shortest line,
> >> also know as Reverse Christmas Tree Format.
> > 
> > I think this declaration sorting order is misguided but
> > here's a possible change to checkpatch adding a test for it
> > that does this test just for net/ and drivers/net/
> 
> I agree with the misguided part.
> That's not actually in CodingStyle AFAICT. Where did this come from?
> 
> 
> thanks.
> -- 
> ~Randy

This puzzles me. The CodingStyle gives some pretty reasonable rationales
for coding style over above the "it's easier to read if it all looks the
same". I can see rationales for other approaches (and I am not proposing
any of these):

alphabetic orderEasier to search for declarations
complex to simple   As in, structs and unions, pointers to simple
data (int, char), simple data. It seems like I
can deduce the simple types from usage, but more
complex I need to know things like the
particular structure.
group by usage  Mirror the ontological locality in the code

Do we have a basis for thinking this is easier or more consistent than
any other approach?
-- 
David VL