Re: [PATCH RFC nohz_full 7/7] nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU

2013-07-28 Thread Lai Jiangshan
On 07/27/2013 07:19 AM, Paul E. McKenney wrote:
> From: "Paul E. McKenney" 
> 
> Because RCU's quiescent-state-forcing mechanism is used to drive the
> full-system-idle state machine, and because this mechanism is executed
> by RCU's grace-period kthreads, this commit forces these kthreads to
> run on the timekeeping CPU (tick_do_timer_cpu).  To do otherwise would
> mean that the RCU grace-period kthreads would force the system into
> non-idle state every time they drove the state machine, which would
> be just a bit on the futile side.
> 
> Signed-off-by: Paul E. McKenney 
> Cc: Frederic Weisbecker 
> Cc: Steven Rostedt 
> ---
>  kernel/rcutree.c|  1 +
>  kernel/rcutree.h|  1 +
>  kernel/rcutree_plugin.h | 20 +++-
>  3 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index aa6d96e..fe83085 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1286,6 +1286,7 @@ static int rcu_gp_init(struct rcu_state *rsp)
>   struct rcu_data *rdp;
>   struct rcu_node *rnp = rcu_get_root(rsp);
>  
> + rcu_bind_gp_kthread();
>   raw_spin_lock_irq(&rnp->lock);
>   rsp->gp_flags = 0; /* Clear all flags: New grace period. */

bind the gp thread when RCU_GP_FLAG_INIT ...

>  
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index e0de5dc..49dac99 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -560,6 +560,7 @@ static void rcu_sysidle_check_cpu(struct rcu_data *rdp, 
> bool *isidle,
>  static bool is_sysidle_rcu_state(struct rcu_state *rsp);
>  static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> unsigned long maxj);
> +static void rcu_bind_gp_kthread(void);
>  static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
>  
>  #endif /* #ifndef RCU_TREE_NONCORE */
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index ff84bed..f65d9c2 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -2544,7 +2544,7 @@ static void rcu_sysidle_check_cpu(struct rcu_data *rdp, 
> bool *isidle,
>   if (!*isidle || rdp->rsp != rcu_sysidle_state ||
>   cpu_is_offline(rdp->cpu) || rdp->cpu == tick_do_timer_cpu)
>   return;
> - /* WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); */
> + WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);


but call rcu_sysidle_check_cpu() when RCU_GP_FLAG_FQS.

In this time, the thread may not be bound to tick_do_timer_cpu,
the WARN_ON_ONCE() may be wrong.

Does any other code ensure the gp thread bound on tick_do_timer_cpu
which I missed?

>  
>   /* Pick up current idle and NMI-nesting counter and check. */
>   cur = atomic_read(&rdtp->dynticks_idle);
> @@ -2570,6 +2570,20 @@ static bool is_sysidle_rcu_state(struct rcu_state *rsp)
>  }
>  
>  /*
> + * Bind the grace-period kthread for the sysidle flavor of RCU to the
> + * timekeeping CPU.
> + */
> +static void rcu_bind_gp_kthread(void)
> +{
> + int cpu = ACCESS_ONCE(tick_do_timer_cpu);
> +
> + if (cpu < 0 || cpu >= nr_cpu_ids)
> + return;
> + if (raw_smp_processor_id() != cpu)
> + set_cpus_allowed_ptr(current, cpumask_of(cpu));
> +}
> +
> +/*
>   * Return a delay in jiffies based on the number of CPUs, rcu_node
>   * leaf fanout, and jiffies tick rate.  The idea is to allow larger
>   * systems more time to transition to full-idle state in order to
> @@ -2767,6 +2781,10 @@ static bool is_sysidle_rcu_state(struct rcu_state *rsp)
>   return false;
>  }
>  
> +static void rcu_bind_gp_kthread(void)
> +{
> +}
> +
>  static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> unsigned long maxj)
>  {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine

2013-07-29 Thread Lai Jiangshan
On 07/27/2013 07:19 AM, Paul E. McKenney wrote:
> From: "Paul E. McKenney" 
> 
> This commit adds the state machine that takes the per-CPU idle data
> as input and produces a full-system-idle indication as output.  This
> state machine is driven out of RCU's quiescent-state-forcing
> mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
> idle state and then rcu_sysidle_report() to drive the state machine.
> 
> The full-system-idle state is sampled using rcu_sys_is_idle(), which
> also drives the state machine if RCU is idle (and does so by forcing
> RCU to become non-idle).  This function returns true if all but the
> timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
> enough to avoid memory contention on the full_sysidle_state state
> variable.  The rcu_sysidle_force_exit() may be called externally
> to reset the state machine back into non-idle state.
> 
> Signed-off-by: Paul E. McKenney 
> Cc: Frederic Weisbecker 
> Cc: Steven Rostedt 
> ---
>  include/linux/rcupdate.h |  18 +++
>  kernel/rcutree.c |  16 ++-
>  kernel/rcutree.h |   5 +
>  kernel/rcutree_plugin.h  | 284 
> ++-
>  4 files changed, 316 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 48f1ef9..1aa8d8c 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1011,4 +1011,22 @@ static inline bool rcu_is_nocb_cpu(int cpu) { return 
> false; }
>  #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
>  
>  
> +/* Only for use by adaptive-ticks code. */
> +#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
> +extern bool rcu_sys_is_idle(void);
> +extern void rcu_sysidle_force_exit(void);
> +#else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> +
> +static inline bool rcu_sys_is_idle(void)
> +{
> + return false;
> +}
> +
> +static inline void rcu_sysidle_force_exit(void)
> +{
> +}
> +
> +#endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> +
> +
>  #endif /* __LINUX_RCUPDATE_H */
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 725524e..aa6d96e 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -718,6 +718,7 @@ static int dyntick_save_progress_counter(struct rcu_data 
> *rdp,
>bool *isidle, unsigned long *maxj)
>  {
>   rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
> + rcu_sysidle_check_cpu(rdp, isidle, maxj);
>   return (rdp->dynticks_snap & 0x1) == 0;
>  }
>  
> @@ -1356,11 +1357,17 @@ int rcu_gp_fqs(struct rcu_state *rsp, int 
> fqs_state_in)
>   rsp->n_force_qs++;
>   if (fqs_state == RCU_SAVE_DYNTICK) {
>   /* Collect dyntick-idle snapshots. */
> + if (is_sysidle_rcu_state(rsp)) {
> + isidle = 1;

isidle = true;
the type of isidle is bool

> + maxj = jiffies - ULONG_MAX / 4;
> + }
>   force_qs_rnp(rsp, dyntick_save_progress_counter,
>&isidle, &maxj);
> + rcu_sysidle_report_gp(rsp, isidle, maxj);
>   fqs_state = RCU_FORCE_QS;
>   } else {
>   /* Handle dyntick-idle and offline CPUs. */
> + isidle = 0;

isidle = false;

>   force_qs_rnp(rsp, rcu_implicit_dynticks_qs, &isidle, &maxj);
>   }
>   /* Clear flag to prevent immediate re-entry. */
> @@ -2087,9 +2094,12 @@ static void force_qs_rnp(struct rcu_state *rsp,
>   cpu = rnp->grplo;
>   bit = 1;
>   for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
> - if ((rnp->qsmask & bit) != 0 &&
> - f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
> - mask |= bit;
> + if ((rnp->qsmask & bit) != 0) {
> + if ((rnp->qsmaskinit & bit) != 0)
> + *isidle = 0;

*isidle = false

> + if (f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
> + mask |= bit;
> + }
>   }
>   if (mask != 0) {
>  
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index 1895043..e0de5dc 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -555,6 +555,11 @@ static void rcu_kick_nohz_cpu(int cpu);
>  static bool init_nocb_callback_list(struct rcu_data *rdp);
>  static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq);
>  static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
> +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> +   unsigned long *maxj);
> +static bool is_sysidle_rcu_state(struct rcu_state *rsp);
> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> +   unsigned long maxj);
>  static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
>  
>  #endif /* #ifndef RCU_TREE_NONCORE 

Re: [PATCH RFC nohz_full 7/7] nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU

2013-07-29 Thread Lai Jiangshan
On 07/30/2013 12:52 AM, Paul E. McKenney wrote:
> On Mon, Jul 29, 2013 at 11:36:05AM +0800, Lai Jiangshan wrote:
>> On 07/27/2013 07:19 AM, Paul E. McKenney wrote:
>>> From: "Paul E. McKenney" 
>>>
>>> Because RCU's quiescent-state-forcing mechanism is used to drive the
>>> full-system-idle state machine, and because this mechanism is executed
>>> by RCU's grace-period kthreads, this commit forces these kthreads to
>>> run on the timekeeping CPU (tick_do_timer_cpu).  To do otherwise would
>>> mean that the RCU grace-period kthreads would force the system into
>>> non-idle state every time they drove the state machine, which would
>>> be just a bit on the futile side.
>>>
>>> Signed-off-by: Paul E. McKenney 
>>> Cc: Frederic Weisbecker 
>>> Cc: Steven Rostedt 
>>> ---
>>>  kernel/rcutree.c|  1 +
>>>  kernel/rcutree.h|  1 +
>>>  kernel/rcutree_plugin.h | 20 +++-
>>>  3 files changed, 21 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
>>> index aa6d96e..fe83085 100644
>>> --- a/kernel/rcutree.c
>>> +++ b/kernel/rcutree.c
>>> @@ -1286,6 +1286,7 @@ static int rcu_gp_init(struct rcu_state *rsp)
>>> struct rcu_data *rdp;
>>> struct rcu_node *rnp = rcu_get_root(rsp);
>>>  
>>> +   rcu_bind_gp_kthread();
>>> raw_spin_lock_irq(&rnp->lock);
>>> rsp->gp_flags = 0; /* Clear all flags: New grace period. */
>>
>> bind the gp thread when RCU_GP_FLAG_INIT ...
>>
>>>  
>>> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
>>> index e0de5dc..49dac99 100644
>>> --- a/kernel/rcutree.h
>>> +++ b/kernel/rcutree.h
>>> @@ -560,6 +560,7 @@ static void rcu_sysidle_check_cpu(struct rcu_data *rdp, 
>>> bool *isidle,
>>>  static bool is_sysidle_rcu_state(struct rcu_state *rsp);
>>>  static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
>>>   unsigned long maxj);
>>> +static void rcu_bind_gp_kthread(void);
>>>  static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
>>>  
>>>  #endif /* #ifndef RCU_TREE_NONCORE */
>>> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
>>> index ff84bed..f65d9c2 100644
>>> --- a/kernel/rcutree_plugin.h
>>> +++ b/kernel/rcutree_plugin.h
>>> @@ -2544,7 +2544,7 @@ static void rcu_sysidle_check_cpu(struct rcu_data 
>>> *rdp, bool *isidle,
>>> if (!*isidle || rdp->rsp != rcu_sysidle_state ||
>>> cpu_is_offline(rdp->cpu) || rdp->cpu == tick_do_timer_cpu)
>>> return;
>>> -   /* WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); */
>>> +   WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);
>>
>>
>> but call rcu_sysidle_check_cpu() when RCU_GP_FLAG_FQS.
> 
> Yep!  But we don't call rcu_gp_fqs() until the grace period is started,
> by which time the kthread will be bound.  Any setting of RCU_GP_FLAG_FQS
> while there is no grace period in progress is ignored.

tick_do_timer_cpu can be changed.
when rcu_gp_fqs() is called, the tick_do_timer_cpu may be a different CPU.

xxx_thread()
{
bind itself to tick_do_timer_cpu.
sleep(); /* tick_do_timer_cpu can be changed while this */
use wrong tick_do_timer_cpu.
}


> 
>> In this time, the thread may not be bound to tick_do_timer_cpu,
>> the WARN_ON_ONCE() may be wrong.
>>
>> Does any other code ensure the gp thread bound on tick_do_timer_cpu
>> which I missed?
> 
> However, on small systems, rcu_sysidle_check_cpu() can be called from
> the timekeeping CPU.  I suppose that this could potentially happen
> before the first grace period starts, and in that case, we could
> potentially see a spurious warning.  I could imagine a number of ways
> to fix this:
> 
> 1.Bind the kthread when it is created.
> 
> 2.Bind the kthread when it first starts running, rather than just
>   after the grace period starts.
> 
> 3.Suppress the warning when there is no grace period in progress.
> 
> 4.Suppress the warning prior to the first grace period starting.
> 
> Seems like #3 is the most straightforward approach.  I just change it to:
> 
>   if (rcu_gp_in_progress(rdp->rsp))
>   WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);
> 
> This still gets a WARN_ON_ONCE() if someone moves the timekeeping CPU,
> but Frederic tells me that it never moves

Re: [PATCH 1/9] workqueue: mark WQ_NON_REENTRANT deprecated

2013-07-30 Thread Lai Jiangshan
On 07/30/2013 08:40 PM, Tejun Heo wrote:
> dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
> WQ_NON_REENTRANT no-op but the following patches didn't remove the
> flag or update the documentation.  Let's mark the flag deprecated and
> update the documentation accordingly.
> 
> Signed-off-by: Tejun Heo 

Acked-by: Lai Jiangshan 

> ---
>  Documentation/workqueue.txt | 18 ++
>  include/linux/workqueue.h   |  7 ++-
>  2 files changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/workqueue.txt b/Documentation/workqueue.txt
> index a6ab4b6..67113f6 100644
> --- a/Documentation/workqueue.txt
> +++ b/Documentation/workqueue.txt
> @@ -100,8 +100,8 @@ Subsystems and drivers can create and queue work items 
> through special
>  workqueue API functions as they see fit. They can influence some
>  aspects of the way the work items are executed by setting flags on the
>  workqueue they are putting the work item on. These flags include
> -things like CPU locality, reentrancy, concurrency limits, priority and
> -more.  To get a detailed overview refer to the API description of
> +things like CPU locality, concurrency limits, priority and more.  To
> +get a detailed overview refer to the API description of
>  alloc_workqueue() below.
>  
>  When a work item is queued to a workqueue, the target gcwq and
> @@ -166,16 +166,6 @@ resources, scheduled and executed.
>  
>  @flags:
>  
> -  WQ_NON_REENTRANT
> -
> - By default, a wq guarantees non-reentrance only on the same
> - CPU.  A work item may not be executed concurrently on the same
> - CPU by multiple workers but is allowed to be executed
> - concurrently on multiple CPUs.  This flag makes sure
> - non-reentrance is enforced across all CPUs.  Work items queued
> - to a non-reentrant wq are guaranteed to be executed by at most
> - one worker system-wide at any given time.
> -
>WQ_UNBOUND
>  
>   Work items queued to an unbound wq are served by a special
> @@ -233,6 +223,10 @@ resources, scheduled and executed.
>  
>   This flag is meaningless for unbound wq.
>  
> +Note that the flag WQ_NON_REENTRANT no longer exists as all workqueues
> +are now non-reentrant - any work item is guaranteed to be executed by
> +at most one worker system-wide at any given time.
> +
>  @max_active:
>  
>  @max_active determines the maximum number of execution contexts per
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index a0ed78a..594521b 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -295,7 +295,12 @@ static inline unsigned int work_static(struct 
> work_struct *work) { return 0; }
>   * Documentation/workqueue.txt.
>   */
>  enum {
> - WQ_NON_REENTRANT= 1 << 0, /* guarantee non-reentrance */
> + /*
> +  * All wqs are now non-reentrant making the following flag
> +  * meaningless.  Will be removed.
> +  */
> + WQ_NON_REENTRANT= 1 << 0, /* DEPRECATED */
> +
>   WQ_UNBOUND  = 1 << 1, /* not bound to any cpu */
>   WQ_FREEZABLE= 1 << 2, /* freeze during suspend */
>   WQ_MEM_RECLAIM  = 1 << 3, /* may be used for memory reclaim */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu 8/9] rcu: Simplify _rcu_barrier() processing

2013-08-20 Thread Lai Jiangshan
On 08/20/2013 10:42 AM, Paul E. McKenney wrote:
> From: "Paul E. McKenney" 
> 
> This commit drops an unneeded ACCESS_ONCE() and simplifies an "our work
> is done" check in _rcu_barrier().  This applies feedback from Linus
> (https://lkml.org/lkml/2013/7/26/777) that he gave to similar code
> in an unrelated patch.
> 
> Signed-off-by: Paul E. McKenney 
> Reviewed-by: Josh Triplett 
> ---
>  kernel/rcutree.c | 15 +--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index c6a064a..612aff1 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -2817,9 +2817,20 @@ static void _rcu_barrier(struct rcu_state *rsp)
>* transition.  The "if" expression below therefore rounds the old
>* value up to the next even number and adds two before comparing.
>*/
> - snap_done = ACCESS_ONCE(rsp->n_barrier_done);
> + snap_done = rsp->n_barrier_done;
>   _rcu_barrier_trace(rsp, "Check", -1, snap_done);
> - if (ULONG_CMP_GE(snap_done, ((snap + 1) & ~0x1) + 2)) {
> +
> + /*
> +  * If the value in snap is odd, we needed to wait for the current
> +  * rcu_barrier() to complete, then wait for the next one, in other
> +  * words, we need the value of snap_done to be three larger than
> +  * the value of snap.  On the other hand, if the value in snap is
> +  * even, we only had to wait for the next rcu_barrier() to complete,
> +  * in other words, we need the value of snap_done to be only two
> +  * greater than the value of snap.  The "(snap + 3) & 0x1" computes

"(snap + 3) & 0x1"
==> "(snap + 3) & ~0x1"

> +  * this for us (thank you, Linus!).
> +  */
> + if (ULONG_CMP_GE(snap_done, (snap + 3) & ~0x1)) {
>   _rcu_barrier_trace(rsp, "EarlyExit", -1, snap_done);
>   smp_mb(); /* caller's subsequent code after above check. */
>   mutex_unlock(&rsp->barrier_mutex);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu 1/9] rcu: Expedite grace periods during suspend/resume

2013-08-20 Thread Lai Jiangshan
On 08/20/2013 10:42 AM, Paul E. McKenney wrote:
> From: Borislav Petkov 
> 
> CONFIG_RCU_FAST_NO_HZ can increase grace-period durations by up to
> a factor of four, which can result in long suspend and resume times.
> Thus, this commit temporarily switches to expedited grace periods when
> suspending the box and return to normal settings when resuming.  Similar
> logic is applied to hibernation.
> 
> Because expedited grace periods are of dubious benefit on very large
> systems, so this commit restricts their automated use during suspend
> and resume to systems of 256 or fewer CPUs.  (Some day a number of
> Linux-kernel facilities, including RCU's expedited grace periods,
> will be more scalable, but I need to see bug reports first.)
> 
> [ paulmck: This also papers over an audio/irq bug, but hopefully that will
>   be fixed soon. ]
> 
> Signed-off-by: Borislav Petkov 
> Signed-off-by: Bjørn Mork 
> Signed-off-by: Paul E. McKenney 
> Reviewed-by: Josh Triplett 
> ---
>  kernel/rcutree.c | 21 +
>  1 file changed, 21 insertions(+)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 338f1d1..a7bf517 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -54,6 +54,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "rcutree.h"
>  #include 
> @@ -3032,6 +3033,25 @@ static int rcu_cpu_notify(struct notifier_block *self,
>   return NOTIFY_OK;
>  }
>  
> +static int rcu_pm_notify(struct notifier_block *self,
> +  unsigned long action, void *hcpu)
> +{
> + switch (action) {
> + case PM_HIBERNATION_PREPARE:
> + case PM_SUSPEND_PREPARE:
> + if (nr_cpu_ids <= 256) /* Expediting bad for large systems. */
> + rcu_expedited = 1;
> + break;
> + case PM_POST_HIBERNATION:
> + case PM_POST_SUSPEND:
> + rcu_expedited = 0;

Users can set it via sysfs, this notify will changes it.
I think we can introduce an rcu_expedited_syfs_saved;
thus we can change this line to:
-   rcu_expedited = 0;
+   rcu_expedited = rcu_expedited_syfs_saved;


rcu_init() {
...
+   rcu_expedited_syfs_saved = rcu_expedited;
}

static ssize_t rcu_expedited_store(struct kobject *kobj,
   struct kobj_attribute *attr,
   const char *buf, size_t count)
{
if (kstrtoint(buf, 0, &rcu_expedited))
return -EINVAL;

+   rcu_expedited_syfs_saved = rcu_expedited;
return count;
}

> + break;
> + default:
> + break;
> + }
> + return NOTIFY_OK;
> +}
> +
>  /*
>   * Spawn the kthread that handles this RCU flavor's grace periods.
>   */
> @@ -3273,6 +3293,7 @@ void __init rcu_init(void)
>* or the scheduler are operational.
>*/
>   cpu_notifier(rcu_cpu_notify, 0);
> + pm_notifier(rcu_pm_notify, 0);
>   for_each_online_cpu(cpu)
>   rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
>  }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu 1/5] rcu: Add duplicate-callback tests to rcutorture

2013-08-20 Thread Lai Jiangshan
On 08/20/2013 10:51 AM, Paul E. McKenney wrote:
> From: "Paul E. McKenney" 
> 
> This commit adds a object_debug option to rcutorture to allow the
> debug-object-based checks for duplicate call_rcu() invocations to
> be deterministically tested.
> 
> Signed-off-by: Paul E. McKenney 
> Cc: Mathieu Desnoyers 
> Cc: Sedat Dilek 
> Cc: Davidlohr Bueso 
> Cc: Rik van Riel 
> Cc: Thomas Gleixner 
> Cc: Linus Torvalds 
> Tested-by: Sedat Dilek 
> [ paulmck: Banish mid-function ifdef, more or less per Josh Triplett. ]
> ---
>  kernel/rcutorture.c | 45 +
>  1 file changed, 45 insertions(+)
> 
> diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
> index 3d936f0f..f5cf2bb 100644
> --- a/kernel/rcutorture.c
> +++ b/kernel/rcutorture.c
> @@ -66,6 +66,7 @@ static int fqs_duration;/* Duration of bursts (us), 0 
> to disable. */
>  static int fqs_holdoff;  /* Hold time within burst (us). */
>  static int fqs_stutter = 3;  /* Wait time between bursts (s). */
>  static int n_barrier_cbs;/* Number of callbacks to test RCU barriers. */
> +static int object_debug; /* Test object-debug double call_rcu()?. */
>  static int onoff_interval;   /* Wait time between CPU hotplugs, 0=disable. */
>  static int onoff_holdoff;/* Seconds after boot before CPU hotplugs. */
>  static int shutdown_secs;/* Shutdown time (s).  <=0 for no shutdown. */
> @@ -100,6 +101,8 @@ module_param(fqs_stutter, int, 0444);
>  MODULE_PARM_DESC(fqs_stutter, "Wait time between fqs bursts (s)");
>  module_param(n_barrier_cbs, int, 0444);
>  MODULE_PARM_DESC(n_barrier_cbs, "# of callbacks/kthreads for barrier 
> testing");
> +module_param(object_debug, int, 0444);
> +MODULE_PARM_DESC(object_debug, "Enable debug-object double call_rcu() 
> testing");
>  module_param(onoff_interval, int, 0444);
>  MODULE_PARM_DESC(onoff_interval, "Time between CPU hotplugs (s), 0=disable");
>  module_param(onoff_holdoff, int, 0444);
> @@ -1934,6 +1937,46 @@ rcu_torture_cleanup(void)
>   rcu_torture_print_module_parms(cur_ops, "End of test: SUCCESS");
>  }
>  
> +#ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD
> +static void rcu_torture_leak_cb(struct rcu_head *rhp)
> +{
> +}
> +
> +static void rcu_torture_err_cb(struct rcu_head *rhp)
> +{
> + /* This -might- happen due to race conditions, but is unlikely. */
> + pr_alert("rcutorture: duplicated callback was invoked.\n");
> +}
> +#endif /* #ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD */
> +
> +/*
> + * Verify that double-free causes debug-objects to complain, but only
> + * if CONFIG_DEBUG_OBJECTS_RCU_HEAD=y.  Otherwise, say that the test
> + * cannot be carried out.
> + */
> +static void rcu_test_debug_objects(void)
> +{
> +#ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD
> + struct rcu_head rh1;
> + struct rcu_head rh2;
> +
> + init_rcu_head_on_stack(&rh1);
> + init_rcu_head_on_stack(&rh2);
> + pr_alert("rcutorture: WARN: Duplicate call_rcu() test starting.\n");
> + local_irq_disable(); /* Make it hard to finish grace period. */

you can use rcu_read_lock() directly.

> + call_rcu(&rh1, rcu_torture_leak_cb); /* start grace period. */
> + call_rcu(&rh2, rcu_torture_err_cb);
> + call_rcu(&rh2, rcu_torture_err_cb); /* duplicate callback. */
> + local_irq_enable();
> + rcu_barrier();
> + pr_alert("rcutorture: WARN: Duplicate call_rcu() test complete.\n");
> + destroy_rcu_head_on_stack(&rh1);
> + destroy_rcu_head_on_stack(&rh2);
> +#else /* #ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD */
> + pr_alert("rcutorture: !CONFIG_DEBUG_OBJECTS_RCU_HEAD, not testing 
> duplicate call_rcu()\n");
> +#endif /* #else #ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD */
> +}
> +
>  static int __init
>  rcu_torture_init(void)
>  {
> @@ -2163,6 +2206,8 @@ rcu_torture_init(void)
>   firsterr = retval;
>   goto unwind;
>   }
> + if (object_debug)
> + rcu_test_debug_objects();
>   rcutorture_record_test_transition();
>   mutex_unlock(&fullstop_mutex);
>   return 0;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu 1/5] rcu: Add duplicate-callback tests to rcutorture

2013-08-20 Thread Lai Jiangshan
On 08/21/2013 02:38 AM, Paul E. McKenney wrote:
> On Tue, Aug 20, 2013 at 06:02:39PM +0800, Lai Jiangshan wrote:
>> On 08/20/2013 10:51 AM, Paul E. McKenney wrote:
>>> From: "Paul E. McKenney" 
>>>
>>> This commit adds a object_debug option to rcutorture to allow the
>>> debug-object-based checks for duplicate call_rcu() invocations to
>>> be deterministically tested.
>>>
>>> Signed-off-by: Paul E. McKenney 
>>> Cc: Mathieu Desnoyers 
>>> Cc: Sedat Dilek 
>>> Cc: Davidlohr Bueso 
>>> Cc: Rik van Riel 
>>> Cc: Thomas Gleixner 
>>> Cc: Linus Torvalds 
>>> Tested-by: Sedat Dilek 
>>> [ paulmck: Banish mid-function ifdef, more or less per Josh Triplett. ]
>>> ---
>>>  kernel/rcutorture.c | 45 +
>>>  1 file changed, 45 insertions(+)
>>>
>>> diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
>>> index 3d936f0f..f5cf2bb 100644
>>> --- a/kernel/rcutorture.c
>>> +++ b/kernel/rcutorture.c
>>> @@ -66,6 +66,7 @@ static int fqs_duration;  /* Duration of bursts (us), 0 
>>> to disable. */
>>>  static int fqs_holdoff;/* Hold time within burst (us). */
>>>  static int fqs_stutter = 3;/* Wait time between bursts (s). */
>>>  static int n_barrier_cbs;  /* Number of callbacks to test RCU barriers. */
>>> +static int object_debug;   /* Test object-debug double call_rcu()?. */
>>>  static int onoff_interval; /* Wait time between CPU hotplugs, 0=disable. */
>>>  static int onoff_holdoff;  /* Seconds after boot before CPU hotplugs. */
>>>  static int shutdown_secs;  /* Shutdown time (s).  <=0 for no shutdown. */
>>> @@ -100,6 +101,8 @@ module_param(fqs_stutter, int, 0444);
>>>  MODULE_PARM_DESC(fqs_stutter, "Wait time between fqs bursts (s)");
>>>  module_param(n_barrier_cbs, int, 0444);
>>>  MODULE_PARM_DESC(n_barrier_cbs, "# of callbacks/kthreads for barrier 
>>> testing");
>>> +module_param(object_debug, int, 0444);
>>> +MODULE_PARM_DESC(object_debug, "Enable debug-object double call_rcu() 
>>> testing");
>>>  module_param(onoff_interval, int, 0444);
>>>  MODULE_PARM_DESC(onoff_interval, "Time between CPU hotplugs (s), 
>>> 0=disable");
>>>  module_param(onoff_holdoff, int, 0444);
>>> @@ -1934,6 +1937,46 @@ rcu_torture_cleanup(void)
>>> rcu_torture_print_module_parms(cur_ops, "End of test: SUCCESS");
>>>  }
>>>  
>>> +#ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD
>>> +static void rcu_torture_leak_cb(struct rcu_head *rhp)
>>> +{
>>> +}
>>> +
>>> +static void rcu_torture_err_cb(struct rcu_head *rhp)
>>> +{
>>> +   /* This -might- happen due to race conditions, but is unlikely. */
>>> +   pr_alert("rcutorture: duplicated callback was invoked.\n");
>>> +}
>>> +#endif /* #ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD */
>>> +
>>> +/*
>>> + * Verify that double-free causes debug-objects to complain, but only
>>> + * if CONFIG_DEBUG_OBJECTS_RCU_HEAD=y.  Otherwise, say that the test
>>> + * cannot be carried out.
>>> + */
>>> +static void rcu_test_debug_objects(void)
>>> +{
>>> +#ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD
>>> +   struct rcu_head rh1;
>>> +   struct rcu_head rh2;
>>> +
>>> +   init_rcu_head_on_stack(&rh1);
>>> +   init_rcu_head_on_stack(&rh2);
>>> +   pr_alert("rcutorture: WARN: Duplicate call_rcu() test starting.\n");
>>> +   local_irq_disable(); /* Make it hard to finish grace period. */
>>
>> you can use rcu_read_lock() directly.
> 
> I could do that as well, but it doesn't do everything that local_irq_disable()
> does.
> 
> Right, which means that my comment is bad.  Fixing both, thank you!
> 
>>> +   call_rcu(&rh1, rcu_torture_leak_cb); /* start grace period. */
> 
> And the one above cannot start a grace period due to irqs being enabled.
> Which is -almost- always OK, but...
> 
>>> +   call_rcu(&rh2, rcu_torture_err_cb);
> 
> And this one should invoke rcu_torture_leak_cb instead of
> rcu_torture_err_cb().  Just results in a confusing error message, but...

I still don't understand why rcu_torture_err_cb() will be called when:

rcu_read_lock();
call_rcu(&rh2, rcu_torture_leak_cb);
call_rcu(&rh2, rcu_torture_err_cb); // rh2 will be still queued here,
// deb

Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

2013-08-20 Thread Lai Jiangshan
On 08/21/2013 11:17 AM, Paul E. McKenney wrote:
> On Sat, Aug 10, 2013 at 08:07:15AM -0700, Paul E. McKenney wrote:
>> On Sat, Aug 10, 2013 at 11:43:59AM +0800, Lai Jiangshan wrote:
> 
> [ . . . ]
> 
>>> So I have to narrow the range of suspect locks. Two choices:
>>> A) don't call rt_mutex_unlock() from rcu_read_unlock(), only call it
>>>from rcu_preempt_not_context_switch(). we need to rework these
>>>two functions and it will add complexity to RCU, and it also still
>>>adds some probability of deferring.
>>
>> One advantage of bh-disable locks is that enabling bh checks
>> TIF_NEED_RESCHED, so that there is no deferring beyond that
>> needed by bh disable.  The same of course applies to preempt_disable().
>>
>> So one approach is to defer when rcu_read_unlock_special() is entered
>> with either preemption or bh disabled.  Your current set_need_resched()
>> trick would work fine in this case.  Unfortunately, re-enabling interrupts
>> does -not- check TIF_NEED_RESCHED, which is why we have latency problems
>> in that case.  (Hence my earlier question about making self-IPI safe
>> on all arches, which would result in an interrupt as soon as interrupts
>> were re-enabled.)
>>
>> Another possibility is to defer only when preemption or bh are disabled
>> on entry ro rcu_read_unlock_special(), but to retain the current
>> (admittedly ugly) nesting rules for the scheduler locks.
> 
> Would you be willing to do a patch that deferred rt_mutex_unlock() in
> the preempt/bh cases?  This of course does not solve the irq-disable
> case, but it should at least narrow the problem to the scheduler locks.
> 
> Not a big hurry, given the testing required, this is 3.13 or 3.14 material,
> I think.
> 
> If you are busy, no problem, I can do it, just figured you have priority
> if you want it.
> 
>   


I'm writing a special rt_mutex_unlock() for rcu deboost only.
I hope Steven accept it.

Thanks,
Lai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

2013-08-22 Thread Lai Jiangshan
[PATCH] rcu/rt_mutex: eliminate a kind of deadlock for rcu read site

Current rtmutex's lock->wait_lock doesn't disables softirq nor irq, it will
cause rcu read site deadlock when rcu overlaps with any 
softirq-context/irq-context lock.

@L is a spinlock of softirq or irq context.

CPU1cpu2(rcu boost)
rcu_read_lock() rt_mutext_lock()
  raw_spin_lock(lock->wait_lock)
spin_lock_XX(L)   
rcu_read_unlock() do_softirq()
  rcu_read_unlock_special()
rt_mutext_unlock()
  raw_spin_lock(lock->wait_lock)spin_lock_XX(L)  **DEADLOCK**

This patch fixes this kind of deadlock by removing rt_mutext_unlock() from
rcu_read_unlock(), new rt_mutex_rcu_deboost_unlock() is called instead.
Thus rtmutex's lock->wait_lock will not be called from rcu_read_unlock().

This patch does not eliminate all kinds of rcu-read-site deadlock,
if @L is a scheduler lock, it will be deadlock, we should apply Paul's rule
in this case.(avoid overlapping or preempt_disable()).

rt_mutex_rcu_deboost_unlock() requires the @waiter is queued, so we
can't directly call rt_mutex_lock(&mtx) in the rcu_boost thread,
we split rt_mutex_lock(&mtx) into two steps just like pi-futex.
This result a internal state in rcu_boost thread and cause
rcu_boost thread a bit more complicated.

Thanks
Lai

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 5cd0f09..8830874 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -102,7 +102,7 @@ extern struct group_info init_groups;
 
 #ifdef CONFIG_RCU_BOOST
 #define INIT_TASK_RCU_BOOST()  \
-   .rcu_boost_mutex = NULL,
+   .rcu_boost_waiter = NULL,
 #else
 #define INIT_TASK_RCU_BOOST()
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e9995eb..1eca99f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1078,7 +1078,7 @@ struct task_struct {
struct rcu_node *rcu_blocked_node;
 #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
 #ifdef CONFIG_RCU_BOOST
-   struct rt_mutex *rcu_boost_mutex;
+   struct rt_mutex_waiter *rcu_boost_waiter;
 #endif /* #ifdef CONFIG_RCU_BOOST */
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
@@ -1723,7 +1723,7 @@ static inline void rcu_copy_process(struct task_struct *p)
p->rcu_blocked_node = NULL;
 #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
 #ifdef CONFIG_RCU_BOOST
-   p->rcu_boost_mutex = NULL;
+   p->rcu_boost_waiter = NULL;
 #endif /* #ifdef CONFIG_RCU_BOOST */
INIT_LIST_HEAD(&p->rcu_node_entry);
 }
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 769e12e..d207ddd 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -33,6 +33,7 @@
 #define RCU_KTHREAD_PRIO 1
 
 #ifdef CONFIG_RCU_BOOST
+#include "rtmutex_common.h"
 #define RCU_BOOST_PRIO CONFIG_RCU_BOOST_PRIO
 #else
 #define RCU_BOOST_PRIO RCU_KTHREAD_PRIO
@@ -340,7 +341,7 @@ void rcu_read_unlock_special(struct task_struct *t)
unsigned long flags;
struct list_head *np;
 #ifdef CONFIG_RCU_BOOST
-   struct rt_mutex *rbmp = NULL;
+   struct rt_mutex_waiter *waiter = NULL;
 #endif /* #ifdef CONFIG_RCU_BOOST */
struct rcu_node *rnp;
int special;
@@ -397,10 +398,10 @@ void rcu_read_unlock_special(struct task_struct *t)
 #ifdef CONFIG_RCU_BOOST
if (&t->rcu_node_entry == rnp->boost_tasks)
rnp->boost_tasks = np;
-   /* Snapshot/clear ->rcu_boost_mutex with rcu_node lock held. */
-   if (t->rcu_boost_mutex) {
-   rbmp = t->rcu_boost_mutex;
-   t->rcu_boost_mutex = NULL;
+   /* Snapshot/clear ->rcu_boost_waiter with rcu_node lock held. */
+   if (t->rcu_boost_waiter) {
+   waiter = t->rcu_boost_waiter;
+   t->rcu_boost_waiter = NULL;
}
 #endif /* #ifdef CONFIG_RCU_BOOST */
 
@@ -426,8 +427,8 @@ void rcu_read_unlock_special(struct task_struct *t)
 
 #ifdef CONFIG_RCU_BOOST
/* Unboost if we were boosted. */
-   if (rbmp)
-   rt_mutex_unlock(rbmp);
+   if (waiter)
+   rt_mutex_rcu_deboost_unlock(t, waiter);
 #endif /* #ifdef CONFIG_RCU_BOOST */
 
/*
@@ -1129,9 +1130,6 @@ void exit_rcu(void)
 #endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */
 
 #ifdef CONFIG_RCU_BOOST
-
-#include "rtmutex_common.h"
-
 #ifdef CONFIG_RCU_TRACE
 
 static void rcu_initiate_boost_trace(struct rcu_node *rnp)
@@ -1181,14 +1179,15 @@ static int rcu_boost(struct rcu_node *rnp)
 {
unsigned long flags;
struct rt_mutex mtx;
+   struct rt_mutex_waiter rcu_boost_waiter;
struct task_struct *t;
struct list_head *tb;
+   int ret;
 
if (rnp->exp_tasks == NULL && rnp->boost_tasks == NULL)

[PATCH 8/8] rcu: remove irq work for rsp_wakeup()

2013-08-07 Thread Lai Jiangshan
It is safe to aquire scheduler lock in rnp->lock since the rcu read lock is
always deadlock-immunity(rnp->lock is always can't be nested in scheduler lock)

it partial revert patch 016a8d5b.

Signed-off-by: Lai Jiangshan 
---
 kernel/rcutree.c |   17 ++---
 kernel/rcutree.h |1 -
 2 files changed, 2 insertions(+), 16 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e08abb9..6c91edc 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1524,14 +1524,6 @@ static int __noreturn rcu_gp_kthread(void *arg)
}
 }
 
-static void rsp_wakeup(struct irq_work *work)
-{
-   struct rcu_state *rsp = container_of(work, struct rcu_state, 
wakeup_work);
-
-   /* Wake up rcu_gp_kthread() to start the grace period. */
-   wake_up(&rsp->gp_wq);
-}
-
 /*
  * Start a new RCU grace period if warranted, re-initializing the hierarchy
  * in preparation for detecting the next grace period.  The caller must hold
@@ -1556,12 +1548,8 @@ rcu_start_gp_advanced(struct rcu_state *rsp, struct 
rcu_node *rnp,
}
rsp->gp_flags = RCU_GP_FLAG_INIT;
 
-   /*
-* We can't do wakeups while holding the rnp->lock, as that
-* could cause possible deadlocks with the rq->lock. Deter
-* the wakeup to interrupt context.
-*/
-   irq_work_queue(&rsp->wakeup_work);
+   /* Wake up rcu_gp_kthread() to start the grace period. */
+   wake_up(&rsp->gp_wq);
 }
 
 /*
@@ -3153,7 +3141,6 @@ static void __init rcu_init_one(struct rcu_state *rsp,
 
rsp->rda = rda;
init_waitqueue_head(&rsp->gp_wq);
-   init_irq_work(&rsp->wakeup_work, rsp_wakeup);
rnp = rsp->level[rcu_num_lvls - 1];
for_each_possible_cpu(i) {
while (i > rnp->grphi)
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index a5e9643..5892a43 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -449,7 +449,6 @@ struct rcu_state {
char *name; /* Name of structure. */
char abbr;  /* Abbreviated name. */
struct list_head flavors;   /* List of RCU flavors. */
-   struct irq_work wakeup_work;/* Postponed wakeups */
 };
 
 /* Values for rcu_state structure's gp_flags field. */
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/8] rcu: keep irqs disabled in rcu_read_unlock_special()

2013-08-07 Thread Lai Jiangshan
rcu_read_unlock_special() may enable irqs temporarily before it finish its
last work. It doesn't introduce any extremely bad things. but it does
add more task rcu machine states and add more complexity, bad to review.

And if the task is preempted when it enables irqs,
the synchronize_rcu_expedited() will be slowed down, and it can't get help
from rcu_boost.

Signed-off-by: Lai Jiangshan 
---
 kernel/rcutree_plugin.h |   13 -
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 54f7e45..6b23b6f 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -338,7 +338,7 @@ void rcu_read_unlock_special(struct task_struct *t)
int empty;
int empty_exp;
int empty_exp_now;
-   unsigned long flags;
+   unsigned long flags, irq_disabled_flags;
struct list_head *np;
 #ifdef CONFIG_RCU_BOOST
struct rt_mutex *rbmp = NULL;
@@ -351,6 +351,7 @@ void rcu_read_unlock_special(struct task_struct *t)
return;
 
local_irq_save(flags);
+   local_save_flags(irq_disabled_flags);
 
/*
 * If RCU core is waiting for this CPU to exit critical section,
@@ -414,9 +415,12 @@ void rcu_read_unlock_special(struct task_struct *t)
 rnp->grplo,
 rnp->grphi,
 !!rnp->gp_tasks);
-   rcu_report_unblock_qs_rnp(rnp, flags);
+   rcu_report_unblock_qs_rnp(rnp, irq_disabled_flags);
+   /* irqs remain disabled. */
} else {
-   raw_spin_unlock_irqrestore(&rnp->lock, flags);
+   raw_spin_unlock_irqrestore(&rnp->lock,
+  irq_disabled_flags);
+   /* irqs remain disabled. */
}
 
 #ifdef CONFIG_RCU_BOOST
@@ -431,9 +435,8 @@ void rcu_read_unlock_special(struct task_struct *t)
 */
if (!empty_exp && empty_exp_now)
rcu_report_exp_rnp(&rcu_preempt_state, rnp, true);
-   } else {
-   local_irq_restore(flags);
}
+   local_irq_restore(flags);
 }
 
 #ifdef CONFIG_RCU_CPU_STALL_VERBOSE
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/8] rcu: delay task rcu state cleanup in exit_rcu()

2013-08-07 Thread Lai Jiangshan
exit_rcu() tries to clean up the task rcu state when the task is exiting.
It did it by calls __rcu_read_unlock().

Actually, calling rcu_read_unlock_special() is enough. This patch defer it
to the rcu_preempt_note_context_switch() of the next schedule().

This patch prepares for the next patch which defers rcu_read_unlock_special()
if irq is disabled when __rcu_read_unlock() is called.
So __rcu_read_unlock() can't work here(it is irq-disabled here)
if the next patch applied.

Signed-off-by: Lai Jiangshan 
---
 kernel/rcutree_plugin.h |   11 +++
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 6b23b6f..fc8b36f 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -942,10 +942,13 @@ void exit_rcu(void)
 
if (likely(list_empty(¤t->rcu_node_entry)))
return;
-   t->rcu_read_lock_nesting = 1;
-   barrier();
-   t->rcu_read_unlock_special = RCU_READ_UNLOCK_BLOCKED;
-   __rcu_read_unlock();
+   WARN_ON_ONCE(!(t->rcu_read_unlock_special | RCU_READ_UNLOCK_BLOCKED));
+   /*
+* Task RCU state(rcu_node_entry) of this task will be cleanup by
+* the next rcu_preempt_note_context_switch() of the next schedule()
+* in the do_exit().
+*/
+   t->rcu_read_lock_nesting = INT_MIN;
 }
 
 #else /* #ifdef CONFIG_TREE_PREEMPT_RCU */
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/8] rcu: add # of deferred _special() statistics

2013-08-07 Thread Lai Jiangshan
Signed-off-by: Lai Jiangshan 
---
 kernel/rcutree.h|1 +
 kernel/rcutree_plugin.h |1 +
 kernel/rcutree_trace.c  |1 +
 3 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 4a39d36..a5e9643 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -290,6 +290,7 @@ struct rcu_data {
unsigned long   n_force_qs_snap;
/* did other CPU force QS recently? */
longblimit; /* Upper limit on a processed batch */
+   unsigned long   n_defer_special;/* # of deferred _special() */
 
/* 3) dynticks interface. */
struct rcu_dynticks *dynticks;  /* Shared per-CPU dynticks state. */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index c9ff9f1..d828eec 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -396,6 +396,7 @@ void rcu_read_unlock_special(struct task_struct *t, bool 
unlock)
 * is still unlikey to be true.
 */
if (unlikely(unlock && irqs_disabled_flags(flags))) {
+   this_cpu_ptr(&rcu_preempt_data)->n_defer_special++;
set_need_resched();
local_irq_restore(flags);
return;
diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
index cf6c174..17d8b2c 100644
--- a/kernel/rcutree_trace.c
+++ b/kernel/rcutree_trace.c
@@ -149,6 +149,7 @@ static void print_one_rcu_data(struct seq_file *m, struct 
rcu_data *rdp)
seq_printf(m, " ci=%lu nci=%lu co=%lu ca=%lu\n",
   rdp->n_cbs_invoked, rdp->n_nocbs_invoked,
   rdp->n_cbs_orphaned, rdp->n_cbs_adopted);
+   seq_printf(m, " ds=%lu\n", rdp->n_defer_special);
 }
 
 static int show_rcudata(struct seq_file *m, void *v)
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/8] rcu: remove irq/softirq context check in rcu_read_unlock_special()

2013-08-07 Thread Lai Jiangshan
After patch 10f39bb1, "special & RCU_READ_UNLOCK_BLOCKED" can't be true
in irq nor softirq.(due to RCU_READ_UNLOCK_BLOCKED can only be set
when preemption)

Signed-off-by: Lai Jiangshan 
---
 kernel/rcutree_plugin.h |6 --
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 8fd947e..54f7e45 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -361,12 +361,6 @@ void rcu_read_unlock_special(struct task_struct *t)
rcu_preempt_qs(smp_processor_id());
}
 
-   /* Hardware IRQ handlers cannot block. */
-   if (in_irq() || in_serving_softirq()) {
-   local_irq_restore(flags);
-   return;
-   }
-
/* Clean up if blocked during RCU read-side critical section. */
if (special & RCU_READ_UNLOCK_BLOCKED) {
t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

2013-08-07 Thread Lai Jiangshan
Although all articles declare that rcu read site is deadlock-immunity.
It is not true for rcu-preempt, it will be deadlock if rcu read site
overlaps with scheduler lock.

ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
is still not deadlock-immunity. And the problem described in 016a8d5b
is still existed(rcu_read_unlock_special() calls wake_up).

The problem is fixed in patch5.

Lai Jiangshan (8):
  rcu: add a warn to rcu_preempt_note_context_switch()
  rcu: rcu_read_unlock_special() can be nested in irq/softirq 10f39bb1
  rcu: keep irqs disabled in rcu_read_unlock_special()
  rcu: delay task rcu state cleanup in exit_rcu()
  rcu: eliminate rcu read site deadlock
  rcu: call rcu_read_unlock_special() in rcu_preempt_check_callbacks()
  rcu: add # of deferred _special() statistics
  rcu: remove irq work for rsp_wakeup()

 include/linux/rcupdate.h |2 +-
 kernel/rcupdate.c|2 +-
 kernel/rcutree.c |   17 +
 kernel/rcutree.h |2 +-
 kernel/rcutree_plugin.h  |   82 ++---
 kernel/rcutree_trace.c   |1 +
 6 files changed, 68 insertions(+), 38 deletions(-)

-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/8] rcu: call rcu_read_unlock_special() in rcu_preempt_check_callbacks()

2013-08-07 Thread Lai Jiangshan
if rcu_read_unlock_special() is deferred, we can invoke it earlier
in the schedule-tick.

Signed-off-by: Lai Jiangshan 
---
 kernel/rcutree_plugin.h |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 997b424..c9ff9f1 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -684,8 +684,11 @@ static void rcu_preempt_check_callbacks(int cpu)
 {
struct task_struct *t = current;
 
-   if (t->rcu_read_lock_nesting == 0) {
+   if (t->rcu_read_lock_nesting == 0 ||
+   t->rcu_read_lock_nesting == INT_MIN) {
rcu_preempt_qs(cpu);
+   if (t->rcu_read_unlock_special)
+   rcu_read_unlock_special(t, false);
return;
}
if (t->rcu_read_lock_nesting > 0 &&
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/8] rcu: eliminate deadlock for rcu read site

2013-08-07 Thread Lai Jiangshan
Background)

Although all articles declare that rcu read site is deadlock-immunity.
It is not true for rcu-preempt, it will be deadlock if rcu read site
overlaps with scheduler lock.

ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
is still not deadlock-immunity. And the problem described in 016a8d5b
is still existed(rcu_read_unlock_special() calls wake_up).

Aim)

We want to fix the problem forever, we want to keep rcu read site
is deadlock-immunity as books say.

How)

The problem is solved by "if rcu_read_unlock_special() is called inside
any lock which can be (chained) nested in rcu_read_unlock_special(),
we defer rcu_read_unlock_special()".
This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks
in printk()/WARN_ON() and all locks nested in these locks or chained nested
in these locks.

The problem is reduced to "how to distinguish all these locks(context)",
We don't distinguish all these locks, we know that all these locks
should be nested in local_irqs_disable().

we just consider if rcu_read_unlock_special() is called in irqs-disabled
context, it may be called in these suspect locks, we should defer
rcu_read_unlock_special().

The algorithm enlarges the probability of deferring, but the probability
is still very very low.

Deferring does add a small overhead, but it offers us:
1) really deadlock-immunity for rcu read site
2) remove the overhead of the irq-work(250 times per second in avg.)

Signed-off-by: Lai Jiangshan 
---
 include/linux/rcupdate.h |2 +-
 kernel/rcupdate.c|2 +-
 kernel/rcutree_plugin.h  |   47 +
 3 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 4b14bdc..00b4220 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -180,7 +180,7 @@ extern void synchronize_sched(void);
 
 extern void __rcu_read_lock(void);
 extern void __rcu_read_unlock(void);
-extern void rcu_read_unlock_special(struct task_struct *t);
+extern void rcu_read_unlock_special(struct task_struct *t, bool unlock);
 void synchronize_rcu(void);
 
 /*
diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
index cce6ba8..33b89a3 100644
--- a/kernel/rcupdate.c
+++ b/kernel/rcupdate.c
@@ -90,7 +90,7 @@ void __rcu_read_unlock(void)
 #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
barrier();  /* assign before ->rcu_read_unlock_special load */
if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
-   rcu_read_unlock_special(t);
+   rcu_read_unlock_special(t, true);
barrier();  /* ->rcu_read_unlock_special load before assign */
t->rcu_read_lock_nesting = 0;
}
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index fc8b36f..997b424 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -242,15 +242,16 @@ static void rcu_preempt_note_context_switch(int cpu)
   ? rnp->gpnum
   : rnp->gpnum + 1);
raw_spin_unlock_irqrestore(&rnp->lock, flags);
-   } else if (t->rcu_read_lock_nesting < 0 &&
-  !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&
-  t->rcu_read_unlock_special) {
+   } else if (t->rcu_read_lock_nesting == 0 ||
+  (t->rcu_read_lock_nesting < 0 &&
+  !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN))) {
 
/*
 * Complete exit from RCU read-side critical section on
 * behalf of preempted instance of __rcu_read_unlock().
 */
-   rcu_read_unlock_special(t);
+   if (t->rcu_read_unlock_special)
+   rcu_read_unlock_special(t, false);
}
 
/*
@@ -333,7 +334,7 @@ static struct list_head *rcu_next_node_entry(struct 
task_struct *t,
  * notify RCU core processing or task having blocked during the RCU
  * read-side critical section.
  */
-void rcu_read_unlock_special(struct task_struct *t)
+void rcu_read_unlock_special(struct task_struct *t, bool unlock)
 {
int empty;
int empty_exp;
@@ -364,6 +365,42 @@ void rcu_read_unlock_special(struct task_struct *t)
 
/* Clean up if blocked during RCU read-side critical section. */
if (special & RCU_READ_UNLOCK_BLOCKED) {
+   /*
+* If rcu read lock overlaps with scheduler lock,
+* rcu_read_unlock_special() may lead to deadlock:
+*
+* rcu_read_lock();
+* preempt_schedule[_irq]() (when preemption)
+* scheduler lock; (or some other locks can be (chained) nested
+*  in rcu_r

[PATCH 1/8] rcu: add a warn to rcu_preempt_note_context_switch()

2013-08-07 Thread Lai Jiangshan
It is expected that _nesting == INT_MIN if _nesting < 0.
Add a warning to it if something unexpected happen.

Signed-off-by: Lai Jiangshan 
---
 kernel/rcutree_plugin.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 63098a5..8fd947e 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -243,6 +243,7 @@ static void rcu_preempt_note_context_switch(int cpu)
   : rnp->gpnum + 1);
raw_spin_unlock_irqrestore(&rnp->lock, flags);
} else if (t->rcu_read_lock_nesting < 0 &&
+  !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&
   t->rcu_read_unlock_special) {
 
/*
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

2013-08-07 Thread Lai Jiangshan
On 08/08/2013 03:29 AM, Carsten Emde wrote:
> Hi Paul,
> 
>>> Although all articles declare that rcu read site is deadlock-immunity.
>>> It is not true for rcu-preempt, it will be deadlock if rcu read site
>>> overlaps with scheduler lock.
>>
>> The real rule is that if the scheduler does its outermost rcu_read_unlock()
>> with one of those locks held, it has to have avoided enabling preemption
>> through the entire RCU read-side critical section.
>>
>> That said, avoiding the need for this rule would be a good thing.
>>
>> How did you test this?  The rcutorture tests will not exercise this.
>> (Intentionally so, given that it can deadlock!)
>>
>>> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
>>> is still not deadlock-immunity. And the problem described in 016a8d5b
>>> is still existed(rcu_read_unlock_special() calls wake_up).
>>>
>>> The problem is fixed in patch5.
>>
>> This is going to require some serious review and testing.  One requirement
>> is that RCU priority boosting not persist significantly beyond the
>> re-enabling of interrupts associated with the irq-disabled lock.  To do
>> otherwise breaks RCU priority boosting.  At first glance, the added
>> set_need_resched() might handle this, but that is part of the review
>> and testing required.
>>
>> Steven, would you and Carsten be willing to try this and see if it
>> helps with the issues you are seeing in -rt?  (My guess is "no", since
>> a deadlock would block forever rather than waking up after a couple
>> thousand seconds, but worth a try.)
> Your guess was correct, applying this patch doesn't heal the 
> NO_HZ_FULL+PREEMPT_RT_FULL 3.10.4 based system; it still is hanging at -> 
> synchronize_rcu -> wait_rcu_gp.
> 
> -Carsten.
> 

I didn't find the problem you reported, could you give me a url?

Thanx,
Lai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

2013-08-07 Thread Lai Jiangshan
On 08/08/2013 08:36 AM, Paul E. McKenney wrote:
> On Wed, Aug 07, 2013 at 05:38:27AM -0700, Paul E. McKenney wrote:
>> On Wed, Aug 07, 2013 at 06:24:56PM +0800, Lai Jiangshan wrote:
>>> Although all articles declare that rcu read site is deadlock-immunity.
>>> It is not true for rcu-preempt, it will be deadlock if rcu read site
>>> overlaps with scheduler lock.
>>
>> The real rule is that if the scheduler does its outermost rcu_read_unlock()
>> with one of those locks held, it has to have avoided enabling preemption
>> through the entire RCU read-side critical section.
>>
>> That said, avoiding the need for this rule would be a good thing.
>>
>> How did you test this?  The rcutorture tests will not exercise this.
>> (Intentionally so, given that it can deadlock!)
>>
>>> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
>>> is still not deadlock-immunity. And the problem described in 016a8d5b
>>> is still existed(rcu_read_unlock_special() calls wake_up).
>>>
>>> The problem is fixed in patch5.
>>
>> This is going to require some serious review and testing.  One requirement
>> is that RCU priority boosting not persist significantly beyond the
>> re-enabling of interrupts associated with the irq-disabled lock.  To do
>> otherwise breaks RCU priority boosting.  At first glance, the added
>> set_need_resched() might handle this, but that is part of the review
>> and testing required.
>>
>> Steven, would you and Carsten be willing to try this and see if it
>> helps with the issues you are seeing in -rt?  (My guess is "no", since
>> a deadlock would block forever rather than waking up after a couple
>> thousand seconds, but worth a try.)
> 
> No joy from either Steven or Carsten on the -rt hangs.
> 
> I pushed this to -rcu and ran tests.  I hit this in one of the
> configurations:
> 
> [  393.641012] =
> [  393.641012] [ INFO: inconsistent lock state ]
> [  393.641012] 3.11.0-rc1+ #1 Not tainted
> [  393.641012] -
> [  393.641012] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
> [  393.641012] rcu_torture_rea/697 [HC1[1]:SC0[0]:HE0:SE1] takes:
> [  393.641012]  (&lock->wait_lock){?.+...}, at: [] 
> rt_mutex_unlock+0x53/0x100
> [  393.641012] {HARDIRQ-ON-W} state was registered at:
> [  393.641012]   [] __lock_acquire+0x651/0x1d40
> [  393.641012]   [] lock_acquire+0x95/0x210
> [  393.641012]   [] _raw_spin_lock+0x36/0x50
> [  393.641012]   [] rt_mutex_slowlock+0x39/0x170
> [  393.641012]   [] rt_mutex_lock+0x2a/0x30
> [  393.641012]   [] rcu_boost_kthread+0x173/0x800
> [  393.641012]   [] kthread+0xd6/0xe0
> [  393.641012]   [] ret_from_fork+0x7c/0xb0
> [  393.641012] irq event stamp: 96581116
> [  393.641012] hardirqs last  enabled at (96581115): [] 
> restore_args+0x0/0x30
> [  393.641012] hardirqs last disabled at (96581116): [] 
> apic_timer_interrupt+0x6a/0x80
> [  393.641012] softirqs last  enabled at (96576304): [] 
> __do_softirq+0x174/0x470
> [  393.641012] softirqs last disabled at (96576275): [] 
> irq_exit+0x96/0xc0
> [  393.641012] 
> [  393.641012] other info that might help us debug this:
> [  393.641012]  Possible unsafe locking scenario:
> [  393.641012] 
> [  393.641012]CPU0
> [  393.641012]
> [  393.641012]   lock(&lock->wait_lock);
> [  393.641012]   
> [  393.641012] lock(&lock->wait_lock);

Patch2 causes it!
When I found all lock which can (chained) nested in rcu_read_unlock_special(),
I didn't notice rtmutex's lock->wait_lock is not nested in irq-disabled.

Two ways to fix it:
1) change rtmutex's lock->wait_lock, make it alwasys irq-disabled.
2) revert my patch2

> [  393.641012] 
> [  393.641012]  *** DEADLOCK ***
> [  393.641012] 
> [  393.641012] no locks held by rcu_torture_rea/697.
> [  393.641012] 
> [  393.641012] stack backtrace:
> [  393.641012] CPU: 3 PID: 697 Comm: rcu_torture_rea Not tainted 3.11.0-rc1+ 
> #1
> [  393.641012] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> [  393.641012]  8586fea0 88001fcc3a78 8187b4cb 
> 8104a261
> [  393.641012]  88001e1a20c0 88001fcc3ad8 818773e4 
> 
> [  393.641012]  8800 8801 81010a0a 
> 0001
> [  393.641012] Call Trace:
> [  393.641012][] dump_stack+0x4f/0x84
> [  393.641012]  [] ? console_unlock+0x291/0x410
> [  393.641012]  [] print_usage_bug+0x1f5/0x206
> [  393.641012]  [] ? save_stack_trace+0x2a/0x50
> [  393.641012]  [] mark_lock+0x283/0x2e0
> 

Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

2013-08-07 Thread Lai Jiangshan
On 08/08/2013 10:12 AM, Steven Rostedt wrote:
> On Thu, 2013-08-08 at 09:47 +0800, Lai Jiangshan wrote:
> 
>>> [  393.641012]CPU0
>>> [  393.641012]
>>> [  393.641012]   lock(&lock->wait_lock);
>>> [  393.641012]   
>>> [  393.641012] lock(&lock->wait_lock);
>>
>> Patch2 causes it!
>> When I found all lock which can (chained) nested in 
>> rcu_read_unlock_special(),
>> I didn't notice rtmutex's lock->wait_lock is not nested in irq-disabled.
>>
>> Two ways to fix it:
>> 1) change rtmutex's lock->wait_lock, make it alwasys irq-disabled.
>> 2) revert my patch2
> 
> Your patch 2 states:
> 
> "After patch 10f39bb1, "special & RCU_READ_UNLOCK_BLOCKED" can't be true
> in irq nor softirq.(due to RCU_READ_UNLOCK_BLOCKED can only be set
> when preemption)"

Patch5 adds "special & RCU_READ_UNLOCK_BLOCKED" back in irq nor softirq.
This new thing is handle in patch5 if I did not do wrong things in patch5.
(I don't notice rtmutex's lock->wait_lock is not irqs-disabled in patch5)

> 
> But then below we have:
> 
> 
>>
>>> [  393.641012] 
>>> [  393.641012]  *** DEADLOCK ***
>>> [  393.641012] 
>>> [  393.641012] no locks held by rcu_torture_rea/697.
>>> [  393.641012] 
>>> [  393.641012] stack backtrace:
>>> [  393.641012] CPU: 3 PID: 697 Comm: rcu_torture_rea Not tainted 
>>> 3.11.0-rc1+ #1
>>> [  393.641012] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
>>> [  393.641012]  8586fea0 88001fcc3a78 8187b4cb 
>>> 8104a261
>>> [  393.641012]  88001e1a20c0 88001fcc3ad8 818773e4 
>>> 
>>> [  393.641012]  8800 8801 81010a0a 
>>> 0001
>>> [  393.641012] Call Trace:
>>> [  393.641012][] dump_stack+0x4f/0x84
>>> [  393.641012]  [] ? console_unlock+0x291/0x410
>>> [  393.641012]  [] print_usage_bug+0x1f5/0x206
>>> [  393.641012]  [] ? save_stack_trace+0x2a/0x50
>>> [  393.641012]  [] mark_lock+0x283/0x2e0
>>> [  393.641012]  [] ? 
>>> print_irq_inversion_bug.part.40+0x1f0/0x1f0
>>> [  393.641012]  [] __lock_acquire+0x906/0x1d40
>>> [  393.641012]  [] ? __lock_acquire+0x2eb/0x1d40
>>> [  393.641012]  [] ? __lock_acquire+0x2eb/0x1d40
>>> [  393.641012]  [] lock_acquire+0x95/0x210
>>> [  393.641012]  [] ? rt_mutex_unlock+0x53/0x100
>>> [  393.641012]  [] _raw_spin_lock+0x36/0x50
>>> [  393.641012]  [] ? rt_mutex_unlock+0x53/0x100
>>> [  393.641012]  [] rt_mutex_unlock+0x53/0x100
>>> [  393.641012]  [] rcu_read_unlock_special+0x17a/0x2a0
>>> [  393.641012]  [] rcu_check_callbacks+0x313/0x950
>>> [  393.641012]  [] ? hrtimer_run_queues+0x1d/0x180
>>> [  393.641012]  [] ? trace_hardirqs_off+0xd/0x10
>>> [  393.641012]  [] update_process_times+0x43/0x80
>>> [  393.641012]  [] tick_sched_handle.isra.10+0x31/0x40
>>> [  393.641012]  [] tick_sched_timer+0x47/0x70
>>> [  393.641012]  [] __run_hrtimer+0x7c/0x490
>>> [  393.641012]  [] ? ktime_get_update_offsets+0x4d/0xe0
>>> [  393.641012]  [] ? tick_nohz_handler+0xa0/0xa0
>>> [  393.641012]  [] hrtimer_interrupt+0x107/0x260
> 
> The hrtimer_interrupt is calling a rt_mutex_unlock? How did that happen?
> Did it first call a rt_mutex_lock?
> 
> If patch two was the culprit, I'm thinking the idea behind patch two is
> wrong. The only option is to remove patch number two!

removing patch number two can solve the problem found be Paul, but it is not 
the best.
because I can't declare that rcu is deadlock-immunity
(it will be deadlock if rcu read site overlaps with rtmutex's lock->wait_lock
if I only remove patch2)
I must do more things, but I think it is still better than changing rtmutex's 
lock->wait_lock.

Thanks,
Lai

> 
> Or perhaps I missed something.
> 
> -- Steve
> 
> 
>>> [  393.641012]  [] local_apic_timer_interrupt+0x33/0x60
>>> [  393.641012]  [] smp_apic_timer_interrupt+0x3e/0x60
>>> [  393.641012]  [] apic_timer_interrupt+0x6f/0x80
>>> [  393.641012][] ? 
>>> rcu_scheduler_starting+0x60/0x60
>>> [  393.641012]  [] ? __rcu_read_unlock+0x91/0xa0
>>> [  393.641012]  [] rcu_torture_read_unlock+0x33/0x70
>>> [  393.641012]  [] rcu_torture_reader+0xe4/0x450
>>> [  393.641012]  [] ? rcu_torture_reader+0x450/0x450
>>> [  393.641012]  [] ? rcutorture_trace_dump+0x30/0x30
>>> [  393.641012]  [] kthread+0xd6/0xe0
>>> [  393.641012]  [] ? _raw_spin_unlock_irq+0x2b/0x60
>>> [  393.641012]  [] ? flush_kthread_worker+0x130/0x130
>>> [  393.641012]  [] ret_from_fork+0x7c/0xb0
>>> [  393.641012]  [] ? flush_kthread_worker+0x130/0x130
>>>
>>> I don't see this without your patches.
>>>
>>> .config attached.  The other configurations completed without errors.
>>> Short tests, 30 minutes per configuration.
>>>
>>> Thoughts?
>>>
>>> Thanx, Paul
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

2013-08-07 Thread Lai Jiangshan
On 08/08/2013 10:12 AM, Steven Rostedt wrote:
> On Thu, 2013-08-08 at 09:47 +0800, Lai Jiangshan wrote:
> 
>>> [  393.641012]CPU0
>>> [  393.641012]
>>> [  393.641012]   lock(&lock->wait_lock);
>>> [  393.641012]   
>>> [  393.641012] lock(&lock->wait_lock);
>>
>> Patch2 causes it!
>> When I found all lock which can (chained) nested in 
>> rcu_read_unlock_special(),
>> I didn't notice rtmutex's lock->wait_lock is not nested in irq-disabled.
>>
>> Two ways to fix it:
>> 1) change rtmutex's lock->wait_lock, make it alwasys irq-disabled.
>> 2) revert my patch2
> 
> Your patch 2 states:
> 
> "After patch 10f39bb1, "special & RCU_READ_UNLOCK_BLOCKED" can't be true
> in irq nor softirq.(due to RCU_READ_UNLOCK_BLOCKED can only be set
> when preemption)"
> 
> But then below we have:
> 
> 
>>
>>> [  393.641012] 
>>> [  393.641012]  *** DEADLOCK ***
>>> [  393.641012] 
>>> [  393.641012] no locks held by rcu_torture_rea/697.
>>> [  393.641012] 
>>> [  393.641012] stack backtrace:
>>> [  393.641012] CPU: 3 PID: 697 Comm: rcu_torture_rea Not tainted 
>>> 3.11.0-rc1+ #1
>>> [  393.641012] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
>>> [  393.641012]  8586fea0 88001fcc3a78 8187b4cb 
>>> 8104a261
>>> [  393.641012]  88001e1a20c0 88001fcc3ad8 818773e4 
>>> 
>>> [  393.641012]  8800 8801 81010a0a 
>>> 0001
>>> [  393.641012] Call Trace:
>>> [  393.641012][] dump_stack+0x4f/0x84
>>> [  393.641012]  [] ? console_unlock+0x291/0x410
>>> [  393.641012]  [] print_usage_bug+0x1f5/0x206
>>> [  393.641012]  [] ? save_stack_trace+0x2a/0x50
>>> [  393.641012]  [] mark_lock+0x283/0x2e0
>>> [  393.641012]  [] ? 
>>> print_irq_inversion_bug.part.40+0x1f0/0x1f0
>>> [  393.641012]  [] __lock_acquire+0x906/0x1d40
>>> [  393.641012]  [] ? __lock_acquire+0x2eb/0x1d40
>>> [  393.641012]  [] ? __lock_acquire+0x2eb/0x1d40
>>> [  393.641012]  [] lock_acquire+0x95/0x210
>>> [  393.641012]  [] ? rt_mutex_unlock+0x53/0x100
>>> [  393.641012]  [] _raw_spin_lock+0x36/0x50
>>> [  393.641012]  [] ? rt_mutex_unlock+0x53/0x100
>>> [  393.641012]  [] rt_mutex_unlock+0x53/0x100
>>> [  393.641012]  [] rcu_read_unlock_special+0x17a/0x2a0
>>> [  393.641012]  [] rcu_check_callbacks+0x313/0x950
>>> [  393.641012]  [] ? hrtimer_run_queues+0x1d/0x180
>>> [  393.641012]  [] ? trace_hardirqs_off+0xd/0x10
>>> [  393.641012]  [] update_process_times+0x43/0x80
>>> [  393.641012]  [] tick_sched_handle.isra.10+0x31/0x40
>>> [  393.641012]  [] tick_sched_timer+0x47/0x70
>>> [  393.641012]  [] __run_hrtimer+0x7c/0x490
>>> [  393.641012]  [] ? ktime_get_update_offsets+0x4d/0xe0
>>> [  393.641012]  [] ? tick_nohz_handler+0xa0/0xa0
>>> [  393.641012]  [] hrtimer_interrupt+0x107/0x260
> 
> The hrtimer_interrupt is calling a rt_mutex_unlock? How did that happen?
> Did it first call a rt_mutex_lock?

Sorry, I forgot to answer this question of yours.
rt_mutex_lock is held by a proxy way in rcu_boost thread.

rt_mutex_init_proxy_locked(&mtx, t);
t->rcu_boost_mutex = &mtx;
raw_spin_unlock_irqrestore(&rnp->lock, flags);
rt_mutex_lock(&mtx);  /* Side effect: boosts task t's priority. */
rt_mutex_unlock(&mtx);  /* Keep lockdep happy. */

> 
> If patch two was the culprit, I'm thinking the idea behind patch two is
> wrong. The only option is to remove patch number two!
> 
> Or perhaps I missed something.
> 
> -- Steve
> 
> 
>>> [  393.641012]  [] local_apic_timer_interrupt+0x33/0x60
>>> [  393.641012]  [] smp_apic_timer_interrupt+0x3e/0x60
>>> [  393.641012]  [] apic_timer_interrupt+0x6f/0x80
>>> [  393.641012][] ? 
>>> rcu_scheduler_starting+0x60/0x60
>>> [  393.641012]  [] ? __rcu_read_unlock+0x91/0xa0
>>> [  393.641012]  [] rcu_torture_read_unlock+0x33/0x70
>>> [  393.641012]  [] rcu_torture_reader+0xe4/0x450
>>> [  393.641012]  [] ? rcu_torture_reader+0x450/0x450
>>> [  393.641012]  [] ? rcutorture_trace_dump+0x30/0x30
>>> [  393.641012]  [] kthread+0xd6/0xe0
>>> [  393.641012]  [] ? _raw_spin_unlock_irq+0x2b/0x60
>>> [  393.641012]  [] ? flush_kthread_worker+0x130/0x130
>>> [  393.641012]  [] ret_from_fork+0x7c/0xb0
>>> [ 

Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

2013-08-07 Thread Lai Jiangshan
On 08/08/2013 10:33 AM, Paul E. McKenney wrote:
> On Thu, Aug 08, 2013 at 10:33:15AM +0800, Lai Jiangshan wrote:
>> On 08/08/2013 10:12 AM, Steven Rostedt wrote:
>>> On Thu, 2013-08-08 at 09:47 +0800, Lai Jiangshan wrote:
>>>
>>>>> [  393.641012]CPU0
>>>>> [  393.641012]
>>>>> [  393.641012]   lock(&lock->wait_lock);
>>>>> [  393.641012]   
>>>>> [  393.641012] lock(&lock->wait_lock);
>>>>
>>>> Patch2 causes it!
>>>> When I found all lock which can (chained) nested in 
>>>> rcu_read_unlock_special(),
>>>> I didn't notice rtmutex's lock->wait_lock is not nested in irq-disabled.
>>>>
>>>> Two ways to fix it:
>>>> 1) change rtmutex's lock->wait_lock, make it alwasys irq-disabled.
>>>> 2) revert my patch2
>>>
>>> Your patch 2 states:
>>>
>>> "After patch 10f39bb1, "special & RCU_READ_UNLOCK_BLOCKED" can't be true
>>> in irq nor softirq.(due to RCU_READ_UNLOCK_BLOCKED can only be set
>>> when preemption)"
>>
>> Patch5 adds "special & RCU_READ_UNLOCK_BLOCKED" back in irq nor softirq.
>> This new thing is handle in patch5 if I did not do wrong things in patch5.
>> (I don't notice rtmutex's lock->wait_lock is not irqs-disabled in patch5)
>>
>>>
>>> But then below we have:
>>>
>>>
>>>>
>>>>> [  393.641012] 
>>>>> [  393.641012]  *** DEADLOCK ***
>>>>> [  393.641012] 
>>>>> [  393.641012] no locks held by rcu_torture_rea/697.
>>>>> [  393.641012] 
>>>>> [  393.641012] stack backtrace:
>>>>> [  393.641012] CPU: 3 PID: 697 Comm: rcu_torture_rea Not tainted 
>>>>> 3.11.0-rc1+ #1
>>>>> [  393.641012] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
>>>>> [  393.641012]  8586fea0 88001fcc3a78 8187b4cb 
>>>>> 8104a261
>>>>> [  393.641012]  88001e1a20c0 88001fcc3ad8 818773e4 
>>>>> 
>>>>> [  393.641012]  8800 8801 81010a0a 
>>>>> 0001
>>>>> [  393.641012] Call Trace:
>>>>> [  393.641012][] dump_stack+0x4f/0x84
>>>>> [  393.641012]  [] ? console_unlock+0x291/0x410
>>>>> [  393.641012]  [] print_usage_bug+0x1f5/0x206
>>>>> [  393.641012]  [] ? save_stack_trace+0x2a/0x50
>>>>> [  393.641012]  [] mark_lock+0x283/0x2e0
>>>>> [  393.641012]  [] ? 
>>>>> print_irq_inversion_bug.part.40+0x1f0/0x1f0
>>>>> [  393.641012]  [] __lock_acquire+0x906/0x1d40
>>>>> [  393.641012]  [] ? __lock_acquire+0x2eb/0x1d40
>>>>> [  393.641012]  [] ? __lock_acquire+0x2eb/0x1d40
>>>>> [  393.641012]  [] lock_acquire+0x95/0x210
>>>>> [  393.641012]  [] ? rt_mutex_unlock+0x53/0x100
>>>>> [  393.641012]  [] _raw_spin_lock+0x36/0x50
>>>>> [  393.641012]  [] ? rt_mutex_unlock+0x53/0x100
>>>>> [  393.641012]  [] rt_mutex_unlock+0x53/0x100
>>>>> [  393.641012]  [] rcu_read_unlock_special+0x17a/0x2a0
>>>>> [  393.641012]  [] rcu_check_callbacks+0x313/0x950
>>>>> [  393.641012]  [] ? hrtimer_run_queues+0x1d/0x180
>>>>> [  393.641012]  [] ? trace_hardirqs_off+0xd/0x10
>>>>> [  393.641012]  [] update_process_times+0x43/0x80
>>>>> [  393.641012]  [] tick_sched_handle.isra.10+0x31/0x40
>>>>> [  393.641012]  [] tick_sched_timer+0x47/0x70
>>>>> [  393.641012]  [] __run_hrtimer+0x7c/0x490
>>>>> [  393.641012]  [] ? ktime_get_update_offsets+0x4d/0xe0
>>>>> [  393.641012]  [] ? tick_nohz_handler+0xa0/0xa0
>>>>> [  393.641012]  [] hrtimer_interrupt+0x107/0x260
>>>
>>> The hrtimer_interrupt is calling a rt_mutex_unlock? How did that happen?
>>> Did it first call a rt_mutex_lock?
>>>
>>> If patch two was the culprit, I'm thinking the idea behind patch two is
>>> wrong. The only option is to remove patch number two!
>>
>> removing patch number two can solve the problem found be Paul, but it is not 
>> the best.
>> because I can't declare that rcu is deadlock-immunity
>> (it will be deadlock if rcu read site overlaps with rtmutex's lock->wait_lock
>> if I only remove patch2)
>> I must do more things, but I think it is still better than changing 
>> rtmutex's lock->wait_lock.
> 
> NP, I will remove your current patches and wait for an updated set.

Hi, Paul

Could you agree that moving the rt_mutex_unlock() to 
rcu_preempt_note_context_switch()?

thanks,
Lai

> 
>   Thanx, Paul
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8] rcu: Ensure rcu read site is deadlock-immunity

2013-08-07 Thread Lai Jiangshan
On 08/08/2013 12:18 PM, Paul E. McKenney wrote:
> On Thu, Aug 08, 2013 at 11:10:47AM +0800, Lai Jiangshan wrote:
>> On 08/08/2013 10:33 AM, Paul E. McKenney wrote:
>>> On Thu, Aug 08, 2013 at 10:33:15AM +0800, Lai Jiangshan wrote:
>>>> On 08/08/2013 10:12 AM, Steven Rostedt wrote:
>>>>> On Thu, 2013-08-08 at 09:47 +0800, Lai Jiangshan wrote:
>>>>>
>>>>>>> [  393.641012]CPU0
>>>>>>> [  393.641012]
>>>>>>> [  393.641012]   lock(&lock->wait_lock);
>>>>>>> [  393.641012]   
>>>>>>> [  393.641012] lock(&lock->wait_lock);
>>>>>>
>>>>>> Patch2 causes it!
>>>>>> When I found all lock which can (chained) nested in 
>>>>>> rcu_read_unlock_special(),
>>>>>> I didn't notice rtmutex's lock->wait_lock is not nested in irq-disabled.
>>>>>>
>>>>>> Two ways to fix it:
>>>>>> 1) change rtmutex's lock->wait_lock, make it alwasys irq-disabled.
>>>>>> 2) revert my patch2
>>>>>
>>>>> Your patch 2 states:
>>>>>
>>>>> "After patch 10f39bb1, "special & RCU_READ_UNLOCK_BLOCKED" can't be true
>>>>> in irq nor softirq.(due to RCU_READ_UNLOCK_BLOCKED can only be set
>>>>> when preemption)"
>>>>
>>>> Patch5 adds "special & RCU_READ_UNLOCK_BLOCKED" back in irq nor softirq.
>>>> This new thing is handle in patch5 if I did not do wrong things in patch5.
>>>> (I don't notice rtmutex's lock->wait_lock is not irqs-disabled in patch5)
>>>>
>>>>>
>>>>> But then below we have:
>>>>>
>>>>>
>>>>>>
>>>>>>> [  393.641012] 
>>>>>>> [  393.641012]  *** DEADLOCK ***
>>>>>>> [  393.641012] 
>>>>>>> [  393.641012] no locks held by rcu_torture_rea/697.
>>>>>>> [  393.641012] 
>>>>>>> [  393.641012] stack backtrace:
>>>>>>> [  393.641012] CPU: 3 PID: 697 Comm: rcu_torture_rea Not tainted 
>>>>>>> 3.11.0-rc1+ #1
>>>>>>> [  393.641012] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
>>>>>>> [  393.641012]  8586fea0 88001fcc3a78 8187b4cb 
>>>>>>> 8104a261
>>>>>>> [  393.641012]  88001e1a20c0 88001fcc3ad8 818773e4 
>>>>>>> 
>>>>>>> [  393.641012]  8800 8801 81010a0a 
>>>>>>> 0001
>>>>>>> [  393.641012] Call Trace:
>>>>>>> [  393.641012][] dump_stack+0x4f/0x84
>>>>>>> [  393.641012]  [] ? console_unlock+0x291/0x410
>>>>>>> [  393.641012]  [] print_usage_bug+0x1f5/0x206
>>>>>>> [  393.641012]  [] ? save_stack_trace+0x2a/0x50
>>>>>>> [  393.641012]  [] mark_lock+0x283/0x2e0
>>>>>>> [  393.641012]  [] ? 
>>>>>>> print_irq_inversion_bug.part.40+0x1f0/0x1f0
>>>>>>> [  393.641012]  [] __lock_acquire+0x906/0x1d40
>>>>>>> [  393.641012]  [] ? __lock_acquire+0x2eb/0x1d40
>>>>>>> [  393.641012]  [] ? __lock_acquire+0x2eb/0x1d40
>>>>>>> [  393.641012]  [] lock_acquire+0x95/0x210
>>>>>>> [  393.641012]  [] ? rt_mutex_unlock+0x53/0x100
>>>>>>> [  393.641012]  [] _raw_spin_lock+0x36/0x50
>>>>>>> [  393.641012]  [] ? rt_mutex_unlock+0x53/0x100
>>>>>>> [  393.641012]  [] rt_mutex_unlock+0x53/0x100
> 
> The really strange thing here is that I thought that your passing false
> in as the new second parameter to rcu_read_unlock_special() was supposed
> to prevent rt_mutex_unlock() from being called.
> 
> But then why is the call from rcu_preempt_note_context_switch() also
> passing false?  I would have expected that one to pass true.  Probably
> I don't understand your intent with the "unlock" argument.
> 
>>>>>>> [  393.641012]  [] rcu_read_unlock_special+0x17a/0x2a0
>>>>>>> [  393.641012]  [] rcu_check_callbacks+0x313/0x950
>>>>>>> [  393.641012]  [] ? hrtimer_run_queues+0x1d/0x180
>>>>>>> [  393.641012]  [] ? trace_hardirqs_off+0xd/0x10
>>>

Re: workqueue, pci: INFO: possible recursive locking detected

2013-07-18 Thread Lai Jiangshan
On 07/19/2013 04:23 AM, Srivatsa S. Bhat wrote:
> 
> On 07/17/2013 03:37 PM, Lai Jiangshan wrote:
>> On 07/16/2013 10:41 PM, Srivatsa S. Bhat wrote:
>>> Hi,
>>>
>>> I have been seeing this warning every time during boot. I haven't
>>> spent time digging through it though... Please let me know if
>>> any machine-specific info is needed.
>>>
>>> Regards,
>>> Srivatsa S. Bhat
>>>
>>>
>>> 
>>>
>>> =
>>> [ INFO: possible recursive locking detected ]
>>> 3.11.0-rc1-lockdep-fix-a #6 Not tainted
>>> -
>>> kworker/0:1/142 is trying to acquire lock:
>>>  ((&wfc.work)){+.+.+.}, at: [] flush_work+0x0/0xb0
>>>
>>> but task is already holding lock:
>>>  ((&wfc.work)){+.+.+.}, at: [] 
>>> process_one_work+0x169/0x610
>>>
>>> other info that might help us debug this:
>>>  Possible unsafe locking scenario:
>>>
>>>CPU0
>>>
>>>   lock((&wfc.work));
>>>   lock((&wfc.work));
>>
>>
> 
> 
> Hi Lai,
> 
> Thanks for taking a look into this!
> 
>>
>> This is false negative,
> 
> I believe you meant false-positive...
> 
>> the two "wfc"s are different, they are
>> both on stack. flush_work() can't be deadlock in such case:
>>
>> void foo(void *)
>> {
>>  ...
>>  if (xxx)
>>  work_on_cpu(..., foo, ...);
>>  ...
>> }
>>
>> bar()
>> {
>>  work_on_cpu(..., foo, ...);
>> }
>>
>> The complaint is caused by "work_on_cpu() uses a static lock_class_key".
>> we should fix work_on_cpu().
>> (but the caller should also be careful, the foo()/local_pci_probe() is 
>> re-entering)
>>
>> But I can't find an elegant fix.
>>
>> long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
>> {
>>  struct work_for_cpu wfc = { .fn = fn, .arg = arg };
>>
>> +#ifdef CONFIG_LOCKDEP
>> +static struct lock_class_key __key;
>> +INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
>> +lockdep_init_map(&wfc.work.lockdep_map, &wfc.work, &__key, 0);
>> +#else
>>  INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
>> +#endif
>>  schedule_work_on(cpu, &wfc.work);
>>  flush_work(&wfc.work);
>>  return wfc.ret;
>> }
>>
> 
> Unfortunately that didn't seem to fix it.. I applied the patch
> shown below, and I got the same old warning.
> 
> ---
> 
>  kernel/workqueue.c |6 ++
>  1 file changed, 6 insertions(+)
> 
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index f02c4a4..07d9a67 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4754,7 +4754,13 @@ long work_on_cpu(int cpu, long (*fn)(void *), void 
> *arg)
>  {
>   struct work_for_cpu wfc = { .fn = fn, .arg = arg };
>  
> +#ifdef CONFIG_LOCKDEP
> + static struct lock_class_key __key;

Sorry, this "static" should be removed.

Thanks,
Lai


> + INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
> + lockdep_init_map(&wfc.work.lockdep_map, "&wfc.work", &__key, 0);
> +#else
>   INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
> +#endif
>   schedule_work_on(cpu, &wfc.work);
>   flush_work(&wfc.work);
>   return wfc.ret;
> 
> 
> 
> Warning:
> 
> 
> wmi: Mapper loaded
> be2net :11:00.0: irq 102 for MSI/MSI-X
> be2net :11:00.0: enabled 1 MSI-x vector(s)
> be2net :11:00.0: created 0 RSS queue(s) and 1 default RX queue
> be2net :11:00.0: created 1 TX queue(s)
> pci :11:04.0: [19a2:0710] type 00 class 0x02
> 
> =
> [ INFO: possible recursive locking detected ]
> 3.11.0-rc1-wq-fix #10 Not tainted
> -
> kworker/0:1/126 is trying to acquire lock:
>  (&wfc.work){+.+.+.}, at: [] flush_work+0x0/0xb0
> 
> but task is already holding lock:
>  (&wfc.work){+.+.+.}, at: [] process_one_work+0x169/0x610
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>CPU0
>
>   lock(&wfc.work);
>   lock(&wfc.work);
> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 3 locks held by kworker

Re: workqueue, pci: INFO: possible recursive locking detected

2013-07-22 Thread Lai Jiangshan
On 07/19/2013 04:57 PM, Srivatsa S. Bhat wrote:
> On 07/19/2013 07:17 AM, Lai Jiangshan wrote:
>> On 07/19/2013 04:23 AM, Srivatsa S. Bhat wrote:
>>>
>>> ---
>>>
>>>  kernel/workqueue.c |6 ++
>>>  1 file changed, 6 insertions(+)
>>>
>>>
>>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>>> index f02c4a4..07d9a67 100644
>>> --- a/kernel/workqueue.c
>>> +++ b/kernel/workqueue.c
>>> @@ -4754,7 +4754,13 @@ long work_on_cpu(int cpu, long (*fn)(void *), void 
>>> *arg)
>>>  {
>>> struct work_for_cpu wfc = { .fn = fn, .arg = arg };
>>>  
>>> +#ifdef CONFIG_LOCKDEP
>>> +   static struct lock_class_key __key;
>>
>> Sorry, this "static" should be removed.
>>
> 
> That didn't help either :-( Because it makes lockdep unhappy,
> since the key isn't persistent.
> 
> This is the patch I used:
> 
> ---
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index f02c4a4..7967e3b 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4754,7 +4754,13 @@ long work_on_cpu(int cpu, long (*fn)(void *), void 
> *arg)
>  {
>   struct work_for_cpu wfc = { .fn = fn, .arg = arg };
> 
> +#ifdef CONFIG_LOCKDEP
> + struct lock_class_key __key;
> + INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
> + lockdep_init_map(&wfc.work.lockdep_map, "&wfc.work", &__key, 0);
> +#else
>   INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
> +#endif
>   schedule_work_on(cpu, &wfc.work);
>   flush_work(&wfc.work);
>   return wfc.ret;
> 
> 
> And here are the new warnings:
> 
> 
> Block layer SCSI generic (bsg) driver version 0.4 loaded (major 252)
> io scheduler noop registered
> io scheduler deadline registered
> io scheduler cfq registered (default)
> BUG: key 881039557b98 not in .data!
> [ cut here ]
> WARNING: CPU: 8 PID: 1 at kernel/lockdep.c:2987 lockdep_init_map+0x168/0x170()

Sorry again.

>From 0096b9dac2282ec03d59a3f665b92977381a18ad Mon Sep 17 00:00:00 2001
From: Lai Jiangshan 
Date: Mon, 22 Jul 2013 19:08:51 +0800
Subject: [PATCH] [PATCH] workqueue: allow the function of work_on_cpu() can
 call work_on_cpu()

If the @fn call work_on_cpu() again, the lockdep will complain:

> [ INFO: possible recursive locking detected ]
> 3.11.0-rc1-lockdep-fix-a #6 Not tainted
> -
> kworker/0:1/142 is trying to acquire lock:
>  ((&wfc.work)){+.+.+.}, at: [] flush_work+0x0/0xb0
>
> but task is already holding lock:
>  ((&wfc.work)){+.+.+.}, at: [] process_one_work+0x169/0x610
>
> other info that might help us debug this:
>  Possible unsafe locking scenario:
>
>CPU0
>
>   lock((&wfc.work));
>   lock((&wfc.work));
>
>  *** DEADLOCK ***

It is false-positive lockdep report. In this sutiation,
the two "wfc"s of the two work_on_cpu() are different,
they are both on stack. flush_work() can't be deadlock.

To fix this, we need to avoid the lockdep checking in this case,
But we don't want to change the flush_work(), so we use
completion instead of flush_work() in the work_on_cpu().

Reported-by: Srivatsa S. Bhat 
Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f02c4a4..b021a45 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4731,6 +4731,7 @@ struct work_for_cpu {
long (*fn)(void *);
void *arg;
long ret;
+   struct completion done;
 };
 
 static void work_for_cpu_fn(struct work_struct *work)
@@ -4738,6 +4739,7 @@ static void work_for_cpu_fn(struct work_struct *work)
struct work_for_cpu *wfc = container_of(work, struct work_for_cpu, 
work);
 
wfc->ret = wfc->fn(wfc->arg);
+   complete(&wfc->done);
 }
 
 /**
@@ -4755,8 +4757,9 @@ long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
struct work_for_cpu wfc = { .fn = fn, .arg = arg };
 
INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
+   init_completion(&wfc.done);
schedule_work_on(cpu, &wfc.work);
-   flush_work(&wfc.work);
+   wait_for_completion(&wfc.done);
return wfc.ret;
 }
 EXPORT_SYMBOL_GPL(work_on_cpu);
-- 
1.7.4.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: workqueue, pci: INFO: possible recursive locking detected

2013-07-22 Thread Lai Jiangshan
On 07/23/2013 05:32 AM, Tejun Heo wrote:
> On Mon, Jul 22, 2013 at 07:52:34PM +0800, Lai Jiangshan wrote:
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index f02c4a4..b021a45 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -4731,6 +4731,7 @@ struct work_for_cpu {
>>  long (*fn)(void *);
>>  void *arg;
>>  long ret;
>> +struct completion done;
>>  };
>>  
>>  static void work_for_cpu_fn(struct work_struct *work)
>> @@ -4738,6 +4739,7 @@ static void work_for_cpu_fn(struct work_struct *work)
>>  struct work_for_cpu *wfc = container_of(work, struct work_for_cpu, 
>> work);
>>  
>>  wfc->ret = wfc->fn(wfc->arg);
>> +complete(&wfc->done);
>>  }
>>  
>>  /**
>> @@ -4755,8 +4757,9 @@ long work_on_cpu(int cpu, long (*fn)(void *), void 
>> *arg)
>>  struct work_for_cpu wfc = { .fn = fn, .arg = arg };
>>  
>>  INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
>> +init_completion(&wfc.done);
>>  schedule_work_on(cpu, &wfc.work);
>> -flush_work(&wfc.work);
>> +wait_for_completion(&wfc.done);
> 
> Hmmm... it's kinda nasty.  Given how infrequently work_on_cpu() users
> nest, I think it'd be cleaner to have work_on_cpu_nested() which takes
> @subclass.  It requires extra work on the caller's part but I think
> that actually is useful as nested work_on_cpu()s are pretty weird
> things.
> 

The problem is that the userS may not know their work_on_cpu() nested,
especially when work_on_cpu()s are on different subsystems and the call depth
is deep enough but the nested work_on_cpu() depends on some conditions.

I prefer to change the user instead of introducing work_on_cpu_nested(), and
I accept to change the user only instead of change work_on_cpu() since there is 
only
one nested-calls case found.

But I'm thinking, since nested work_on_cpu() don't have any problem,
Why workqueue.c don't offer a more friendly API/behavior?

Thanks,
Lai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: workqueue, pci: INFO: possible recursive locking detected

2013-07-24 Thread Lai Jiangshan
On 07/23/2013 10:38 PM, Tejun Heo wrote:
> Hey, Lai.
> 
> On Tue, Jul 23, 2013 at 09:23:14AM +0800, Lai Jiangshan wrote:
>> The problem is that the userS may not know their work_on_cpu() nested,
>> especially when work_on_cpu()s are on different subsystems and the call depth
>> is deep enough but the nested work_on_cpu() depends on some conditions.
> 
> Yeah, that's a possibility.  Not sure how much it'd actually matter
> tho given that this is the only instance we have and we've had the
> lockdep annotation for years.
> 
>> I prefer to change the user instead of introducing work_on_cpu_nested(), and
>> I accept to change the user only instead of change work_on_cpu() since there 
>> is only
>> one nested-calls case found.
>>
>> But I'm thinking, since nested work_on_cpu() don't have any problem,
>> Why workqueue.c don't offer a more friendly API/behavior?
> 
> If we wanna solve it from workqueue side, let's please do it by
> introduing an internal flush_work() variant which skips the lockdep
> annotation.  I'd really like to avoid using completion here.  It's
> nasty as it depends solely on the fact that completion doesn't have
> lockdep annotation yet.  Let's do it explicitly.
> 
> Thanks.
> 

>From 269bf1a2f47f04e0daf429c2cdf4052b4e8fb309 Mon Sep 17 00:00:00 2001
From: Lai Jiangshan 
Date: Wed, 24 Jul 2013 18:21:50 +0800
Subject: [PATCH] workqueue: allow the function of work_on_cpu() can call
 work_on_cpu()

If the @fn call work_on_cpu() again, the lockdep will complain:

> [ INFO: possible recursive locking detected ]
> 3.11.0-rc1-lockdep-fix-a #6 Not tainted
> -
> kworker/0:1/142 is trying to acquire lock:
>  ((&wfc.work)){+.+.+.}, at: [] flush_work+0x0/0xb0
>
> but task is already holding lock:
>  ((&wfc.work)){+.+.+.}, at: [] process_one_work+0x169/0x610
>
> other info that might help us debug this:
>  Possible unsafe locking scenario:
>
>CPU0
>
>   lock((&wfc.work));
>   lock((&wfc.work));
>
>  *** DEADLOCK ***

It is false-positive lockdep report. In this sutiation,
the two "wfc"s of the two work_on_cpu() are different,
they are both on stack. flush_work() can't be deadlock.

To fix this, we need to avoid the lockdep checking in this case,
thus we instroduce a internal __flush_work() which skip the lockdep.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |   29 +++--
 1 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f02c4a4..53df707 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2817,6 +2817,19 @@ already_gone:
return false;
 }
 
+static bool __flush_work(struct work_struct *work)
+{
+   struct wq_barrier barr;
+
+   if (start_flush_work(work, &barr)) {
+   wait_for_completion(&barr.done);
+   destroy_work_on_stack(&barr.work);
+   return true;
+   } else {
+   return false;
+   }
+}
+
 /**
  * flush_work - wait for a work to finish executing the last queueing instance
  * @work: the work to flush
@@ -2830,18 +2843,10 @@ already_gone:
  */
 bool flush_work(struct work_struct *work)
 {
-   struct wq_barrier barr;
-
lock_map_acquire(&work->lockdep_map);
lock_map_release(&work->lockdep_map);
 
-   if (start_flush_work(work, &barr)) {
-   wait_for_completion(&barr.done);
-   destroy_work_on_stack(&barr.work);
-   return true;
-   } else {
-   return false;
-   }
+   return __flush_work(work);
 }
 EXPORT_SYMBOL_GPL(flush_work);
 
@@ -4756,7 +4761,11 @@ long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
 
INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
schedule_work_on(cpu, &wfc.work);
-   flush_work(&wfc.work);
+   /*
+* flushing the work can't lead to deadlock, using __flush_work()
+* to avoid the lockdep complaint for nested work_on_cpu()s.
+*/
+   __flush_work(&wfc.work);
return wfc.ret;
 }
 EXPORT_SYMBOL_GPL(work_on_cpu);
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] tracing: remove the found node directly before break the search-loop

2013-07-24 Thread Lai Jiangshan
Just a clean-up, but it gives us better readability.

Sign-off-by: Lai Jiangshan 
---
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 7d85429..a44f501 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1827,13 +1827,13 @@ static void trace_module_remove_events(struct module 
*mod)
 
/* Now free the file_operations */
list_for_each_entry(file_ops, &ftrace_module_file_list, list) {
-   if (file_ops->mod == mod)
+   if (file_ops->mod == mod) {
+   list_del(&file_ops->list);
+   kfree(file_ops);
break;
+   }
}
-   if (&file_ops->list != &ftrace_module_file_list) {
-   list_del(&file_ops->list);
-   kfree(file_ops);
-   }
+
up_write(&trace_event_sem);
 
/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] workqueue: clear workers of a pool after the CPU is offline

2013-07-25 Thread Lai Jiangshan
The unbound pools and their workers can be destroyed/cleared
when their refcnt become zero. But the cpu pool can't be destroyed
due to they are always referenced, their refcnt are always > 0.

We don't want to destroy the cpu pools, but we want to destroy
the workers of the pool when the pool is full idle after the cpu
is offline. This is the default behavior in old days until
we removed the trustee_thread().

We need to find a new way to restore this behavior,
We add offline_pool() and POOL_OFFLINE flag to do so.

1) Before we try to clear workers, we set the POOL_OFFLINE to the pool,
   and pool will not serve to works, any work which is tried to be queued
   on that pool will be rejected except chained works.

2) when all the pending works are finished and all workers are idle, worker
   thread will schedule offline_pool() to clear workers.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |   89 +--
 1 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f02c4a4..2617895 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -63,13 +63,18 @@ enum {
 * %WORKER_UNBOUND set and concurrency management disabled, and may
 * be executing on any CPU.  The pool behaves as an unbound one.
 *
-* Note that DISASSOCIATED should be flipped only while holding
-* manager_mutex to avoid changing binding state while
+* OFFLINE is a further state of DISASSOCIATED when the cpu had
+* finished offline and all the workers will exit after they
+* finish the last works of the pool.
+*
+* Note that DISASSOCIATED and OFFLINE should be flipped only while
+* holding manager_mutex to avoid changing binding state while
 * create_worker() is in progress.
 */
POOL_MANAGE_WORKERS = 1 << 0,   /* need to manage workers */
-   POOL_DISASSOCIATED  = 1 << 2,   /* cpu can't serve workers */
+   POOL_DISASSOCIATED  = 1 << 2,   /* pool dissociates its cpu */
POOL_FREEZING   = 1 << 3,   /* freeze in progress */
+   POOL_OFFLINE= 1 << 4,   /* pool can't serve work */
 
/* worker flags */
WORKER_STARTED  = 1 << 0,   /* started */
@@ -164,6 +169,7 @@ struct worker_pool {
struct mutexmanager_arb;/* manager arbitration */
struct mutexmanager_mutex;  /* manager exclusion */
struct idr  worker_idr; /* MG: worker IDs and iteration 
*/
+   struct work_struct  offline_work;   /* offline the pool */
 
struct workqueue_attrs  *attrs; /* I: worker attributes */
struct hlist_node   hash_node;  /* PL: unbound_pool_hash node */
@@ -1372,6 +1378,12 @@ retry:
  wq->name, cpu);
}
 
+   if (unlikely(pwq->pool->flags & POOL_OFFLINE) &&
+   WARN_ON_ONCE(!is_chained_work(wq))) {
+   spin_unlock(&pwq->pool->lock);
+   return;
+   }
+
/* pwq determined, queue */
trace_workqueue_queue_work(req_cpu, pwq, work);
 
@@ -1784,7 +1796,7 @@ static void start_worker(struct worker *worker)
 }
 
 /**
- * create_and_start_worker - create and start a worker for a pool
+ * create_and_start_worker - create and start the initial worker for a pool
  * @pool: the target pool
  *
  * Grab the managership of @pool and create and start a new worker for it.
@@ -1798,6 +1810,7 @@ static int create_and_start_worker(struct worker_pool 
*pool)
worker = create_worker(pool);
if (worker) {
spin_lock_irq(&pool->lock);
+   pool->flags &= ~POOL_OFFLINE;
start_worker(worker);
spin_unlock_irq(&pool->lock);
}
@@ -2091,6 +2104,54 @@ static bool manage_workers(struct worker *worker)
 }
 
 /**
+ * offline_pool - try to offline a pool
+ * @work: embedded offline work item of the target pool
+ *
+ * Try to offline a pool by destroying all its workers.
+ *
+ * offline_pool() only destroys workers which are idle on the idle_list.
+ * If any worker leaves idle by some reasons, it can not be destroyed,
+ * but this work item will be rescheduled by the worker's worker_thread()
+ * again in this case. So offline_pool() may be called multi times
+ * to finish offline pool in this rare case.
+ *
+ * offline_pool() is always scheduled by system_unbound_wq even the pool
+ * is high priority pool:
+ *  1) The pool of system_unbound_wq is always online.
+ *  2) The latency of offline_pool() doesn't matter.
+ */
+static void offline_pool(struct work_struct *work)
+{
+   struct worker_pool *pool;
+   struct worker *worker;
+
+   pool = container_of(work, struct worker_pool, offline_work);
+
+

Re: [PATCH] workqueue: clear workers of a pool after the CPU is offline

2013-07-25 Thread Lai Jiangshan
On 07/25/2013 11:31 PM, Tejun Heo wrote:
> Hello, Lai.
> 
> On Thu, Jul 25, 2013 at 06:52:02PM +0800, Lai Jiangshan wrote:
>> The unbound pools and their workers can be destroyed/cleared
>> when their refcnt become zero. But the cpu pool can't be destroyed
>> due to they are always referenced, their refcnt are always > 0.
>>
>> We don't want to destroy the cpu pools, but we want to destroy
>> the workers of the pool when the pool is full idle after the cpu
>> is offline. This is the default behavior in old days until
>> we removed the trustee_thread().
>>
>> We need to find a new way to restore this behavior,
>> We add offline_pool() and POOL_OFFLINE flag to do so.
> 
> Hmmm... if I'm not confused, now the cpu pools just behave like a
> normal unbound pool when the cpu goes down,

cpu pools are always referenced, they don't behave like unbound pool.

> which means that the idle
> cpu workers will exit once idle timeout is reached, right? 

No, no code to force the cpu workers quit currently.
you can just offline a cpu to see what happened to the workers.

> I really
> don't think it'd be worthwhile to add extra logic to accelerate the
> process.
> 
> Note that there actually are benefits to doing it asynchronously as
> CPUs go up and down very frequently on mobile platforms and destroying
> idle workers as soon as possible would just mean that we'd be doing a
> lot of work which isn't necessary.  I mean, we even grew an explicit
> mechanism to park kthreads to avoid repeatedly creating and destroying
> per-cpu kthreads as cpus go up and down.  I don't see any point in
> adding code to go the other direction.
> 
> Thanks.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] workqueue: clear workers of a pool after the CPU is offline

2013-07-25 Thread Lai Jiangshan
On 07/26/2013 11:07 AM, Tejun Heo wrote:
> Hello,
> 
> On Fri, Jul 26, 2013 at 10:13:25AM +0800, Lai Jiangshan wrote:
>>> Hmmm... if I'm not confused, now the cpu pools just behave like a
>>> normal unbound pool when the cpu goes down,
>>
>> cpu pools are always referenced, they don't behave like unbound pool.
> 
> Yeah sure, they don't get destroyed but pool management functions the
> same.
> 
>>> which means that the idle
>>> cpu workers will exit once idle timeout is reached, right? 
>>
>> No, no code to force the cpu workers quit currently.
>> you can just offline a cpu to see what happened to the workers.
> 
> Hmmm?  The idle timer thing doesn't work?  Why?
> 

any worker can't kill itself.
managers always tries to leave 2 workers.

so the workers of the offline cpu pool can't be totally destroyed.

(In old days, we also have idle timer, but the last workers are killed by 
trustee_thread())
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] workqueue: clear workers of a pool after the CPU is offline

2013-07-26 Thread Lai Jiangshan
On Fri, Jul 26, 2013 at 6:22 PM, Tejun Heo  wrote:
> On Fri, Jul 26, 2013 at 11:47:04AM +0800, Lai Jiangshan wrote:
>> any worker can't kill itself.
>> managers always tries to leave 2 workers.
>>
>> so the workers of the offline cpu pool can't be totally destroyed.
>
> But we *do* want to keep them around as CPUs taken offline are likely
> to come online at some point and destroying all of them saves only
> ~16k of memory while adding more work while CPUs are on/offlined which

4 threads, (normal and high priority wq)
~32k
it is still small.

> can be very frequent on mobile devices.  The change was *intentional*.

but sometimes the cpu is offline for long time.
and maybe the adminstrator want to reclaim the resource..

Add a boot option or sysfs switch?

>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

2013-08-09 Thread Lai Jiangshan
On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> On Wed, Aug 07, 2013 at 06:25:01PM +0800, Lai Jiangshan wrote:
>> Background)
>>
>> Although all articles declare that rcu read site is deadlock-immunity.
>> It is not true for rcu-preempt, it will be deadlock if rcu read site
>> overlaps with scheduler lock.
>>
>> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
>> is still not deadlock-immunity. And the problem described in 016a8d5b
>> is still existed(rcu_read_unlock_special() calls wake_up).
>>
>> Aim)
>>
>> We want to fix the problem forever, we want to keep rcu read site
>> is deadlock-immunity as books say.
>>
>> How)
>>
>> The problem is solved by "if rcu_read_unlock_special() is called inside
>> any lock which can be (chained) nested in rcu_read_unlock_special(),
>> we defer rcu_read_unlock_special()".
>> This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks
>> in printk()/WARN_ON() and all locks nested in these locks or chained nested
>> in these locks.
>>
>> The problem is reduced to "how to distinguish all these locks(context)",
>> We don't distinguish all these locks, we know that all these locks
>> should be nested in local_irqs_disable().
>>
>> we just consider if rcu_read_unlock_special() is called in irqs-disabled
>> context, it may be called in these suspect locks, we should defer
>> rcu_read_unlock_special().
>>
>> The algorithm enlarges the probability of deferring, but the probability
>> is still very very low.
>>
>> Deferring does add a small overhead, but it offers us:
>>  1) really deadlock-immunity for rcu read site
>>  2) remove the overhead of the irq-work(250 times per second in avg.)
> 
> One problem here -- it may take quite some time for a set_need_resched()
> to take effect.  This is especially a problem for RCU priority boosting,
> but can also needlessly delay preemptible-RCU grace periods because
> local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.


The final effect of deboosting(rt_mutex_unlock()) is also accomplished
via set_need_resched()/set_tsk_need_resched().
set_need_resched() is enough for RCU priority boosting issue here.

Since rcu_read_unlock_special() is deferred, it does take quite some time for
QS report to take effect.


> 
> OK, alternatives...
> 
> o Keep the current rule saying that if the scheduler is going
>   to exit an RCU read-side critical section while holding
>   one of its spinlocks, preemption has to have been disabled

Since rtmutex'lock->wait_lock is not irqs-disabled nor bh-disabled.

This kind of spinlocks include scheduler locks, rtmutex'lock->wait_lock,
all locks can be acquired in irq/SOFTIRQ.

So this rule is not only applied for scheduler locks, it should also
be applied for almost all spinlocks in the kernel.

I was hard to accept that rcu read site is not deadlock-immunity.

Thanks,
Lai

>   throughout the full duration of that critical section.
>   Well, we can certainly do this, but it would be nice to get
>   rid of this rule.
> 
> o Use per-CPU variables, possibly injecting delay.  This has ugly
>   disadvantages as noted above.
> 
> o irq_work_queue() can wait a jiffy (or on some architectures,
>   quite a bit longer) before actually doing anything.
> 
> o raise_softirq() is more immediate and is an easy change, but
>   adds a softirq vector -- which people are really trying to
>   get rid of.  Also, wakeup_softirqd() calls things that acquire
>   the scheduler locks, which is exactly what we were trying to
>   avoid doing.
> 
> o invoke_rcu_core() can invoke raise_softirq() as above.
> 
> o IPI to self.  From what I can see, not all architectures
>   support this.  Easy to fake if you have at least two CPUs,
>   but not so good from an OS jitter viewpoint...
> 
> o Add a check to local_irq_disable() and friends.  I would guess
>   that this suggestion would not make architecture maintainers
>   happy.
> 
> Other thoughts?
> 
>   Thanx, Paul
> 
>> Signed-off-by: Lai Jiangshan 
>> ---
>>  include/linux/rcupdate.h |2 +-
>>  kernel/rcupdate.c|2 +-
>>  kernel/rcutree_plugin.h  |   47 
>> +
>>  3 files changed, 44 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>> index 4b14bdc..00b4220 100644
>> --- a/include/linux/rcupdate.h
>> +++ b/in

Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

2013-08-09 Thread Lai Jiangshan
Hi, Steven

I was considering rtmutex's lock->wait_lock is a scheduler lock,
But it is not, and it is just a spinlock of process context.
I hope you change it to a spinlock of irq context.

1) it causes rcu read site more deadlockable, example:
x is a spinlock of softirq context.

CPU1cpu2(rcu boost)
rcu_read_lock() rt_mutext_lock()
raw_spin_lock(lock->wait_lock)
spin_lock_bh(x) 
rcu_read_unlock() do_softirq()
  rcu_read_unlock_special()
rt_mutext_unlock()
  raw_spin_lock(lock->wait_lock)spin_lock_bh(x)  DEADLOCK

This example can happen on any one of these code:
without my patchset
with my patchset
with my patchset but reverted patch2

2) why it causes more deadlockable: it extends the range of suspect locks.
#DEFINE suspect locks: any lock can be (chained) nested in 
rcu_read_unlock_special().

So the suspect locks are: rnp->lock, scheduler locks, rtmutex's lock->wait_lock,
locks in prink()/WARN_ON() and the locks which can be chained/indirectly nested
in the above locks.

If the rtmutex's lock->wait_lock is a spinlock of irq context, all suspect 
locks are
some spinlocks of irq context.

If the rtmutex's lock->wait_lock is a spinlock of process context, suspect locks
will be extended to, all spinlocks of irq context, all spinlocks of softirq 
context,
and (all spinlocks of process context but nested in rtmutex's lock->wait_lock).

We can see from the definition, if rcu_read_unlock_special() is called from
any suspect lock, it may be deadlock like the example. the rtmutex's 
lock->wait_lock
extends the range of suspect locks, it causes more deadlockable.

3) How my algorithm works, why smaller range of suspect locks help us.
Since rcu_read_unlock_special() can't be called from suspect locks context,
we should defer rcu_read_unlock_special() when in these contexts.
It is hard to find out current context is suspect locks context or not,
but we can determine it based on irq/softirq/process context.

if all suspect locks are some spinlocks of irq context:
if (irqs_disabled) /* we may be in suspect locks context */
defer rcu_read_unlock_special().

if all suspect locks are some spinlocks of irq/softirq/process context:
if (irqs_disabled || in_atomic()) /* we may be in suspect locks context 
*/
defer rcu_read_unlock_special().
In this case, the deferring becomes large more, I can't accept it.
So I have to narrow the range of suspect locks. Two choices:
A) don't call rt_mutex_unlock() from rcu_read_unlock(), only call it
   from rcu_preempt_not_context_switch(). we need to rework these
   two functions and it will add complexity to RCU, and it also still
   adds some probability of deferring.
B) change rtmutex's lock->wait_lock to irqs-disabled.

4) In the view of rtmutex, I think it will be better if ->wait_lock is 
irqs-disabled.
   A) like trylock of mutex/rw_sem, we may call rt_mutex_trylock() in irq in 
future.
   B) the critical section of ->wait_lock is short,
  making it irqs-disabled don't hurts responsibility/latency.
   C) almost all time of the critical section of ->wait_lock is irqs-disabled
  (due to task->pi_lock), I think converting whole time of the critical 
section
  of ->wait_lock to irqs-disabled is OK.

So I hope you change rtmutex's lock->wait_lock.

Any feedback from anyone is welcome.

Thanks,
Lai

On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> On Wed, Aug 07, 2013 at 06:25:01PM +0800, Lai Jiangshan wrote:
>> Background)
>>
>> Although all articles declare that rcu read site is deadlock-immunity.
>> It is not true for rcu-preempt, it will be deadlock if rcu read site
>> overlaps with scheduler lock.
>>
>> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
>> is still not deadlock-immunity. And the problem described in 016a8d5b
>> is still existed(rcu_read_unlock_special() calls wake_up).
>>
>> Aim)
>>
>> We want to fix the problem forever, we want to keep rcu read site
>> is deadlock-immunity as books say.
>>
>> How)
>>
>> The problem is solved by "if rcu_read_unlock_special() is called inside
>> any lock which can be (chained) nested in rcu_read_unlock_special(),
>> we defer rcu_read_unlock_special()".
>> This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks
>> in printk()/WARN_ON() and all locks nested in these locks or chained nested
>> in these locks.
>>
>> The problem is reduced to "how to distinguish all these locks(context)",
>> We don't distinguish all these locks, we know that all these locks
>> should be nested in 

[PATCH] lglock: add read-preference local-global rwlock

2013-03-01 Thread Lai Jiangshan
>From c63f2be9a4cf7106a521dda169a0e14f8e4f7e3b Mon Sep 17 00:00:00 2001
From: Lai Jiangshan 
Date: Mon, 25 Feb 2013 23:14:27 +0800
Subject: [PATCH] lglock: add read-preference local-global rwlock

Current lglock is not read-preference, so it can't be used on some cases
which read-preference rwlock can do. Example, get_cpu_online_atomic().

Although we can use rwlock for these cases which needs read-preference.
but it leads to unnecessary cache-line bouncing even when there are no
writers present, which can slow down the system needlessly. It will be
worse when we have a lot of CPUs, it is not scale.

So we look forward to lglock. lglock is read-write-lock based on percpu locks,
but it is not read-preference due to its underlining percpu locks.

But what if we convert the percpu locks of lglock to use percpu rwlocks:

 CPU 0CPU 1
 --   --
1.spin_lock(&random_lock); read_lock(my_rwlock of CPU 1);
2.read_lock(my_rwlock of CPU 0);   spin_lock(&random_lock);

Writer:
 CPU 2:
 --
  for_each_online_cpu(cpu)
write_lock(my_rwlock of 'cpu');


Consider what happens if the writer begins his operation in between steps 1
and 2 at the reader side. It becomes evident that we end up in a (previously
non-existent) deadlock due to a circular locking dependency between the 3
entities, like this:


(holds  Waiting for
 random_lock) CPU 0 -> CPU 2  (holds my_rwlock of CPU 0
   for write)
   ^   |
   |   |
Waiting|   | Waiting
  for  |   |  for
   |   V
-- CPU 1 <--

(holds my_rwlock of
 CPU 1 for read)


So obviously this "straight-forward" way of implementing percpu rwlocks is
deadlock-prone. So we can't implement read-preference local-global rwlock
like this.


The implement of this patch reuse current lglock as frontend to achieve
local-read-lock, and reuse global fallback rwlock as backend to achieve
read-preference, and use percpu reader counter to indicate 1) the depth
of the nested reader lockes and 2) whether the outmost lock is percpu lock
or fallback rwlock.

The algorithm is simple, in the read site:
If it is nested reader, just increase the counter
If it is the outmost reader,
1) try to lock its cpu's lock of the frontend lglock.
   (reader count +=1 if success)
2) if the above step fails, read_lock(&fallback_rwlock).
   (reader count += FALLBACK_BASE + 1)

Write site:
Do the lg_global_lock() of the frontend lglock.
And then write_lock(&fallback_rwlock).


Prof:
1) reader-writer exclusive:
The two steps of write site finished, no reader. Vice-verse.

2) read-preference:
before write site lock finished acquired, read site at least
wins at read_lock(&fallback_rwlock) for rwlock is read-preference.
read-preference also implies nestable.

3) read site functions are irqsafe(reentrance-safe)
If read site function is interrupted at any point and reenters read site
again, reentranced read site will not be mislead by the first read site if the
reader counter > 0, in this case, it means currently frontend(this cpu lock of
lglock) or backend(fallback rwlock) lock is held, it is safe to act as
nested reader.
if the reader counter=0, eentranced reader considers it is the
outmost read site, and it always successes after the write side release the 
lock.
(even the interrupted read-site has already hold the cpu lock of lglock
or the fallback_rwlock).
And reentranced read site only calls arch_spin_trylock(), read_lock()
and __this_cpu_op(), arch_spin_trylock(), read_lock() is already 
reentrance-safe.
Although __this_cpu_op() is not reentrance-safe, but the value of the counter
will be restored after the interrupted finished, so read site functions
are still reentrance-safe.


Performance:
We only focus on the performance of the read site. this read site's fast path
is just preempt_disable() + __this_cpu_read/inc() + arch_spin_trylock(),
It has only one heavy memory operation. it will be expected fast.

We test three locks.
1) traditional rwlock WITHOUT remote competition nor cache-bouncing.(opt-rwlock)
2) this lock(lgrwlock)
3) V6 percpu-rwlock by "Srivatsa S. Bhat". (percpu-rwlock)
   (https://lkml.org/lkml/2013/2/18/186)

nested=1(no nested) nested=2nested=4
opt-rwlock   517181 1009200 2010027
lgrwlock 452897  700026 1201415
percpu-rwlock   1192955 1451343 1951757

The value is the time(nano-second) of 1 times of the operations:
{
read-lock
[nested read-lock]...
[nes

Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-03-01 Thread Lai Jiangshan
On 28/02/13 05:19, Srivatsa S. Bhat wrote:
> On 02/27/2013 06:03 AM, Lai Jiangshan wrote:
>> On Wed, Feb 27, 2013 at 3:30 AM, Srivatsa S. Bhat
>>  wrote:
>>> On 02/26/2013 09:55 PM, Lai Jiangshan wrote:
>>>> On Tue, Feb 26, 2013 at 10:22 PM, Srivatsa S. Bhat
>>>>  wrote:
>>>>>
>>>>> Hi Lai,
>>>>>
>>>>> I'm really not convinced that piggy-backing on lglocks would help
>>>>> us in any way. But still, let me try to address some of the points
>>>>> you raised...
>>>>>
>>>>> On 02/26/2013 06:29 PM, Lai Jiangshan wrote:
>>>>>> On Tue, Feb 26, 2013 at 5:02 PM, Srivatsa S. Bhat
>>>>>>  wrote:
>>>>>>> On 02/26/2013 05:47 AM, Lai Jiangshan wrote:
>>>>>>>> On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
>>>>>>>>  wrote:
>>>>>>>>> Hi Lai,
>>>>>>>>>
>>>>>>>>> On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
>>>>>>>>>> Hi, Srivatsa,
>>>>>>>>>>
>>>>>>>>>> The target of the whole patchset is nice for me.
>>>>>>>>>
>>>>>>>>> Cool! Thanks :-)
>>>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>> Unfortunately, I see quite a few issues with the code above. IIUC, the
>>>>>>> writer and the reader both increment the same counters. So how will the
>>>>>>> unlock() code in the reader path know when to unlock which of the locks?
>>>>>>
>>>>>> The same as your code, the reader(which nested in write C.S.) just dec
>>>>>> the counters.
>>>>>
>>>>> And that works fine in my case because the writer and the reader update
>>>>> _two_ _different_ counters.
>>>>
>>>> I can't find any magic in your code, they are the same counter.
>>>>
>>>> /*
>>>>  * It is desirable to allow the writer to acquire the percpu-rwlock
>>>>  * for read (if necessary), without deadlocking or getting 
>>>> complaints
>>>>  * from lockdep. To achieve that, just increment the reader_refcnt 
>>>> of
>>>>  * this CPU - that way, any attempt by the writer to acquire the
>>>>  * percpu-rwlock for read, will get treated as a case of nested 
>>>> percpu
>>>>  * reader, which is safe, from a locking perspective.
>>>>  */
>>>> this_cpu_inc(pcpu_rwlock->rw_state->reader_refcnt);
>>>>
>>>
>>> Whoa! Hold on, were you really referring to _this_ increment when you said
>>> that, in your patch you would increment the refcnt at the writer? Then I 
>>> guess
>>> there is a major disconnect in our conversations. (I had assumed that you 
>>> were
>>> referring to the update of writer_signal, and were just trying to have a 
>>> single
>>> refcnt instead of reader_refcnt and writer_signal).
>>
>> https://github.com/laijs/linux/commit/53e5053d5b724bea7c538b11743d0f420d98f38d
>>
>> Sorry the name "fallback_reader_refcnt" misled you.
>>
> [...]
> 
>>>> All I was considered is "nested reader is seldom", so I always
>>>> fallback to rwlock when nested.
>>>> If you like, I can add 6 lines of code, the overhead is
>>>> 1 spin_try_lock()(fast path)  + N  __this_cpu_inc()
>>>>
>>>
>>> I'm assuming that calculation is no longer valid, considering that
>>> we just discussed how the per-cpu refcnt that you were using is quite
>>> unnecessary and can be removed.
>>>
>>> IIUC, the overhead with your code, as per above discussion would be:
>>> 1 spin_try_lock() [non-nested] + N read_lock(global_rwlock).
>>
>> https://github.com/laijs/linux/commit/46334544bb7961550b7065e015da76f6dab21f16
>>
>> Again, I'm so sorry the name "fallback_reader_refcnt" misled you.
>>
> 
> At this juncture I really have to admit that I don't understand your
> intentions at all. What are you really trying to prove? Without giving
> a single good reason why my code is inferior, why are you even bringing
> up the discussion about a complete rewrite of the synchronization code?
> http://article.gmane.org/gmane.linux.kerne

[PATCH V2] lglock: add read-preference local-global rwlock

2013-03-02 Thread Lai Jiangshan
>From 345a7a75c314ff567be48983e0892bc69c4452e7 Mon Sep 17 00:00:00 2001
From: Lai Jiangshan 
Date: Sat, 2 Mar 2013 20:33:14 +0800
Subject: [PATCH] lglock: add read-preference local-global rwlock

Current lglock is not read-preference, so it can't be used on some cases
which read-preference rwlock can do. Example, get_cpu_online_atomic().

Although we can use rwlock for these cases which needs read-preference.
but it leads to unnecessary cache-line bouncing even when there are no
writers present, which can slow down the system needlessly. It will be
worse when we have a lot of CPUs, it is not scale.

So we look forward to lglock. lglock is read-write-lock based on percpu locks,
but it is not read-preference due to its underlining percpu locks.

But what if we convert the percpu locks of lglock to use percpu rwlocks:

 CPU 0CPU 1
 --   --
1.spin_lock(&random_lock); read_lock(my_rwlock of CPU 1);
2.read_lock(my_rwlock of CPU 0);   spin_lock(&random_lock);

Writer:
 CPU 2:
 --
  for_each_online_cpu(cpu)
write_lock(my_rwlock of 'cpu');


Consider what happens if the writer begins his operation in between steps 1
and 2 at the reader side. It becomes evident that we end up in a (previously
non-existent) deadlock due to a circular locking dependency between the 3
entities, like this:


(holds  Waiting for
 random_lock) CPU 0 -> CPU 2  (holds my_rwlock of CPU 0
   for write)
   ^   |
   |   |
Waiting|   | Waiting
  for  |   |  for
   |   V
-- CPU 1 <--

(holds my_rwlock of
 CPU 1 for read)


So obviously this "straight-forward" way of implementing percpu rwlocks is
deadlock-prone. So we can't implement read-preference local-global rwlock
like this.


The implement of this patch reuse current lglock as frontend to achieve
local-read-lock, and reuse global fallback rwlock as backend to achieve
read-preference, and use percpu reader counter to indicate 1) the depth
of the nested reader lockes and 2) whether the outmost lock is percpu lock
or fallback rwlock.

The algorithm is simple, in the read site:
If it is nested reader, just increase the counter
If it is the outmost reader,
1) try to lock its cpu's lock of the frontend lglock.
   (reader count +=1 if success)
2) if the above step fails, read_lock(&fallback_rwlock).
   (reader count += FALLBACK_BASE + 1)

Write site:
Do the lg_global_lock() of the frontend lglock.
And then write_lock(&fallback_rwlock).


Prof:
1) reader-writer exclusive:
write-site must requires all percpu locks and fallback_rwlock.
outmost read-site must requires one of these locks.

2) read-preference:
before write site lock finished acquired, read site at least
wins at read_lock(&fallback_rwlock) due to rwlock is read-preference.

3) read site functions are irqsafe(reentrance-safe)
   (read site functions is not protected by disabled irq, but they are irqsafe)
If read site function is interrupted at any point and reenters read site
again, reentranced read site will not be mislead by the first read site if the
reader counter > 0, in this case, it means currently frontend(this cpu lock of
lglock) or backend(fallback rwlock) lock is held, it is safe to act as
nested reader.
if the reader counter=0, eentranced reader considers it is the
outmost read site, and it always successes after the write side release the 
lock.
(even the interrupted read-site has already hold the cpu lock of lglock
or the fallback_rwlock).
And reentranced read site only calls arch_spin_trylock(), read_lock()
and __this_cpu_op(), arch_spin_trylock(), read_lock() is already 
reentrance-safe.
Although __this_cpu_op() is not reentrance-safe, but the value of the counter
will be restored after the interrupted finished, so read site functions
are still reentrance-safe.


Performance:
We only focus on the performance of the read site. this read site's fast path
is just preempt_disable() + __this_cpu_read/inc() + arch_spin_trylock(),
It has only one heavy memory operation. it will be expected fast.

We test three locks.
1) traditional rwlock WITHOUT remote competition nor cache-bouncing.(opt-rwlock)
2) this lock(lgrwlock)
3) V6 percpu-rwlock by "Srivatsa S. Bhat". (percpu-rwlock)
   (https://lkml.org/lkml/2013/2/18/186)

nested=1(no nested) nested=2nested=4
opt-rwlock   517181 1009200 2010027
lgrwlock 452897  700026 1201415
percpu-rwlock   1192955 1451343 1951757

The value is the time(nano-seco

Re: [PATCH] lglock: add read-preference local-global rwlock

2013-03-02 Thread Lai Jiangshan
On 02/03/13 02:28, Oleg Nesterov wrote:
> Lai, I didn't read this discussion except the code posted by Michel.
> I'll try to read this patch carefully later, but I'd like to ask
> a couple of questions.
> 
> This version looks more complex than Michel's, why? Just curious, I
> am trying to understand what I missed. See
> http://marc.info/?l=linux-kernel&m=136196350213593

Michel changed my old draft version a little, his version is good enough for me.
My new version tries to add a little better nestable support with only
adding single __this_cpu_op() in _read_[un]lock().

> 
> And I can't understand FALLBACK_BASE...
> 
> OK, suppose that CPU_0 does _write_unlock() and releases ->fallback_rwlock.
> 
> CPU_1 does _read_lock(), and ...
> 
>> +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
>> +{
>> +struct lglock *lg = &lgrw->lglock;
>> +
>> +preempt_disable();
>> +rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_);
>> +if (likely(!__this_cpu_read(*lgrw->reader_refcnt))) {
>> +if (!arch_spin_trylock(this_cpu_ptr(lg->lock))) {
> 
> _trylock() fails,
> 
>> +read_lock(&lgrw->fallback_rwlock);
>> +__this_cpu_add(*lgrw->reader_refcnt, FALLBACK_BASE);
> 
> so we take ->fallback_rwlock and ->reader_refcnt == FALLBACK_BASE.
> 
> CPU_0 does lg_global_unlock(lgrw->lglock) and finishes _write_unlock().
> 
> Interrupt handler on CPU_1 does _read_lock() notices ->reader_refcnt != 0
> and simply does this_cpu_inc(), so reader_refcnt == FALLBACK_BASE + 1.
> 
> Then irq does _read_unlock(), and
> 
>> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
>> +{
>> +switch (__this_cpu_dec_return(*lgrw->reader_refcnt)) {
>> +case 0:
>> +lg_local_unlock(&lgrw->lglock);
>> +return;
>> +case FALLBACK_BASE:
>> +__this_cpu_sub(*lgrw->reader_refcnt, FALLBACK_BASE);
>> +read_unlock(&lgrw->fallback_rwlock);
> 
> hits this case?
> 
> Doesn't look right, but most probably I missed something.

Your are right, I just realized that I had spit a code which should be atomic.

I hope this patch(V2) can get more reviews.

My first and many locking knowledge is learned from Paul.
Paul, would you also review it?

Thanks,
Lai

> 
> Oleg.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] workqueue: fix possible bug which may silence the pool

2013-03-02 Thread Lai Jiangshan
After we introduce multiple pools for cpu pools, a part of the comments
in wq_unbind_fn() becomes wrong.

It said that "current worker would trigger unbound chain execution".
It is wrong. current worker only belongs to one of the multiple pools.

If wq_unbind_fn() does unbind the normal_pri pool(not the pool of the current
worker), the current worker is not the available worker to trigger unbound
chain execution of the normal_pri pool, and if all the workers of
the normal_pri goto sleep after they were set %WORKER_UNBOUND but before
they finish their current work, unbound chain execution is not triggered
totally. The pool is stopped!

We can change wq_unbind_fn() only does unbind one pool and we launch multiple
wq_unbind_fn()s, one for each pool to solve the problem.
But this change will add much latency to hotplug path unnecessarily.

So we choice to wake up a worker directly to trigger unbound chain execution.

current worker may sleep on &second_pool->assoc_mutex, so we also move
the wakeup code into the loop to avoid second_pool silences the first_pool.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |   45 ++---
 1 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 81f2457..03159c2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3446,28 +3446,35 @@ static void wq_unbind_fn(struct work_struct *work)
 
spin_unlock_irq(&pool->lock);
mutex_unlock(&pool->assoc_mutex);
-   }
 
-   /*
-* Call schedule() so that we cross rq->lock and thus can guarantee
-* sched callbacks see the %WORKER_UNBOUND flag.  This is necessary
-* as scheduler callbacks may be invoked from other cpus.
-*/
-   schedule();
+   /*
+* Call schedule() so that we cross rq->lock and thus can
+* guarantee sched callbacks see the %WORKER_UNBOUND flag.
+* This is necessary as scheduler callbacks may be invoked
+* from other cpus.
+*/
+   schedule();
 
-   /*
-* Sched callbacks are disabled now.  Zap nr_running.  After this,
-* nr_running stays zero and need_more_worker() and keep_working()
-* are always true as long as the worklist is not empty.  Pools on
-* @cpu now behave as unbound (in terms of concurrency management)
-* pools which are served by workers tied to the CPU.
-*
-* On return from this function, the current worker would trigger
-* unbound chain execution of pending work items if other workers
-* didn't already.
-*/
-   for_each_std_worker_pool(pool, cpu)
+   /*
+* Sched callbacks are disabled now.  Zap nr_running.
+* After this, nr_running stays zero and need_more_worker()
+* and keep_working() are always true as long as the worklist
+* is not empty.  This pool now behave as unbound (in terms of
+* concurrency management) pool which are served by workers
+* tied to the pool.
+*/
atomic_set(&pool->nr_running, 0);
+
+   /* The current busy workers of this pool may goto sleep without
+* wake up any other worker after they were set %WORKER_UNBOUND
+* flag. Here we wake up another possible worker to start
+* the unbound chain execution of pending work items in this
+* case.
+*/
+   spin_lock_irq(&pool->lock);
+   wake_up_worker(pool);
+   spin_unlock_irq(&pool->lock);
+   }
 }
 
 /*
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] workqueue: fix possible bug which may silence the pool

2013-03-04 Thread Lai Jiangshan
On 03/05/2013 03:20 AM, Tejun Heo wrote:
> Hello, Lai.
> 
> On Sat, Mar 02, 2013 at 11:55:29PM +0800, Lai Jiangshan wrote:
>> After we introduce multiple pools for cpu pools, a part of the comments
>> in wq_unbind_fn() becomes wrong.
>>
>> It said that "current worker would trigger unbound chain execution".
>> It is wrong. current worker only belongs to one of the multiple pools.
>>
>> If wq_unbind_fn() does unbind the normal_pri pool(not the pool of the current
>> worker), the current worker is not the available worker to trigger unbound
>> chain execution of the normal_pri pool, and if all the workers of
>> the normal_pri goto sleep after they were set %WORKER_UNBOUND but before
>> they finish their current work, unbound chain execution is not triggered
>> totally. The pool is stopped!
>>
>> We can change wq_unbind_fn() only does unbind one pool and we launch multiple
>> wq_unbind_fn()s, one for each pool to solve the problem.
>> But this change will add much latency to hotplug path unnecessarily.
>>
>> So we choice to wake up a worker directly to trigger unbound chain execution.
>>
>> current worker may sleep on &second_pool->assoc_mutex, so we also move
>> the wakeup code into the loop to avoid second_pool silences the first_pool.
>>
>> Signed-off-by: Lai Jiangshan 
> 
> Nice catch.
> 
>> @@ -3446,28 +3446,35 @@ static void wq_unbind_fn(struct work_struct *work)
>>  
>>  spin_unlock_irq(&pool->lock);
>>  mutex_unlock(&pool->assoc_mutex);
>> -}
>>  
>> -/*
>> - * Call schedule() so that we cross rq->lock and thus can guarantee
>> - * sched callbacks see the %WORKER_UNBOUND flag.  This is necessary
>> - * as scheduler callbacks may be invoked from other cpus.
>> - */
>> -schedule();
>> +/*
>> + * Call schedule() so that we cross rq->lock and thus can
>> + * guarantee sched callbacks see the %WORKER_UNBOUND flag.
>> + * This is necessary as scheduler callbacks may be invoked
>> + * from other cpus.
>> + */
>> +schedule();
>>  
>> -/*
>> - * Sched callbacks are disabled now.  Zap nr_running.  After this,
>> - * nr_running stays zero and need_more_worker() and keep_working()
>> - * are always true as long as the worklist is not empty.  Pools on
>> - * @cpu now behave as unbound (in terms of concurrency management)
>> - * pools which are served by workers tied to the CPU.
>> - *
>> - * On return from this function, the current worker would trigger
>> - * unbound chain execution of pending work items if other workers
>> - * didn't already.
>> - */
>> -for_each_std_worker_pool(pool, cpu)
>> +/*
>> + * Sched callbacks are disabled now.  Zap nr_running.
>> + * After this, nr_running stays zero and need_more_worker()
>> + * and keep_working() are always true as long as the worklist
>> + * is not empty.  This pool now behave as unbound (in terms of
>> + * concurrency management) pool which are served by workers
>> + * tied to the pool.
>> + */
>>  atomic_set(&pool->nr_running, 0);
>> +
>> +/* The current busy workers of this pool may goto sleep without
>> + * wake up any other worker after they were set %WORKER_UNBOUND
>> + * flag. Here we wake up another possible worker to start
>> + * the unbound chain execution of pending work items in this
>> + * case.
>> + */
>> +spin_lock_irq(&pool->lock);
>> +wake_up_worker(pool);
>> +spin_unlock_irq(&pool->lock);
>> +}
> 
> But can we please just addd wake_up_worker() in the
> for_each_std_worker_pool() loop?  

wake_up_worker() needed be put on the same loop which do set %WORKER_UNBOUND.


mutex_lock(&pool->assoc_mutex);
do set %WORKER_UNBOUND for normal_pri pool
mutex_unlock(&pool->assoc_mutex);

// no wakeup for normal_pri pool
// but all workers of normal_pri pool goto sleep

// try to  do set %WORKER_UNBOUND for high_pri pool
mutex_lock(&pool->assoc_mutex);
waiting forever here due to high_pri pool's manage_workers()
waiting on allocating memory forever(waiting normal_pri pool
free memory, but normal_pri pool is silenced)
mutex_unlock(&pool->assoc_mutex);


> We want to mark the patch for
> -stable and keep it short and to the point.  This patch is a couple
> times larger than necessary.
> 
> Thanks.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] lockdep: introduce lock_acquire_exclusive/shared helper macros

2013-03-05 Thread Lai Jiangshan


On 05/03/13 10:17, Michel Lespinasse wrote:
> In lockdep.h, the spinlock/mutex/rwsem/rwlock/lock_map acquire macros
> have different definitions based on the value of CONFIG_PROVE_LOCKING.
> We have separate ifdefs for each of these definitions, which seems
> redundant.
> 
> Introduce lock_acquire_{exclusive,shared,shared_recursive} helpers
> which will have different definitions based on CONFIG_PROVE_LOCKING.
> Then all other helper macros can be defined based on the above ones,
> which reduces the amount of ifdefined code.
> 
> Signed-off-by: Michel Lespinasse 
> 
> ---
>  include/linux/lockdep.h | 92 
> +
>  1 file changed, 23 insertions(+), 69 deletions(-)
> 
> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
> index f1e877b79ed8..cfc2f119779a 100644
> --- a/include/linux/lockdep.h
> +++ b/include/linux/lockdep.h
> @@ -365,7 +365,7 @@ extern void lockdep_trace_alloc(gfp_t mask);
>  
>  #define lockdep_recursing(tsk)   ((tsk)->lockdep_recursion)
>  
> -#else /* !LOCKDEP */
> +#else /* !CONFIG_LOCKDEP */
>  
>  static inline void lockdep_off(void)
>  {
> @@ -479,82 +479,36 @@ static inline void print_irqtrace_events(struct 
> task_struct *curr)
>   * on the per lock-class debug mode:
>   */
>  
> -#ifdef CONFIG_DEBUG_LOCK_ALLOC
> -# ifdef CONFIG_PROVE_LOCKING
> -#  define spin_acquire(l, s, t, i)   lock_acquire(l, s, t, 0, 2, 
> NULL, i)
> -#  define spin_acquire_nest(l, s, t, n, i)   lock_acquire(l, s, t, 0, 2, n, 
> i)
> -# else
> -#  define spin_acquire(l, s, t, i)   lock_acquire(l, s, t, 0, 1, 
> NULL, i)
> -#  define spin_acquire_nest(l, s, t, n, i)   lock_acquire(l, s, t, 0, 1, 
> NULL, i)
> -# endif
> -# define spin_release(l, n, i)   lock_release(l, n, i)
> +#ifdef CONFIG_PROVE_LOCKING
> + #define lock_acquire_exclusive(l, s, t, n, i)   lock_acquire(l, 
> s, t, 0, 2, n, i)
> + #define lock_acquire_shared(l, s, t, n, i)  lock_acquire(l, s, t, 
> 1, 2, n, i)
> + #define lock_acquire_shared_recursive(l, s, t, n, i)lock_acquire(l, 
> s, t, 2, 2, n, i)

Hi, Michel

I don't like the name lock_acquire_shared_recursive().
(I mean the name is wrong, ..)

In the lockdep design, lock_acquire(l, s, t, 2, 2, n, i) is used for
read-preference locks(rwlock) and all types of RCU. not for "recursive"
read-preference implies "recursive".

But the name lock_acquire_shared_recursive() don't tell us it is
read-preference.

Example if we do have a lock which is write-preference but allow read_lock 
recursive,
it will be still deadlock in this way, "recursive" does not help:

cpu0: spin_lock(a); recursiveable_read_lock(b)
cpu1: recursiveable_read_lock(b);   spin_lock(a);
cpu2:   write_lock(b);


I also noticed the lockdep annotations problem of lglock. and patch2 is good,
so for patch2: Reviewed-by: Lai Jiangshan  


Thanks,
Lai

>  #else
> -# define spin_acquire(l, s, t, i)do { } while (0)
> -# define spin_release(l, n, i)   do { } while (0)
> + #define lock_acquire_exclusive(l, s, t, n, i)   lock_acquire(l, 
> s, t, 0, 1, n, i)
> + #define lock_acquire_shared(l, s, t, n, i)  lock_acquire(l, s, t, 
> 1, 1, n, i)
> + #define lock_acquire_shared_recursive(l, s, t, n, i)lock_acquire(l, 
> s, t, 2, 1, n, i)
>  #endif
>  
> -#ifdef CONFIG_DEBUG_LOCK_ALLOC
> -# ifdef CONFIG_PROVE_LOCKING
> -#  define rwlock_acquire(l, s, t, i) lock_acquire(l, s, t, 0, 2, 
> NULL, i)
> -#  define rwlock_acquire_read(l, s, t, i)lock_acquire(l, s, t, 2, 2, 
> NULL, i)
> -# else
> -#  define rwlock_acquire(l, s, t, i) lock_acquire(l, s, t, 0, 1, 
> NULL, i)
> -#  define rwlock_acquire_read(l, s, t, i)lock_acquire(l, s, t, 2, 1, 
> NULL, i)
> -# endif
> -# define rwlock_release(l, n, i) lock_release(l, n, i)
> -#else
> -# define rwlock_acquire(l, s, t, i)  do { } while (0)
> -# define rwlock_acquire_read(l, s, t, i) do { } while (0)
> -# define rwlock_release(l, n, i) do { } while (0)
> -#endif
> +#define spin_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, 
> NULL, i)
> +#define spin_acquire_nest(l, s, t, n, i) lock_acquire_exclusive(l, s, t, 
> n, i)
> +#define spin_release(l, n, i)lock_release(l, n, i)
>  
> -#ifdef CONFIG_DEBUG_LOCK_ALLOC
> -# ifdef CONFIG_PROVE_LOCKING
> -#  define mutex_acquire(l, s, t, i)  lock_acquire(l, s, t, 0, 2, 
> NULL, i)
> -#  define mutex_acquire_nest(l, s, t, n, i)  lock_acquire(l, s, t, 0, 2, n, 
> i)
> -# else

Re: [PATCH V2] lglock: add read-preference local-global rwlock

2013-03-05 Thread Lai Jiangshan
On 03/03/13 01:20, Oleg Nesterov wrote:
> On 03/02, Lai Jiangshan wrote:
>>
>> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
>> +{
>> +switch (__this_cpu_read(*lgrw->reader_refcnt)) {
>> +case 1:
>> +__this_cpu_write(*lgrw->reader_refcnt, 0);
>> +lg_local_unlock(&lgrw->lglock);
>> +return;
>> +case FALLBACK_BASE:
>> +__this_cpu_write(*lgrw->reader_refcnt, 0);
>> +read_unlock(&lgrw->fallback_rwlock);
>> +rwlock_release(&lg->lock_dep_map, 1, _RET_IP_);
> 
> I guess "case 1:" should do rwlock_release() too.

Already do it in "lg_local_unlock(&lgrw->lglock);" before it returns.
(I like reuse old code)

> 
> Otherwise, at first glance looks correct...
> 
> However, I still think that FALLBACK_BASE only adds the unnecessary
> complications. But even if I am right this is subjective of course, please
> feel free to ignore.

OK, I kill FALLBACK_BASE in later patch.

> 
> And btw, I am not sure about lg->lock_dep_map, perhaps we should use
> fallback_rwlock->dep_map ?

Use either one is OK.

> 
> We need rwlock_acquire_read() even in the fast-path, and this acquire_read
> should be paired with rwlock_acquire() in _write_lock(), but it does
> spin_acquire(lg->lock_dep_map). Yes, currently this is the same (afaics)
> but perhaps fallback_rwlock->dep_map would be more clean.
> 

I can't tell which one is better. I try to use fallback_rwlock->dep_map later.

> Oleg.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] lglock: add read-preference local-global rwlock

2013-03-05 Thread Lai Jiangshan
On 03/03/13 01:11, Srivatsa S. Bhat wrote:
> On 03/02/2013 06:44 PM, Lai Jiangshan wrote:
>> From 345a7a75c314ff567be48983e0892bc69c4452e7 Mon Sep 17 00:00:00 2001
>> From: Lai Jiangshan 
>> Date: Sat, 2 Mar 2013 20:33:14 +0800
>> Subject: [PATCH] lglock: add read-preference local-global rwlock
>>
>> Current lglock is not read-preference, so it can't be used on some cases
>> which read-preference rwlock can do. Example, get_cpu_online_atomic().
>>
> [...]
>> diff --git a/kernel/lglock.c b/kernel/lglock.c
>> index 6535a66..52e9b2c 100644
>> --- a/kernel/lglock.c
>> +++ b/kernel/lglock.c
>> @@ -87,3 +87,71 @@ void lg_global_unlock(struct lglock *lg)
>>  preempt_enable();
>>  }
>>  EXPORT_SYMBOL(lg_global_unlock);
>> +
>> +#define FALLBACK_BASE   (1UL << 30)
>> +
>> +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
>> +{
>> +struct lglock *lg = &lgrw->lglock;
>> +
>> +preempt_disable();
>> +if (likely(!__this_cpu_read(*lgrw->reader_refcnt))) {
>> +rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_);
>> +if (unlikely(!arch_spin_trylock(this_cpu_ptr(lg->lock {
>> +read_lock(&lgrw->fallback_rwlock);
>> +__this_cpu_write(*lgrw->reader_refcnt, FALLBACK_BASE);
>> +return;
>> +}
>> +}
>> +
>> +__this_cpu_inc(*lgrw->reader_refcnt);
>> +}
>> +EXPORT_SYMBOL(lg_rwlock_local_read_lock);
>> +
>> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
>> +{
>> +switch (__this_cpu_read(*lgrw->reader_refcnt)) {
>> +case 1:
>> +__this_cpu_write(*lgrw->reader_refcnt, 0);
>> +lg_local_unlock(&lgrw->lglock);
>> +return;
> 
> This should be a break, instead of a return, right?
> Otherwise, there will be a preempt imbalance...


"lockdep" and "preempt" are handled in lg_local_unlock(&lgrw->lglock);

Thanks,
Lai

> 
>> +case FALLBACK_BASE:
>> +__this_cpu_write(*lgrw->reader_refcnt, 0);
>> +read_unlock(&lgrw->fallback_rwlock);
>> +rwlock_release(&lg->lock_dep_map, 1, _RET_IP_);
>> +break;
>> +default:
>> +__this_cpu_dec(*lgrw->reader_refcnt);
>> +break;
>> +}
>> +
>> +preempt_enable();
>> +}
> 
> 
> Regards,
> Srivatsa S. Bhat
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-03-05 Thread Lai Jiangshan
On 02/03/13 03:47, Srivatsa S. Bhat wrote:
> On 03/01/2013 11:20 PM, Lai Jiangshan wrote:
>> On 28/02/13 05:19, Srivatsa S. Bhat wrote:
>>> On 02/27/2013 06:03 AM, Lai Jiangshan wrote:
>>>> On Wed, Feb 27, 2013 at 3:30 AM, Srivatsa S. Bhat
>>>>  wrote:
>>>>> On 02/26/2013 09:55 PM, Lai Jiangshan wrote:
>>>>>> On Tue, Feb 26, 2013 at 10:22 PM, Srivatsa S. Bhat
>>>>>>  wrote:
>>>>>>>
>>>>>>> Hi Lai,
>>>>>>>
>>>>>>> I'm really not convinced that piggy-backing on lglocks would help
>>>>>>> us in any way. But still, let me try to address some of the points
>>>>>>> you raised...
>>>>>>>
>>>>>>> On 02/26/2013 06:29 PM, Lai Jiangshan wrote:
>>>>>>>> On Tue, Feb 26, 2013 at 5:02 PM, Srivatsa S. Bhat
>>>>>>>>  wrote:
>>>>>>>>> On 02/26/2013 05:47 AM, Lai Jiangshan wrote:
>>>>>>>>>> On Tue, Feb 26, 2013 at 3:26 AM, Srivatsa S. Bhat
>>>>>>>>>>  wrote:
>>>>>>>>>>> Hi Lai,
>>>>>>>>>>>
>>>>>>>>>>> On 02/25/2013 09:23 PM, Lai Jiangshan wrote:
>>>>>>>>>>>> Hi, Srivatsa,
>>>>>>>>>>>>
>>>>>>>>>>>> The target of the whole patchset is nice for me.
>>>>>>>>>>>
>>>>>>>>>>> Cool! Thanks :-)
>>>>>>>>>>>
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>> Unfortunately, I see quite a few issues with the code above. IIUC, the
>>>>>>>>> writer and the reader both increment the same counters. So how will 
>>>>>>>>> the
>>>>>>>>> unlock() code in the reader path know when to unlock which of the 
>>>>>>>>> locks?
>>>>>>>>
>>>>>>>> The same as your code, the reader(which nested in write C.S.) just dec
>>>>>>>> the counters.
>>>>>>>
>>>>>>> And that works fine in my case because the writer and the reader update
>>>>>>> _two_ _different_ counters.
>>>>>>
>>>>>> I can't find any magic in your code, they are the same counter.
>>>>>>
>>>>>> /*
>>>>>>  * It is desirable to allow the writer to acquire the 
>>>>>> percpu-rwlock
>>>>>>  * for read (if necessary), without deadlocking or getting 
>>>>>> complaints
>>>>>>  * from lockdep. To achieve that, just increment the 
>>>>>> reader_refcnt of
>>>>>>  * this CPU - that way, any attempt by the writer to acquire the
>>>>>>  * percpu-rwlock for read, will get treated as a case of nested 
>>>>>> percpu
>>>>>>  * reader, which is safe, from a locking perspective.
>>>>>>  */
>>>>>> this_cpu_inc(pcpu_rwlock->rw_state->reader_refcnt);
>>>>>>
>>>>>
>>>>> Whoa! Hold on, were you really referring to _this_ increment when you said
>>>>> that, in your patch you would increment the refcnt at the writer? Then I 
>>>>> guess
>>>>> there is a major disconnect in our conversations. (I had assumed that you 
>>>>> were
>>>>> referring to the update of writer_signal, and were just trying to have a 
>>>>> single
>>>>> refcnt instead of reader_refcnt and writer_signal).
>>>>
>>>> https://github.com/laijs/linux/commit/53e5053d5b724bea7c538b11743d0f420d98f38d
>>>>
>>>> Sorry the name "fallback_reader_refcnt" misled you.
>>>>
>>> [...]
>>>
>>>>>> All I was considered is "nested reader is seldom", so I always
>>>>>> fallback to rwlock when nested.
>>>>>> If you like, I can add 6 lines of code, the overhead is
>>>>>> 1 spin_try_lock()(fast path)  + N  __this_cpu_inc()
>>>>>>
>>>>>
>>>>> I'm assuming that calculation is no longer valid, considering that
>>>>> we just discu

Re: [PATCH] lglock: add read-preference local-global rwlock

2013-03-05 Thread Lai Jiangshan
On 03/03/13 01:06, Oleg Nesterov wrote:
> On 03/02, Michel Lespinasse wrote:
>>
>> My version would be slower if it needs to take the
>> slow path in a reentrant way, but I'm not sure it matters either :)
> 
> I'd say, this doesn't matter at all, simply because this can only happen
> if we race with the active writer.
> 

It can also happen when interrupted. (still very rarely)

arch_spin_trylock()
--->interrupted,
__this_cpu_read() returns 0.
arch_spin_trylock() fails
slowpath, any nested will be slowpath too.
...
..._read_unlock()
<---interrupt
__this_cpu_inc()



I saw get_online_cpu_atomic() is called very frequent.
And the above thing happens in one CPU rarely, but how often it
happens in the whole system if we have 4096 CPUs?
(I worries to much. I tend to remove FALLBACK_BASE now, we should
add it only after we proved we needed it, this part is not proved)

Thanks,
Lai


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] sched: Shrink include/linux/sched.h

2013-03-05 Thread Lai Jiangshan
On Tue, Mar 5, 2013 at 4:05 PM, Li Zefan  wrote:
> While working of a cgroup patch which also touched include/linux/sched.h,
> I found some function/macro/structure declarations can be moved to
> kernel/sched/sched.h, and some can even be total removed, so here's
> the patchset.
>
> The result is a reduction of ~200 LOC from include/linux/sched.h.

It looks good to me.
Acked-by: Lai Jiangshan 

>
> 0001-sched-Remove-some-dummpy-functions.patch
> 0002-sched-Remove-test_sd_parent.patch
> 0003-sched-Move-SCHED_LOAD_SHIFT-macros-to-kernel-sched-s.patch
> 0004-sched-Move-struct-sched_group-to-kernel-sched-sched..patch
> 0005-sched-Move-wake-flags-to-kernel-sched-sched.h.patch
> 0006-sched-Move-struct-sched_class-to-kernel-sched-sched..patch
> 0007-sched-Make-default_scale_freq_power-static.patch
> 0008-sched-Move-group-scheduling-functions-out-of-include.patch
> 0009-sched-Remove-double-declaration-of-root_task_group.patch
>
> --
>  include/linux/sched.h | 194 
> +-
>  kernel/sched/core.c   |  14 ++--
>  kernel/sched/fair.c   |   6 +-
>  kernel/sched/sched.h  | 159 +++--
>  4 files changed, 168 insertions(+), 205 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: af_unix udev startup regression

2013-04-07 Thread Lai Jiangshan
On 04/05/2013 02:03 AM, Linus Torvalds wrote:
> [ Fixed odd legacy subject line that has nothing to do with the actual bug ]
> 
> Hmm. Can you double-check and verify that reverting that commit makes
> things work again for you?

reverting 14134f6584212d585b310ce95428014b653dfaf6 works.

14134f6584212d585b310ce95428014b653dfaf6 is already reverted in upstream.
(and sorry for so late reply)

> 
> Also, what's your distribution and setup? 

Fedora 16

Thanks,
Lai

> I'd like this to get
> verified, just to see that it's not some timing-dependent thing or a
> bisection mistake, but if so, then the LSB test-cases obviously have
> to be fixed, and the commit that causes the problem needs to be
> reverted. Test-cases count for nothing compared to actual users.
> 
>         Linus
> 
> On Thu, Apr 4, 2013 at 9:17 AM, Lai Jiangshan  wrote:
>> Hi, ALL
>>
>> I also encountered the same problem.
>>
>> git bisect:
>>
>> 14134f6584212d585b310ce95428014b653dfaf6 is the first bad commit
>> commit 14134f6584212d585b310ce95428014b653dfaf6
>> Author: dingtianhong 
>> Date:   Mon Mar 25 17:02:04 2013 +
>>
>> af_unix: dont send SCM_CREDENTIAL when dest socket is NULL
>>
>> SCM_SCREDENTIALS should apply to write() syscalls only either source or
>> destination
>> socket asserted SOCK_PASSCRED. The original implememtation in
>> maybe_add_creds is wrong,
>> and breaks several LSB testcases ( i.e.
>> /tset/LSB.os/netowkr/recvfrom/T.recvfrom).
>>
>> Origionally-authored-by: Karel Srot 
>> Signed-off-by: Ding Tianhong 
>> Acked-by: Eric Dumazet 
>> Signed-off-by: David S. Miller 
>>
>> :04 04 ef0356cc0fc168a39c0f94cff0ba27c46c4d0048
>> ae34e59f235c379f04d6145f0103cccd5b3a307a M net
>>
>> ===
>> Like Brian Gerst, no obvious bug, but the system can't boot, "service udev
>> start" fails when boot
>> (also DEBUG_PAGEALLOC=n, I did not try to test with it=y)
>>
>> [   11.022976] systemd[1]: udev-control.socket failed to listen on sockets:
>> Address already in use
>> [   11.023293] systemd[1]: Unit udev-control.socket entered failed state.
>> [   11.182478] systemd-readahead-replay[399]: Bumped block_nr parameter of
>> 8:16 to 16384. This is a temporary hack and should be removed one day.
>> [   14.473283] udevd[410]: bind failed: Address already in use
>> [   14.478630] udevd[410]: error binding udev control socket
>> [   15.201158] systemd[1]: udev.service: main process exited, code=exited,
>> status=1
>> [   16.900792] udevd[427]: error binding udev control socket
>> [   18.356484] EXT4-fs (sdb7): re-mounted. Opts: (null)
>> [   19.738401] systemd[1]: udev.service holdoff time over, scheduling
>> restart.
>> [   19.742494] systemd[1]: Job pending for unit, delaying automatic restart.
>> [   19.747764] systemd[1]: Unit udev.service entered failed state.
>> [   19.752303] systemd[1]: udev-control.socket failed to listen on sockets:
>> Address already in use
>> [   19.770723] udevd[459]: bind failed: Address already in use
>> [   19.771027] udevd[459]: error binding udev control socket
>> [   19.771175] udevd[459]: error binding udev control socket
>> [   19.813256] systemd[1]: udev.service: main process exited, code=exited,
>> status=1
>> [   19.914450] systemd[1]: udev.service holdoff time over, scheduling
>> restart.
>> [   19.918374] systemd[1]: Job pending for unit, delaying automatic restart.
>> [   19.923392] systemd[1]: Unit udev.service entered failed state.
>> [   19.923808] systemd[1]: udev-control.socket failed to listen on sockets:
>> Address already in use
>> [   19.943792] udevd[465]: bind failed: Address already in use
>> [   19.944056] udevd[465]: error binding udev control socket
>> [   19.944210] udevd[465]: error binding udev control socket
>> [   19.946071] systemd[1]: udev.service: main process exited, code=exited,
>> status=1
>> [   20.047524] systemd[1]: udev.service holdoff time over, scheduling
>> restart.
>> [   20.051939] systemd[1]: Job pending for unit, delaying automatic restart.
>> [   20.057539] systemd[1]: Unit udev.service entered failed state.
>> [   20.058069] systemd[1]: udev-control.socket failed to listen on sockets:
>> Address already in use
>> [   20.081141] udevd[467]: bind failed: Address already in use
>> [   20.087120] udevd[467]: error binding udev control socket
>> [   20.092040] udevd[467]: error binding udev control socket
>> [   20.096519] systemd[1]: udev.service: main process exited, code=exited,
>>

Re: [PATCH 2/3] kernel/SRCU: provide a static initializer

2013-04-08 Thread Lai Jiangshan
On 04/08/2013 06:03 PM, Sebastian Andrzej Siewior wrote:
> On 04/05/2013 09:21 AM, Lai Jiangshan wrote:
>> Hi, Sebastian
> 
> Hi Lai,
> 
>> I don't want to expose __SRCU_STRUCT_INIT(),
>> due to it has strong coupling with the percpu array.
>>
>> I hope other structure which uses SRCU should use init_srcu_struct().
> 
> I need a static initialization for this kind. Patch #3 shows one
> example I have another one pending for crypto.

If the percpu array can be defined in __SRCU_STRUCT_INIT(),
I'm happy to expose it. but it is not currently.

Why crypto can't use boot time initialization?

> Do you have any idea how I could get it done without this? Do you want
> to move/merge header files?

if crypto has to use static initialization, I will find out some way
or use your patch.

Thanks,
Lai

> 
>>
>> Thanks,
>> Lai
> 
> Sebastian
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] kernel/SRCU: provide a static initializer

2013-04-11 Thread Lai Jiangshan
On 04/12/2013 01:04 AM, Sebastian Andrzej Siewior wrote:
> * Lai Jiangshan | 2013-04-09 09:09:56 [+0800]:
> 
>> If the percpu array can be defined in __SRCU_STRUCT_INIT(),
>> I'm happy to expose it. but it is not currently.
> 
> I have no idea how to achieve this.
> 
>> Why crypto can't use boot time initialization?
> 
> It would require something like this:
> --- linux-stable.orig/crypto/Kconfig
> +++ linux-stable/crypto/Kconfig
> @@ -13,7 +13,7 @@ source "crypto/async_tx/Kconfig"
>  # Cryptographic API Configuration
>  #
>  menuconfig CRYPTO
> -   tristate "Cryptographic API"
> +   bool "Cryptographic API"
> help
>   This option provides the core Cryptographic API.

Why convert to "bool"?
srcu_init_notifier_head() can be called in module-load-time.

> 
> --- linux-stable.orig/crypto/api.c
> +++ linux-stable/crypto/api.c
> @@ -34,6 +34,13 @@ EXPORT_SYMBOL_GPL(crypto_alg_sem);
>  struct srcu_notifier_head crypto_chain;
>  EXPORT_SYMBOL_GPL(crypto_chain);
> 
> +static int __init crypto_api_init(void)
> +{
> +   srcu_init_notifier_head(&crypto_chain);
> +   return 0;
> +}
> +core_initcall(crypto_api_init);
> +
>  static inline struct crypto_alg *crypto_alg_get(struct crypto_alg *alg)
>  {
> atomic_inc(&alg->cra_refcnt);

And again, why crypto can't use boot time nor module-load-time initialization?

> 
> and there is no need for this.
> 
>>> Do you have any idea how I could get it done without this? Do you want
>>> to move/merge header files?
>>
>> if crypto has to use static initialization, I will find out some way
>> or use your patch.
> 
> The crypto would like this:
> 
> Subject: crypto: Convert crypto notifier chain to SRCU
> From: Peter Zijlstra 
> Date: Fri, 05 Oct 2012 09:03:24 +0100
> 
> The crypto notifier deadlocks on RT. Though this can be a real deadlock
> on mainline as well due to fifo fair rwsems.
> 
> The involved parties here are:
> 
> [   82.172678] swapper/0   S 0001 0 1  0 
> 0x
> [   82.172682]  88042f18fcf0 0046 88042f18fc80 
> 81491238
> [   82.172685]  00011cc0 00011cc0 88042f18c040 
> 88042f18ffd8
> [   82.172688]  00011cc0 00011cc0 88042f18ffd8 
> 00011cc0
> [   82.172689] Call Trace:
> [   82.172697]  [] ? _raw_spin_unlock_irqrestore+0x6c/0x7a
> [   82.172701]  [] schedule+0x64/0x66
> [   82.172704]  [] schedule_timeout+0x27/0xd0
> [   82.172708]  [] ? unpin_current_cpu+0x1a/0x6c
> [   82.172713]  [] ? migrate_enable+0x12f/0x141
> [   82.172716]  [] wait_for_common+0xbb/0x11f
> [   82.172719]  [] ? try_to_wake_up+0x182/0x182
> [   82.172722]  [] 
> wait_for_completion_interruptible+0x1d/0x2e
> [   82.172726]  [] crypto_wait_for_test+0x49/0x6b
> [   82.172728]  [] crypto_register_alg+0x53/0x5a
> [   82.172730]  [] crypto_register_algs+0x33/0x72
> [   82.172734]  [] ? aes_init+0x12/0x12
> [   82.172737]  [] aesni_init+0x64/0x66
> [   82.172741]  [] do_one_initcall+0x7f/0x13b
> [   82.172744]  [] kernel_init+0x199/0x22c
> [   82.172747]  [] ? loglevel+0x31/0x31
> [   82.172752]  [] kernel_thread_helper+0x4/0x10
> [   82.172755]  [] ? retint_restore_args+0x13/0x13
> [   82.172759]  [] ? start_kernel+0x3ca/0x3ca
> [   82.172761]  [] ? gs_change+0x13/0x13
> 
> [   82.174186] cryptomgr_test  S 0001 041  2 
> 0x
> [   82.174189]  88042c971980 0046 81d74830 
> 0292
> [   82.174192]  00011cc0 00011cc0 88042c96eb80 
> 88042c971fd8
> [   82.174195]  00011cc0 00011cc0 88042c971fd8 
> 00011cc0
> [   82.174195] Call Trace:
> [   82.174198]  [] schedule+0x64/0x66
> [   82.174201]  [] schedule_timeout+0x27/0xd0
> [   82.174204]  [] ? unpin_current_cpu+0x1a/0x6c
> [   82.174206]  [] ? migrate_enable+0x12f/0x141
> [   82.174209]  [] wait_for_common+0xbb/0x11f
> [   82.174212]  [] ? try_to_wake_up+0x182/0x182
> [   82.174215]  [] 
> wait_for_completion_interruptible+0x1d/0x2e
> [   82.174218]  [] cryptomgr_notify+0x280/0x385
> [   82.174221]  [] notifier_call_chain+0x6b/0x98
> [   82.174224]  [] ? rt_down_read+0x10/0x12
> [   82.174227]  [] __blocking_notifier_call_chain+0x70/0x8d
> [   82.174230]  [] blocking_notifier_call_chain+0x14/0x16
> [   82.174234]  [] crypto_probing_notify+0x24/0x50
> [   82.174236]  [] crypto_alg_mod_lookup+0x3e/0x74
> [   82.174238]  [] crypto_alloc_base+0x36/0x8f
> [   82.174241]  [] cryptd_alloc_ablkcipher+0x6e/0xb5
> [   82.174243]  [] ? kzalloc.clone.5+0xe/0x10

[PATCH 0/8] workqueue: advance concurrency management

2013-04-14 Thread Lai Jiangshan
I found the early-increasing nr_running in wq_worker_waking_up() is useless
in many cases. it tries to avoid waking up idle workers for pending work item.
but delay increasing nr_running does not increase waking up idle workers.

so we delay increasing and remove wq_worker_waking_up() and ...

enjoy a simpler concurrency management.

Lai Jiangshan (8):
  workqueue: remove @cpu from wq_worker_sleeping()
  workqueue: use create_and_start_worker() in manage_workers()
  workqueue: remove cpu_intensive from process_one_work()
  workqueue: quit cm mode when sleeping
  workqueue: remove disabled wq_worker_waking_up()
  workqueue: make nr_running non-atomic
  workqueue: move worker->flags up
  workqueue: rename ->nr_running to ->nr_cm_workers

 kernel/sched/core.c |6 +-
 kernel/workqueue.c  |  234 +++---
 kernel/workqueue_internal.h |9 +-
 3 files changed, 89 insertions(+), 160 deletions(-)

-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/8] workqueue: remove @cpu from wq_worker_sleeping()

2013-04-14 Thread Lai Jiangshan
WARN_ON_ONCE(cpu != raw_smp_processor_id()) in
wq_worker_sleeping() in useless, the caller ensures
cpu == raw_smp_processor_id().

We should use WARN_ON_ONCE(pool->cpu != raw_smp_processor_id())
to do the expected test.

It results @cpu removed from wq_worker_sleeping()

Signed-off-by: Lai Jiangshan 
---
 kernel/sched/core.c |2 +-
 kernel/workqueue.c  |7 +++
 kernel/workqueue_internal.h |2 +-
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 23606ee..ffc06ad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2907,7 +2907,7 @@ need_resched:
if (prev->flags & PF_WQ_WORKER) {
struct task_struct *to_wakeup;
 
-   to_wakeup = wq_worker_sleeping(prev, cpu);
+   to_wakeup = wq_worker_sleeping(prev);
if (to_wakeup)
try_to_wake_up_local(to_wakeup);
}
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c273376..b3095ad 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -807,7 +807,6 @@ void wq_worker_waking_up(struct task_struct *task, int cpu)
 /**
  * wq_worker_sleeping - a worker is going to sleep
  * @task: task going to sleep
- * @cpu: CPU in question, must be the current CPU number
  *
  * This function is called during schedule() when a busy worker is
  * going to sleep.  Worker on the same cpu can be woken up by
@@ -817,9 +816,9 @@ void wq_worker_waking_up(struct task_struct *task, int cpu)
  * spin_lock_irq(rq->lock)
  *
  * RETURNS:
- * Worker task on @cpu to wake up, %NULL if none.
+ * Worker task on the same pool to wake up, %NULL if none.
  */
-struct task_struct *wq_worker_sleeping(struct task_struct *task, int cpu)
+struct task_struct *wq_worker_sleeping(struct task_struct *task)
 {
struct worker *worker = kthread_data(task), *to_wakeup = NULL;
struct worker_pool *pool;
@@ -835,7 +834,7 @@ struct task_struct *wq_worker_sleeping(struct task_struct 
*task, int cpu)
pool = worker->pool;
 
/* this can only happen on the local cpu */
-   if (WARN_ON_ONCE(cpu != raw_smp_processor_id()))
+   if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id()))
return NULL;
 
/*
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index 84ab6e1..aec8df4 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -57,6 +57,6 @@ static inline struct worker *current_wq_worker(void)
  * sched.c and workqueue.c.
  */
 void wq_worker_waking_up(struct task_struct *task, int cpu);
-struct task_struct *wq_worker_sleeping(struct task_struct *task, int cpu);
+struct task_struct *wq_worker_sleeping(struct task_struct *task);
 
 #endif /* _KERNEL_WORKQUEUE_INTERNAL_H */
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/8] workqueue: use create_and_start_worker() in manage_workers()

2013-04-14 Thread Lai Jiangshan
After we allocated worker, we are free to access the worker without and
protection before it is visiable/published.

In old code, worker is published by start_worker(), and it is visiable only
after start_worker(), but in current code, it is visiable by
for_each_pool_worker() after
"idr_replace(&pool->worker_idr, worker, worker->id);"

It means the step of publishing worker is not atomic, it is very fragile.
(although I did not find any bug from it in current code). it should be fixed.

It can be fixed by moving "idr_replace(&pool->worker_idr, worker, worker->id);"
to start_worker() or by folding start_worker() in to create_worker().

I choice the second one. It makes the code much simple.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |   62 +++
 1 files changed, 18 insertions(+), 44 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b3095ad..d1e10c5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -64,7 +64,7 @@ enum {
 *
 * Note that DISASSOCIATED should be flipped only while holding
 * manager_mutex to avoid changing binding state while
-* create_worker() is in progress.
+* create_and_start_worker_locked() is in progress.
 */
POOL_MANAGE_WORKERS = 1 << 0,   /* need to manage workers */
POOL_DISASSOCIATED  = 1 << 2,   /* cpu can't serve workers */
@@ -1542,7 +1542,10 @@ static void worker_enter_idle(struct worker *worker)
 (worker->hentry.next || worker->hentry.pprev)))
return;
 
-   /* can't use worker_set_flags(), also called from start_worker() */
+   /*
+* can't use worker_set_flags(), also called from
+* create_and_start_worker_locked().
+*/
worker->flags |= WORKER_IDLE;
pool->nr_idle++;
worker->last_active = jiffies;
@@ -1663,12 +1666,10 @@ static struct worker *alloc_worker(void)
 }
 
 /**
- * create_worker - create a new workqueue worker
+ * create_and_start_worker_locked - create and start a worker for a pool
  * @pool: pool the new worker will belong to
  *
- * Create a new worker which is bound to @pool.  The returned worker
- * can be started by calling start_worker() or destroyed using
- * destroy_worker().
+ * Create a new worker which is bound to @pool and start it.
  *
  * CONTEXT:
  * Might sleep.  Does GFP_KERNEL allocations.
@@ -1676,7 +1677,7 @@ static struct worker *alloc_worker(void)
  * RETURNS:
  * Pointer to the newly created worker.
  */
-static struct worker *create_worker(struct worker_pool *pool)
+static struct worker *create_and_start_worker_locked(struct worker_pool *pool)
 {
struct worker *worker = NULL;
int id = -1;
@@ -1734,9 +1735,15 @@ static struct worker *create_worker(struct worker_pool 
*pool)
if (pool->flags & POOL_DISASSOCIATED)
worker->flags |= WORKER_UNBOUND;
 
-   /* successful, commit the pointer to idr */
spin_lock_irq(&pool->lock);
+   /* successful, commit the pointer to idr */
idr_replace(&pool->worker_idr, worker, worker->id);
+
+   /* start worker */
+   worker->flags |= WORKER_STARTED;
+   worker->pool->nr_workers++;
+   worker_enter_idle(worker);
+   wake_up_process(worker->task);
spin_unlock_irq(&pool->lock);
 
return worker;
@@ -1752,23 +1759,6 @@ fail:
 }
 
 /**
- * start_worker - start a newly created worker
- * @worker: worker to start
- *
- * Make the pool aware of @worker and start it.
- *
- * CONTEXT:
- * spin_lock_irq(pool->lock).
- */
-static void start_worker(struct worker *worker)
-{
-   worker->flags |= WORKER_STARTED;
-   worker->pool->nr_workers++;
-   worker_enter_idle(worker);
-   wake_up_process(worker->task);
-}
-
-/**
  * create_and_start_worker - create and start a worker for a pool
  * @pool: the target pool
  *
@@ -1779,14 +1769,7 @@ static int create_and_start_worker(struct worker_pool 
*pool)
struct worker *worker;
 
mutex_lock(&pool->manager_mutex);
-
-   worker = create_worker(pool);
-   if (worker) {
-   spin_lock_irq(&pool->lock);
-   start_worker(worker);
-   spin_unlock_irq(&pool->lock);
-   }
-
+   worker = create_and_start_worker_locked(pool);
mutex_unlock(&pool->manager_mutex);
 
return worker ? 0 : -ENOMEM;
@@ -1934,17 +1917,8 @@ restart:
mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INITIAL_TIMEOUT);
 
while (true) {
-   struct worker *worker;
-
-   worker = create_worker(pool);
-   if (worker) {
-   del_timer_sync(&pool->mayday_timer);
-   spin_lock_irq(&pool->lock);
- 

[PATCH 3/8] workqueue: remove cpu_intensive from process_one_work()

2013-04-14 Thread Lai Jiangshan
In process_one_work(), we can use "worker->flags & WORKER_CPU_INTENSIVE"
instead "cpu_intensive" and because worker->flags is hot field
(accessed when process each work item). so this change will not cause
any performance down.

It prepare for also clearing WORKER_QUIT_CM in the same place.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |7 +++
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d1e10c5..a4bc589 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2068,7 +2068,6 @@ __acquires(&pool->lock)
 {
struct pool_workqueue *pwq = get_work_pwq(work);
struct worker_pool *pool = worker->pool;
-   bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
int work_color;
struct worker *collision;
 #ifdef CONFIG_LOCKDEP
@@ -2118,7 +2117,7 @@ __acquires(&pool->lock)
 * CPU intensive works don't participate in concurrency
 * management.  They're the scheduler's responsibility.
 */
-   if (unlikely(cpu_intensive))
+   if (unlikely(pwq->wq->flags & WQ_CPU_INTENSIVE))
worker_set_flags(worker, WORKER_CPU_INTENSIVE, true);
 
/*
@@ -2161,8 +2160,8 @@ __acquires(&pool->lock)
 
spin_lock_irq(&pool->lock);
 
-   /* clear cpu intensive status */
-   if (unlikely(cpu_intensive))
+   /* clear cpu intensive status if it is set */
+   if (unlikely(worker->flags & WORKER_CPU_INTENSIVE))
worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
 
/* we're done with it, release */
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/8] workqueue: quit cm mode when sleeping

2013-04-14 Thread Lai Jiangshan
When a work is waken up from sleeping, it makes very small sense if
we still consider this worker is RUNNING(in view of concurrency management)
o   if the work goes to sleep again, it is not RUNNING again.
o   if the work runs long without sleeping, the worker should be consider
as CPU_INTENSIVE.
o   if the work runs short without sleeping, we can still consider
this worker is not RUNNING this harmless short time,
and fix it up before next work.

o   In almost all cases, the increasing nr_running does not increase
nr_running from 0. there are other RUNNING workers, the other
workers will not goto sleeping very probably before this worker
finishes the work in may cases. this early increasing makes less
sense.

So don't need consider this worker is RUNNING so early and
we can delay increasing nr_running a little. we increase it after
finished the work.

It is done by adding a new worker flag: WORKER_QUIT_CM.
it used for disabling increasing nr_running in wq_worker_waking_up(),
and for increasing nr_running after finished the work.

This change maybe cause we wakeup(or create) more workers in raw case,
but this is not incorrect.

It make the currency management much more simpler

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c  |   20 ++--
 kernel/workqueue_internal.h |2 +-
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a4bc589..668e9b7 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -75,11 +75,13 @@ enum {
WORKER_DIE  = 1 << 1,   /* die die die */
WORKER_IDLE = 1 << 2,   /* is idle */
WORKER_PREP = 1 << 3,   /* preparing to run works */
+   WORKER_QUIT_CM  = 1 << 4,   /* quit concurrency managed */
WORKER_CPU_INTENSIVE= 1 << 6,   /* cpu intensive */
WORKER_UNBOUND  = 1 << 7,   /* worker is unbound */
WORKER_REBOUND  = 1 << 8,   /* worker was rebound */
 
-   WORKER_NOT_RUNNING  = WORKER_PREP | WORKER_CPU_INTENSIVE |
+   WORKER_NOT_RUNNING  = WORKER_PREP | WORKER_QUIT_CM |
+ WORKER_CPU_INTENSIVE |
  WORKER_UNBOUND | WORKER_REBOUND,
 
NR_STD_WORKER_POOLS = 2,/* # standard pools per cpu */
@@ -122,6 +124,10 @@ enum {
  *cpu or grabbing pool->lock is enough for read access.  If
  *POOL_DISASSOCIATED is set, it's identical to L.
  *
+ * LI: If POOL_DISASSOCIATED is NOT set, read/modification access should be
+ * done with local IRQ-disabled and only from local cpu.
+ * If POOL_DISASSOCIATED is set, it's identical to L.
+ *
  * MG: pool->manager_mutex and pool->lock protected.  Writes require both
  * locks.  Reads can happen under either lock.
  *
@@ -843,11 +849,13 @@ struct task_struct *wq_worker_sleeping(struct task_struct 
*task)
 * Please read comment there.
 *
 * NOT_RUNNING is clear.  This means that we're bound to and
-* running on the local cpu w/ rq lock held and preemption
+* running on the local cpu w/ rq lock held and preemption/irq
 * disabled, which in turn means that none else could be
 * manipulating idle_list, so dereferencing idle_list without pool
-* lock is safe.
+* lock is safe. And which in turn also means that we can
+* manipulating worker->flags.
 */
+   worker->flags |= WORKER_QUIT_CM;
if (atomic_dec_and_test(&pool->nr_running) &&
!list_empty(&pool->worklist))
to_wakeup = first_worker(pool);
@@ -2160,9 +2168,9 @@ __acquires(&pool->lock)
 
spin_lock_irq(&pool->lock);
 
-   /* clear cpu intensive status if it is set */
-   if (unlikely(worker->flags & WORKER_CPU_INTENSIVE))
-   worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
+   /* clear cpu intensive status or WORKER_QUIT_CM if they are set */
+   if (unlikely(worker->flags & (WORKER_CPU_INTENSIVE | WORKER_QUIT_CM)))
+   worker_clr_flags(worker, WORKER_CPU_INTENSIVE | WORKER_QUIT_CM);
 
/* we're done with it, release */
hash_del(&worker->hentry);
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index aec8df4..1713ae7 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -35,7 +35,7 @@ struct worker {
/* L: for rescuers */
/* 64 bytes boundary on 64bit, 32 on 32bit */
unsigned long   last_active;/* L: last active timestamp */
-   unsigned intflags;  /* X: flags */
+   unsigned intflags;  /* LI: flag

[PATCH 5/8] workqueue: remove disabled wq_worker_waking_up()

2013-04-14 Thread Lai Jiangshan
When a worker is sleeping, its flags has WORKER_QUIT_CM, which means
worker->flags & WORKER_NOT_RUNNING is always non-zero, and which means
wq_worker_waking_up() is disabled.

so we removed wq_worker_waking_up(). (the access to worker->flags
in wq_worker_waking_up() is not protected by "LI". after this, it is alwasy
protected by "LI")

The patch also do these changes after removal:
1) because wq_worker_waking_up() is removed, we don't need schedule()
   before zapping nr_running in wq_unbind_fn(), and don't need to
   release/regain pool->lock.
2) the sanity check in worker_enter_idle() is changed to also check for
   unbound/disassociated pools. (because the above change and nr_running
   is expected always reliable in worker_enter_idle() now.)

Signed-off-by: Lai Jiangshan 
---
 kernel/sched/core.c |4 ---
 kernel/workqueue.c  |   58 +++---
 kernel/workqueue_internal.h |3 +-
 3 files changed, 11 insertions(+), 54 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ffc06ad..18f95884 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1276,10 +1276,6 @@ static void ttwu_activate(struct rq *rq, struct 
task_struct *p, int en_flags)
 {
activate_task(rq, p, en_flags);
p->on_rq = 1;
-
-   /* if a worker is waking up, notify workqueue */
-   if (p->flags & PF_WQ_WORKER)
-   wq_worker_waking_up(p, cpu_of(rq));
 }
 
 /*
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 668e9b7..9f1ebdf 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -790,27 +790,6 @@ static void wake_up_worker(struct worker_pool *pool)
 }
 
 /**
- * wq_worker_waking_up - a worker is waking up
- * @task: task waking up
- * @cpu: CPU @task is waking up to
- *
- * This function is called during try_to_wake_up() when a worker is
- * being awoken.
- *
- * CONTEXT:
- * spin_lock_irq(rq->lock)
- */
-void wq_worker_waking_up(struct task_struct *task, int cpu)
-{
-   struct worker *worker = kthread_data(task);
-
-   if (!(worker->flags & WORKER_NOT_RUNNING)) {
-   WARN_ON_ONCE(worker->pool->cpu != cpu);
-   atomic_inc(&worker->pool->nr_running);
-   }
-}
-
-/**
  * wq_worker_sleeping - a worker is going to sleep
  * @task: task going to sleep
  *
@@ -1564,14 +1543,8 @@ static void worker_enter_idle(struct worker *worker)
if (too_many_workers(pool) && !timer_pending(&pool->idle_timer))
mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT);
 
-   /*
-* Sanity check nr_running.  Because wq_unbind_fn() releases
-* pool->lock between setting %WORKER_UNBOUND and zapping
-* nr_running, the warning may trigger spuriously.  Check iff
-* unbind is not in progress.
-*/
-   WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
-pool->nr_workers == pool->nr_idle &&
+   /* Sanity check nr_running. */
+   WARN_ON_ONCE(pool->nr_workers == pool->nr_idle &&
 atomic_read(&pool->nr_running));
 }
 
@@ -4385,24 +4358,12 @@ static void wq_unbind_fn(struct work_struct *work)
 
pool->flags |= POOL_DISASSOCIATED;
 
-   spin_unlock_irq(&pool->lock);
-   mutex_unlock(&pool->manager_mutex);
-
/*
-* Call schedule() so that we cross rq->lock and thus can
-* guarantee sched callbacks see the %WORKER_UNBOUND flag.
-* This is necessary as scheduler callbacks may be invoked
-* from other cpus.
-*/
-   schedule();
-
-   /*
-* Sched callbacks are disabled now.  Zap nr_running.
-* After this, nr_running stays zero and need_more_worker()
-* and keep_working() are always true as long as the
-* worklist is not empty.  This pool now behaves as an
-* unbound (in terms of concurrency management) pool which
-* are served by workers tied to the pool.
+* Zap nr_running. After this, nr_running stays zero
+* and need_more_worker() and keep_working() are always true
+* as long as the worklist is not empty.  This pool now
+* behaves as an unbound (in terms of concurrency management)
+* pool which are served by workers tied to the pool.
 */
atomic_set(&pool->nr_running, 0);
 
@@ -4411,9 +4372,9 @@ static void wq_unbind_fn(struct work_struct *work)
 * worker blocking could lead to lengthy stalls.  Kick off
 * unbound chain execution of currently pending work items.
 */
-   spin_lock_irq(&pool->lock

[PATCH 6/8] workqueue: make nr_running non-atomic

2013-04-14 Thread Lai Jiangshan
Now, nr_running is accessed only with local IRQ-disabled and only from local
cpu if the pool is assocated.(execpt read-access in insert_work()).

so we convert it to non-atomic to reduce the overhead of atomic.
It is protected by "LI"

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |   49 +
 1 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9f1ebdf..25e2e5a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -150,6 +150,7 @@ struct worker_pool {
int node;   /* I: the associated node ID */
int id; /* I: pool ID */
unsigned intflags;  /* X: flags */
+   int nr_running; /* LI: count for running */
 
struct list_headworklist;   /* L: list of pending works */
int nr_workers; /* L: total number of workers */
@@ -175,13 +176,6 @@ struct worker_pool {
int refcnt; /* PL: refcnt for unbound pools 
*/
 
/*
-* The current concurrency level.  As it's likely to be accessed
-* from other CPUs during try_to_wake_up(), put it in a separate
-* cacheline.
-*/
-   atomic_tnr_running cacheline_aligned_in_smp;
-
-   /*
 * Destruction of pool is sched-RCU protected to allow dereferences
 * from get_work_pool().
 */
@@ -700,7 +694,7 @@ static bool work_is_canceling(struct work_struct *work)
 
 static bool __need_more_worker(struct worker_pool *pool)
 {
-   return !atomic_read(&pool->nr_running);
+   return !pool->nr_running;
 }
 
 /*
@@ -725,8 +719,7 @@ static bool may_start_working(struct worker_pool *pool)
 /* Do I need to keep working?  Called from currently running workers. */
 static bool keep_working(struct worker_pool *pool)
 {
-   return !list_empty(&pool->worklist) &&
-   atomic_read(&pool->nr_running) <= 1;
+   return !list_empty(&pool->worklist) && pool->nr_running <= 1;
 }
 
 /* Do we need a new worker?  Called from manager. */
@@ -823,21 +816,24 @@ struct task_struct *wq_worker_sleeping(struct task_struct 
*task)
return NULL;
 
/*
-* The counterpart of the following dec_and_test, implied mb,
-* worklist not empty test sequence is in insert_work().
-* Please read comment there.
-*
 * NOT_RUNNING is clear.  This means that we're bound to and
 * running on the local cpu w/ rq lock held and preemption/irq
 * disabled, which in turn means that none else could be
 * manipulating idle_list, so dereferencing idle_list without pool
 * lock is safe. And which in turn also means that we can
-* manipulating worker->flags.
+* manipulating worker->flags and pool->nr_running.
 */
worker->flags |= WORKER_QUIT_CM;
-   if (atomic_dec_and_test(&pool->nr_running) &&
-   !list_empty(&pool->worklist))
-   to_wakeup = first_worker(pool);
+   if (--pool->nr_running == 0) {
+   /*
+* This smp_mb() forces a mb between decreasing nr_running
+* and reading worklist. It paires with the smp_mb() in
+* insert_work(). Please read comment there.
+*/
+   smp_mb();
+   if (!list_empty(&pool->worklist))
+   to_wakeup = first_worker(pool);
+   }
return to_wakeup ? to_wakeup->task : NULL;
 }
 
@@ -868,12 +864,10 @@ static inline void worker_set_flags(struct worker 
*worker, unsigned int flags,
 */
if ((flags & WORKER_NOT_RUNNING) &&
!(worker->flags & WORKER_NOT_RUNNING)) {
-   if (wakeup) {
-   if (atomic_dec_and_test(&pool->nr_running) &&
-   !list_empty(&pool->worklist))
-   wake_up_worker(pool);
-   } else
-   atomic_dec(&pool->nr_running);
+   pool->nr_running--;
+   if (wakeup && !pool->nr_running &&
+   !list_empty(&pool->worklist))
+   wake_up_worker(pool);
}
 
worker->flags |= flags;
@@ -905,7 +899,7 @@ static inline void worker_clr_flags(struct worker *worker, 
unsigned int flags)
 */
if ((flags & WORKER_NOT_RUNNING) && (oflags & WORKER_NOT_RUNNING))
if (!(worker->flags & WORKER_NOT_RUNNING))
-   atomic_inc(&pool->nr_running);
+   pool->nr_running++;
 }
 
 /**
@@ -1544,8 

[PATCH 7/8] workqueue: move worker->flags up

2013-04-14 Thread Lai Jiangshan
worker->flags is hot field(accessed when process each work item).
Move it up the the first 64 bytes(32 byte in 32bis) which are
hot fields.

And move colder field worker->task down to ensure worker->pool is
still in the first 64 bytes.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue_internal.h |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index e9fd05f..63cfac7 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -20,6 +20,7 @@ struct worker_pool;
  * Only to be used in workqueue and async.
  */
 struct worker {
+   unsigned intflags;  /* LI: flags */
/* on idle list while idle, on busy hash table while busy */
union {
struct list_headentry;  /* L: while idle */
@@ -30,12 +31,11 @@ struct worker {
work_func_t current_func;   /* L: current_work's fn */
struct pool_workqueue   *current_pwq; /* L: current_work's pwq */
struct list_headscheduled;  /* L: scheduled works */
-   struct task_struct  *task;  /* I: worker task */
struct worker_pool  *pool;  /* I: the associated pool */
/* L: for rescuers */
/* 64 bytes boundary on 64bit, 32 on 32bit */
+   struct task_struct  *task;  /* I: worker task */
unsigned long   last_active;/* L: last active timestamp */
-   unsigned intflags;  /* LI: flags */
int id; /* I: worker id */
 
/* used only by rescuers to point to the target workqueue */
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 8/8] workqueue: rename ->nr_running to ->nr_cm_workers

2013-04-14 Thread Lai Jiangshan
nr_running is not a good name, the reviewers may think they are non-sleeping
busy workers. nr_running is actually a counter for concurrency managed
workers. renaming it to nr_cm_workers would be better.

s/nr_running/nr_cm_workers/
s/NOT_RUNNING/NOT_CM/
manually tune a little(indent and the comment for nr_cm_workers)

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |   69 +--
 1 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 25e2e5a..25e028c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -80,7 +80,7 @@ enum {
WORKER_UNBOUND  = 1 << 7,   /* worker is unbound */
WORKER_REBOUND  = 1 << 8,   /* worker was rebound */
 
-   WORKER_NOT_RUNNING  = WORKER_PREP | WORKER_QUIT_CM |
+   WORKER_NOT_CM   = WORKER_PREP | WORKER_QUIT_CM |
  WORKER_CPU_INTENSIVE |
  WORKER_UNBOUND | WORKER_REBOUND,
 
@@ -150,7 +150,7 @@ struct worker_pool {
int node;   /* I: the associated node ID */
int id; /* I: pool ID */
unsigned intflags;  /* X: flags */
-   int nr_running; /* LI: count for running */
+   int nr_cm_workers;  /* LI: count for cm workers */
 
struct list_headworklist;   /* L: list of pending works */
int nr_workers; /* L: total number of workers */
@@ -694,14 +694,14 @@ static bool work_is_canceling(struct work_struct *work)
 
 static bool __need_more_worker(struct worker_pool *pool)
 {
-   return !pool->nr_running;
+   return !pool->nr_cm_workers;
 }
 
 /*
  * Need to wake up a worker?  Called from anything but currently
  * running workers.
  *
- * Note that, because unbound workers never contribute to nr_running, this
+ * Note that, because unbound workers never contribute to nr_cm_workers, this
  * function will always return %true for unbound pools as long as the
  * worklist isn't empty.
  */
@@ -719,7 +719,7 @@ static bool may_start_working(struct worker_pool *pool)
 /* Do I need to keep working?  Called from currently running workers. */
 static bool keep_working(struct worker_pool *pool)
 {
-   return !list_empty(&pool->worklist) && pool->nr_running <= 1;
+   return !list_empty(&pool->worklist) && pool->nr_cm_workers <= 1;
 }
 
 /* Do we need a new worker?  Called from manager. */
@@ -804,9 +804,9 @@ struct task_struct *wq_worker_sleeping(struct task_struct 
*task)
/*
 * Rescuers, which may not have all the fields set up like normal
 * workers, also reach here, let's not access anything before
-* checking NOT_RUNNING.
+* checking NOT_CM.
 */
-   if (worker->flags & WORKER_NOT_RUNNING)
+   if (worker->flags & WORKER_NOT_CM)
return NULL;
 
pool = worker->pool;
@@ -816,17 +816,17 @@ struct task_struct *wq_worker_sleeping(struct task_struct 
*task)
return NULL;
 
/*
-* NOT_RUNNING is clear.  This means that we're bound to and
+* NOT_CM is clear.  This means that we're bound to and
 * running on the local cpu w/ rq lock held and preemption/irq
 * disabled, which in turn means that none else could be
 * manipulating idle_list, so dereferencing idle_list without pool
 * lock is safe. And which in turn also means that we can
-* manipulating worker->flags and pool->nr_running.
+* manipulating worker->flags and pool->nr_cm_workers.
 */
worker->flags |= WORKER_QUIT_CM;
-   if (--pool->nr_running == 0) {
+   if (--pool->nr_cm_workers == 0) {
/*
-* This smp_mb() forces a mb between decreasing nr_running
+* This smp_mb() forces a mb between decreasing nr_cm_workers
 * and reading worklist. It paires with the smp_mb() in
 * insert_work(). Please read comment there.
 */
@@ -838,13 +838,13 @@ struct task_struct *wq_worker_sleeping(struct task_struct 
*task)
 }
 
 /**
- * worker_set_flags - set worker flags and adjust nr_running accordingly
+ * worker_set_flags - set worker flags and adjust nr_cm_workers accordingly
  * @worker: self
  * @flags: flags to set
  * @wakeup: wakeup an idle worker if necessary
  *
- * Set @flags in @worker->flags and adjust nr_running accordingly.  If
- * nr_running becomes zero and @wakeup is %true, an idle worker is
+ * Set @flags in @worker->flags and adjust nr_cm_workers accordingly.  If
+ * nr_cm_workers becomes zero and @wakeup is %true, an idle worker is
  * woken up.
  *
  * CONTEXT:
@@ -858,14 +858,13 @@ 

[PATCH 2/2 tj/for-3.10] workqueue: modify wq->freezing only when freezable

2013-03-30 Thread Lai Jiangshan
simplify pwq_adjust_max_active().
make freeze_workqueues_begin() and thaw_workqueues() fast skip non-freezable wq.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6b7e5a4..0a38852 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3566,18 +3566,15 @@ static void pwq_unbound_release_workfn(struct 
work_struct *work)
 static void pwq_adjust_max_active(struct pool_workqueue *pwq)
 {
struct workqueue_struct *wq = pwq->wq;
-   bool freezable = wq->flags & WQ_FREEZABLE;
 
/* for @wq->saved_max_active and @wq->freezing */
lockdep_assert_held(&wq->mutex);
-
-   /* fast exit for non-freezable wqs */
-   if (!freezable && pwq->max_active == wq->saved_max_active)
-   return;
+   if (WARN_ON_ONCE(!(wq->flags & WQ_FREEZABLE) && wq->freezing))
+   wq->freezing = false;
 
spin_lock_irq(&pwq->pool->lock);
 
-   if (!freezable || !wq->freezing) {
+   if (!wq->freezing) {
pwq->max_active = wq->saved_max_active;
 
while (!list_empty(&pwq->delayed_works) &&
@@ -3792,7 +3789,8 @@ struct workqueue_struct *__alloc_workqueue_key(const char 
*fmt,
mutex_lock(&wq_pool_mutex);
 
mutex_lock(&wq->mutex);
-   wq->freezing = workqueue_freezing;
+   if (wq->flags & WQ_FREEZABLE)
+   wq->freezing = workqueue_freezing;
for_each_pwq(pwq, wq)
pwq_adjust_max_active(pwq);
mutex_unlock(&wq->mutex);
@@ -4289,6 +4287,8 @@ void freeze_workqueues_begin(void)
workqueue_freezing = true;
 
list_for_each_entry(wq, &workqueues, list) {
+   if (!(wq->flags & WQ_FREEZABLE))
+   continue;
mutex_lock(&wq->mutex);
WARN_ON_ONCE(wq->freezing);
wq->freezing = true;
@@ -4367,6 +4367,8 @@ void thaw_workqueues(void)
 
/* restore max_active and repopulate worklist */
list_for_each_entry(wq, &workqueues, list) {
+   if (!(wq->flags & WQ_FREEZABLE))
+   continue;
mutex_lock(&wq->mutex);
wq->freezing = false;
for_each_pwq(pwq, wq)
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2 tj/for-3.10] workqueue: add wq->freezing and remove POOL_FREEZING

2013-03-30 Thread Lai Jiangshan
freezing is nothing related to pools, but POOL_FREEZING adds a connection,
and causes freeze_workqueues_begin() and thaw_workqueues() complicated.

Since freezing is workqueue instance attribute, so we introduce wq->freezing
instead and remove POOL_FREEZING.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |   33 +++--
 1 files changed, 7 insertions(+), 26 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 04a8b98..6b7e5a4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -66,7 +66,6 @@ enum {
 */
POOL_MANAGE_WORKERS = 1 << 0,   /* need to manage workers */
POOL_DISASSOCIATED  = 1 << 2,   /* cpu can't serve workers */
-   POOL_FREEZING   = 1 << 3,   /* freeze in progress */
 
/* worker flags */
WORKER_STARTED  = 1 << 0,   /* started */
@@ -241,6 +240,7 @@ struct workqueue_struct {
 
int nr_drainers;/* WQ: drain in progress */
int saved_max_active; /* WQ: saved pwq max_active */
+   boolfreezing;   /* WQ: the wq is freezing */
 
 #ifdef CONFIG_SYSFS
struct wq_device*wq_dev;/* I: for sysfs interface */
@@ -3493,9 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
if (!pool || init_worker_pool(pool) < 0)
goto fail;
 
-   if (workqueue_freezing)
-   pool->flags |= POOL_FREEZING;
-
lockdep_set_subclass(&pool->lock, 1);   /* see put_pwq() */
copy_workqueue_attrs(pool->attrs, attrs);
 
@@ -3571,7 +3568,7 @@ static void pwq_adjust_max_active(struct pool_workqueue 
*pwq)
struct workqueue_struct *wq = pwq->wq;
bool freezable = wq->flags & WQ_FREEZABLE;
 
-   /* for @wq->saved_max_active */
+   /* for @wq->saved_max_active and @wq->freezing */
lockdep_assert_held(&wq->mutex);
 
/* fast exit for non-freezable wqs */
@@ -3580,7 +3577,7 @@ static void pwq_adjust_max_active(struct pool_workqueue 
*pwq)
 
spin_lock_irq(&pwq->pool->lock);
 
-   if (!freezable || !(pwq->pool->flags & POOL_FREEZING)) {
+   if (!freezable || !wq->freezing) {
pwq->max_active = wq->saved_max_active;
 
while (!list_empty(&pwq->delayed_works) &&
@@ -3795,6 +3792,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char 
*fmt,
mutex_lock(&wq_pool_mutex);
 
mutex_lock(&wq->mutex);
+   wq->freezing = workqueue_freezing;
for_each_pwq(pwq, wq)
pwq_adjust_max_active(pwq);
mutex_unlock(&wq->mutex);
@@ -4282,26 +4280,18 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
  */
 void freeze_workqueues_begin(void)
 {
-   struct worker_pool *pool;
struct workqueue_struct *wq;
struct pool_workqueue *pwq;
-   int pi;
 
mutex_lock(&wq_pool_mutex);
 
WARN_ON_ONCE(workqueue_freezing);
workqueue_freezing = true;
 
-   /* set FREEZING */
-   for_each_pool(pool, pi) {
-   spin_lock_irq(&pool->lock);
-   WARN_ON_ONCE(pool->flags & POOL_FREEZING);
-   pool->flags |= POOL_FREEZING;
-   spin_unlock_irq(&pool->lock);
-   }
-
list_for_each_entry(wq, &workqueues, list) {
mutex_lock(&wq->mutex);
+   WARN_ON_ONCE(wq->freezing);
+   wq->freezing = true;
for_each_pwq(pwq, wq)
pwq_adjust_max_active(pwq);
mutex_unlock(&wq->mutex);
@@ -4369,25 +4359,16 @@ void thaw_workqueues(void)
 {
struct workqueue_struct *wq;
struct pool_workqueue *pwq;
-   struct worker_pool *pool;
-   int pi;
 
mutex_lock(&wq_pool_mutex);
 
if (!workqueue_freezing)
goto out_unlock;
 
-   /* clear FREEZING */
-   for_each_pool(pool, pi) {
-   spin_lock_irq(&pool->lock);
-   WARN_ON_ONCE(!(pool->flags & POOL_FREEZING));
-   pool->flags &= ~POOL_FREEZING;
-   spin_unlock_irq(&pool->lock);
-   }
-
/* restore max_active and repopulate worklist */
list_for_each_entry(wq, &workqueues, list) {
mutex_lock(&wq->mutex);
+   wq->freezing = false;
for_each_pwq(pwq, wq)
pwq_adjust_max_active(pwq);
mutex_unlock(&wq->mutex);
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 13/14] workqueue: implement NUMA affinity for unbound workqueues

2013-03-30 Thread Lai Jiangshan
On 31/03/13 00:32, Tejun Heo wrote:
> Hello, Lai.
> 
> 
> On Sat, Mar 30, 2013 at 9:13 AM, Lai Jiangshan  <mailto:eag0...@gmail.com>> wrote:
> 
> 
> +   /* all pwqs have been created successfully, let's install'em 
> */
> mutex_lock(&wq->mutex);
> 
> copy_workqueue_attrs(wq->unbound_attrs, new_attrs);
> +
> +   /* save the previous pwq and install the new one */
> for_each_node(node)
> -   last_pwq = numa_pwq_tbl_install(wq, node, pwq);
> +   pwq_tbl[node] = numa_pwq_tbl_install(wq, node, 
> pwq_tbl[node]);
> +
> +   /* @dfl_pwq might not have been used, ensure it's linked */
> +   link_pwq(dfl_pwq);
> +   swap(wq->dfl_pwq, dfl_pwq);
> 
> mutex_unlock(&wq->mutex);
> 
> -   put_pwq_unlocked(last_pwq);
> +   /* put the old pwqs */
> +   for_each_node(node)
> +   put_pwq_unlocked(pwq_tbl[node]);
> +   put_pwq_unlocked(dfl_pwq);
> +
> +   put_online_cpus();
> return 0;
> 
> 
> 
> Forgot to free new_attrs in previous patch
> (workqueue: fix unbound workqueue attrs hashing / comparison).
> 
> Forgot to free tmp_attrs, pwq_tbl in this patch.
> 
> 
> Right, will fix. 
> 
> +retry:
> +   mutex_lock(&wq->mutex);
> +
> +   copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
> +   pwq = unbound_pwq_by_node(wq, node);
> +
> +   /*
> +* Let's determine what needs to be done.  If the target 
> cpumask is
> +* different from wq's, we need to compare it to @pwq's and 
> create
> +* a new one if they don't match.  If the target cpumask 
> equals
> +* wq's, the default pwq should be used.  If @pwq is already 
> the
> +* default one, nothing to do; otherwise, install the default 
> one.
> +*/
> +   if (wq_calc_node_cpumask(wq->unbound_attrs, node, cpu_off, 
> cpumask)) {
> +   if (cpumask_equal(cpumask, pwq->pool->attrs->cpumask))
> +   goto out_unlock;
> +   } else if (pwq != wq->dfl_pwq) {
> +   goto use_dfl_pwq;
> +   } else {
> +   goto out_unlock;
> +   }
> +
> +   /*
> +* Have we already created a new pwq?  As we could have raced 
> with
> +* apply_workqueue_attrs(), verify that its attrs match the 
> desired
> +* one before installing.
> +*/
> 
> 
> I don't see any race since there is get/put_online_cpu() in 
> apply_workqueue_attrs().
> 
> 
> I don't know. I kinda want wq exclusion to be self-contained, but yeah the 
> hotplug exclusion here is *almost* explicit so maybe it would be better to 
> depend on it. Will think about it.
> 
> +   mutex_unlock(&wq->mutex);
> +   put_pwq_unlocked(old_pwq);
> +   free_unbound_pwq(new_pwq);
> +}
> 
> 
> OK, your solution is what I suggested: swapping dfl_pwq <-> node pwq.
> But when the last cpu of the node(of the wq) is trying to offline.
> you need to handle the work items of node pwq(old_pwq in the code).
> 
> you may handle the works which are still queued by migrating, OR by
> flushing the works.
> and you may handle busy works by temporary changing the cpumask of
> the workers, OR by flushing the busy works.
> 
> 
> I don't think that's necessary.

Please document it.

> It's not like we have hard guarantee on attr changes anyway.
> Self-requeueing work items can get stuck with old attributes for quite a 
> while,

It is OK for it is documented.

> and even per-cpu work items get migrated to other CPUs on CPU DOWN.

It is expected.

But for unbound wq when cpuhotplug
w/o NUMA affinity, works arealways   in the cpus  if   there is online cpu 
in wq's cpumask
w/ NUMA affinity, .   NOT always  even 


> Workqueue's affinity guarantee is very specific - the work item owner is
> responsible for flushing the work item during CPU DOWN if it wants
> to guarantee affinity over full execution. 

Could you add the comments and add Reviewed-by: Lai Jiangshan 

for the patchset?

Thanks,
Lai

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2 tj/for-3.10] workqueue: add wq->freezing and remove POOL_FREEZING

2013-04-01 Thread Lai Jiangshan
On 04/02/2013 02:44 AM, Tejun Heo wrote:
> On Sun, Mar 31, 2013 at 12:29:14AM +0800, Lai Jiangshan wrote:
>> freezing is nothing related to pools, but POOL_FREEZING adds a connection,
>> and causes freeze_workqueues_begin() and thaw_workqueues() complicated.
>>
>> Since freezing is workqueue instance attribute, so we introduce wq->freezing
>> instead and remove POOL_FREEZING.
>>
>> Signed-off-by: Lai Jiangshan 
>> ---
>>  kernel/workqueue.c |   33 +++--
>>  1 files changed, 7 insertions(+), 26 deletions(-)
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 04a8b98..6b7e5a4 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -66,7 +66,6 @@ enum {
>>   */
>>  POOL_MANAGE_WORKERS = 1 << 0,   /* need to manage workers */
>>  POOL_DISASSOCIATED  = 1 << 2,   /* cpu can't serve workers */
>> -POOL_FREEZING   = 1 << 3,   /* freeze in progress */
>>  
>>  /* worker flags */
>>  WORKER_STARTED  = 1 << 0,   /* started */
>> @@ -241,6 +240,7 @@ struct workqueue_struct {
>>  
>>  int nr_drainers;/* WQ: drain in progress */
>>  int saved_max_active; /* WQ: saved pwq max_active */
>> +boolfreezing;   /* WQ: the wq is freezing */
> 
> Why not use another internal flag?  There already are __WQ_DRAINING
> and __WQ_ORDERED.  Can't we just add __WQ_FREEZING?
> 
> Thanks.
> 



->flags is hot and almost read-only(except __WQ_DRAINING).
__WQ_DRAINING bit is accessed for every queue_work(), so we add it to hot 
->flags.

__WQ_ORDERED is read-only.

->freezing is cold and non-read-only.
I don't think we need to add __WQ_FREEZING to ->flags.

Thanks,
Lai

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 17/31] workqueue: implement attribute-based unbound worker_pool management

2013-03-10 Thread Lai Jiangshan
On 02/03/13 11:24, Tejun Heo wrote:
> This patch makes unbound worker_pools reference counted and
> dynamically created and destroyed as workqueues needing them come and
> go.  All unbound worker_pools are hashed on unbound_pool_hash which is
> keyed by the content of worker_pool->attrs.
> 
> When an unbound workqueue is allocated, get_unbound_pool() is called
> with the attributes of the workqueue.  If there already is a matching
> worker_pool, the reference count is bumped and the pool is returned.
> If not, a new worker_pool with matching attributes is created and
> returned.
> 
> When an unbound workqueue is destroyed, put_unbound_pool() is called
> which decrements the reference count of the associated worker_pool.
> If the refcnt reaches zero, the worker_pool is destroyed in sched-RCU
> safe way.
> 
> Note that the standard unbound worker_pools - normal and highpri ones
> with no specific cpumask affinity - are no longer created explicitly
> during init_workqueues().  init_workqueues() only initializes
> workqueue_attrs to be used for standard unbound pools -
> unbound_std_wq_attrs[].  The pools are spawned on demand as workqueues
> are created.
> 
> Signed-off-by: Tejun Heo 
> ---
>  kernel/workqueue.c | 230 
> ++---
>  1 file changed, 218 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 7eba824..fb91b67 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -41,6 +41,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -80,6 +81,7 @@ enum {
>  
>   NR_STD_WORKER_POOLS = 2,/* # standard pools per cpu */
>  
> + UNBOUND_POOL_HASH_ORDER = 6,/* hashed by pool->attrs */
>   BUSY_WORKER_HASH_ORDER  = 6,/* 64 pointers */
>  
>   MAX_IDLE_WORKERS_RATIO  = 4,/* 1/4 of busy can be idle */
> @@ -149,6 +151,8 @@ struct worker_pool {
>   struct ida  worker_ida; /* L: for worker IDs */
>  
>   struct workqueue_attrs  *attrs; /* I: worker attributes */
> + struct hlist_node   hash_node;  /* R: unbound_pool_hash node */
> + atomic_trefcnt; /* refcnt for unbound pools */
>  
>   /*
>* The current concurrency level.  As it's likely to be accessed
> @@ -156,6 +160,12 @@ struct worker_pool {
>* cacheline.
>*/
>   atomic_tnr_running cacheline_aligned_in_smp;
> +
> + /*
> +  * Destruction of pool is sched-RCU protected to allow dereferences
> +  * from get_work_pool().
> +  */
> + struct rcu_head rcu;
>  } cacheline_aligned_in_smp;
>  
>  /*
> @@ -218,6 +228,11 @@ struct workqueue_struct {
>  
>  static struct kmem_cache *pwq_cache;
>  
> +/* hash of all unbound pools keyed by pool->attrs */
> +static DEFINE_HASHTABLE(unbound_pool_hash, UNBOUND_POOL_HASH_ORDER);
> +
> +static struct workqueue_attrs *unbound_std_wq_attrs[NR_STD_WORKER_POOLS];
> +
>  struct workqueue_struct *system_wq __read_mostly;
>  EXPORT_SYMBOL_GPL(system_wq);
>  struct workqueue_struct *system_highpri_wq __read_mostly;
> @@ -1740,7 +1755,7 @@ static struct worker *create_worker(struct worker_pool 
> *pool)
>   worker->pool = pool;
>   worker->id = id;
>  
> - if (pool->cpu != WORK_CPU_UNBOUND)
> + if (pool->cpu >= 0)
>   worker->task = kthread_create_on_node(worker_thread,
>   worker, cpu_to_node(pool->cpu),
>   "kworker/%d:%d%s", pool->cpu, id, pri);
> @@ -3159,6 +3174,54 @@ fail:
>   return NULL;
>  }
>  
> +static void copy_workqueue_attrs(struct workqueue_attrs *to,
> +  const struct workqueue_attrs *from)
> +{
> + to->nice = from->nice;
> + cpumask_copy(to->cpumask, from->cpumask);
> +}
> +
> +/*
> + * Hacky implementation of jhash of bitmaps which only considers the
> + * specified number of bits.  We probably want a proper implementation in
> + * include/linux/jhash.h.
> + */
> +static u32 jhash_bitmap(const unsigned long *bitmap, int bits, u32 hash)
> +{
> + int nr_longs = bits / BITS_PER_LONG;
> + int nr_leftover = bits % BITS_PER_LONG;
> + unsigned long leftover = 0;
> +
> + if (nr_longs)
> + hash = jhash(bitmap, nr_longs * sizeof(long), hash);
> + if (nr_leftover) {
> + bitmap_copy(&leftover, bitmap + nr_longs, nr_leftover);
> + hash = jhash(&leftover, sizeof(long), hash);
> + }
> + return hash;
> +}
> +
> +/* hash value of the content of @attr */
> +static u32 wqattrs_hash(const struct workqueue_attrs *attrs)
> +{
> + u32 hash = 0;
> +
> + hash = jhash_1word(attrs->nice, hash);
> + hash = jhash_bitmap(cpumask_bits(attrs->cpumask), nr_cpu_ids, hash);
> + return hash;
> +}
> +
> +/* content equality test */
> +static bool wqattrs_equal(const struct workqueue_at

Re: [PATCH 07/31] workqueue: restructure pool / pool_workqueue iterations in freeze/thaw functions

2013-03-10 Thread Lai Jiangshan
On 02/03/13 11:23, Tejun Heo wrote:
> The three freeze/thaw related functions - freeze_workqueues_begin(),
> freeze_workqueues_busy() and thaw_workqueues() - need to iterate
> through all pool_workqueues of all freezable workqueues.  They did it
> by first iterating pools and then visiting all pwqs (pool_workqueues)
> of all workqueues and process it if its pwq->pool matches the current
> pool.  This is rather backwards and done this way partly because
> workqueue didn't have fitting iteration helpers and partly to avoid
> the number of lock operations on pool->lock.
> 
> Workqueue now has fitting iterators and the locking operation overhead
> isn't anything to worry about - those locks are unlikely to be
> contended and the same CPU visiting the same set of locks multiple
> times isn't expensive.
> 
> Restructure the three functions such that the flow better matches the
> logical steps and pwq iteration is done using for_each_pwq() inside
> workqueue iteration.
> 
> * freeze_workqueues_begin(): Setting of FREEZING is moved into a
>   separate for_each_pool() iteration.  pwq iteration for clearing
>   max_active is updated as described above.
> 
> * freeze_workqueues_busy(): pwq iteration updated as described above.
> 
> * thaw_workqueues(): The single for_each_wq_cpu() iteration is broken
>   into three discrete steps - clearing FREEZING, restoring max_active,
>   and kicking workers.  The first and last steps use for_each_pool()
>   and the second step uses pwq iteration described above.
> 
> This makes the code easier to understand and removes the use of
> for_each_wq_cpu() for walking pwqs, which can't support multiple
> unbound pwqs which will be needed to implement unbound workqueues with
> custom attributes.
> 
> This patch doesn't introduce any visible behavior changes.
> 
> Signed-off-by: Tejun Heo 
> ---
>  kernel/workqueue.c | 87 
> --
>  1 file changed, 45 insertions(+), 42 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 869dbcc..9f195aa 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3598,6 +3598,8 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
>  void freeze_workqueues_begin(void)
>  {
>   struct worker_pool *pool;
> + struct workqueue_struct *wq;
> + struct pool_workqueue *pwq;
>   int id;
>  
>   spin_lock_irq(&workqueue_lock);
> @@ -3605,23 +3607,24 @@ void freeze_workqueues_begin(void)
>   WARN_ON_ONCE(workqueue_freezing);
>   workqueue_freezing = true;
>  
> + /* set FREEZING */
>   for_each_pool(pool, id) {
> - struct workqueue_struct *wq;
> -
>   spin_lock(&pool->lock);
> -
>   WARN_ON_ONCE(pool->flags & POOL_FREEZING);
>   pool->flags |= POOL_FREEZING;
> + spin_unlock(&pool->lock);
> + }
>  
> - list_for_each_entry(wq, &workqueues, list) {
> - struct pool_workqueue *pwq = get_pwq(pool->cpu, wq);
> + /* suppress further executions by setting max_active to zero */
> + list_for_each_entry(wq, &workqueues, list) {
> + if (!(wq->flags & WQ_FREEZABLE))
> + continue;
>  
> - if (pwq && pwq->pool == pool &&
> - (wq->flags & WQ_FREEZABLE))
> - pwq->max_active = 0;
> + for_each_pwq(pwq, wq) {
> + spin_lock(&pwq->pool->lock);
> + pwq->max_active = 0;
> + spin_unlock(&pwq->pool->lock);
>   }
> -
> - spin_unlock(&pool->lock);
>   }
>  
>   spin_unlock_irq(&workqueue_lock);
> @@ -3642,25 +3645,22 @@ void freeze_workqueues_begin(void)
>   */
>  bool freeze_workqueues_busy(void)
>  {
> - unsigned int cpu;
>   bool busy = false;
> + struct workqueue_struct *wq;
> + struct pool_workqueue *pwq;
>  
>   spin_lock_irq(&workqueue_lock);
>  
>   WARN_ON_ONCE(!workqueue_freezing);
>  
> - for_each_wq_cpu(cpu) {
> - struct workqueue_struct *wq;
> + list_for_each_entry(wq, &workqueues, list) {
> + if (!(wq->flags & WQ_FREEZABLE))
> + continue;
>   /*
>* nr_active is monotonically decreasing.  It's safe
>* to peek without lock.
>*/
> - list_for_each_entry(wq, &workqueues, list) {
> - struct pool_workqueue *pwq = get_pwq(cpu, wq);
> -
> - if (!pwq || !(wq->flags & WQ_FREEZABLE))
> - continue;
> -
> + for_each_pwq(pwq, wq) {
>   WARN_ON_ONCE(pwq->nr_active < 0);
>   if (pwq->nr_active) {
>   busy = true;
> @@ -3684,40 +3684,43 @@ out_unlock:
>   */
>  void thaw_workqueues(void)
>  {
> - unsigned int cpu;
> + struct workqueue_struct *wq;
> + struct pool_workqueue *pwq;
> + struc

Re: [PATCH 12/31] workqueue: update synchronization rules on workqueue->pwqs

2013-03-10 Thread Lai Jiangshan
On 02/03/13 11:24, Tejun Heo wrote:
> Make workqueue->pwqs protected by workqueue_lock for writes and
> sched-RCU protected for reads.  Lockdep assertions are added to
> for_each_pwq() and first_pwq() and all their users are converted to
> either hold workqueue_lock or disable preemption/irq.
> 
> alloc_and_link_pwqs() is updated to use list_add_tail_rcu() for
> consistency which isn't strictly necessary as the workqueue isn't
> visible.  destroy_workqueue() isn't updated to sched-RCU release pwqs.
> This is okay as the workqueue should have on users left by that point.
> 
> The locking is superflous at this point.  This is to help
> implementation of unbound pools/pwqs with custom attributes.
> 
> This patch doesn't introduce any behavior changes.
> 
> Signed-off-by: Tejun Heo 
> ---
>  kernel/workqueue.c | 85 
> +++---
>  1 file changed, 68 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 02f51b8..ff51c59 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -42,6 +42,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "workqueue_internal.h"
>  
> @@ -118,6 +119,8 @@ enum {
>   * F: wq->flush_mutex protected.
>   *
>   * W: workqueue_lock protected.
> + *
> + * R: workqueue_lock protected for writes.  Sched-RCU protected for reads.
>   */
>  
>  /* struct worker is defined in workqueue_internal.h */
> @@ -169,7 +172,7 @@ struct pool_workqueue {
>   int nr_active;  /* L: nr of active works */
>   int max_active; /* L: max active works */
>   struct list_headdelayed_works;  /* L: delayed works */
> - struct list_headpwqs_node;  /* I: node on wq->pwqs */
> + struct list_headpwqs_node;  /* R: node on wq->pwqs */
>   struct list_headmayday_node;/* W: node on wq->maydays */
>  } __aligned(1 << WORK_STRUCT_FLAG_BITS);
>  
> @@ -189,7 +192,7 @@ struct wq_flusher {
>  struct workqueue_struct {
>   unsigned intflags;  /* W: WQ_* flags */
>   struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwq's */
> - struct list_headpwqs;   /* I: all pwqs of this wq */
> + struct list_headpwqs;   /* R: all pwqs of this wq */
>   struct list_headlist;   /* W: list of all workqueues */
>  
>   struct mutexflush_mutex;/* protects wq flushing */
> @@ -227,6 +230,11 @@ EXPORT_SYMBOL_GPL(system_freezable_wq);
>  #define CREATE_TRACE_POINTS
>  #include 
>  
> +#define assert_rcu_or_wq_lock()  
> \
> + rcu_lockdep_assert(rcu_read_lock_sched_held() ||\
> +lockdep_is_held(&workqueue_lock),\
> +"sched RCU or workqueue lock should be held")
> +
>  #define for_each_std_worker_pool(pool, cpu)  \
>   for ((pool) = &std_worker_pools(cpu)[0];\
>(pool) < &std_worker_pools(cpu)[NR_STD_WORKER_POOLS]; (pool)++)
> @@ -282,9 +290,16 @@ static inline int __next_wq_cpu(int cpu, const struct 
> cpumask *mask,
>   * for_each_pwq - iterate through all pool_workqueues of the specified 
> workqueue
>   * @pwq: iteration cursor
>   * @wq: the target workqueue
> + *
> + * This must be called either with workqueue_lock held or sched RCU read
> + * locked.  If the pwq needs to be used beyond the locking in effect, the
> + * caller is responsible for guaranteeing that the pwq stays online.
> + *
> + * The if clause exists only for the lockdep assertion and can be ignored.
>   */
>  #define for_each_pwq(pwq, wq)
> \
> - list_for_each_entry((pwq), &(wq)->pwqs, pwqs_node)
> + list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node)  \
> + if (({ assert_rcu_or_wq_lock(); true; }))

Aware this:

if (somecondition)
for_each_pwq(pwq, wq)
one_statement;q
else
x;


for_each_pwq() will eat the else.

To avoid this, you can use:

#define for_each_pwq(pwq, wq)   \
list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node)  \
if (({ assert_rcu_or_wq_lock(); false; })) { }  \
else


The same for for_each_pool() in later patch.


>  
>  #ifdef CONFIG_DEBUG_OBJECTS_WORK
>  
> @@ -463,9 +478,19 @@ static struct worker_pool *get_std_worker_pool(int cpu, 
> bool highpri)
>   return &pools[highpri];
>  }
>  
> +/**
> + * first_pwq - return the first pool_workqueue of the specified workqueue
> + * @wq: the target workqueue
> + *
> + * This must be called either with workqueue_lock held or sched RCU read
> + * locked.  If the pwq needs to be used beyond the locking in effect, the
> + * caller is responsible for guaranteeing that the pwq stays on

Re: [PATCH 14/31] workqueue: replace POOL_MANAGING_WORKERS flag with worker_pool->manager_mutex

2013-03-10 Thread Lai Jiangshan
On 02/03/13 11:24, Tejun Heo wrote:
> POOL_MANAGING_WORKERS is used to synchronize the manager role.
> Synchronizing among workers doesn't need blocking and that's why it's
> implemented as a flag.
> 
> It got converted to a mutex a while back to add blocking wait from CPU
> hotplug path - 6037315269 ("workqueue: use mutex for global_cwq
> manager exclusion").  Later it turned out that synchronization among
> workers and cpu hotplug need to be done separately.  Eventually,
> POOL_MANAGING_WORKERS is restored and workqueue->manager_mutex got
> morphed into workqueue->assoc_mutex - 552a37e936 ("workqueue: restore
> POOL_MANAGING_WORKERS") and b2eb83d123 ("workqueue: rename
> manager_mutex to assoc_mutex").
> 
> Now, we're gonna need to be able to lock out managers from
> destroy_workqueue() to support multiple unbound pools with custom
> attributes making it again necessary to be able to block on the
> manager role.  This patch replaces POOL_MANAGING_WORKERS with
> worker_pool->manager_mutex.
> 
> This patch doesn't introduce any behavior changes.
> 
> Signed-off-by: Tejun Heo 
> ---
>  kernel/workqueue.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 2645218..68b3443 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -64,7 +64,6 @@ enum {
>* create_worker() is in progress.
>*/
>   POOL_MANAGE_WORKERS = 1 << 0,   /* need to manage workers */
> - POOL_MANAGING_WORKERS   = 1 << 1,   /* managing workers */
>   POOL_DISASSOCIATED  = 1 << 2,   /* cpu can't serve workers */
>   POOL_FREEZING   = 1 << 3,   /* freeze in progress */
>  
> @@ -145,6 +144,7 @@ struct worker_pool {
>   DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
>   /* L: hash of busy workers */
>  
> + struct mutexmanager_mutex;  /* the holder is the manager */
>   struct mutexassoc_mutex;/* protect POOL_DISASSOCIATED */
>   struct ida  worker_ida; /* L: for worker IDs */
>  
> @@ -702,7 +702,7 @@ static bool need_to_manage_workers(struct worker_pool 
> *pool)
>  /* Do we have too many workers and should some go away? */
>  static bool too_many_workers(struct worker_pool *pool)
>  {
> - bool managing = pool->flags & POOL_MANAGING_WORKERS;
> + bool managing = mutex_is_locked(&pool->manager_mutex);
>   int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
>   int nr_busy = pool->nr_workers - nr_idle;
>  
> @@ -2027,15 +2027,13 @@ static bool manage_workers(struct worker *worker)
>   struct worker_pool *pool = worker->pool;
>   bool ret = false;
>  
> - if (pool->flags & POOL_MANAGING_WORKERS)
> + if (!mutex_trylock(&pool->manager_mutex))
>   return ret;
>  
> - pool->flags |= POOL_MANAGING_WORKERS;


if mutex_trylock(&pool->manager_mutex) fails, it does not mean
the pool is managing workers. (although current code does).
so I recommend to keep POOL_MANAGING_WORKERS.

I suggest that you reuse assoc_mutex for your purpose(later patches).
(and rename assoc_mutex back to manager_mutex)


> -
>   /*
>* To simplify both worker management and CPU hotplug, hold off
>* management while hotplug is in progress.  CPU hotplug path can't
> -  * grab %POOL_MANAGING_WORKERS to achieve this because that can
> +  * grab @pool->manager_mutex to achieve this because that can
>* lead to idle worker depletion (all become busy thinking someone
>* else is managing) which in turn can result in deadlock under
>* extreme circumstances.  Use @pool->assoc_mutex to synchronize
> @@ -2075,8 +2073,8 @@ static bool manage_workers(struct worker *worker)
>   ret |= maybe_destroy_workers(pool);
>   ret |= maybe_create_worker(pool);
>  
> - pool->flags &= ~POOL_MANAGING_WORKERS;
>   mutex_unlock(&pool->assoc_mutex);
> + mutex_unlock(&pool->manager_mutex);
>   return ret;
>  }
>  
> @@ -3805,6 +3803,7 @@ static int __init init_workqueues(void)
>   setup_timer(&pool->mayday_timer, pool_mayday_timeout,
>   (unsigned long)pool);
>  
> + mutex_init(&pool->manager_mutex);
>   mutex_init(&pool->assoc_mutex);
>   ida_init(&pool->worker_ida);
>  

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHSET wq/for-3.10-tmp] workqueue: implement workqueue with custom worker attributes

2013-03-10 Thread Lai Jiangshan
On 02/03/13 11:23, Tejun Heo wrote:

Hi, Tejun,

I agree almost the whole design.(only except some locks)
And I found only a little problems for current review.

> 
> This patchset contains the following 31 patches.
> 
>  0001-workqueue-make-sanity-checks-less-punshing-using-WAR.patch

>  0002-workqueue-make-workqueue_lock-irq-safe.patch

workqueue_lock protects too many things. We can introduce different locks
for different purpose later.

>  0003-workqueue-introduce-kmem_cache-for-pool_workqueues.patch
>  0004-workqueue-add-workqueue_struct-pwqs-list.patch
>  0005-workqueue-replace-for_each_pwq_cpu-with-for_each_pwq.patch
>  0006-workqueue-introduce-for_each_pool.patch
>  0007-workqueue-restructure-pool-pool_workqueue-iterations.patch
>  0008-workqueue-add-wokrqueue_struct-maydays-list-to-repla.patch
>  0009-workqueue-consistently-use-int-for-cpu-variables.patch
>  0010-workqueue-remove-workqueue_struct-pool_wq.single.patch
>  0011-workqueue-replace-get_pwq-with-explicit-per_cpu_ptr-.patch
>  0012-workqueue-update-synchronization-rules-on-workqueue-.patch
>  0013-workqueue-update-synchronization-rules-on-worker_poo.patch

>  0014-workqueue-replace-POOL_MANAGING_WORKERS-flag-with-wo.patch
>  0015-workqueue-separate-out-init_worker_pool-from-init_wo.patch
>  0016-workqueue-introduce-workqueue_attrs.patch
>  0017-workqueue-implement-attribute-based-unbound-worker_p.patch
>  0018-workqueue-remove-unbound_std_worker_pools-and-relate.patch
>  0019-workqueue-drop-std-from-cpu_std_worker_pools-and-for.patch
>  0020-workqueue-add-pool-ID-to-the-names-of-unbound-kworke.patch
>  0021-workqueue-drop-WQ_RESCUER-and-test-workqueue-rescuer.patch
>  0022-workqueue-restructure-__alloc_workqueue_key.patch


>  0023-workqueue-implement-get-put_pwq.patch

I guess this patch and patch25 may have very deep issue VS RCU.

>  0024-workqueue-prepare-flush_workqueue-for-dynamic-creati.patch
>  0025-workqueue-perform-non-reentrancy-test-when-queueing-.patch
>  0026-workqueue-implement-apply_workqueue_attrs.patch
>  0027-workqueue-make-it-clear-that-WQ_DRAINING-is-an-inter.patch
>  0028-workqueue-reject-increasing-max_active-for-ordered-w.patch
>  0029-cpumask-implement-cpumask_parse.patch
>  0030-driver-base-implement-subsys_virtual_register.patch
>  0031-workqueue-implement-sysfs-interface-for-workqueues.patch
> 


for 1~13,15~22,26~28, please add Reviewed-by: Lai Jiangshan 



> 0001-0003 are misc preps.
> 
> 0004-0008 update various iterators such that they don't operate on cpu
> number.
> 
> 0009-0011 are another set of misc preps / cleanups.
> 
> 0012-0014 update synchronization rules to prepare for dynamic
> management of pwqs and pools.
> 
> 0015-0022 introduce workqueue_attrs and prepare for dynamic management
> of pwqs and pools.
> 
> 0023-0026 implement dynamic application of workqueue_attrs which
> involes creating and destroying unbound pwqs and pools dynamically.
> 
> 0027-0028 prepare workqueue for sysfs exports.
> 
> 0029-0030 make cpumask and driver core changes for workqueue sysfs
> exports.
> 
> 0031 implements sysfs exports for workqueues.
> 
> This patchset is on top of
> 
> [1] wq/for-3.10-tmp 7bceeff75e ("workqueue: better define synchronization 
> rule around rescuer->pool updates")
> 
> which is scheduled to be rebased on top of v3.9-rc1 once it comes out.
> The changes are also available in the following git branch.
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-attrs
> 
> diffstat follows.
> 
>  drivers/base/base.h |2 
>  drivers/base/bus.c  |   73 +
>  drivers/base/core.c |2 
>  include/linux/cpumask.h |   15 
>  include/linux/device.h  |2 
>  include/linux/workqueue.h   |   34 
>  kernel/workqueue.c  | 1716 
> +++-
>  kernel/workqueue_internal.h |5 
>  8 files changed, 1322 insertions(+), 527 deletions(-)
> 
> Thanks.
> 
> --
> tejun
> 
> [1] git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-3.10-tmp
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHSET wq/for-3.10-tmp] workqueue: implement workqueue with custom worker attributes

2013-03-11 Thread Lai Jiangshan
On Mon, Mar 11, 2013 at 11:24 PM, Tejun Heo  wrote:
> On Sun, Mar 10, 2013 at 05:01:13AM -0700, Tejun Heo wrote:
>> Hey, Lai.
>>
>> On Sun, Mar 10, 2013 at 06:34:33PM +0800, Lai Jiangshan wrote:
>> > > This patchset contains the following 31 patches.
>> > >
>> > >  0001-workqueue-make-sanity-checks-less-punshing-using-WAR.patch
>> >
>> > >  0002-workqueue-make-workqueue_lock-irq-safe.patch
>> >
>> > workqueue_lock protects too many things. We can introduce different locks
>> > for different purpose later.
>>
>> I don't know.  My general attitude toward locking is the simpler the
>> better.  None of the paths protected by workqueue_lock are hot.
>> There's no actual benefit in making them finer grained.
>
> Heh, I need to make workqueues and pools protected by a mutex rather
> than spinlock, so I'm breaking out the locking after all.  This is
> gonna be a separate series of patches and it seems like there are
> gonna be three locks - wq_mutex (pool and workqueues), pwq_lock
> (spinlock protecting pwqs), wq_mayday_lock (lock for the mayday list).

Glad to hear this.
wq_mayday_lock is needed at least. spin_lock_irq(workqueue_lock)
with long loop in its C.S hurts RT people.

Thanks,
Lai

>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHSET wq/for-3.10-tmp] workqueue: implement workqueue with custom worker attributes

2013-03-11 Thread Lai Jiangshan
On Sun, Mar 10, 2013 at 8:01 PM, Tejun Heo  wrote:
> Hey, Lai.
>
> On Sun, Mar 10, 2013 at 06:34:33PM +0800, Lai Jiangshan wrote:
>> > This patchset contains the following 31 patches.
>> >
>> >  0001-workqueue-make-sanity-checks-less-punshing-using-WAR.patch
>>
>> >  0002-workqueue-make-workqueue_lock-irq-safe.patch
>>
>> workqueue_lock protects too many things. We can introduce different locks
>> for different purpose later.
>
> I don't know.  My general attitude toward locking is the simpler the
> better.  None of the paths protected by workqueue_lock are hot.
> There's no actual benefit in making them finer grained.
>
>> >  0023-workqueue-implement-get-put_pwq.patch
>>
>> I guess this patch and patch25 may have very deep issue VS RCU.
>
> Hmmm... scary.  I suppose you're gonna elaborate on the review of the
> actual patch?
>
>> >  0024-workqueue-prepare-flush_workqueue-for-dynamic-creati.patch
>> >  0025-workqueue-perform-non-reentrancy-test-when-queueing-.patch
>> >  0026-workqueue-implement-apply_workqueue_attrs.patch
>> >  0027-workqueue-make-it-clear-that-WQ_DRAINING-is-an-inter.patch
>> >  0028-workqueue-reject-increasing-max_active-for-ordered-w.patch
>> >  0029-cpumask-implement-cpumask_parse.patch
>> >  0030-driver-base-implement-subsys_virtual_register.patch
>> >  0031-workqueue-implement-sysfs-interface-for-workqueues.patch
>>
>>
>> for 1~13,15~22,26~28, please add Reviewed-by: Lai Jiangshan 
>> 

OK, Also add my Reviewed-by to 23~25.

>
> Done.
>

I didn't see you updated branch in your tree.

Thanks,
Lai

> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH wq/for-3.9-fixes] workqueue: fix possible pool stall bug in wq_unbind_fn()

2013-03-11 Thread Lai Jiangshan
Hi, Tejun,

Forgot to send a pull-request?
Add CC Linus.


Thanks,
Lai


On 09/03/13 07:15, Tejun Heo wrote:
> From: Lai Jiangshan 
> 
> Since multiple pools per cpu have been introduced, wq_unbind_fn() has
> a subtle bug which may theoretically stall work item processing.  The
> problem is two-fold.
> 
> * wq_unbind_fn() depends on the worker executing wq_unbind_fn() itself
>   to start unbound chain execution, which works fine when there was
>   only single pool.  With multiple pools, only the pool which is
>   running wq_unbind_fn() - the highpri one - is guaranteed to have
>   such kick-off.  The other pool could stall when its busy workers
>   block.
> 
> * The current code is setting WORKER_UNBIND / POOL_DISASSOCIATED of
>   the two pools in succession without initiating work execution
>   inbetween.  Because setting the flags requires grabbing assoc_mutex
>   which is held while new workers are created, this could lead to
>   stalls if a pool's manager is waiting for the previous pool's work
>   items to release memory.  This is almost purely theoretical tho.
> 
> Update wq_unbind_fn() such that it sets WORKER_UNBIND /
> POOL_DISASSOCIATED, goes over schedule() and explicitly kicks off
> execution for a pool and then moves on to the next one.
> 
> tj: Updated comments and description.
> 
> Signed-off-by: Lai Jiangshan 
> Signed-off-by: Tejun Heo 
> Cc: sta...@vger.kernel.org
> ---
> As you seemingly has disappeared, I just fixed up this patch and
> applied it to wq/for-3.9-fixes.
> 
> Thanks.
> 
>  kernel/workqueue.c |   44 +---
>  1 file changed, 25 insertions(+), 19 deletions(-)
> 
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3446,28 +3446,34 @@ static void wq_unbind_fn(struct work_str
>  
>   spin_unlock_irq(&pool->lock);
>   mutex_unlock(&pool->assoc_mutex);
> - }
>  
> - /*
> -  * Call schedule() so that we cross rq->lock and thus can guarantee
> -  * sched callbacks see the %WORKER_UNBOUND flag.  This is necessary
> -  * as scheduler callbacks may be invoked from other cpus.
> -  */
> - schedule();
> + /*
> +  * Call schedule() so that we cross rq->lock and thus can
> +  * guarantee sched callbacks see the %WORKER_UNBOUND flag.
> +  * This is necessary as scheduler callbacks may be invoked
> +  * from other cpus.
> +  */
> + schedule();
>  
> - /*
> -  * Sched callbacks are disabled now.  Zap nr_running.  After this,
> -  * nr_running stays zero and need_more_worker() and keep_working()
> -  * are always true as long as the worklist is not empty.  Pools on
> -  * @cpu now behave as unbound (in terms of concurrency management)
> -  * pools which are served by workers tied to the CPU.
> -  *
> -  * On return from this function, the current worker would trigger
> -  * unbound chain execution of pending work items if other workers
> -  * didn't already.
> -  */
> - for_each_std_worker_pool(pool, cpu)
> + /*
> +  * Sched callbacks are disabled now.  Zap nr_running.
> +  * After this, nr_running stays zero and need_more_worker()
> +  * and keep_working() are always true as long as the
> +  * worklist is not empty.  This pool now behaves as an
> +  * unbound (in terms of concurrency management) pool which
> +  * are served by workers tied to the pool.
> +  */
>   atomic_set(&pool->nr_running, 0);
> +
> + /*
> +  * With concurrency management just turned off, a busy
> +  * worker blocking could lead to lengthy stalls.  Kick off
> +  * unbound chain execution of currently pending work items.
> +  */
> + spin_lock_irq(&pool->lock);
> + wake_up_worker(pool);
> + spin_unlock_irq(&pool->lock);
> + }
>  }
>  
>  /*
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/3] async: simple cleanups

2013-03-11 Thread Lai Jiangshan
I found somethings need to be cleanup when I watch what has been changed
to async.c.

Lai Jiangshan (3):
  async: simplify lowest_in_progress()
  async: remove unused @node from struct async_domain
  async: rename and redefine async_func_ptr

Cc: Tejun Heo 
Cc: Arjan van de Ven 
---
 arch/sh/drivers/pci/pcie-sh7786.c |2 +-
 include/linux/async.h |   19 ++---
 kernel/async.c|   40 
 3 files changed, 26 insertions(+), 35 deletions(-)

-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] async: remove unused @node from struct async_domain

2013-03-11 Thread Lai Jiangshan
The @node in struct async_domain is unused after we introduce
async_global_pending, remove it.

Signed-off-by: Lai Jiangshan 
Cc: Tejun Heo 
Cc: Arjan van de Ven 
---
 include/linux/async.h |   13 -
 1 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/include/linux/async.h b/include/linux/async.h
index a2e3f18..8e53494 100644
--- a/include/linux/async.h
+++ b/include/linux/async.h
@@ -18,7 +18,6 @@
 typedef u64 async_cookie_t;
 typedef void (async_func_ptr) (void *data, async_cookie_t cookie);
 struct async_domain {
-   struct list_head node;
struct list_head pending;
unsigned registered:1;
 };
@@ -26,19 +25,15 @@ struct async_domain {
 /*
  * domain participates in global async_synchronize_full
  */
-#define ASYNC_DOMAIN(_name) \
-   struct async_domain _name = { .node = LIST_HEAD_INIT(_name.node), \
- .pending = LIST_HEAD_INIT(_name.pending), 
\
- .registered = 1 }
+#define ASYNC_DOMAIN(_name) struct async_domain _name =
\
+   { .pending = LIST_HEAD_INIT(_name.pending), .registered = 1 }
 
 /*
  * domain is free to go out of scope as soon as all pending work is
  * complete, this domain does not participate in async_synchronize_full
  */
-#define ASYNC_DOMAIN_EXCLUSIVE(_name) \
-   struct async_domain _name = { .node = LIST_HEAD_INIT(_name.node), \
- .pending = LIST_HEAD_INIT(_name.pending), 
\
- .registered = 0 }
+#define ASYNC_DOMAIN_EXCLUSIVE(_name) struct async_domain _name =  \
+   { .pending = LIST_HEAD_INIT(_name.pending), .registered = 0 }
 
 extern async_cookie_t async_schedule(async_func_ptr *ptr, void *data);
 extern async_cookie_t async_schedule_domain(async_func_ptr *ptr, void *data,
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] async: simplify lowest_in_progress()

2013-03-11 Thread Lai Jiangshan
The code in lowest_in_progress() are duplicated in two branches,
simplify them.

Signed-off-by: Lai Jiangshan 
Cc: Tejun Heo 
Cc: Arjan van de Ven 
---
 kernel/async.c |   20 
 1 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/kernel/async.c b/kernel/async.c
index 8ddee2c..ef66b2f 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -84,24 +84,20 @@ static atomic_t entry_count;
 
 static async_cookie_t lowest_in_progress(struct async_domain *domain)
 {
-   struct async_entry *first = NULL;
+   struct list_head *pending;
async_cookie_t ret = ASYNC_COOKIE_MAX;
unsigned long flags;
 
spin_lock_irqsave(&async_lock, flags);
 
-   if (domain) {
-   if (!list_empty(&domain->pending))
-   first = list_first_entry(&domain->pending,
-   struct async_entry, domain_list);
-   } else {
-   if (!list_empty(&async_global_pending))
-   first = list_first_entry(&async_global_pending,
-   struct async_entry, global_list);
-   }
+   if (domain)
+   pending = &domain->pending;
+   else
+   pending = &async_global_pending;
 
-   if (first)
-   ret = first->cookie;
+   if (!list_empty(pending))
+   ret = list_first_entry(pending, struct async_entry,
+   domain_list)->cookie;
 
spin_unlock_irqrestore(&async_lock, flags);
return ret;
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] async: rename and redefine async_func_ptr

2013-03-11 Thread Lai Jiangshan
A function type is typically defined as
typedef ret_type (*func)(args..)

but async_func_ptr is not.  Redefine it.

Also rename async_func_ptr to async_func_t for _func_t suffix is more generic.

Signed-off-by: Lai Jiangshan 
Cc: Tejun Heo 
Cc: Arjan van de Ven 
---
 arch/sh/drivers/pci/pcie-sh7786.c |2 +-
 include/linux/async.h |6 +++---
 kernel/async.c|   20 ++--
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/sh/drivers/pci/pcie-sh7786.c 
b/arch/sh/drivers/pci/pcie-sh7786.c
index c2c85f6..a162a7f 100644
--- a/arch/sh/drivers/pci/pcie-sh7786.c
+++ b/arch/sh/drivers/pci/pcie-sh7786.c
@@ -35,7 +35,7 @@ static unsigned int nr_ports;
 
 static struct sh7786_pcie_hwops {
int (*core_init)(void);
-   async_func_ptr *port_init_hw;
+   async_func_t port_init_hw;
 } *sh7786_pcie_hwops;
 
 static struct resource sh7786_pci0_resources[] = {
diff --git a/include/linux/async.h b/include/linux/async.h
index 8e53494..0f1226d 100644
--- a/include/linux/async.h
+++ b/include/linux/async.h
@@ -16,7 +16,7 @@
 #include 
 
 typedef u64 async_cookie_t;
-typedef void (async_func_ptr) (void *data, async_cookie_t cookie);
+typedef void (*async_func_t) (void *data, async_cookie_t cookie);
 struct async_domain {
struct list_head pending;
unsigned registered:1;
@@ -35,8 +35,8 @@ struct async_domain {
 #define ASYNC_DOMAIN_EXCLUSIVE(_name) struct async_domain _name =  \
{ .pending = LIST_HEAD_INIT(_name.pending), .registered = 0 }
 
-extern async_cookie_t async_schedule(async_func_ptr *ptr, void *data);
-extern async_cookie_t async_schedule_domain(async_func_ptr *ptr, void *data,
+extern async_cookie_t async_schedule(async_func_t func, void *data);
+extern async_cookie_t async_schedule_domain(async_func_t func, void *data,
struct async_domain *domain);
 void async_unregister_domain(struct async_domain *domain);
 extern void async_synchronize_full(void);
diff --git a/kernel/async.c b/kernel/async.c
index ef66b2f..61873c3 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -73,7 +73,7 @@ struct async_entry {
struct list_headglobal_list;
struct work_struct  work;
async_cookie_t  cookie;
-   async_func_ptr  *func;
+   async_func_tfunc;
void*data;
struct async_domain *domain;
 };
@@ -145,7 +145,7 @@ static void async_run_entry_fn(struct work_struct *work)
wake_up(&async_done);
 }
 
-static async_cookie_t __async_schedule(async_func_ptr *ptr, void *data, struct 
async_domain *domain)
+static async_cookie_t __async_schedule(async_func_t func, void *data, struct 
async_domain *domain)
 {
struct async_entry *entry;
unsigned long flags;
@@ -165,13 +165,13 @@ static async_cookie_t __async_schedule(async_func_ptr 
*ptr, void *data, struct a
spin_unlock_irqrestore(&async_lock, flags);
 
/* low on memory.. run synchronously */
-   ptr(data, newcookie);
+   func(data, newcookie);
return newcookie;
}
INIT_LIST_HEAD(&entry->domain_list);
INIT_LIST_HEAD(&entry->global_list);
INIT_WORK(&entry->work, async_run_entry_fn);
-   entry->func = ptr;
+   entry->func = func;
entry->data = data;
entry->domain = domain;
 
@@ -198,21 +198,21 @@ static async_cookie_t __async_schedule(async_func_ptr 
*ptr, void *data, struct a
 
 /**
  * async_schedule - schedule a function for asynchronous execution
- * @ptr: function to execute asynchronously
+ * @func: function to execute asynchronously
  * @data: data pointer to pass to the function
  *
  * Returns an async_cookie_t that may be used for checkpointing later.
  * Note: This function may be called from atomic or non-atomic contexts.
  */
-async_cookie_t async_schedule(async_func_ptr *ptr, void *data)
+async_cookie_t async_schedule(async_func_t func, void *data)
 {
-   return __async_schedule(ptr, data, &async_dfl_domain);
+   return __async_schedule(func, data, &async_dfl_domain);
 }
 EXPORT_SYMBOL_GPL(async_schedule);
 
 /**
  * async_schedule_domain - schedule a function for asynchronous execution 
within a certain domain
- * @ptr: function to execute asynchronously
+ * @func: function to execute asynchronously
  * @data: data pointer to pass to the function
  * @domain: the domain
  *
@@ -222,10 +222,10 @@ EXPORT_SYMBOL_GPL(async_schedule);
  * synchronization domain is specified via @domain.  Note: This function
  * may be called from atomic or non-atomic contexts.
  */
-async_cookie_t async_schedule_domain(async_func_ptr *ptr, void *data,
+async_cookie_t async_schedule_domain(async_func_t func, void *data,
 struct async_domain *domain)
 {
-   return __async_schedule(ptr, data, domain);
+

[RFC PATCH 00/23 V2] memory,numa: introduce MOVABLE-dedicated node and online_movable for hotplug

2012-08-01 Thread Lai Jiangshan
A) Introduction:

This patchset adds MOVABLE-dedicated node and online_movable for 
memory-management.

It is used for anti-fragmentation(hugepage, big-order allocation...),
hot-removal-of-memory(virtualization, power-conserve, move memory between 
systems
to make better utilities of memories).

B) changed from V1:

The original V1 patchset of MOVABLE-dedicated node is here:
http://comments.gmane.org/gmane.linux.kernel.mm/78122

The new V2 adds N_MEMORY and a notion of "MOVABLE-dedicated node".
And fix some related problems.

The orignal V1 patchset of "add online_movable" is here:
https://lkml.org/lkml/2012/7/4/145

The new V2 discards the MIGRATE_HOTREMOVE approach, and use a more straight
implementation(only 1 patch).

C) User Interface:

When users(big system manager) need config some node/memory as MOVABLE:
1 Use kernelcore_max_addr=XX when boot
2 Use movable_online hotplug action when running
We may introduce some more convenient interface, such as
movable_node=NODE_LIST boot option.

D) Patches

Patch1introduce N_MEMORY
Patch2-13 use N_MEMORY instead N_HIGH_MEMORY.
  The patches are separated by subsystem,
  *these conversions was(must be) checked carefully*.
  Patch13 also changes the node_states initialization
Patch14,15,17 Fix problems of the current code.(all related with hotplug)
Patch18   Add config to allow MOVABLE-dedicated node
Patch19-22Add kernelcore_max_addr
Patch23   Add online_movable


Lai Jiangshan (19):
  node_states: introduce N_MEMORY
  cpuset: use N_MEMORY instead N_HIGH_MEMORY
  procfs: use N_MEMORY instead N_HIGH_MEMORY
  oom: use N_MEMORY instead N_HIGH_MEMORY
  mm,migrate: use N_MEMORY instead N_HIGH_MEMORY
  mempolicy: use N_MEMORY instead N_HIGH_MEMORY
  memcontrol: use N_MEMORY instead N_HIGH_MEMORY
  hugetlb: use N_MEMORY instead N_HIGH_MEMORY
  vmstat: use N_MEMORY instead N_HIGH_MEMORY
  kthread: use N_MEMORY instead N_HIGH_MEMORY
  init: use N_MEMORY instead N_HIGH_MEMORY
  vmscan: use N_MEMORY instead N_HIGH_MEMORY
  page_alloc: use N_MEMORY instead N_HIGH_MEMORY and change the node_states 
initialization
  slub, hotplug: ignore unrelated node's hot-adding and hot-removing
  memory_hotplug: fix missing nodemask management
  numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
  page_alloc.c: don't subtract unrelated memmap from zone's present pages
  page_alloc: add kernelcore_max_addr
  mm, memory-hotplug: add online_movable

Yasuaki Ishimatsu (4):
  x86: get pg_data_t's memory from other node
  x86: use memblock_set_current_limit() to set memblock.current_limit
  memblock: limit memory address from memblock
  memblock: compare current_limit with end variable at
memblock_find_in_range_node()

 Documentation/cgroups/cpusets.txt   |2 +-
 Documentation/kernel-parameters.txt |9 +++
 Documentation/memory-hotplug.txt|   16 -
 arch/x86/kernel/setup.c |4 +-
 arch/x86/mm/init_64.c   |4 +-
 arch/x86/mm/numa.c  |8 ++-
 drivers/base/memory.c   |   19 +++--
 drivers/base/node.c |8 ++-
 fs/proc/kcore.c |2 +-
 fs/proc/task_mmu.c  |4 +-
 include/linux/cpuset.h  |2 +-
 include/linux/memblock.h|1 +
 include/linux/memory_hotplug.h  |   13 +++-
 include/linux/nodemask.h|5 ++
 init/main.c |2 +-
 kernel/cpuset.c |   32 
 kernel/kthread.c|2 +-
 mm/Kconfig  |8 ++
 mm/hugetlb.c|   24 +++---
 mm/memblock.c   |   10 ++-
 mm/memcontrol.c |   18 +++---
 mm/memory_hotplug.c |  137 ---
 mm/mempolicy.c  |   12 ++--
 mm/migrate.c|2 +-
 mm/oom_kill.c   |2 +-
 mm/page_alloc.c |   96 +++--
 mm/page_cgroup.c|2 +-
 mm/slub.c   |6 ++
 mm/vmscan.c |4 +-
 mm/vmstat.c |4 +-
 30 files changed, 335 insertions(+), 123 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 03/23 V2] procfs: use N_MEMORY instead N_HIGH_MEMORY

2012-08-01 Thread Lai Jiangshan
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 fs/proc/kcore.c|2 +-
 fs/proc/task_mmu.c |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 86c67ee..e96d4f1 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -249,7 +249,7 @@ static int kcore_update_ram(void)
/* Not inializedupdate now */
/* find out "max pfn" */
end_pfn = 0;
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
unsigned long node_end;
node_end  = NODE_DATA(nid)->node_start_pfn +
NODE_DATA(nid)->node_spanned_pages;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4540b8f..ed3d381 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1080,7 +1080,7 @@ static struct page *can_gather_numa_stats(pte_t pte, 
struct vm_area_struct *vma,
return NULL;
 
nid = page_to_nid(page);
-   if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
+   if (!node_isset(nid, node_states[N_MEMORY]))
return NULL;
 
return page;
@@ -1232,7 +1232,7 @@ static int show_numa_map(struct seq_file *m, void *v, int 
is_pid)
if (md->writeback)
seq_printf(m, " writeback=%lu", md->writeback);
 
-   for_each_node_state(n, N_HIGH_MEMORY)
+   for_each_node_state(n, N_MEMORY)
if (md->node[n])
seq_printf(m, " N%d=%lu", n, md->node[n]);
 out:
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 06/23 V2] mempolicy: use N_MEMORY instead N_HIGH_MEMORY

2012-08-01 Thread Lai Jiangshan
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 mm/mempolicy.c |   12 ++--
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1d771e4..ad0381d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -212,9 +212,9 @@ static int mpol_set_nodemask(struct mempolicy *pol,
/* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
if (pol == NULL)
return 0;
-   /* Check N_HIGH_MEMORY */
+   /* Check N_MEMORY */
nodes_and(nsc->mask1,
- cpuset_current_mems_allowed, node_states[N_HIGH_MEMORY]);
+ cpuset_current_mems_allowed, node_states[N_MEMORY]);
 
VM_BUG_ON(!nodes);
if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
@@ -1363,7 +1363,7 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, 
maxnode,
goto out_put;
}
 
-   if (!nodes_subset(*new, node_states[N_HIGH_MEMORY])) {
+   if (!nodes_subset(*new, node_states[N_MEMORY])) {
err = -EINVAL;
goto out_put;
}
@@ -2314,7 +2314,7 @@ void __init numa_policy_init(void)
 * fall back to the largest node if they're all smaller.
 */
nodes_clear(interleave_nodes);
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
unsigned long total_pages = node_present_pages(nid);
 
/* Preserve the largest node */
@@ -2395,7 +2395,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, 
int no_context)
*nodelist++ = '\0';
if (nodelist_parse(nodelist, nodes))
goto out;
-   if (!nodes_subset(nodes, node_states[N_HIGH_MEMORY]))
+   if (!nodes_subset(nodes, node_states[N_MEMORY]))
goto out;
} else
nodes_clear(nodes);
@@ -2429,7 +2429,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, 
int no_context)
 * Default to online nodes with memory if no nodelist
 */
if (!nodelist)
-   nodes = node_states[N_HIGH_MEMORY];
+   nodes = node_states[N_MEMORY];
break;
case MPOL_LOCAL:
/*
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 09/23 V2] vmstat: use N_MEMORY instead N_HIGH_MEMORY

2012-08-01 Thread Lai Jiangshan
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 mm/vmstat.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1d9..aa3da12 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -917,7 +917,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
pg_data_t *pgdat = (pg_data_t *)arg;
 
/* check memoryless node */
-   if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+   if (!node_state(pgdat->node_id, N_MEMORY))
return 0;
 
seq_printf(m, "Page block order: %d\n", pageblock_order);
@@ -1279,7 +1279,7 @@ static int unusable_show(struct seq_file *m, void *arg)
pg_data_t *pgdat = (pg_data_t *)arg;
 
/* check memoryless node */
-   if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+   if (!node_state(pgdat->node_id, N_MEMORY))
return 0;
 
walk_zones_in_node(m, pgdat, unusable_show_print);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 14/23 V2] slub, hotplug: ignore unrelated node's hot-adding and hot-removing

2012-08-01 Thread Lai Jiangshan
SLUB only fucus on the nodes which has normal memory, so ignore the other
node's hot-adding and hot-removing.

Signed-off-by: Lai Jiangshan 
---
 mm/slub.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 8c691fa..4c5bdc0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3577,6 +3577,9 @@ static void slab_mem_offline_callback(void *arg)
if (offline_node < 0)
return;
 
+   if (page_zonenum(pfn_to_page(marg->start_pfn)) > ZONE_NORMAL)
+   return;
+
down_read(&slub_lock);
list_for_each_entry(s, &slab_caches, list) {
n = get_node(s, offline_node);
@@ -3611,6 +3614,9 @@ static int slab_mem_going_online_callback(void *arg)
if (nid < 0)
return 0;
 
+   if (page_zonenum(pfn_to_page(marg->start_pfn)) > ZONE_NORMAL)
+   return 0;
+
/*
 * We are bringing a node online. No memory is available yet. We must
 * allocate a kmem_cache_node structure in order to bring the node
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 16/23 V2] numa: add CONFIG_MOVABLE_NODE for movable-dedicated node

2012-08-01 Thread Lai Jiangshan
All are prepared, we can actually introduce N_MEMORY.
add CONFIG_MOVABLE_NODE make we can use it for movable-dedicated node

Signed-off-by: Lai Jiangshan 
---
 drivers/base/node.c  |6 ++
 include/linux/nodemask.h |4 
 mm/Kconfig   |8 
 mm/page_alloc.c  |3 +++
 4 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 31f4805..4bf5629 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -621,6 +621,9 @@ static struct node_attr node_state_attr[] = {
 #ifdef CONFIG_HIGHMEM
_NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
 #endif
+#ifdef CONFIG_MOVABLE_NODE
+   _NODE_ATTR(has_memory, N_MEMORY),
+#endif
 };
 
 static struct attribute *node_state_attrs[] = {
@@ -631,6 +634,9 @@ static struct attribute *node_state_attrs[] = {
 #ifdef CONFIG_HIGHMEM
&node_state_attr[4].attr.attr,
 #endif
+#ifdef CONFIG_MOVABLE_NODE
+   &node_state_attr[4].attr.attr,
+#endif
NULL
 };
 
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index c6ebdc9..4e2cbfa 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -380,7 +380,11 @@ enum node_states {
 #else
N_HIGH_MEMORY = N_NORMAL_MEMORY,
 #endif
+#ifdef CONFIG_MOVABLE_NODE
+   N_MEMORY,   /* The node has memory(regular, high, movable) 
*/
+#else
N_MEMORY = N_HIGH_MEMORY,
+#endif
N_CPU,  /* The node has one or more cpus */
NR_NODE_STATES
 };
diff --git a/mm/Kconfig b/mm/Kconfig
index 82fed4e..4371c65 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -140,6 +140,14 @@ config ARCH_DISCARD_MEMBLOCK
 config NO_BOOTMEM
boolean
 
+config MOVABLE_NODE
+   boolean "Enable to assign a node has only movable memory"
+   depends on HAVE_MEMBLOCK
+   depends on NO_BOOTMEM
+   depends on X86_64
+   depends on NUMA
+   default y
+
 # eventually, we can have this option just 'select SPARSEMEM'
 config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0571f2a..737faf7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -91,6 +91,9 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
 #ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = { { [0] = 1UL } },
 #endif
+#ifdef CONFIG_MOVABLE_NODE
+   [N_MEMORY] = { { [0] = 1UL } },
+#endif
[N_CPU] = { { [0] = 1UL } },
 #endif /* NUMA */
 };
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 21/23 V2] memblock: limit memory address from memblock

2012-08-01 Thread Lai Jiangshan
From: Yasuaki Ishimatsu 

Setting kernelcore_max_pfn means all memory which is bigger than
the boot parameter is allocated as ZONE_MOVABLE. So memory which
is allocated by memblock also should be limited by the parameter.

The patch limits memory from memblock.

Signed-off-by: Yasuaki Ishimatsu 
Signed-off-by: Lai Jiangshan 
---
 include/linux/memblock.h |1 +
 mm/memblock.c|5 -
 mm/page_alloc.c  |6 +-
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 19dc455..f2977ae 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,6 +42,7 @@ struct memblock {
 
 extern struct memblock memblock;
 extern int memblock_debug;
+extern phys_addr_t memblock_limit;
 
 #define memblock_dbg(fmt, ...) \
if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
diff --git a/mm/memblock.c b/mm/memblock.c
index 5cc6731..663b805 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -931,7 +931,10 @@ int __init_memblock 
memblock_is_region_reserved(phys_addr_t base, phys_addr_t si
 
 void __init_memblock memblock_set_current_limit(phys_addr_t limit)
 {
-   memblock.current_limit = limit;
+   if (!memblock_limit || (memblock_limit > limit))
+   memblock.current_limit = limit;
+   else
+   memblock.current_limit = memblock_limit;
 }
 
 static void __init_memblock memblock_dump(struct memblock_type *type, char 
*name)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 65ac5c9..c4d3aa0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -209,6 +209,8 @@ static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
 
+phys_addr_t memblock_limit;
+
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
 EXPORT_SYMBOL(movable_zone);
@@ -4876,7 +4878,9 @@ static int __init cmdline_parse_core(char *p, unsigned 
long *core)
  */
 static int __init cmdline_parse_kernelcore_max_addr(char *p)
 {
-   return cmdline_parse_core(p, &required_kernelcore_max_pfn);
+   cmdline_parse_core(p, &required_kernelcore_max_pfn);
+   memblock_limit = required_kernelcore_max_pfn << PAGE_SHIFT;
+   return 0;
 }
 early_param("kernelcore_max_addr", cmdline_parse_kernelcore_max_addr);
 #endif
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 20/23 V2] x86: use memblock_set_current_limit() to set memblock.current_limit

2012-08-01 Thread Lai Jiangshan
From: Yasuaki Ishimatsu 

memblock.current_limit is set directly though memblock_set_current_limit()
is prepared. So fix it.

Signed-off-by: Yasuaki Ishimatsu 
Signed-off-by: Lai Jiangshan 
---
 arch/x86/kernel/setup.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f4b9b80..bb9d9f8 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -889,7 +889,7 @@ void __init setup_arch(char **cmdline_p)
 
cleanup_highmap();
 
-   memblock.current_limit = get_max_mapped();
+   memblock_set_current_limit(get_max_mapped());
memblock_x86_fill();
 
/*
@@ -925,7 +925,7 @@ void __init setup_arch(char **cmdline_p)
max_low_pfn = max_pfn;
}
 #endif
-   memblock.current_limit = get_max_mapped();
+   memblock_set_current_limit(get_max_mapped());
dma_contiguous_reserve(0);
 
/*
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 23/23 V2] mm, memory-hotplug: add online_movable

2012-08-01 Thread Lai Jiangshan
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.

It makes things easy when we dynamic hot-add/remove memory, make better
utilities of memories, and helps for THP.

Current constraints: Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be onlined from ZONE_NORMAL to ZONE_MOVABLE.

For opposite onlining behavior, we also introduce "online_kernel" to change
a memoryblock of ZONE_MOVABLE to ZONE_KERNEL when online.

Signed-off-by: Lai Jiangshan 
---
 Documentation/memory-hotplug.txt |   14 -
 drivers/base/memory.c|   19 --
 include/linux/memory_hotplug.h   |   13 -
 mm/memory_hotplug.c  |  114 +++--
 4 files changed, 144 insertions(+), 16 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 89f21b2..7b1269c 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -161,7 +161,8 @@ a recent addition and not present on older kernels.
in the memory block.
 'state'   : read-write
 at read:  contains online/offline state of memory.
-at write: user can specify "online", "offline" command
+at write: user can specify "online_kernel",
+"online_movable", "online", "offline" command
 which will be performed on al sections in the block.
 'phys_device' : read-only: designed to show the name of physical memory
 device.  This is not well implemented now.
@@ -255,6 +256,17 @@ For onlining, you have to write "online" to the section's 
state file as:
 
 % echo online > /sys/devices/system/memory/memoryXXX/state
 
+This onlining will not change the ZONE type of the target memory section,
+If the memory section is in ZONE_NORMAL, you can change it to ZONE_MOVABLE:
+
+% echo online_movable > /sys/devices/system/memory/memoryXXX/state
+(NOTE: current limit: this memory section must be adjacent to ZONE_MOVABLE)
+
+And if the memory section is in ZONE_MOVABLE, you can change it to ZONE_NORMAL:
+
+% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
+(NOTE: current limit: this memory section must be adjacent to ZONE_NORMAL)
+
 After this, section memoryXXX's state will be 'online' and the amount of
 available memory will be increased.
 
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 7dda4f7..1ad2f48 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -246,7 +246,7 @@ static bool pages_correctly_reserved(unsigned long 
start_pfn,
  * OK to have direct references to sparsemem variables in here.
  */
 static int
-memory_block_action(unsigned long phys_index, unsigned long action)
+memory_block_action(unsigned long phys_index, unsigned long action, int 
online_type)
 {
unsigned long start_pfn, start_paddr;
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
@@ -262,7 +262,7 @@ memory_block_action(unsigned long phys_index, unsigned long 
action)
if (!pages_correctly_reserved(start_pfn, nr_pages))
return -EBUSY;
 
-   ret = online_pages(start_pfn, nr_pages);
+   ret = online_pages(start_pfn, nr_pages, online_type);
break;
case MEM_OFFLINE:
start_paddr = page_to_pfn(first_page) << PAGE_SHIFT;
@@ -279,7 +279,8 @@ memory_block_action(unsigned long phys_index, unsigned long 
action)
 }
 
 static int memory_block_change_state(struct memory_block *mem,
-   unsigned long to_state, unsigned long from_state_req)
+   unsigned long to_state, unsigned long from_state_req,
+   int online_type)
 {
int ret = 0;
 
@@ -293,7 +294,7 @@ static int memory_block_change_state(struct memory_block 
*mem,
if (to_state == MEM_OFFLINE)
mem->state = MEM_GOING_OFFLINE;
 
-   ret = memory_block_action(mem->start_section_nr, to_state);
+   ret = memory_block_action(mem->start_section_nr, to_state, online_type);
 
if (ret) {
mem->state = from_state_req;
@@ -325,10 +326,14 @@ store_mem_state(struct device *dev,
 
mem = container_of(dev, struct memory_block, dev);
 
-   if (!strncmp(buf, "online", min((int)count, 6)))
-   ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
+   if (!strncmp(buf, "online_kernel", min((int)count, 13)))
+   ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE, 
ONLINE_KERNEL);
+   else if (!strncmp(buf, "online_mova

[RFC PATCH 22/23 V2] memblock: compare current_limit with end variable at memblock_find_in_range_node()

2012-08-01 Thread Lai Jiangshan
From: Yasuaki Ishimatsu 

memblock_find_in_range_node() does not compare memblock.current_limit
with end variable. Thus even if memblock.current_limit is smaller than
end variable, the function allocates memory address that is bigger than
memblock.current_limit.

The patch adds the check to "memblock_find_in_range_node()"

Signed-off-by: Yasuaki Ishimatsu 
Signed-off-by: Lai Jiangshan 
---
 mm/memblock.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 663b805..ce7fcb6 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -99,11 +99,12 @@ phys_addr_t __init_memblock 
memblock_find_in_range_node(phys_addr_t start,
phys_addr_t align, int nid)
 {
phys_addr_t this_start, this_end, cand;
+   phys_addr_t current_limit = memblock.current_limit;
u64 i;
 
/* pump up @end */
-   if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
-   end = memblock.current_limit;
+   if ((end == MEMBLOCK_ALLOC_ACCESSIBLE) || (end > current_limit))
+   end = current_limit;
 
/* avoid allocating the first page */
start = max_t(phys_addr_t, start, PAGE_SIZE);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 02/23 V2] cpuset: use N_MEMORY instead N_HIGH_MEMORY

2012-08-01 Thread Lai Jiangshan
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 Documentation/cgroups/cpusets.txt |2 +-
 include/linux/cpuset.h|2 +-
 kernel/cpuset.c   |   32 
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt 
b/Documentation/cgroups/cpusets.txt
index cefd3d8..12e01d4 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -218,7 +218,7 @@ and name space for cpusets, with a minimum of additional 
kernel code.
 The cpus and mems files in the root (top_cpuset) cpuset are
 read-only.  The cpus file automatically tracks the value of
 cpu_online_mask using a CPU hotplug notifier, and the mems file
-automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e.,
+automatically tracks the value of node_states[N_MEMORY]--i.e.,
 nodes with memory--using the cpuset_track_online_nodes() hook.
 
 
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 838320f..8c8a60d 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -144,7 +144,7 @@ static inline nodemask_t cpuset_mems_allowed(struct 
task_struct *p)
return node_possible_map;
 }
 
-#define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
+#define cpuset_current_mems_allowed (node_states[N_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
 
 static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index f33c715..2b133db 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -302,10 +302,10 @@ static void guarantee_online_cpus(const struct cpuset *cs,
  * are online, with memory.  If none are online with memory, walk
  * up the cpuset hierarchy until we find one that does have some
  * online mems.  If we get all the way to the top and still haven't
- * found any online mems, return node_states[N_HIGH_MEMORY].
+ * found any online mems, return node_states[N_MEMORY].
  *
  * One way or another, we guarantee to return some non-empty subset
- * of node_states[N_HIGH_MEMORY].
+ * of node_states[N_MEMORY].
  *
  * Call with callback_mutex held.
  */
@@ -313,14 +313,14 @@ static void guarantee_online_cpus(const struct cpuset *cs,
 static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
 {
while (cs && !nodes_intersects(cs->mems_allowed,
-   node_states[N_HIGH_MEMORY]))
+   node_states[N_MEMORY]))
cs = cs->parent;
if (cs)
nodes_and(*pmask, cs->mems_allowed,
-   node_states[N_HIGH_MEMORY]);
+   node_states[N_MEMORY]);
else
-   *pmask = node_states[N_HIGH_MEMORY];
-   BUG_ON(!nodes_intersects(*pmask, node_states[N_HIGH_MEMORY]));
+   *pmask = node_states[N_MEMORY];
+   BUG_ON(!nodes_intersects(*pmask, node_states[N_MEMORY]));
 }
 
 /*
@@ -1100,7 +1100,7 @@ static int update_nodemask(struct cpuset *cs, struct 
cpuset *trialcs,
return -ENOMEM;
 
/*
-* top_cpuset.mems_allowed tracks node_stats[N_HIGH_MEMORY];
+* top_cpuset.mems_allowed tracks node_stats[N_MEMORY];
 * it's read-only
 */
if (cs == &top_cpuset) {
@@ -1122,7 +1122,7 @@ static int update_nodemask(struct cpuset *cs, struct 
cpuset *trialcs,
goto done;
 
if (!nodes_subset(trialcs->mems_allowed,
-   node_states[N_HIGH_MEMORY])) {
+   node_states[N_MEMORY])) {
retval =  -EINVAL;
goto done;
}
@@ -2034,7 +2034,7 @@ static struct cpuset *cpuset_next(struct list_head *queue)
  * before dropping down to the next.  It always processes a node before
  * any of its children.
  *
- * In the case of memory hot-unplug, it will remove nodes from N_HIGH_MEMORY
+ * In the case of memory hot-unplug, it will remove nodes from N_MEMORY
  * if all present pages from a node are offlined.
  */
 static void
@@ -2073,7 +2073,7 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum 
hotplug_event event)
 
/* Continue past cpusets with all mems online */
if (nodes_subset(cp->mems_allowed,
-   node_states[N_HIGH_MEMORY]))
+   node_states[N_MEMORY]))
continue;
 
oldmems = cp->mems_allowed;
@@ -2081,7 +2081,7 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum 
hotplug_event event)

[RFC PATCH 12/23 V2] vmscan: use N_MEMORY instead N_HIGH_MEMORY

2012-08-01 Thread Lai Jiangshan
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 mm/vmscan.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 66e4310..1888026 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2921,7 +2921,7 @@ static int __devinit cpu_callback(struct notifier_block 
*nfb,
int nid;
 
if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) {
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
pg_data_t *pgdat = NODE_DATA(nid);
const struct cpumask *mask;
 
@@ -2976,7 +2976,7 @@ static int __init kswapd_init(void)
int nid;
 
swap_setup();
-   for_each_node_state(nid, N_HIGH_MEMORY)
+   for_each_node_state(nid, N_MEMORY)
kswapd_run(nid);
hotcpu_notifier(cpu_callback, 0);
return 0;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 11/23 V2] init: use N_MEMORY instead N_HIGH_MEMORY

2012-08-01 Thread Lai Jiangshan
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 init/main.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/init/main.c b/init/main.c
index 4121d1f..c9317aa 100644
--- a/init/main.c
+++ b/init/main.c
@@ -846,7 +846,7 @@ static int __init kernel_init(void * unused)
/*
 * init can allocate pages on any node
 */
-   set_mems_allowed(node_states[N_HIGH_MEMORY]);
+   set_mems_allowed(node_states[N_MEMORY]);
/*
 * init can run on any cpu.
 */
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 15/23 V2] memory_hotplug: fix missing nodemask management

2012-08-01 Thread Lai Jiangshan
Currently memory_hotplug only manages the node_states[N_HIGH_MEMORY],
it forgot to manage node_states[N_NORMAL_MEMORY]. fix it.

Signed-off-by: Lai Jiangshan 
---
 Documentation/memory-hotplug.txt |2 +-
 mm/memory_hotplug.c  |   23 +--
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 6d0c251..89f21b2 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -382,7 +382,7 @@ struct memory_notify {
 
 start_pfn is start_pfn of online/offline memory.
 nr_pages is # of pages of online/offline memory.
-status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be)
+status_change_nid is set node id when N_MEMORY of nodemask is (will be)
 set/clear. It means a new(memoryless) node gets new memory by online and a
 node loses all memory. If this is -1, then nodemask status is not changed.
 If status_changed_nid >= 0, callback should create/discard structures for the
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 427bb29..c44b39e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -522,8 +522,18 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages)
init_per_zone_wmark_min();
 
if (onlined_pages) {
+   enum zone_type zoneid = zone_idx(zone);
+
kswapd_run(zone_to_nid(zone));
-   node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
+
+   node_set_state(nid, N_MEMORY);
+   if (zoneid <= ZONE_NORMAL && N_NORMAL_MEMORY != N_MEMORY)
+   node_set_state(nid, N_NORMAL_MEMORY);
+#ifdef CONFIG_HIGMEM
+   if (zoneid <= ZONE_HIGHMEM && N_HIGH_MEMORY != N_MEMORY)
+   node_set_state(nid, N_HIGH_MEMORY);
+#endif
+
}
 
vm_total_pages = nr_free_pagecache_pages();
@@ -966,7 +976,16 @@ repeat:
init_per_zone_wmark_min();
 
if (!node_present_pages(node)) {
-   node_clear_state(node, N_HIGH_MEMORY);
+   enum zone_type zoneid = zone_idx(zone);
+
+   node_clear_state(node, N_MEMORY);
+   if (zoneid <= ZONE_NORMAL && N_NORMAL_MEMORY != N_MEMORY)
+   node_clear_state(node, N_NORMAL_MEMORY);
+#ifdef CONFIG_HIGMEM
+   if (zoneid <= ZONE_HIGHMEM && N_HIGH_MEMORY != N_MEMORY)
+   node_clear_state(node, N_HIGH_MEMORY);
+#endif
+
kswapd_stop(node);
}
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 17/23 V2] page_alloc.c: don't subtract unrelated memmap from zone's present pages

2012-08-01 Thread Lai Jiangshan
A)==
Currently, memory-page-map(struct page array) is not defined in struct zone.
It is defined in several ways:

FLATMEM: global memmap, can be allocated from any zone <= ZONE_NORMAL
CONFIG_DISCONTIGMEM: node-specific memmap, can be allocated from any
 zone <= ZONE_NORMAL within that node.
CONFIG_SPARSEMEM: memorysection-specific memmap, can be allocated from any zone,
  when CONFIG_SPARSEMEM_VMEMMAP, it is even not physical 
continuous.

So, the memmap has nothing directly related with the zone. And it's memory can 
be
allocated outside, so it is wrong to subtract memmap's size from zone's
present pages.

B)==
When system has large holes, the subtracted-present-pages-size will become
very small or negative, make the memory management works bad at the zone or
make the zone unusable even the real-present-pages-size is actually large.

C)==
And subtracted-present-pages-size has problem when memory-hot-removing,
the zone->zone->present_pages may overflow and become huge(unsigned long).

D)==
memory-page-map is large and long living unreclaimable memory, it is good to
subtract them for proper watermark.
So a new proper approach is needed to do it similarly 
and new approach should also handle other long living unreclaimable memory.

Current blindly subtracted-present-pages-size approach does wrong, remove it.

Signed-off-by: Lai Jiangshan 
---
 mm/page_alloc.c |   20 +---
 1 files changed, 1 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 737faf7..03ad63d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4360,30 +4360,12 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
 
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
-   unsigned long size, realsize, memmap_pages;
+   unsigned long size, realsize;
 
size = zone_spanned_pages_in_node(nid, j, zones_size);
realsize = size - zone_absent_pages_in_node(nid, j,
zholes_size);
 
-   /*
-* Adjust realsize so that it accounts for how much memory
-* is used by this zone for memmap. This affects the watermark
-* and per-cpu initialisations
-*/
-   memmap_pages =
-   PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
-   if (realsize >= memmap_pages) {
-   realsize -= memmap_pages;
-   if (memmap_pages)
-   printk(KERN_DEBUG
-  "  %s zone: %lu pages used for memmap\n",
-  zone_names[j], memmap_pages);
-   } else
-   printk(KERN_WARNING
-   "  %s zone: %lu pages exceeds realsize %lu\n",
-   zone_names[j], memmap_pages, realsize);
-
/* Account for reserved pages */
if (j == 0 && realsize > dma_reserve) {
realsize -= dma_reserve;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 19/23 V2] x86: get pg_data_t's memory from other node

2012-08-01 Thread Lai Jiangshan
From: Yasuaki Ishimatsu 

If system can create movable node which all memory of the
node is allocated as ZONE_MOVABLE, setup_node_data() cannot
allocate memory for the node's pg_data_t.
So when memblock_alloc_nid() fails, setup_node_data() retries
memblock_alloc().

Signed-off-by: Yasuaki Ishimatsu 
Signed-off-by: Lai Jiangshan 
---
 arch/x86/mm/numa.c |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..a86e315 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -223,9 +223,13 @@ static void __init setup_node_data(int nid, u64 start, u64 
end)
remapped = true;
} else {
nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
-   if (!nd_pa) {
-   pr_err("Cannot find %zu bytes in node %d\n",
+   if (!nd_pa)
+   printk(KERN_WARNING "Cannot find %zu bytes in node 
%d\n",
   nd_size, nid);
+   nd_pa = memblock_alloc(nd_size, SMP_CACHE_BYTES);
+   if (!nd_pa) {
+   pr_err("Cannot find %zu bytes in other node\n",
+  nd_size);
return;
}
nd = __va(nd_pa);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 18/23 V2] page_alloc: add kernelcore_max_addr

2012-08-01 Thread Lai Jiangshan
Current ZONE_MOVABLE (kernelcore=) setting policy with boot option doesn't meet
our requirement. We need something like kernelcore_max_addr=XX boot option
to limit the kernelcore upper address.

The memory with higher address will be migratable(movable) and they
are easier to be offline(always ready to be offline when the system don't 
require
so much memory).

It makes things easy when we dynamic hot-add/remove memory, make better
utilities of memories, and helps for THP.

All kernelcore_max_addr=, kernelcore= and movablecore= can be safely specified
at the same time(or any 2 of them).

Signed-off-by: Lai Jiangshan 
---
 Documentation/kernel-parameters.txt |9 +
 mm/page_alloc.c |   29 -
 2 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 12783fa..48dff61 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1216,6 +1216,15 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.
 
+   kernelcore_max_addr=nn[KMG] [KNL,X86,IA-64,PPC] This parameter
+   is the same effect as kernelcore parameter, except it
+   specifies the up physical address of memory range
+   usable by the kernel for non-movable allocations.
+   If both kernelcore and kernelcore_max_addr are
+   specified, this requested's priority is higher than
+   kernelcore's.
+   See the kernelcore parameter.
+
kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
Format: [,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 03ad63d..65ac5c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -204,6 +204,7 @@ static unsigned long __meminitdata dma_reserve;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+static unsigned long __initdata required_kernelcore_max_pfn;
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
@@ -4600,6 +4601,7 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 {
int i, nid;
unsigned long usable_startpfn;
+   unsigned long kernelcore_max_pfn;
unsigned long kernelcore_node, kernelcore_remaining;
/* save the state before borrow the nodemask */
nodemask_t saved_node_state = node_states[N_MEMORY];
@@ -4628,6 +4630,9 @@ static void __init find_zone_movable_pfns_for_nodes(void)
required_kernelcore = max(required_kernelcore, corepages);
}
 
+   if (required_kernelcore_max_pfn && !required_kernelcore)
+   required_kernelcore = totalpages;
+
/* If kernelcore was not specified, there is no ZONE_MOVABLE */
if (!required_kernelcore)
goto out;
@@ -4636,6 +4641,12 @@ static void __init find_zone_movable_pfns_for_nodes(void)
find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
 
+   if (required_kernelcore_max_pfn)
+   kernelcore_max_pfn = required_kernelcore_max_pfn;
+   else
+   kernelcore_max_pfn = ULONG_MAX >> PAGE_SHIFT;
+   kernelcore_max_pfn = max(kernelcore_max_pfn, usable_startpfn);
+
 restart:
/* Spread kernelcore memory as evenly as possible throughout nodes */
kernelcore_node = required_kernelcore / usable_nodes;
@@ -4662,8 +4673,12 @@ restart:
unsigned long size_pages;
 
start_pfn = max(start_pfn, zone_movable_pfn[nid]);
-   if (start_pfn >= end_pfn)
+   end_pfn = min(kernelcore_max_pfn, end_pfn);
+   if (start_pfn >= end_pfn) {
+   if (!zone_movable_pfn[nid])
+   zone_movable_pfn[nid] = start_pfn;
continue;
+   }
 
/* Account for what is only usable for kernelcore */
if (start_pfn < usable_startpfn) {
@@ -4854,6 +4869,18 @@ static int __init cmdline_parse_core(char *p, unsigned 
long *core)
return 0;
 }
 
+#ifdef CONFIG_MOVABLE_NODE
+/*
+ * kernelcore_max_addr=addr sets the up physical address of memory range
+ * for use for allocations that cannot be reclaimed or migrated.
+ */
+static int __ini

[RFC PATCH 13/23 V2] page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states initialization

2012-08-01 Thread Lai Jiangshan
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Since we introduced N_MEMORY, we update the initialization of node_states.

Signed-off-by: Lai Jiangshan 
---
 arch/x86/mm/init_64.c |4 +++-
 mm/page_alloc.c   |   40 ++--
 2 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 2b6b4a3..005f00c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -625,7 +625,9 @@ void __init paging_init(void)
 *   numa support is not compiled in, and later node_set_state
 *   will not set it back.
 */
-   node_clear_state(0, N_NORMAL_MEMORY);
+   node_clear_state(0, N_MEMORY);
+   if (N_MEMORY != N_NORMAL_MEMORY)
+   node_clear_state(0, N_NORMAL_MEMORY);
 
zone_sizes_init();
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a4f921..0571f2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1646,7 +1646,7 @@ bool zone_watermark_ok_safe(struct zone *z, int order, 
unsigned long mark,
  *
  * If the zonelist cache is present in the passed in zonelist, then
  * returns a pointer to the allowed node mask (either the current
- * tasks mems_allowed, or node_states[N_HIGH_MEMORY].)
+ * tasks mems_allowed, or node_states[N_MEMORY].)
  *
  * If the zonelist cache is not available for this zonelist, does
  * nothing and returns NULL.
@@ -1675,7 +1675,7 @@ static nodemask_t *zlc_setup(struct zonelist *zonelist, 
int alloc_flags)
 
allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
&cpuset_current_mems_allowed :
-   &node_states[N_HIGH_MEMORY];
+   &node_states[N_MEMORY];
return allowednodes;
 }
 
@@ -3070,7 +3070,7 @@ static int find_next_best_node(int node, nodemask_t 
*used_node_mask)
return node;
}
 
-   for_each_node_state(n, N_HIGH_MEMORY) {
+   for_each_node_state(n, N_MEMORY) {
 
/* Don't want a node to appear more than once */
if (node_isset(n, *used_node_mask))
@@ -3212,7 +3212,7 @@ static int default_zonelist_order(void)
 * local memory, NODE_ORDER may be suitable.
  */
average_size = total_size /
-   (nodes_weight(node_states[N_HIGH_MEMORY]) + 1);
+   (nodes_weight(node_states[N_MEMORY]) + 1);
for_each_online_node(nid) {
low_kmem_size = 0;
total_size = 0;
@@ -4587,7 +4587,7 @@ unsigned long __init 
find_min_pfn_with_active_regions(void)
 /*
  * early_calculate_totalpages()
  * Sum pages in active regions for movable zone.
- * Populate N_HIGH_MEMORY for calculating usable_nodes.
+ * Populate N_MEMORY for calculating usable_nodes.
  */
 static unsigned long __init early_calculate_totalpages(void)
 {
@@ -4600,7 +4600,7 @@ static unsigned long __init 
early_calculate_totalpages(void)
 
totalpages += pages;
if (pages)
-   node_set_state(nid, N_HIGH_MEMORY);
+   node_set_state(nid, N_MEMORY);
}
return totalpages;
 }
@@ -4617,9 +4617,9 @@ static void __init find_zone_movable_pfns_for_nodes(void)
unsigned long usable_startpfn;
unsigned long kernelcore_node, kernelcore_remaining;
/* save the state before borrow the nodemask */
-   nodemask_t saved_node_state = node_states[N_HIGH_MEMORY];
+   nodemask_t saved_node_state = node_states[N_MEMORY];
unsigned long totalpages = early_calculate_totalpages();
-   int usable_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
+   int usable_nodes = nodes_weight(node_states[N_MEMORY]);
 
/*
 * If movablecore was specified, calculate what size of
@@ -4654,7 +4654,7 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 restart:
/* Spread kernelcore memory as evenly as possible throughout nodes */
kernelcore_node = required_kernelcore / usable_nodes;
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
unsigned long start_pfn, end_pfn;
 
/*
@@ -4746,23 +4746,27 @@ restart:
 
 out:
/* restore the node_state */
-   node_states[N_HIGH_MEMORY] = saved_node_state;
+   node_states[N_MEMORY] = saved_node_state;
 }
 
-/* Any regular memory on that node ? */
-static void check_for_regular_memory(pg_data_t *pgdat)
+/* Any regular or high memory on that node ? */
+static void check_for_memory(pg_data_t *pgdat, int nid)
 {
-#ifdef CONFIG_HIGHMEM
enum zone_type zone_type;
 
-   for (zone_type = 0; zone_type <= ZONE_NORMAL; zon

[RFC PATCH 10/23 V2] kthread: use N_MEMORY instead N_HIGH_MEMORY

2012-08-01 Thread Lai Jiangshan
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 kernel/kthread.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3d3de63..4139962 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -280,7 +280,7 @@ int kthreadd(void *unused)
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
set_cpus_allowed_ptr(tsk, cpu_all_mask);
-   set_mems_allowed(node_states[N_HIGH_MEMORY]);
+   set_mems_allowed(node_states[N_MEMORY]);
 
current->flags |= PF_NOFREEZE;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 08/23 V2] hugetlb: use N_MEMORY instead N_HIGH_MEMORY

2012-08-01 Thread Lai Jiangshan
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 drivers/base/node.c |2 +-
 mm/hugetlb.c|   24 
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index af1a177..31f4805 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -227,7 +227,7 @@ static node_registration_func_t __hugetlb_unregister_node;
 static inline bool hugetlb_register_node(struct node *node)
 {
if (__hugetlb_register_node &&
-   node_state(node->dev.id, N_HIGH_MEMORY)) {
+   node_state(node->dev.id, N_MEMORY)) {
__hugetlb_register_node(node);
return true;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e198831..661db47 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1046,7 +1046,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 * on-line nodes with memory and will handle the hstate accounting.
 */
while (nr_pages--) {
-   if (!free_pool_huge_page(h, &node_states[N_HIGH_MEMORY], 1))
+   if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
break;
}
 }
@@ -1150,14 +1150,14 @@ static struct page *alloc_huge_page(struct 
vm_area_struct *vma,
 int __weak alloc_bootmem_huge_page(struct hstate *h)
 {
struct huge_bootmem_page *m;
-   int nr_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
+   int nr_nodes = nodes_weight(node_states[N_MEMORY]);
 
while (nr_nodes) {
void *addr;
 
addr = __alloc_bootmem_node_nopanic(
NODE_DATA(hstate_next_node_to_alloc(h,
-   &node_states[N_HIGH_MEMORY])),
+   &node_states[N_MEMORY])),
huge_page_size(h), huge_page_size(h), 0);
 
if (addr) {
@@ -1229,7 +1229,7 @@ static void __init hugetlb_hstate_alloc_pages(struct 
hstate *h)
if (!alloc_bootmem_huge_page(h))
break;
} else if (!alloc_fresh_huge_page(h,
-&node_states[N_HIGH_MEMORY]))
+&node_states[N_MEMORY]))
break;
}
h->max_huge_pages = i;
@@ -1497,7 +1497,7 @@ static ssize_t nr_hugepages_store_common(bool 
obey_mempolicy,
if (!(obey_mempolicy &&
init_nodemask_of_mempolicy(nodes_allowed))) {
NODEMASK_FREE(nodes_allowed);
-   nodes_allowed = &node_states[N_HIGH_MEMORY];
+   nodes_allowed = &node_states[N_MEMORY];
}
} else if (nodes_allowed) {
/*
@@ -1507,11 +1507,11 @@ static ssize_t nr_hugepages_store_common(bool 
obey_mempolicy,
count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
init_nodemask_of_node(nodes_allowed, nid);
} else
-   nodes_allowed = &node_states[N_HIGH_MEMORY];
+   nodes_allowed = &node_states[N_MEMORY];
 
h->max_huge_pages = set_max_huge_pages(h, count, nodes_allowed);
 
-   if (nodes_allowed != &node_states[N_HIGH_MEMORY])
+   if (nodes_allowed != &node_states[N_MEMORY])
NODEMASK_FREE(nodes_allowed);
 
return len;
@@ -1812,7 +1812,7 @@ static void hugetlb_register_all_nodes(void)
 {
int nid;
 
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
struct node *node = &node_devices[nid];
if (node->dev.id == nid)
hugetlb_register_node(node);
@@ -1906,8 +1906,8 @@ void __init hugetlb_add_hstate(unsigned order)
h->free_huge_pages = 0;
for (i = 0; i < MAX_NUMNODES; ++i)
INIT_LIST_HEAD(&h->hugepage_freelists[i]);
-   h->next_nid_to_alloc = first_node(node_states[N_HIGH_MEMORY]);
-   h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
+   h->next_nid_to_alloc = first_node(node_states[N_MEMORY]);
+   h->next_nid_to_free = first_node(node_states[N_MEMORY]);
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
huge_page_size(h)/1024);
 
@@ -1995,11 +1995,11 @@ static int hugetlb_sysctl_handler_common(bool 
obey_mempolicy,
if (!(obey_mempolicy &&
   init_nodemask_of_mempolicy(nodes_allowed))) {
NODEMASK_FREE(no

[RFC PATCH 07/23 V2] memcontrol: use N_MEMORY instead N_HIGH_MEMORY

2012-08-01 Thread Lai Jiangshan
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 mm/memcontrol.c  |   18 +-
 mm/page_cgroup.c |2 +-
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f72b5e5..4402c2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -797,7 +797,7 @@ static unsigned long mem_cgroup_nr_lru_pages(struct 
mem_cgroup *memcg,
int nid;
u64 total = 0;
 
-   for_each_node_state(nid, N_HIGH_MEMORY)
+   for_each_node_state(nid, N_MEMORY)
total += mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask);
return total;
 }
@@ -1549,9 +1549,9 @@ static void mem_cgroup_may_update_nodemask(struct 
mem_cgroup *memcg)
return;
 
/* make a nodemask where this memcg uses memory from */
-   memcg->scan_nodes = node_states[N_HIGH_MEMORY];
+   memcg->scan_nodes = node_states[N_MEMORY];
 
-   for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
+   for_each_node_mask(nid, node_states[N_MEMORY]) {
 
if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
node_clear(nid, memcg->scan_nodes);
@@ -1622,7 +1622,7 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup 
*memcg, bool noswap)
/*
 * Check rest of nodes.
 */
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
if (node_isset(nid, memcg->scan_nodes))
continue;
if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
@@ -3700,7 +3700,7 @@ move_account:
drain_all_stock_sync(memcg);
ret = 0;
mem_cgroup_start_move(memcg);
-   for_each_node_state(node, N_HIGH_MEMORY) {
+   for_each_node_state(node, N_MEMORY) {
for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
enum lru_list lru;
for_each_lru(lru) {
@@ -4025,7 +4025,7 @@ static int mem_control_numa_stat_show(struct cgroup 
*cont, struct cftype *cft,
 
total_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL);
seq_printf(m, "total=%lu", total_nr);
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL);
seq_printf(m, " N%d=%lu", nid, node_nr);
}
@@ -4033,7 +4033,7 @@ static int mem_control_numa_stat_show(struct cgroup 
*cont, struct cftype *cft,
 
file_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
seq_printf(m, "file=%lu", file_nr);
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
LRU_ALL_FILE);
seq_printf(m, " N%d=%lu", nid, node_nr);
@@ -4042,7 +4042,7 @@ static int mem_control_numa_stat_show(struct cgroup 
*cont, struct cftype *cft,
 
anon_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
seq_printf(m, "anon=%lu", anon_nr);
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
LRU_ALL_ANON);
seq_printf(m, " N%d=%lu", nid, node_nr);
@@ -4051,7 +4051,7 @@ static int mem_control_numa_stat_show(struct cgroup 
*cont, struct cftype *cft,
 
unevictable_nr = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_UNEVICTABLE));
seq_printf(m, "unevictable=%lu", unevictable_nr);
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
BIT(LRU_UNEVICTABLE));
seq_printf(m, " N%d=%lu", nid, node_nr);
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index eb750f8..e775239 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -271,7 +271,7 @@ void __init page_cgroup_init(void)
if (mem_cgroup_disabled())
return;
 
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
unsigned long start_pfn, end_pfn;
 
start_pfn = node_start_pfn(nid);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 05/23 V2] mm,migrate: use N_MEMORY instead N_HIGH_MEMORY

2012-08-01 Thread Lai Jiangshan
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 mm/migrate.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index be26d5c..dbe4f86 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1226,7 +1226,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t 
task_nodes,
if (node < 0 || node >= MAX_NUMNODES)
goto out_pm;
 
-   if (!node_state(node, N_HIGH_MEMORY))
+   if (!node_state(node, N_MEMORY))
goto out_pm;
 
err = -EACCES;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 04/23 V2] oom: use N_MEMORY instead N_HIGH_MEMORY

2012-08-01 Thread Lai Jiangshan
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 mm/oom_kill.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ac300c9..1e58f12 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -257,7 +257,7 @@ static enum oom_constraint constrained_alloc(struct 
zonelist *zonelist,
 * the page allocator means a mempolicy is in effect.  Cpuset policy
 * is enforced in get_page_from_freelist().
 */
-   if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
+   if (nodemask && !nodes_subset(node_states[N_MEMORY], *nodemask)) {
*totalpages = total_swap_pages;
for_each_node_mask(nid, *nodemask)
*totalpages += node_spanned_pages(nid);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 01/23 V2] node_states: introduce N_MEMORY

2012-08-01 Thread Lai Jiangshan
We have N_NORMAL_MEMORY for standing for the nodes that have normal memory with
zone_type <= ZONE_NORMAL.

And we have N_HIGH_MEMORY for standing for the nodes that have normal or high
memory.

But we don't have any word to stand for the nodes that have *any* memory.

And we have N_CPU but without N_MEMORY.

Current code reuse the N_HIGH_MEMORY for this purpose because any node which
has memory must have high memory or normal memory currently.

A)  But this reusing is bad for *readability*. Because the name
N_HIGH_MEMORY just stands for high or normal:

A.example 1)
mem_cgroup_nr_lru_pages():
for_each_node_state(nid, N_HIGH_MEMORY)

The user will be confused(why this function just counts for high or
normal memory node? does it counts for ZONE_MOVABLE's lru pages?)
until someone else tell them N_HIGH_MEMORY is reused to stand for
nodes that have any memory.

A.cont) If we introduce N_MEMORY, we can reduce this confusing
AND make the code more clearly:

A.example 2) mm/page_cgroup.c use N_HIGH_MEMORY twice:

One is in page_cgroup_init(void):
for_each_node_state(nid, N_HIGH_MEMORY) {

It means if the node have memory, we will allocate page_cgroup map for
the node. We should use N_MEMORY instead here to gaim more clearly.

The second using is in alloc_page_cgroup():
if (node_state(nid, N_HIGH_MEMORY))
addr = vzalloc_node(size, nid);

It means if the node has high or normal memory that can be allocated
from kernel. We should keep N_HIGH_MEMORY here, and it will be better
if the "any memory" semantic of N_HIGH_MEMORY is removed.

B)  This reusing is out-dated if we introduce MOVABLE-dedicated node.
The MOVABLE-dedicated node should not appear in
node_stats[N_HIGH_MEMORY] nor node_stats[N_NORMAL_MEMORY],
because MOVABLE-dedicated node has no high or normal memory.

In x86_64, N_HIGH_MEMORY=N_NORMAL_MEMORY, if a MOVABLE-dedicated node
is in node_stats[N_HIGH_MEMORY], it is also means it is in
node_stats[N_NORMAL_MEMORY], it causes SLUB wrong.

The slub uses
for_each_node_state(nid, N_NORMAL_MEMORY)
and creates kmem_cache_node for MOVABLE-dedicated node and cause 
problem.

In one word, we need a N_MEMORY. We just intrude it as an alias to
N_HIGH_MEMORY and fix all im-proper usages of N_HIGH_MEMORY in late patches.

Signed-off-by: Lai Jiangshan 
---
 include/linux/nodemask.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 7afc363..c6ebdc9 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -380,6 +380,7 @@ enum node_states {
 #else
N_HIGH_MEMORY = N_NORMAL_MEMORY,
 #endif
+   N_MEMORY = N_HIGH_MEMORY,
N_CPU,  /* The node has one or more cpus */
NR_NODE_STATES
 };
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   5   6   7   8   9   10   >