from:"Tejun Heo"

Re: [RFC 2/9] workqueue: unconfine alloc/apply/free_workqueue_attrs()

2019-07-29 Thread Tejun Heo

On Thu, Jul 25, 2019 at 05:24:58PM -0400, Daniel Jordan wrote:
> padata will use these these interfaces in a later patch, so unconfine them.
> 
> Signed-off-by: Daniel Jordan 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun

Re: [RFC 3/9] workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs

2019-07-29 Thread Tejun Heo

On Thu, Jul 25, 2019 at 05:24:59PM -0400, Daniel Jordan wrote:
> Change the calling convention for apply_workqueue_attrs to require CPU
> hotplug read exclusion.
> 
> Avoids lockdep complaints about nested calls to get_online_cpus in a
> future patch where padata calls apply_workqueue_attrs when changing
> other CPU-hotplug-sensitive data structures with the CPU read lock
> already held.
> 
> Signed-off-by: Daniel Jordan 

Acked-by: Tejun Heo 

Please feel free to route with the rest of the patchset.

Thanks.

-- 
tejun

Re: workqueue list corruption

2017-06-05 Thread Tejun Heo

Hello,

On Sun, Jun 04, 2017 at 12:30:03PM -0700, Cong Wang wrote:
> On Tue, Apr 18, 2017 at 8:08 PM, Samuel Holland  wrote:
> > Representative backtraces follow (the warnings come in sets). I have
> > kernel .configs and extended netconsole output from several occurrences
> > available upon request.
> >
> > WARNING: CPU: 1 PID: 0 at lib/list_debug.c:33 __list_add+0x89/0xb0
> > list_add corruption. prev->next should be next (99f135016a90), but
> > was d34affc03b10. (prev=d34affc03b10).

So, while trying to move a work item from delayed list to the pending
list, the pending list's last item's next pointer is no longer
pointing to the head and looks re-initialized.  Could be a premature
free and reuse.

If this is reproducible, it'd help a lot to update move_linked_works()
to check for list validity directly and print out the work function of
the corrupt work item.  There's no guarantee that the re-user is the
one which did premature free but given that we're likely seeing
INIT_LIST_HEAD() instead of random corruption is encouraging, so
there's some chance that doing that would point us to the culprit or
at least pretty close to it.

Thanks.

-- 
tejun

Re: [PATCH] crypto: qat: Remove deprecated create_workqueue

2016-06-11 Thread Tejun Heo

On Wed, Jun 08, 2016 at 02:47:47AM +0530, Bhaktipriya Shridhar wrote:
> alloc_workqueue replaces deprecated create_workqueue().
> 
> The workqueue device_reset_wq has workitem &reset_data->reset_work per
> adf_reset_dev_data. The workqueue  pf2vf_resp_wq is a workqueue for
> PF2VF responses has workitem &pf2vf_resp->pf2vf_resp_work per pf2vf_resp.
> The workqueue adf_vf_stop_wq is used to call adf_dev_stop()
> asynchronously.
> 
> Dedicated workqueues have been used in all cases since the workitems
> on the workqueues are involved in operation of crypto which can be used in
> the IO path which is depended upon during memory reclaim. Hence,
> WQ_MEM_RECLAIM has been set to gurantee forward progress under memory
> pressure.
> Since there are only a fixed number of work items, explicit concurrency
> limit is unnecessary.
> 
> Signed-off-by: Bhaktipriya Shridhar 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu

2014-07-15 Thread Tejun Heo

On Tue, Jul 15, 2014 at 03:36:27PM +0200, Peter Zijlstra wrote:
> So, just to expand on this, we're already getting 'bug' reports because
> worker threads are not cgroup aware. If work gets generated inside some
> cgroup, the worker doesn't care and runs the worker thread wherever
> (typically the root cgroup).
> 
> This means that the 'work' escapes the cgroup confines and creates
> resource inversion etc. The same is of course true for nice and RT
> priorities.
> 
> TJ, are you aware of this and/or given it any throught?

Yeap, I'm aware of the issue but haven't read any actual bug reports
yet.  Can you point me to the reports?

Given that worker pool management is dynamic, spawning separate pools
for individual cgroups on-demand should be doable.  Haven't been able
to decide how much we should be willing to pay in terms of complexity
yet.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 39/51] DMA-API: others: use dma_set_coherent_mask()

2013-09-20 Thread Tejun Heo

Hey,

On Fri, Sep 20, 2013 at 03:00:18PM +0100, Russell King - ARM Linux wrote:
> Another would be if subsystem maintainers are happy that I carry them,
> I can add the acks, and then later on towards the end of the cycle,
> provide a branch subsystem maintainers could pull.
> 
> Or... if you can think of something easier...

I'm happy with the latter method and it's likely that you'll end up
carrying at least some of the patches through your tree anyway.
Please feel free to add my acks to all libata related patches and
carry them through your tree.

Thanks and have fun routing.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 39/51] DMA-API: others: use dma_set_coherent_mask()

2013-09-20 Thread Tejun Heo

On Fri, Sep 20, 2013 at 07:16:52AM -0500, Tejun Heo wrote:
> On Fri, Sep 20, 2013 at 12:11:38AM +0100, Russell King wrote:
> > The correct way for a driver to specify the coherent DMA mask is
> > not to directly access the field in the struct device, but to use
> > dma_set_coherent_mask().  Only arch and bus code should access this
> > member directly.
> > 
> > Convert all direct write accesses to using the correct API.
> > 
> > Signed-off-by: Russell King 
> 
> Acked-by: Tejun Heo 
> 
> The patch is pretty widely spread.  I don't mind how it gets routed
> but what's the plan?

Hmm... maybe hte pata_ixp4xx_cf.c part should be moved to the one
which updates pata_octeon_cf.c?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 39/51] DMA-API: others: use dma_set_coherent_mask()

2013-09-20 Thread Tejun Heo

On Fri, Sep 20, 2013 at 12:11:38AM +0100, Russell King wrote:
> The correct way for a driver to specify the coherent DMA mask is
> not to directly access the field in the struct device, but to use
> dma_set_coherent_mask().  Only arch and bus code should access this
> member directly.
> 
> Convert all direct write accesses to using the correct API.
> 
> Signed-off-by: Russell King 

Acked-by: Tejun Heo 

The patch is pretty widely spread.  I don't mind how it gets routed
but what's the plan?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/4] lib/scatterlist: introduce sg_pcopy_from_buffer() and sg_pcopy_to_buffer()

2013-06-24 Thread Tejun Heo

On Sun, Jun 23, 2013 at 09:37:39PM +0900, Akinobu Mita wrote:
> The only difference between sg_pcopy_{from,to}_buffer() and
> sg_copy_{from,to}_buffer() is an additional argument that specifies
> the number of bytes to skip the SG list before copying.
> 
> Signed-off-by: Akinobu Mita 
> Cc: Tejun Heo 
> Cc: Imre Deak 
> Cc: Herbert Xu 
> Cc: "David S. Miller" 
> Cc: linux-crypto@vger.kernel.org
> Cc: "James E.J. Bottomley" 
> Cc: Douglas Gilbert 
> Cc: linux-s...@vger.kernel.org

Acked-by: Tejun Heo 

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/4] lib/scatterlist: introduce sg_pcopy_from_buffer() and sg_pcopy_to_buffer()

2013-06-18 Thread Tejun Heo

On Tue, Jun 18, 2013 at 10:31:32PM +0900, Akinobu Mita wrote:
>  /**
> + * sg_miter_seek - reposition mapping iterator
> + * @miter: sg mapping iter to be seeked
> + * @offset: number of bytes to plus the current location
> + *
> + * Description:
> + *   Sets the offset of @miter to its current location plus @offset bytes.
> + *
> + * Notes:
> + *   This function must be called just after sg_miter_start() or 
> sg_miter_stop()

It's not gonna be hard to make this function to handle any state,
right?  Wouldn't it be better to do that?  It's a pretty generic
feature after all.  Also, maybe a better name is sg_miter_skip()?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/4] lib/scatterlist: factor out sg_miter_get_next_page() from sg_miter_next()

2013-06-18 Thread Tejun Heo

On Tue, Jun 18, 2013 at 10:31:31PM +0900, Akinobu Mita wrote:
> This function is used to proceed page iterator to the next page if
> necessary, and will be used to implement the variants of
> sg_copy_{from,to}_buffer() later.
> 
> Signed-off-by: Akinobu Mita 
> Cc: Tejun Heo 
> Cc: Imre Deak 
> Cc: Herbert Xu 
> Cc: "David S. Miller" 
> Cc: linux-crypto@vger.kernel.org
> Cc: "James E.J. Bottomley" 
> Cc: Douglas Gilbert 
> Cc: linux-s...@vger.kernel.org

Acked-by: Tejun Heo 

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] lib/scatterlist: introduce sg_pcopy_from_buffer() and sg_pcopy_to_buffer()

2013-06-06 Thread Tejun Heo

Hello,

On Thu, Jun 06, 2013 at 09:52:56PM +0900, Akinobu Mita wrote:
> +static bool sg_miter_get_next_page(struct sg_mapping_iter *miter)
> +{
> + if (!miter->__remaining) {
> + struct scatterlist *sg;
> + unsigned long pgoffset;
> +
> + if (!__sg_page_iter_next(&miter->piter))
> + return false;
> +
> + sg = miter->piter.sg;
> + pgoffset = miter->piter.sg_pgoffset;
> +
> + miter->__offset = pgoffset ? 0 : sg->offset;
> + miter->__remaining = sg->offset + sg->length -
> + (pgoffset << PAGE_SHIFT) - miter->__offset;
> + miter->__remaining = min_t(unsigned long, miter->__remaining,
> +PAGE_SIZE - miter->__offset);
> + }
> +
> + return true;
> +}

It'd be better if separating out this function is a separate patch.
Mixing factoring out something and adding new stuff makes the patch a
bit harder to read.

> +static bool sg_miter_seek(struct sg_mapping_iter *miter, off_t offset)
> +{
> + WARN_ON(miter->addr);
> +
> + while (offset) {
> + off_t consumed;
> +
> + if (!sg_miter_get_next_page(miter))
> + return false;
> +
> + consumed = min_t(off_t, offset, miter->__remaining);
> + miter->__offset += consumed;
> + miter->__remaining -= consumed;
> + offset -= consumed;
> + }
> +
> + return true;
> +}

While I think the above should work at the beginning, what if @miter
is in the middle of iteration and __remaining isn't zero?

Looks good to me otherwise.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Subject: [PATCHSET v2 wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

2013-04-01 Thread Tejun Heo

Applied to wq/for-3.10 with the updated patches and Lai's
Reviewed-by's added.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 13/14] workqueue: implement NUMA affinity for unbound workqueues

2013-04-01 Thread Tejun Heo

>From 4c16bd327c74d6678858706211a0c6e4e53eb3e6 Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Mon, 1 Apr 2013 11:23:36 -0700

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued.  This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues.  Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask.  Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

As CPUs come up and go down, the pool association is changed
accordingly.  Changing pool association may involve allocating new
pools which may fail.  To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active.  The limit is now per NUMA
node instead of global.  While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control.  Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
by Lai.

v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus.  Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down.  This problem
was spotted by Lai Jiangshan.

v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free.  Clear wq->dfl_pwq before putting it.

v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success.  Fixed.

Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug.  Removed.

Documentation on CPU affinity guarantee on CPU_DOWN added.

All changes are suggested by Lai Jiangshan.

Signed-off-by: Tejun Heo 
Reviewed-by: Lai Jiangshan 
---
 kernel/workqueue.c | 278 +
 1 file changed, 259 insertions(+), 19 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d9a4aeb..57cd77d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -45,6 +45,7 @@
 #include 
 #include 
 #include 
+#include 

 #include "workqueue_internal.h"

@@ -245,6 +246,7 @@ struct workqueue_struct {
int saved_max_active; /* WQ: saved pwq max_active */

struct workqueue_attrs  *unbound_attrs; /* WQ: only for unbound wqs */
+   struct pool_workqueue   *dfl_pwq;   /* WQ: only for unbound wqs */

 #ifdef CONFIG_SYSFS
struct wq_device*wq_dev;/* I: for sysfs interface */
@@ -268,6 +270,9 @@ static cpumask_var_t *wq_numa_possible_cpumask;

 static bool wq_numa_enabled;   /* unbound NUMA affinity enabled */

+/* buf for wq_update_unbound_numa_attrs(), protected by CPU hotplug exclusion 
*/
+static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
+
 static DEFINE_MUTEX(wq_pool_mutex);/* protects pools and workqueues list */
 static DEFINE_SPINLOCK(wq_mayday_lock);/* protects wq->maydays list */

@@ -3710,6 +3715,61 @@ static struct pool_workqueue *alloc_unbound_pwq(struct 
workqueue_struct *wq,
return pwq;
 }

+/* undo alloc_unbound_pwq(), used only in the error path */
+static void free_unbound_pwq(struct pool_workqueue *pwq)
+{
+   lockdep_assert_held(&wq_pool_mutex);
+
+   if (pwq) {
+   put_unbound_pool(pwq->pool);
+   kfree(pwq);
+   }
+}
+
+/**
+ * wq_calc_node_mask - calculate a wq_attrs' cpumask for the specified node
+ * @attrs: the wq_attrs of interest
+ * @node: the target NUMA node
+ * @cpu_going_down: if >= 0, the CPU to consider as offline
+ * @cpumask: outarg, the resulting cpumask
+ *
+ * Calculate the cpumask a workqueue with @attrs should use on @node.  If
+ * @cpu_going_down is >= 0, that cpu is considered offline during
+ * calculation.  The result is stored in @cpumask.  This function returns
+ * %true if the resulting @cpumask is different from @at

[PATCH 0.5/14] workqueue: fix memory leak in apply_workqueue_attrs()

2013-04-01 Thread Tejun Heo

>From 4862125b0256a946d2749a1d5003b0604bc3cb4d Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Mon, 1 Apr 2013 11:23:31 -0700

apply_workqueue_attrs() wasn't freeing temp attrs variable @new_attrs
in its success path.  Fix it.

Signed-off-by: Tejun Heo 
Reported-by: Lai Jiangshan 
---
This causes some minor conflicts down the series but nothing
noteworthy.  I'm not posting the whole refreshed series.  If you want
it, holler.

Thanks.

 kernel/workqueue.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index abe1f0d..89480fc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3636,6 +3636,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
struct workqueue_attrs *new_attrs;
struct pool_workqueue *pwq = NULL, *last_pwq;
struct worker_pool *pool;
+   int ret;
 
/* only unbound workqueues can change attributes */
if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
@@ -3668,12 +3669,16 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
spin_unlock_irq(&last_pwq->pool->lock);
}
 
-   return 0;
+   ret = 0;
+   /* fall through */
+out_free:
+   free_workqueue_attrs(new_attrs);
+   return ret;
 
 enomem:
kmem_cache_free(pwq_cache, pwq);
-   free_workqueue_attrs(new_attrs);
-   return -ENOMEM;
+   ret = -ENOMEM;
+   goto out_free;
 }
 
 static int alloc_and_link_pwqs(struct workqueue_struct *wq)
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 13/14] workqueue: implement NUMA affinity for unbound workqueues

2013-03-31 Thread Tejun Heo

Hello, Lai.

On Sun, Mar 31, 2013 at 01:23:46AM +0800, Lai Jiangshan wrote:
> But for unbound wq when cpuhotplug
> w/o NUMA affinity, works arealways   in the cpus  if   there is online 
> cpu in wq's cpumask
> w/ NUMA affinity, .   NOT always  even 
> 

Yeah, this is rather unfortunate but cpumask for unbound workqueue is
a completely new thing anyway and I think providing a similar
guarantee as per-cpu should be enough.  Things are much simpler that
way and requiring users which depend on hard affinity to take care of
flushing is reasonable enough and in line with how workqueue has
traditionally been working.

> > Workqueue's affinity guarantee is very specific - the work item owner is
> > responsible for flushing the work item during CPU DOWN if it wants
> > to guarantee affinity over full execution. 
> 
> Could you add the comments and add Reviewed-by: Lai Jiangshan 
> 
> for the patchset?

Sure thing.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 13/14] workqueue: implement NUMA affinity for unbound workqueues

2013-03-29 Thread Tejun Heo

>From e4bc30ced68420e89d264a26e10d450765a747ed Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Fri, 29 Mar 2013 15:42:27 -0700

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued.  This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues.  Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask.  Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

As CPUs come up and go down, the pool association is changed
accordingly.  Changing pool association may involve allocating new
pools which may fail.  To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active.  The limit is now per NUMA
node instead of global.  While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control.  Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
by Lai.

v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus.  Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down.  This problem
was spotted by Lai Jiangshan.

v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free.  Clear wq->dfl_pwq before putting it.

Signed-off-by: Tejun Heo 
Cc: Lai Jiangshan 
---
There was a silly bug in destroy_workqueue().  wq/review-numa branch
updated accordingly.

Thanks.

 kernel/workqueue.c | 281 +
 1 file changed, 262 insertions(+), 19 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e656931..3b710cd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -45,6 +45,7 @@
 #include 
 #include 
 #include 
+#include 

 #include "workqueue_internal.h"

@@ -245,6 +246,7 @@ struct workqueue_struct {
int saved_max_active; /* WQ: saved pwq max_active */

struct workqueue_attrs  *unbound_attrs; /* WQ: only for unbound wqs */
+   struct pool_workqueue   *dfl_pwq;   /* WQ: only for unbound wqs */

 #ifdef CONFIG_SYSFS
struct wq_device*wq_dev;/* I: for sysfs interface */
@@ -268,6 +270,9 @@ static cpumask_var_t *wq_numa_possible_cpumask;

 static bool wq_numa_enabled;   /* unbound NUMA affinity enabled */

+/* buf for wq_update_unbound_numa_attrs(), protected by CPU hotplug exclusion 
*/
+static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
+
 static DEFINE_MUTEX(wq_pool_mutex);/* protects pools and workqueues list */
 static DEFINE_SPINLOCK(wq_mayday_lock);/* protects wq->maydays list */

@@ -3710,6 +3715,61 @@ static struct pool_workqueue *alloc_unbound_pwq(struct 
workqueue_struct *wq,
return pwq;
 }

+/* undo alloc_unbound_pwq(), used only in the error path */
+static void free_unbound_pwq(struct pool_workqueue *pwq)
+{
+   lockdep_assert_held(&wq_pool_mutex);
+
+   if (pwq) {
+   put_unbound_pool(pwq->pool);
+   kfree(pwq);
+   }
+}
+
+/**
+ * wq_calc_node_mask - calculate a wq_attrs' cpumask for the specified node
+ * @attrs: the wq_attrs of interest
+ * @node: the target NUMA node
+ * @cpu_going_down: if >= 0, the CPU to consider as offline
+ * @cpumask: outarg, the resulting cpumask
+ *
+ * Calculate the cpumask a workqueue with @attrs should use on @node.  If
+ * @cpu_going_down is >= 0, that cpu is considered offline during
+ * calculation.  The result is stored in @cpumask.  This function returns
+ * %true if the resulting @cpumask is different from @attrs->cpumask,
+ * %false if equal.
+ *
+ * If NUMA affinity is not enabled, @attrs->cpumask is always used.  If
+ * enabled and @node has online CPUs requested by @attrs, the returned
+ * cpumask is the intersection of the possible CPUs of @node and
+ * @attrs-&

Subject: [PATCHSET v2 wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

2013-03-27 Thread Tejun Heo

Hello,

Changes from the last take[L] are

* Lai pointed out that the previous implementation was broken in that
  if a workqueue spans over multiple nodes and some of the nodes don't
  have any desired online CPUs, work items queued on those nodes would
  be spread across all CPUs violating the configured cpumask.

  The patchset is updated such that apply_workqueue_attrs() now only
  assigns NUMA-affine pwqs to nodes with desired online CPUs and
  default pwq to all other nodes.  To track CPU online state,
  wq_update_unbound_numa_attrs() is added.  The function is called for
  each workqueue during hot[un]plug and updates pwq association
  accordingly.

* Rolled in updated patches from the previous posting.

* More helper routines are factored out such that
  apply_workqueue_attrs() is easier to follow and can share code paths
  with wq_update_unbound_numa_attrs().

* Various cleanups and fixes.

Patchset description from the original posting follows.

There are two types of workqueues - per-cpu and unbound.  The former
is bound to each CPU and the latter isn't not bound to any by default.
While the recently added attrs support allows unbound workqueues to be
confined to subset of CPUs, it still is quite cumbersome for
applications where CPU affinity is too constricted but NUMA locality
still matters.

This patchset tries to solve that issue by automatically making
unbound workqueues affine to NUMA nodes by default.  A work item
queued to an unbound workqueue is executed on one of the CPUs allowed
by the workqueue in the same node.  If there's none allowed, it may be
executed on any cpu allowed by the workqueue.  It doesn't require any
changes on the user side.  Every interface of workqueues functions the
same as before.

This would be most helpful to subsystems which use some form of async
execution to process significant amount of data - e.g. crypto and
btrfs; however, I wanted to find out whether it would make any dent in
much less favorable use cases.  The following is total run time in
seconds of buliding allmodconfig kernel w/ -j20 on a dual socket
opteron machine with writeback thread pool converted to unbound
workqueue and thus made NUMA-affine.  The file system is ext4 on top
of a WD SSD.

before conversion   after conversion
1396.1261394.763
1397.6211394.965
1399.6361394.738
1397.4631398.162
1395.5431393.670

AVG 1397.2781395.260DIFF2.018
STDEV  1.585   1.700

And, yes, it actually made things go faster by about 1.2 sigma, which
isn't completely conclusive but is a pretty good indication that it's
actually faster.  Note that this is a workload which is dominated by
CPU time and while there's writeback going on continously it really
isn't touching too much data or a dominating factor, so the gain is
understandably small, 0.14%, but hey it's still a gain and it should
be much more interesting for crypto and btrfs which would actully
access the data or workloads which are more sensitive to NUMA
affinity.

The implementation is fairly simple.  After the recent attrs support
changes, a lot of the differences in pwq (pool_workqueue) handling
between unbound and per-cpu workqueues are gone.  An unbound workqueue
still has one "current" pwq that it uses for queueing any new work
items but can handle multiple pwqs perfectly well while they're
draining, so this patchset adds pwq dispatch table to unbound
workqueues which is indexed by NUMA node and points to the matching
pwq.  Unbound workqueues now simply have multiple "current" pwqs keyed
by NUMA node.

NUMA affinity can be turned off system-wide by workqueue.disable_numa
kernel param or per-workqueue using "numa" sysfs file.

This patchset contains the following fourteen patches.

 0001-workqueue-move-pwq_pool_locking-outside-of-get-put_u.patch
 0002-workqueue-add-wq_numa_tbl_len-and-wq_numa_possible_c.patch
 0003-workqueue-drop-H-from-kworker-names-of-unbound-worke.patch
 0004-workqueue-determine-NUMA-node-of-workers-accourding-.patch
 0005-workqueue-add-workqueue-unbound_attrs.patch
 0006-workqueue-make-workqueue-name-fixed-len.patch
 0007-workqueue-move-hot-fields-of-workqueue_struct-to-the.patch
 0008-workqueue-map-an-unbound-workqueues-to-multiple-per-.patch
 0009-workqueue-break-init_and_link_pwq-into-two-functions.patch
 0010-workqueue-use-NUMA-aware-allocation-for-pool_workque.patch
 0011-workqueue-introduce-numa_pwq_tbl_install.patch
 0012-workqueue-introduce-put_pwq_unlocked.patch
 0013-workqueue-implement-NUMA-affinity-for-unbound-workqu.patch
 0014-workqueue-update-sysfs-interface-to-reflect-NUMA-awa.patch

0001-0009 are prep patches.

0010-0013 implement NUMA affinity.

0014 adds control knobs and updates sysfs interface.

This patchset is on top of

  wq/for-3.10 b59276054 ("workqueue: remove

[PATCH 02/14] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]

2013-03-27 Thread Tejun Heo

Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
and wq_numa_possible_cpumask[] in preparation.  The former is the
highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
each NUMA node.

This patch only introduces these.  Future patches will make use of
them.

v2: NUMA initialization move into wq_numa_init().  Also, the possible
cpumask array is not created if there aren't multiple nodes on the
system.  wq_numa_enabled bool added.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 26771f4e..54b5048 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "workqueue_internal.h"
 
@@ -253,6 +254,12 @@ struct workqueue_struct {
 
 static struct kmem_cache *pwq_cache;
 
+static int wq_numa_tbl_len;/* highest possible NUMA node id + 1 */
+static cpumask_var_t *wq_numa_possible_cpumask;
+   /* possible CPUs of each node */
+
+static bool wq_numa_enabled;   /* unbound NUMA affinity enabled */
+
 static DEFINE_MUTEX(wq_pool_mutex);/* protects pools and workqueues list */
 static DEFINE_SPINLOCK(wq_mayday_lock);/* protects wq->maydays list */
 
@@ -4402,6 +4409,43 @@ out_unlock:
 }
 #endif /* CONFIG_FREEZER */
 
+static void __init wq_numa_init(void)
+{
+   cpumask_var_t *tbl;
+   int node, cpu;
+
+   /* determine NUMA pwq table len - highest node id + 1 */
+   for_each_node(node)
+   wq_numa_tbl_len = max(wq_numa_tbl_len, node + 1);
+
+   if (num_possible_nodes() <= 1)
+   return;
+
+   /*
+* We want masks of possible CPUs of each node which isn't readily
+* available.  Build one from cpu_to_node() which should have been
+* fully initialized by now.
+*/
+   tbl = kzalloc(wq_numa_tbl_len * sizeof(tbl[0]), GFP_KERNEL);
+   BUG_ON(!tbl);
+
+   for_each_node(node)
+   BUG_ON(!alloc_cpumask_var_node(&tbl[node], GFP_KERNEL, node));
+
+   for_each_possible_cpu(cpu) {
+   node = cpu_to_node(cpu);
+   if (WARN_ON(node == NUMA_NO_NODE)) {
+   pr_warn("workqueue: NUMA node mapping not available for 
cpu%d, disabling NUMA support\n", cpu);
+   /* happens iff arch is bonkers, let's just proceed */
+   return;
+   }
+   cpumask_set_cpu(cpu, tbl[node]);
+   }
+
+   wq_numa_possible_cpumask = tbl;
+   wq_numa_enabled = true;
+}
+
 static int __init init_workqueues(void)
 {
int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
@@ -4418,6 +4462,8 @@ static int __init init_workqueues(void)
cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);
 
+   wq_numa_init();
+
/* initialize CPU pools */
for_each_possible_cpu(cpu) {
struct worker_pool *pool;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/14] workqueue: implement NUMA affinity for unbound workqueues

2013-03-27 Thread Tejun Heo

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued.  This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues.  Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask.  Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

As CPUs come up and go down, the pool association is changed
accordingly.  Changing pool association may involve allocating new
pools which may fail.  To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active.  The limit is now per NUMA
node instead of global.  While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control.  Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
by Lai.

v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus.  Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down.  This problem
was spotted by Lai Jiangshan.

Signed-off-by: Tejun Heo 
Cc: Lai Jiangshan 
---
 kernel/workqueue.c | 278 +
 1 file changed, 258 insertions(+), 20 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e656931..637debe 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -45,6 +45,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "workqueue_internal.h"
 
@@ -245,6 +246,7 @@ struct workqueue_struct {
int saved_max_active; /* WQ: saved pwq max_active */
 
struct workqueue_attrs  *unbound_attrs; /* WQ: only for unbound wqs */
+   struct pool_workqueue   *dfl_pwq;   /* WQ: only for unbound wqs */
 
 #ifdef CONFIG_SYSFS
struct wq_device*wq_dev;/* I: for sysfs interface */
@@ -268,6 +270,9 @@ static cpumask_var_t *wq_numa_possible_cpumask;
 
 static bool wq_numa_enabled;   /* unbound NUMA affinity enabled */
 
+/* buf for wq_update_unbound_numa_attrs(), protected by CPU hotplug exclusion 
*/
+static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
+
 static DEFINE_MUTEX(wq_pool_mutex);/* protects pools and workqueues list */
 static DEFINE_SPINLOCK(wq_mayday_lock);/* protects wq->maydays list */
 
@@ -3710,6 +3715,61 @@ static struct pool_workqueue *alloc_unbound_pwq(struct 
workqueue_struct *wq,
return pwq;
 }
 
+/* undo alloc_unbound_pwq(), used only in the error path */
+static void free_unbound_pwq(struct pool_workqueue *pwq)
+{
+   lockdep_assert_held(&wq_pool_mutex);
+
+   if (pwq) {
+   put_unbound_pool(pwq->pool);
+   kfree(pwq);
+   }
+}
+
+/**
+ * wq_calc_node_mask - calculate a wq_attrs' cpumask for the specified node
+ * @attrs: the wq_attrs of interest
+ * @node: the target NUMA node
+ * @cpu_going_down: if >= 0, the CPU to consider as offline
+ * @cpumask: outarg, the resulting cpumask
+ *
+ * Calculate the cpumask a workqueue with @attrs should use on @node.  If
+ * @cpu_going_down is >= 0, that cpu is considered offline during
+ * calculation.  The result is stored in @cpumask.  This function returns
+ * %true if the resulting @cpumask is different from @attrs->cpumask,
+ * %false if equal.
+ *
+ * If NUMA affinity is not enabled, @attrs->cpumask is always used.  If
+ * enabled and @node has online CPUs requested by @attrs, the returned
+ * cpumask is the intersection of the possible CPUs of @node and
+ * @attrs->cpumask.
+ *
+ * The caller is responsible for ensuring that the cpumask of @node stays
+ * stable.
+ */
+static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
+int cpu_going_down, cpumask_t *cpumask)
+{
+   if (!wq_numa_enabled)
+   goto use_dfl;
+
+   /* does @node have any online CPUs @attrs wants? */
+   cpumask_and(cpumask, cpumask_of_node(node), attrs->cpum

[PATCH 12/14] workqueue: introduce put_pwq_unlocked()

2013-03-27 Thread Tejun Heo

Factor out lock pool, put_pwq(), unlock sequence into
put_pwq_unlocked().  The two existing places are converted and there
will be more with NUMA affinity support.

This is to prepare for NUMA affinity support for unbound workqueues
and doesn't introduce any functional difference.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 36 +++-
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 527dc418..e656931 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1057,6 +1057,25 @@ static void put_pwq(struct pool_workqueue *pwq)
schedule_work(&pwq->unbound_release_work);
 }
 
+/**
+ * put_pwq_unlocked - put_pwq() with surrounding pool lock/unlock
+ * @pwq: pool_workqueue to put (can be %NULL)
+ *
+ * put_pwq() with locking.  This function also allows %NULL @pwq.
+ */
+static void put_pwq_unlocked(struct pool_workqueue *pwq)
+{
+   if (pwq) {
+   /*
+* As both pwqs and pools are sched-RCU protected, the
+* following lock operations are safe.
+*/
+   spin_lock_irq(&pwq->pool->lock);
+   put_pwq(pwq);
+   spin_unlock_irq(&pwq->pool->lock);
+   }
+}
+
 static void pwq_activate_delayed_work(struct work_struct *work)
 {
struct pool_workqueue *pwq = get_work_pwq(work);
@@ -3759,12 +3778,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 
mutex_unlock(&wq->mutex);
 
-   if (last_pwq) {
-   spin_lock_irq(&last_pwq->pool->lock);
-   put_pwq(last_pwq);
-   spin_unlock_irq(&last_pwq->pool->lock);
-   }
-
+   put_pwq_unlocked(last_pwq);
return 0;
 
 enomem:
@@ -3975,16 +3989,12 @@ void destroy_workqueue(struct workqueue_struct *wq)
} else {
/*
 * We're the sole accessor of @wq at this point.  Directly
-* access the first pwq and put the base ref.  As both pwqs
-* and pools are sched-RCU protected, the lock operations
-* are safe.  @wq will be freed when the last pwq is
-* released.
+* access the first pwq and put the base ref.  @wq will be
+* freed when the last pwq is released.
 */
pwq = list_first_entry(&wq->pwqs, struct pool_workqueue,
   pwqs_node);
-   spin_lock_irq(&pwq->pool->lock);
-   put_pwq(pwq);
-   spin_unlock_irq(&pwq->pool->lock);
+   put_pwq_unlocked(pwq);
}
 }
 EXPORT_SYMBOL_GPL(destroy_workqueue);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/14] workqueue: use NUMA-aware allocation for pool_workqueues

2013-03-27 Thread Tejun Heo

Use kmem_cache_alloc_node() with @pool->node instead of
kmem_cache_zalloc() when allocating a pool_workqueue so that it's
allocated on the same node as the associated worker_pool.  As there's
no no kmem_cache_zalloc_node(), move zeroing to init_pwq().

This was suggested by Lai Jiangshan.

Signed-off-by: Tejun Heo 
Cc: Lai Jiangshan 
---
 kernel/workqueue.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 58c7663..a4420be 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3626,12 +3626,14 @@ static void pwq_adjust_max_active(struct pool_workqueue 
*pwq)
spin_unlock_irq(&pwq->pool->lock);
 }
 
-/* initialize newly zalloced @pwq which is associated with @wq and @pool */
+/* initialize newly alloced @pwq which is associated with @wq and @pool */
 static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
 struct worker_pool *pool)
 {
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
 
+   memset(pwq, 0, sizeof(*pwq));
+
pwq->pool = pool;
pwq->wq = wq;
pwq->flush_color = -1;
@@ -3677,7 +3679,7 @@ static struct pool_workqueue *alloc_unbound_pwq(struct 
workqueue_struct *wq,
if (!pool)
return NULL;
 
-   pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+   pwq = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, pool->node);
if (!pwq) {
put_unbound_pool(pool);
return NULL;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 11/14] workqueue: introduce numa_pwq_tbl_install()

2013-03-27 Thread Tejun Heo

Factor out pool_workqueue linking and installation into numa_pwq_tbl[]
from apply_workqueue_attrs() into numa_pwq_tbl_install().  link_pwq()
is made safe to call multiple times.  numa_pwq_tbl_install() links the
pwq, installs it into numa_pwq_tbl[] at the specified node and returns
the old entry.

@last_pwq is removed from link_pwq() as the return value of the new
function can be used instead.

This is to prepare for NUMA affinity support for unbound workqueues.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 35 ++-
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a4420be..527dc418 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3639,24 +3639,26 @@ static void init_pwq(struct pool_workqueue *pwq, struct 
workqueue_struct *wq,
pwq->flush_color = -1;
pwq->refcnt = 1;
INIT_LIST_HEAD(&pwq->delayed_works);
+   INIT_LIST_HEAD(&pwq->pwqs_node);
INIT_LIST_HEAD(&pwq->mayday_node);
INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
 }
 
 /* sync @pwq with the current state of its associated wq and link it */
-static void link_pwq(struct pool_workqueue *pwq,
-struct pool_workqueue **p_last_pwq)
+static void link_pwq(struct pool_workqueue *pwq)
 {
struct workqueue_struct *wq = pwq->wq;
 
lockdep_assert_held(&wq->mutex);
 
+   /* may be called multiple times, ignore if already linked */
+   if (!list_empty(&pwq->pwqs_node))
+   return;
+
/*
 * Set the matching work_color.  This is synchronized with
 * wq->mutex to avoid confusing flush_workqueue().
 */
-   if (p_last_pwq)
-   *p_last_pwq = first_pwq(wq);
pwq->work_color = wq->work_color;
 
/* sync max_active to the current setting */
@@ -3689,6 +3691,23 @@ static struct pool_workqueue *alloc_unbound_pwq(struct 
workqueue_struct *wq,
return pwq;
 }
 
+/* install @pwq into @wq's numa_pwq_tbl[] for @node and return the old pwq */
+static struct pool_workqueue *numa_pwq_tbl_install(struct workqueue_struct *wq,
+  int node,
+  struct pool_workqueue *pwq)
+{
+   struct pool_workqueue *old_pwq;
+
+   lockdep_assert_held(&wq->mutex);
+
+   /* link_pwq() can handle duplicate calls */
+   link_pwq(pwq);
+
+   old_pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
+   rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+   return old_pwq;
+}
+
 /**
  * apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
  * @wq: the target workqueue
@@ -3707,7 +3726,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
  const struct workqueue_attrs *attrs)
 {
struct workqueue_attrs *new_attrs;
-   struct pool_workqueue *pwq, *last_pwq;
+   struct pool_workqueue *pwq, *last_pwq = NULL;
int node;
 
/* only unbound workqueues can change attributes */
@@ -3734,11 +3753,9 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 
mutex_lock(&wq->mutex);
 
-   link_pwq(pwq, &last_pwq);
-
copy_workqueue_attrs(wq->unbound_attrs, new_attrs);
for_each_node(node)
-   rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+   last_pwq = numa_pwq_tbl_install(wq, node, pwq);
 
mutex_unlock(&wq->mutex);
 
@@ -3774,7 +3791,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct 
*wq)
init_pwq(pwq, wq, &cpu_pools[highpri]);
 
mutex_lock(&wq->mutex);
-   link_pwq(pwq, NULL);
+   link_pwq(pwq);
mutex_unlock(&wq->mutex);
}
return 0;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/14] workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()

2013-03-27 Thread Tejun Heo

Break init_and_link_pwq() into init_pwq() and link_pwq() and move
unbound-workqueue specific handling into apply_workqueue_attrs().
Also, factor out unbound pool and pool_workqueue allocation into
alloc_unbound_pwq().

This reorganization is to prepare for NUMA affinity and doesn't
introduce any functional changes.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 77 ++
 1 file changed, 49 insertions(+), 28 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5b53705..58c7663 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3626,13 +3626,10 @@ static void pwq_adjust_max_active(struct pool_workqueue 
*pwq)
spin_unlock_irq(&pwq->pool->lock);
 }
 
-static void init_and_link_pwq(struct pool_workqueue *pwq,
- struct workqueue_struct *wq,
- struct worker_pool *pool,
- struct pool_workqueue **p_last_pwq)
+/* initialize newly zalloced @pwq which is associated with @wq and @pool */
+static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
+struct worker_pool *pool)
 {
-   int node;
-
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
 
pwq->pool = pool;
@@ -3642,8 +3639,15 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
INIT_LIST_HEAD(&pwq->delayed_works);
INIT_LIST_HEAD(&pwq->mayday_node);
INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
+}
 
-   mutex_lock(&wq->mutex);
+/* sync @pwq with the current state of its associated wq and link it */
+static void link_pwq(struct pool_workqueue *pwq,
+struct pool_workqueue **p_last_pwq)
+{
+   struct workqueue_struct *wq = pwq->wq;
+
+   lockdep_assert_held(&wq->mutex);
 
/*
 * Set the matching work_color.  This is synchronized with
@@ -3658,14 +3662,29 @@ static void init_and_link_pwq(struct pool_workqueue 
*pwq,
 
/* link in @pwq */
list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
+}
 
-   if (wq->flags & WQ_UNBOUND) {
-   copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
-   for_each_node(node)
-   rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+/* obtain a pool matching @attr and create a pwq associating the pool and @wq 
*/
+static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
+   const struct workqueue_attrs *attrs)
+{
+   struct worker_pool *pool;
+   struct pool_workqueue *pwq;
+
+   lockdep_assert_held(&wq_pool_mutex);
+
+   pool = get_unbound_pool(attrs);
+   if (!pool)
+   return NULL;
+
+   pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+   if (!pwq) {
+   put_unbound_pool(pool);
+   return NULL;
}
 
-   mutex_unlock(&wq->mutex);
+   init_pwq(pwq, wq, pool);
+   return pwq;
 }
 
 /**
@@ -3686,8 +3705,8 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
  const struct workqueue_attrs *attrs)
 {
struct workqueue_attrs *new_attrs;
-   struct pool_workqueue *pwq = NULL, *last_pwq;
-   struct worker_pool *pool;
+   struct pool_workqueue *pwq, *last_pwq;
+   int node;
 
/* only unbound workqueues can change attributes */
if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
@@ -3706,22 +3725,21 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);
 
mutex_lock(&wq_pool_mutex);
-
-   pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
-   if (!pwq) {
-   mutex_unlock(&wq_pool_mutex);
+   pwq = alloc_unbound_pwq(wq, new_attrs);
+   mutex_unlock(&wq_pool_mutex);
+   if (!pwq)
goto enomem;
-   }
 
-   pool = get_unbound_pool(new_attrs);
-   if (!pool) {
-   mutex_unlock(&wq_pool_mutex);
-   goto enomem;
-   }
+   mutex_lock(&wq->mutex);
 
-   mutex_unlock(&wq_pool_mutex);
+   link_pwq(pwq, &last_pwq);
+
+   copy_workqueue_attrs(wq->unbound_attrs, new_attrs);
+   for_each_node(node)
+   rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+
+   mutex_unlock(&wq->mutex);
 
-   init_and_link_pwq(pwq, wq, pool, &last_pwq);
if (last_pwq) {
spin_lock_irq(&last_pwq->pool->lock);
put_pwq(last_pwq);
@@ -3731,7 +3749,6 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
return 0;
 
 enomem:
-   kmem_cache_free(pwq_cache, pwq);
free_workqueue_attrs(new_attrs);
return -ENOMEM;
 }
@@ -3752,7 +3769,11 @@ static

[PATCH 07/14] workqueue: move hot fields of workqueue_struct to the end

2013-03-27 Thread Tejun Heo

Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align
them to the cacheline.  These two fields are used in the work item
issue path and thus hot.  The scheduled NUMA affinity support will add
dispatch table at the end of workqueue_struct and relocating these two
fields will allow us hitting only single cacheline on hot paths.

Note that wq->pwqs isn't moved although it currently is being used in
the work item issue path for unbound workqueues.  The dispatch table
mentioned above will replace its use in the issue path, so it will
become cold once NUMA support is implemented.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 23c9099..9297ea3 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -227,8 +227,6 @@ struct wq_device;
  * the appropriate worker_pool through its pool_workqueues.
  */
 struct workqueue_struct {
-   unsigned intflags;  /* WQ: WQ_* flags */
-   struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwq's */
struct list_headpwqs;   /* WR: all pwqs of this wq */
struct list_headlist;   /* PL: list of all workqueues */
 
@@ -255,6 +253,10 @@ struct workqueue_struct {
struct lockdep_map  lockdep_map;
 #endif
charname[WQ_NAME_LEN]; /* I: workqueue name */
+
+   /* hot fields used during command issue, aligned to cacheline */
+   unsigned intflags cacheline_aligned; /* WQ: WQ_* flags 
*/
+   struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
 };
 
 static struct kmem_cache *pwq_cache;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/14] workqueue: map an unbound workqueues to multiple per-node pool_workqueues

2013-03-27 Thread Tejun Heo

Currently, an unbound workqueue has only one "current" pool_workqueue
associated with it.  It may have multple pool_workqueues but only the
first pool_workqueue servies new work items.  For NUMA affinity, we
want to change this so that there are multiple current pool_workqueues
serving different NUMA nodes.

Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and
points to the pool_workqueue to use for each possible node.  This
replaces first_pwq() in __queue_work() and workqueue_congested().

numa_pwq_tbl[] is currently initialized to point to the same
pool_workqueue as first_pwq() so this patch doesn't make any behavior
changes.

v2: Use rcu_dereference_raw() in unbound_pwq_by_node() as the function
may be called only with wq->mutex held.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 48 +---
 1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9297ea3..5b53705 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -257,6 +257,7 @@ struct workqueue_struct {
/* hot fields used during command issue, aligned to cacheline */
unsigned intflags cacheline_aligned; /* WQ: WQ_* flags 
*/
struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
+   struct pool_workqueue __rcu *numa_pwq_tbl[]; /* FR: unbound pwqs 
indexed by node */
 };
 
 static struct kmem_cache *pwq_cache;
@@ -525,6 +526,22 @@ static struct pool_workqueue *first_pwq(struct 
workqueue_struct *wq)
  pwqs_node);
 }
 
+/**
+ * unbound_pwq_by_node - return the unbound pool_workqueue for the given node
+ * @wq: the target workqueue
+ * @node: the node ID
+ *
+ * This must be called either with pwq_lock held or sched RCU read locked.
+ * If the pwq needs to be used beyond the locking in effect, the caller is
+ * responsible for guaranteeing that the pwq stays online.
+ */
+static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
+ int node)
+{
+   assert_rcu_or_wq_mutex(wq);
+   return rcu_dereference_raw(wq->numa_pwq_tbl[node]);
+}
+
 static unsigned int work_color_to_flags(int color)
 {
return color << WORK_STRUCT_COLOR_SHIFT;
@@ -1278,14 +1295,14 @@ static void __queue_work(int cpu, struct 
workqueue_struct *wq,
WARN_ON_ONCE(!is_chained_work(wq)))
return;
 retry:
+   if (req_cpu == WORK_CPU_UNBOUND)
+   cpu = raw_smp_processor_id();
+
/* pwq which will be used unless @work is executing elsewhere */
-   if (!(wq->flags & WQ_UNBOUND)) {
-   if (cpu == WORK_CPU_UNBOUND)
-   cpu = raw_smp_processor_id();
+   if (!(wq->flags & WQ_UNBOUND))
pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
-   } else {
-   pwq = first_pwq(wq);
-   }
+   else
+   pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
 
/*
 * If @work was previously on a different pool, it might still be
@@ -1315,8 +1332,8 @@ retry:
 * pwq is determined and locked.  For unbound pools, we could have
 * raced with pwq release and it could already be dead.  If its
 * refcnt is zero, repeat pwq selection.  Note that pwqs never die
-* without another pwq replacing it as the first pwq or while a
-* work item is executing on it, so the retying is guaranteed to
+* without another pwq replacing it in the numa_pwq_tbl or while
+* work items are executing on it, so the retrying is guaranteed to
 * make forward-progress.
 */
if (unlikely(!pwq->refcnt)) {
@@ -3614,6 +3631,8 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
  struct worker_pool *pool,
  struct pool_workqueue **p_last_pwq)
 {
+   int node;
+
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
 
pwq->pool = pool;
@@ -3640,8 +3659,11 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
/* link in @pwq */
list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
 
-   if (wq->flags & WQ_UNBOUND)
+   if (wq->flags & WQ_UNBOUND) {
copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
+   for_each_node(node)
+   rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+   }
 
mutex_unlock(&wq->mutex);
 }
@@ -3756,12 +3778,16 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
   struct lock_class_key *key,
   const char *lock_name, ...)
 {
+   size_t tbl_size = 0;
va_list args;
struct workqueue_struct *wq;
struct pool_workqueue *pwq;

[PATCH 06/14] workqueue: make workqueue->name[] fixed len

2013-03-27 Thread Tejun Heo

Currently workqueue->name[] is of flexible length.  We want to use the
flexible field for something more useful and there isn't much benefit
in allowing arbitrary name length anyway.  Make it fixed len capping
at 24 bytes.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 65f200c..23c9099 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -101,6 +101,8 @@ enum {
 */
RESCUER_NICE_LEVEL  = -20,
HIGHPRI_NICE_LEVEL  = -20,
+
+   WQ_NAME_LEN = 24,
 };
 
 /*
@@ -252,7 +254,7 @@ struct workqueue_struct {
 #ifdef CONFIG_LOCKDEP
struct lockdep_map  lockdep_map;
 #endif
-   charname[]; /* I: workqueue name */
+   charname[WQ_NAME_LEN]; /* I: workqueue name */
 };
 
 static struct kmem_cache *pwq_cache;
@@ -3752,17 +3754,12 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
   struct lock_class_key *key,
   const char *lock_name, ...)
 {
-   va_list args, args1;
+   va_list args;
struct workqueue_struct *wq;
struct pool_workqueue *pwq;
-   size_t namelen;
-
-   /* determine namelen, allocate wq and format name */
-   va_start(args, lock_name);
-   va_copy(args1, args);
-   namelen = vsnprintf(NULL, 0, fmt, args) + 1;
 
-   wq = kzalloc(sizeof(*wq) + namelen, GFP_KERNEL);
+   /* allocate wq and format name */
+   wq = kzalloc(sizeof(*wq), GFP_KERNEL);
if (!wq)
return NULL;
 
@@ -3772,9 +3769,9 @@ struct workqueue_struct *__alloc_workqueue_key(const char 
*fmt,
goto err_free_wq;
}
 
-   vsnprintf(wq->name, namelen, fmt, args1);
+   va_start(args, lock_name);
+   vsnprintf(wq->name, sizeof(wq->name), fmt, args);
va_end(args);
-   va_end(args1);
 
max_active = max_active ?: WQ_DFL_ACTIVE;
max_active = wq_clamp_max_active(max_active, flags, wq->name);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/14] workqueue: add workqueue->unbound_attrs

2013-03-27 Thread Tejun Heo

Currently, when exposing attrs of an unbound workqueue via sysfs, the
workqueue_attrs of first_pwq() is used as that should equal the
current state of the workqueue.

The planned NUMA affinity support will make unbound workqueues make
use of multiple pool_workqueues for different NUMA nodes and the above
assumption will no longer hold.  Introduce workqueue->unbound_attrs
which records the current attrs in effect and use it for sysfs instead
of first_pwq()->attrs.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 36 
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index fab2630..65f200c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -244,6 +244,8 @@ struct workqueue_struct {
int nr_drainers;/* WQ: drain in progress */
int saved_max_active; /* WQ: saved pwq max_active */
 
+   struct workqueue_attrs  *unbound_attrs; /* WQ: only for unbound wqs */
+
 #ifdef CONFIG_SYSFS
struct wq_device*wq_dev;/* I: for sysfs interface */
 #endif
@@ -3088,10 +3090,9 @@ static ssize_t wq_nice_show(struct device *dev, struct 
device_attribute *attr,
struct workqueue_struct *wq = dev_to_wq(dev);
int written;
 
-   rcu_read_lock_sched();
-   written = scnprintf(buf, PAGE_SIZE, "%d\n",
-   first_pwq(wq)->pool->attrs->nice);
-   rcu_read_unlock_sched();
+   mutex_lock(&wq->mutex);
+   written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->unbound_attrs->nice);
+   mutex_unlock(&wq->mutex);
 
return written;
 }
@@ -3105,9 +3106,9 @@ static struct workqueue_attrs *wq_sysfs_prep_attrs(struct 
workqueue_struct *wq)
if (!attrs)
return NULL;
 
-   rcu_read_lock_sched();
-   copy_workqueue_attrs(attrs, first_pwq(wq)->pool->attrs);
-   rcu_read_unlock_sched();
+   mutex_lock(&wq->mutex);
+   copy_workqueue_attrs(attrs, wq->unbound_attrs);
+   mutex_unlock(&wq->mutex);
return attrs;
 }
 
@@ -3138,10 +3139,9 @@ static ssize_t wq_cpumask_show(struct device *dev,
struct workqueue_struct *wq = dev_to_wq(dev);
int written;
 
-   rcu_read_lock_sched();
-   written = cpumask_scnprintf(buf, PAGE_SIZE,
-   first_pwq(wq)->pool->attrs->cpumask);
-   rcu_read_unlock_sched();
+   mutex_lock(&wq->mutex);
+   written = cpumask_scnprintf(buf, PAGE_SIZE, wq->unbound_attrs->cpumask);
+   mutex_unlock(&wq->mutex);
 
written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
return written;
@@ -3558,8 +3558,10 @@ static void pwq_unbound_release_workfn(struct 
work_struct *work)
 * If we're the last pwq going away, @wq is already dead and no one
 * is gonna access it anymore.  Free it.
 */
-   if (is_last)
+   if (is_last) {
+   free_workqueue_attrs(wq->unbound_attrs);
kfree(wq);
+   }
 }
 
 /**
@@ -3634,6 +3636,9 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
/* link in @pwq */
list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
 
+   if (wq->flags & WQ_UNBOUND)
+   copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
+
mutex_unlock(&wq->mutex);
 }
 
@@ -3761,6 +3766,12 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
if (!wq)
return NULL;
 
+   if (flags & WQ_UNBOUND) {
+   wq->unbound_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+   if (!wq->unbound_attrs)
+   goto err_free_wq;
+   }
+
vsnprintf(wq->name, namelen, fmt, args1);
va_end(args);
va_end(args1);
@@ -3830,6 +3841,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char 
*fmt,
return wq;
 
 err_free_wq:
+   free_workqueue_attrs(wq->unbound_attrs);
kfree(wq);
return NULL;
 err_destroy:
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/14] workqueue: determine NUMA node of workers accourding to the allowed cpumask

2013-03-27 Thread Tejun Heo

When worker tasks are created using kthread_create_on_node(),
currently only per-cpu ones have the matching NUMA node specified.
All unbound workers are always created with NUMA_NO_NODE.

Now that an unbound worker pool may have an arbitrary cpumask
associated with it, this isn't optimal.  Add pool->node which is
determined by the pool's cpumask.  If the pool's cpumask is contained
inside a NUMA node proper, the pool is associated with that node, and
all workers of the pool are created on that node.

This currently only makes difference for unbound worker pools with
cpumask contained inside single NUMA node, but this will serve as
foundation for making all unbound pools NUMA-affine.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 57ced49..fab2630 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -138,6 +138,7 @@ enum {
 struct worker_pool {
spinlock_t  lock;   /* the pool lock */
int cpu;/* I: the associated cpu */
+   int node;   /* I: the associated node ID */
int id; /* I: pool ID */
unsigned intflags;  /* X: flags */
 
@@ -1645,7 +1646,6 @@ static struct worker *alloc_worker(void)
 static struct worker *create_worker(struct worker_pool *pool)
 {
struct worker *worker = NULL;
-   int node = pool->cpu >= 0 ? cpu_to_node(pool->cpu) : NUMA_NO_NODE;
int id = -1;
char id_buf[16];
 
@@ -1678,7 +1678,7 @@ static struct worker *create_worker(struct worker_pool 
*pool)
else
snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
 
-   worker->task = kthread_create_on_node(worker_thread, worker, node,
+   worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
  "kworker/%s", id_buf);
if (IS_ERR(worker->task))
goto fail;
@@ -3360,6 +3360,7 @@ static int init_worker_pool(struct worker_pool *pool)
spin_lock_init(&pool->lock);
pool->id = -1;
pool->cpu = -1;
+   pool->node = NUMA_NO_NODE;
pool->flags |= POOL_DISASSOCIATED;
INIT_LIST_HEAD(&pool->worklist);
INIT_LIST_HEAD(&pool->idle_list);
@@ -3465,6 +3466,7 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
 {
u32 hash = wqattrs_hash(attrs);
struct worker_pool *pool;
+   int node;
 
lockdep_assert_held(&wq_pool_mutex);
 
@@ -3487,6 +3489,17 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
lockdep_set_subclass(&pool->lock, 1);   /* see put_pwq() */
copy_workqueue_attrs(pool->attrs, attrs);
 
+   /* if cpumask is contained inside a NUMA node, we belong to that node */
+   if (wq_numa_enabled) {
+   for_each_node(node) {
+   if (cpumask_subset(pool->attrs->cpumask,
+  wq_numa_possible_cpumask[node])) {
+   pool->node = node;
+   break;
+   }
+   }
+   }
+
if (worker_pool_assign_id(pool) < 0)
goto fail;
 
@@ -4475,6 +4488,7 @@ static int __init init_workqueues(void)
pool->cpu = cpu;
cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
pool->attrs->nice = std_nice[i++];
+   pool->node = cpu_to_node(cpu);
 
/* alloc pool ID */
mutex_lock(&wq_pool_mutex);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/14] workqueue: drop 'H' from kworker names of unbound worker pools

2013-03-27 Thread Tejun Heo

Currently, all workqueue workers which have negative nice value has
'H' postfixed to their names.  This is necessary for per-cpu workers
as they use the CPU number instead of pool->id to identify the pool
and the 'H' postfix is the only thing distinguishing normal and
highpri workers.

As workers for unbound pools use pool->id, the 'H' postfix is purely
informational.  TASK_COMM_LEN is 16 and after the static part and
delimiters, there are only five characters left for the pool and
worker IDs.  We're expecting to have more unbound pools with the
scheduled NUMA awareness support.  Let's drop the non-essential 'H'
postfix from unbound kworker name.

While at it, restructure kthread_create*() invocation to help future
NUMA related changes.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 54b5048..57ced49 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1644,9 +1644,10 @@ static struct worker *alloc_worker(void)
  */
 static struct worker *create_worker(struct worker_pool *pool)
 {
-   const char *pri = pool->attrs->nice < 0  ? "H" : "";
struct worker *worker = NULL;
+   int node = pool->cpu >= 0 ? cpu_to_node(pool->cpu) : NUMA_NO_NODE;
int id = -1;
+   char id_buf[16];
 
lockdep_assert_held(&pool->manager_mutex);
 
@@ -1672,13 +1673,13 @@ static struct worker *create_worker(struct worker_pool 
*pool)
worker->id = id;
 
if (pool->cpu >= 0)
-   worker->task = kthread_create_on_node(worker_thread,
-   worker, cpu_to_node(pool->cpu),
-   "kworker/%d:%d%s", pool->cpu, id, pri);
+   snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
+pool->attrs->nice < 0  ? "H" : "");
else
-   worker->task = kthread_create(worker_thread, worker,
- "kworker/u%d:%d%s",
- pool->id, id, pri);
+   snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
+
+   worker->task = kthread_create_on_node(worker_thread, worker, node,
+ "kworker/%s", id_buf);
if (IS_ERR(worker->task))
goto fail;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/14] workqueue: move pwq_pool_locking outside of get/put_unbound_pool()

2013-03-27 Thread Tejun Heo

The scheduled NUMA affinity support for unbound workqueues would need
to walk workqueues list and pool related operations on each workqueue.

Move wq_pool_mutex locking out of get/put_unbound_pool() to their
callers so that pool operations can be performed while walking the
workqueues list, which is also protected by wq_pool_mutex.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index abe1f0d..26771f4e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3395,31 +3395,28 @@ static void rcu_free_pool(struct rcu_head *rcu)
  * safe manner.  get_unbound_pool() calls this function on its failure path
  * and this function should be able to release pools which went through,
  * successfully or not, init_worker_pool().
+ *
+ * Should be called with wq_pool_mutex held.
  */
 static void put_unbound_pool(struct worker_pool *pool)
 {
struct worker *worker;
 
-   mutex_lock(&wq_pool_mutex);
-   if (--pool->refcnt) {
-   mutex_unlock(&wq_pool_mutex);
+   lockdep_assert_held(&wq_pool_mutex);
+
+   if (--pool->refcnt)
return;
-   }
 
/* sanity checks */
if (WARN_ON(!(pool->flags & POOL_DISASSOCIATED)) ||
-   WARN_ON(!list_empty(&pool->worklist))) {
-   mutex_unlock(&wq_pool_mutex);
+   WARN_ON(!list_empty(&pool->worklist)))
return;
-   }
 
/* release id and unhash */
if (pool->id >= 0)
idr_remove(&worker_pool_idr, pool->id);
hash_del(&pool->hash_node);
 
-   mutex_unlock(&wq_pool_mutex);
-
/*
 * Become the manager and destroy all workers.  Grabbing
 * manager_arb prevents @pool's workers from blocking on
@@ -3453,13 +3450,15 @@ static void put_unbound_pool(struct worker_pool *pool)
  * reference count and return it.  If there already is a matching
  * worker_pool, it will be used; otherwise, this function attempts to
  * create a new one.  On failure, returns NULL.
+ *
+ * Should be called with wq_pool_mutex held.
  */
 static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs)
 {
u32 hash = wqattrs_hash(attrs);
struct worker_pool *pool;
 
-   mutex_lock(&wq_pool_mutex);
+   lockdep_assert_held(&wq_pool_mutex);
 
/* do we already have a matching pool? */
hash_for_each_possible(unbound_pool_hash, pool, hash_node, hash) {
@@ -3490,10 +3489,8 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
/* install */
hash_add(unbound_pool_hash, &pool->hash_node, hash);
 out_unlock:
-   mutex_unlock(&wq_pool_mutex);
return pool;
 fail:
-   mutex_unlock(&wq_pool_mutex);
if (pool)
put_unbound_pool(pool);
return NULL;
@@ -3530,7 +3527,10 @@ static void pwq_unbound_release_workfn(struct 
work_struct *work)
is_last = list_empty(&wq->pwqs);
mutex_unlock(&wq->mutex);
 
+   mutex_lock(&wq_pool_mutex);
put_unbound_pool(pool);
+   mutex_unlock(&wq_pool_mutex);
+
call_rcu_sched(&pwq->rcu, rcu_free_pwq);
 
/*
@@ -3653,13 +3653,21 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
copy_workqueue_attrs(new_attrs, attrs);
cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);
 
+   mutex_lock(&wq_pool_mutex);
+
pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
-   if (!pwq)
+   if (!pwq) {
+   mutex_unlock(&wq_pool_mutex);
goto enomem;
+   }
 
pool = get_unbound_pool(new_attrs);
-   if (!pool)
+   if (!pool) {
+   mutex_unlock(&wq_pool_mutex);
goto enomem;
+   }
+
+   mutex_unlock(&wq_pool_mutex);
 
init_and_link_pwq(pwq, wq, pool, &last_pwq);
if (last_pwq) {
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 14/14] workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity

2013-03-27 Thread Tejun Heo

Unbound workqueues are now NUMA aware.  Let's add some control knobs
and update sysfs interface accordingly.

* Add kernel param workqueue.numa_disable which disables NUMA affinity
  globally.

* Replace sysfs file "pool_id" with "pool_ids" which contain
  node:pool_id pairs.  This change is userland-visible but "pool_id"
  hasn't seen a release yet, so this is okay.

* Add a new sysf files "numa" which can toggle NUMA affinity on
  individual workqueues.  This is implemented as attrs->no_numa whichn
  is special in that it isn't part of a pool's attributes.  It only
  affects how apply_workqueue_attrs() picks which pools to use.

After "pool_ids" change, first_pwq() doesn't have any user left.
Removed.

Signed-off-by: Tejun Heo 
---
 Documentation/kernel-parameters.txt |  9 
 include/linux/workqueue.h   |  5 +++
 kernel/workqueue.c  | 82 ++---
 3 files changed, 73 insertions(+), 23 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 4609e81..c75ea0b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3222,6 +3222,15 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
or other driver-specific files in the
Documentation/watchdog/ directory.
 
+   workqueue.disable_numa
+   By default, all work items queued to unbound
+   workqueues are affine to the NUMA nodes they're
+   issued on, which results in better behavior in
+   general.  If NUMA affinity needs to be disabled for
+   whatever reason, this option can be used.  Note
+   that this also can be controlled per-workqueue for
+   workqueues visible under /sys/bus/workqueue/.
+
x2apic_phys [X86-64,APIC] Use x2apic physical mode instead of
default x2apic cluster mode on platforms
supporting x2apic.
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 835d12b..7179756 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -119,10 +119,15 @@ struct delayed_work {
 /*
  * A struct for workqueue attributes.  This can be used to change
  * attributes of an unbound workqueue.
+ *
+ * Unlike other fields, ->no_numa isn't a property of a worker_pool.  It
+ * only modifies how apply_workqueue_attrs() select pools and thus doesn't
+ * participate in pool hash calculations or equality comparisons.
  */
 struct workqueue_attrs {
int nice;   /* nice level */
cpumask_var_t   cpumask;/* allowed CPUs */
+   boolno_numa;/* disable NUMA affinity */
 };
 
 static inline struct delayed_work *to_delayed_work(struct work_struct *work)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 637debe..0b6a3b0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -268,6 +268,9 @@ static int wq_numa_tbl_len; /* highest possible 
NUMA node id + 1 */
 static cpumask_var_t *wq_numa_possible_cpumask;
/* possible CPUs of each node */
 
+static bool wq_disable_numa;
+module_param_named(disable_numa, wq_disable_numa, bool, 0444);
+
 static bool wq_numa_enabled;   /* unbound NUMA affinity enabled */
 
 /* buf for wq_update_unbound_numa_attrs(), protected by CPU hotplug exclusion 
*/
@@ -517,21 +520,6 @@ static int worker_pool_assign_id(struct worker_pool *pool)
 }
 
 /**
- * first_pwq - return the first pool_workqueue of the specified workqueue
- * @wq: the target workqueue
- *
- * This must be called either with wq->mutex held or sched RCU read locked.
- * If the pwq needs to be used beyond the locking in effect, the caller is
- * responsible for guaranteeing that the pwq stays online.
- */
-static struct pool_workqueue *first_pwq(struct workqueue_struct *wq)
-{
-   assert_rcu_or_wq_mutex(wq);
-   return list_first_or_null_rcu(&wq->pwqs, struct pool_workqueue,
- pwqs_node);
-}
-
-/**
  * unbound_pwq_by_node - return the unbound pool_workqueue for the given node
  * @wq: the target workqueue
  * @node: the node ID
@@ -3114,16 +3102,21 @@ static struct device_attribute wq_sysfs_attrs[] = {
__ATTR_NULL,
 };
 
-static ssize_t wq_pool_id_show(struct device *dev,
-  struct device_attribute *attr, char *buf)
+static ssize_t wq_pool_ids_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
 {
struct workqueue_struct *wq = dev_to_wq(dev);
-   struct worker_pool *pool;
-   int written;
+   const char *delim = "";
+   int

Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

2013-03-25 Thread Tejun Heo

On Mon, Mar 25, 2013 at 12:15:00PM -0700, Tejun Heo wrote:
> On Sun, Mar 24, 2013 at 11:55:06AM -0700, Tejun Heo wrote:
> > > the whole patchset is more complicated than my brain.
> > 
> > It isn't that complex, is it?  I mean, the difficult part - using
> > multiple pwqs on unbound wq - already happened, and even that wasn't
> > too complex as it in most part synchronized the behaviors between
> > per-cpu and unbound workqueues.  All that NUMA support is doing is
> > mapping different pwqs to different issuing CPUs.
> 
> Oh, BTW, please feel free to rebase your patchset on top of the
> current wq/for-3.10.  I'll take care of the conflicts with the numa
> one, if there are any.

Sorry, never mind.  I'm cherry picking wq->mutex patches.  Will apply
the rebased versions in a couple hours.  Let's base everything else on
top of those.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

2013-03-25 Thread Tejun Heo

On Sun, Mar 24, 2013 at 11:55:06AM -0700, Tejun Heo wrote:
> > the whole patchset is more complicated than my brain.
> 
> It isn't that complex, is it?  I mean, the difficult part - using
> multiple pwqs on unbound wq - already happened, and even that wasn't
> too complex as it in most part synchronized the behaviors between
> per-cpu and unbound workqueues.  All that NUMA support is doing is
> mapping different pwqs to different issuing CPUs.

Oh, BTW, please feel free to rebase your patchset on top of the
current wq/for-3.10.  I'll take care of the conflicts with the numa
one, if there are any.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

2013-03-24 Thread Tejun Heo

Hey, Lai.

On Mon, Mar 25, 2013 at 12:04:19AM +0800, Lai Jiangshan wrote:
> example:
> node0(cpu0,cpu1),node1(cpu2,cpu3),
> wq's cpumask: 1,3
> the pwq of this wq on the node1's cpumask: 3
> current online cpu: 0-2.
> so the cpumask of worker tasks of the pwq on node1 is actually cpu_all_mask.
> so the work scheduled from cpu2 can be executed on cpu0 or cpu2.
> we expect it is executed on cpu1 only.

Ah, right.  I was hoping to avoid doing pwq swaps on CPU hot[un]plugs
doing everything on possible mask.  Looks like we'll need some
massaging after all.

> It can be fixed by swapping pwqs(node's pwq <-> default pwq)
> when cpuhotplug. But could you reschedule this patchset to wq/for-3.11?

We're still only at -rc4 meaning we still have plenty of time to
resolve whatever issues which come up, so I think I'm still gonna
target 3.10.

> the whole patchset is more complicated than my brain.

It isn't that complex, is it?  I mean, the difficult part - using
multiple pwqs on unbound wq - already happened, and even that wasn't
too complex as it in most part synchronized the behaviors between
per-cpu and unbound workqueues.  All that NUMA support is doing is
mapping different pwqs to different issuing CPUs.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

2013-03-20 Thread Tejun Heo

On Tue, Mar 19, 2013 at 05:00:19PM -0700, Tejun Heo wrote:
> and also available in the following git branch.
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa

Branch rebased on top of the current wq/for-3.10 with updated patches.
The new comit ID is 9555fbc12d786a9eae7cf7701a6718a0c029173e.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 UPDATED 09/10] workqueue: implement NUMA affinity for unbound workqueues

2013-03-20 Thread Tejun Heo

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued.  This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues.  Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has possible CPUs in
@attrs->cpumask.  Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active.  The limit is now per NUMA
node instead of global.  While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control.  Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
by Lai.

Signed-off-by: Tejun Heo 
Cc: Lai Jiangshan 
---
Please forget about the previous posting.  It was freeing dfl_pwq
multiple times.  This one, hopefully, is correct.

Thanks.

 kernel/workqueue.c |  119 +++--
 1 file changed, 97 insertions(+), 22 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3666,13 +3666,13 @@ static void init_pwq(struct pool_workque
pwq->flush_color = -1;
pwq->refcnt = 1;
INIT_LIST_HEAD(&pwq->delayed_works);
+   INIT_LIST_HEAD(&pwq->pwqs_node);
INIT_LIST_HEAD(&pwq->mayday_node);
INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
 }
 
 /* sync @pwq with the current state of its associated wq and link it */
-static void link_pwq(struct pool_workqueue *pwq,
-struct pool_workqueue **p_last_pwq)
+static void link_pwq(struct pool_workqueue *pwq)
 {
struct workqueue_struct *wq = pwq->wq;
 
@@ -3683,8 +3683,6 @@ static void link_pwq(struct pool_workque
 * Set the matching work_color.  This is synchronized with
 * flush_mutex to avoid confusing flush_workqueue().
 */
-   if (p_last_pwq)
-   *p_last_pwq = first_pwq(wq);
pwq->work_color = wq->work_color;
 
/* sync max_active to the current setting */
@@ -3715,16 +3713,26 @@ static struct pool_workqueue *alloc_unbo
return pwq;
 }
 
+/* undo alloc_unbound_pwq(), used only in the error path */
+static void free_unbound_pwq(struct pool_workqueue *pwq)
+{
+   if (pwq) {
+   put_unbound_pool(pwq->pool);
+   kfree(pwq);
+   }
+}
+
 /**
  * apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
  * @wq: the target workqueue
  * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
  *
- * Apply @attrs to an unbound workqueue @wq.  If @attrs doesn't match the
- * current attributes, a new pwq is created and made the first pwq which
- * will serve all new work items.  Older pwqs are released as in-flight
- * work items finish.  Note that a work item which repeatedly requeues
- * itself back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq.  Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on.  Older pwqs are released as in-flight work
+ * items finish.  Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
  *
  * Performs GFP_KERNEL allocations.  Returns 0 on success and -errno on
  * failure.
@@ -3732,7 +3740,8 @@ static struct pool_workqueue *alloc_unbo
 int apply_workqueue_attrs(struct workqueue_struct *wq,
  const struct workqueue_attrs *attrs)
 {
-   struct pool_workqueue *pwq, *last_pwq;
+   struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
+   struct workqueue_attrs *tmp_attrs = NULL;
int node;
 
/* only unbound workqueues can change attributes */
@@ -3743,29 +3752,95 @@ int apply_workqueue_attrs(struct workque
if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
return -EINVAL;
 
-   pwq = alloc_unbound_pwq(wq, attrs);
-   if (!pwq)
-   return -ENOMEM;
+   pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
+   tmp_attrs = a

[PATCH 11/10] workqueue: use NUMA-aware allocation for pool_workqueues workqueues

2013-03-20 Thread Tejun Heo

Use kmem_cache_alloc_node() with @pool->node instead of
kmem_cache_zalloc() when allocating a pool_workqueue so that it's
allocated on the same node as the associated worker_pool.  As there's
no no kmem_cache_zalloc_node(), move zeroing to init_pwq().

This was suggested by Lai Jiangshan.

Signed-off-by: Tejun Heo 
Cc: Lai Jiangshan 
---
 kernel/workqueue.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3683,12 +3683,14 @@ static void pwq_adjust_max_active(struct
spin_unlock(&pwq->pool->lock);
 }
 
-/* initialize newly zalloced @pwq which is associated with @wq and @pool */
+/* initialize newly alloced @pwq which is associated with @wq and @pool */
 static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
 struct worker_pool *pool)
 {
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
 
+   memset(pwq, 0, sizeof(*pwq));
+
pwq->pool = pool;
pwq->wq = wq;
pwq->flush_color = -1;
@@ -3731,7 +3733,7 @@ static struct pool_workqueue *alloc_unbo
if (!pool)
return NULL;
 
-   pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+   pwq = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, pool->node);
if (!pwq) {
put_unbound_pool(pool);
return NULL;
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 09/10] workqueue: implement NUMA affinity for unbound workqueues

2013-03-20 Thread Tejun Heo

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued.  This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues.  Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has possible CPUs in
@attrs->cpumask.  Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active.  The limit is now per NUMA
node instead of global.  While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control.  Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
by Lai.

Signed-off-by: Tejun Heo 
Cc: Lai Jiangshan 
---
 kernel/workqueue.c |  120 +++--
 1 file changed, 98 insertions(+), 22 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3658,13 +3658,13 @@ static void init_pwq(struct pool_workque
pwq->flush_color = -1;
pwq->refcnt = 1;
INIT_LIST_HEAD(&pwq->delayed_works);
+   INIT_LIST_HEAD(&pwq->pwqs_node);
INIT_LIST_HEAD(&pwq->mayday_node);
INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
 }
 
 /* sync @pwq with the current state of its associated wq and link it */
-static void link_pwq(struct pool_workqueue *pwq,
-struct pool_workqueue **p_last_pwq)
+static void link_pwq(struct pool_workqueue *pwq)
 {
struct workqueue_struct *wq = pwq->wq;
 
@@ -3675,8 +3675,6 @@ static void link_pwq(struct pool_workque
 * Set the matching work_color.  This is synchronized with
 * flush_mutex to avoid confusing flush_workqueue().
 */
-   if (p_last_pwq)
-   *p_last_pwq = first_pwq(wq);
pwq->work_color = wq->work_color;
 
/* sync max_active to the current setting */
@@ -3707,16 +3705,26 @@ static struct pool_workqueue *alloc_unbo
return pwq;
 }
 
+/* undo alloc_unbound_pwq(), used only in the error path */
+static void free_unbound_pwq(struct pool_workqueue *pwq)
+{
+   if (pwq) {
+   put_unbound_pool(pwq->pool);
+   kfree(pwq);
+   }
+}
+
 /**
  * apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
  * @wq: the target workqueue
  * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
  *
- * Apply @attrs to an unbound workqueue @wq.  If @attrs doesn't match the
- * current attributes, a new pwq is created and made the first pwq which
- * will serve all new work items.  Older pwqs are released as in-flight
- * work items finish.  Note that a work item which repeatedly requeues
- * itself back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq.  Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on.  Older pwqs are released as in-flight work
+ * items finish.  Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
  *
  * Performs GFP_KERNEL allocations.  Returns 0 on success and -errno on
  * failure.
@@ -3724,7 +3732,8 @@ static struct pool_workqueue *alloc_unbo
 int apply_workqueue_attrs(struct workqueue_struct *wq,
  const struct workqueue_attrs *attrs)
 {
-   struct pool_workqueue *pwq, *last_pwq;
+   struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
+   struct workqueue_attrs *tmp_attrs = NULL;
int node;
 
/* only unbound workqueues can change attributes */
@@ -3735,29 +3744,96 @@ int apply_workqueue_attrs(struct workque
if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
return -EINVAL;
 
-   pwq = alloc_unbound_pwq(wq, attrs);
-   if (!pwq)
-   return -ENOMEM;
+   pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
+   tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+   if (!pwq_tbl || !tmp_attrs)
+   goto enomem;
+
+   copy_workqueue_attrs(tmp_attrs,

Re: [PATCH 08/10] workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()

2013-03-20 Thread Tejun Heo

On Wed, Mar 20, 2013 at 11:52:03PM +0800, Lai Jiangshan wrote:
> > +static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct 
> > *wq,
> > +   const struct workqueue_attrs *attrs)
> > +{
> > +   struct worker_pool *pool;
> > +   struct pool_workqueue *pwq;
> > +
> > +   pool = get_unbound_pool(attrs);
> > +   if (!pool)
> > +   return NULL;
> > +
> > +   pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
> 
> this allocation is not numa-awared, you may use pool->node here.

Nice catch.  Will do.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]

2013-03-20 Thread Tejun Heo

On Wed, Mar 20, 2013 at 11:43:30PM +0800, Lai Jiangshan wrote:
> > +   for_each_node(node)
> > +   
> > BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
> > +  GFP_KERNEL, node));
> > +   for_each_possible_cpu(cpu) {
> > +   node = cpu_to_node(cpu);
> > +   if (WARN_ON(node == NUMA_NO_NODE)) {
> > +   pr_err("workqueue: NUMA node mapping not available 
> > for cpu%d, disabling NUMA support\n", cpu);
> 
> since you used WARN_ON(), not BUG_ON(), I think you need to free
> allocated memory here.

I don't know.  Is it necessary?  It shouldn't happen sans bugs in arch
code and we're vomiting warning message with full stack trace.  The
system will function but something is seriously screwed.  I don't
think it matters whether we free the puny number of bytes there or
not and I don't wanna nest deeper there. :P

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues

2013-03-20 Thread Tejun Heo

On Wed, Mar 20, 2013 at 8:26 AM, Lai Jiangshan  wrote:
>> for_eahc_node(node)
>> if (pwq_tbl[node] != dfl_pwq)
>> kfree(pwq_tbl[node]);
>> kfree(dfl_pwq);
>
> I also missed.
> we still need to put_unbound_pool() before free(pwq)

Yeap, we do.  Wonder whether we can just use put_pwq() there. I'll see if I can.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues

2013-03-20 Thread Tejun Heo

On Wed, Mar 20, 2013 at 11:03:53PM +0800, Lai Jiangshan wrote:
> > +enomem:
> > +   free_workqueue_attrs(tmp_attrs);
> > +   if (pwq_tbl) {
> > +   for_each_node(node)
> > +   kfree(pwq_tbl[node]);
> 
> It will free the dfl_pwq multi times.

Oops, you're right.  We need to do


for_eahc_node(node)
if (pwq_tbl[node] != dfl_pwq)
kfree(pwq_tbl[node]);
kfree(dfl_pwq);

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]

2013-03-20 Thread Tejun Heo

On Wed, Mar 20, 2013 at 11:08:29PM +0900, JoonSoo Kim wrote:
> 2013/3/20 Tejun Heo :
> > Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
> > and wq_numa_possible_cpumask[] in preparation.  The former is the
> > highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
> > each NUMA node.
> 
> It is better to move this code to topology.c or cpumask.c,
> then it can be generally used.

Yeah, it just isn't clear where it should go.  Most NUMA
initialization happens during early boot and arch-specific and
different archs expect NUMA information to be valid at different
point.  We can do it from one of the early initcalls by which all NUMA
information should be valid but, I don't know, having different NUMA
information coming up at different times seems error-prone.  There
would be no apparent indication that certain part is available while
others are not.

We could solve this by unifying how NUMA information is represented
and initialized.  e.g. if all NUMA archs used
CONFIG_USE_PERCPU_NUMA_NODE_ID, we can simply modify
set_cpu_numa_node() to build all data structures as NUMA nodes are
discovered.  Unfortunately, this isn't true yet and it's gonna be a
bit of work to get it in consistent state as it spans over multiple
architectures (not too many tho, NUMA fortunately is rare), so if
somebody can clean it up, I'll be happy to move these to topology.
Right now, I think it's best to just carry it in workqueue.c.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/10] workqueue: make workqueue->name[] fixed len

2013-03-19 Thread Tejun Heo

Currently workqueue->name[] is of flexible length.  We want to use the
flexible field for something more useful and there isn't much benefit
in allowing arbitrary name length anyway.  Make it fixed len capping
at 24 bytes.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 57e6139..151ce49 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -101,6 +101,8 @@ enum {
 */
RESCUER_NICE_LEVEL  = -20,
HIGHPRI_NICE_LEVEL  = -20,
+
+   WQ_NAME_LEN = 24,
 };
 
 /*
@@ -255,7 +257,7 @@ struct workqueue_struct {
 #ifdef CONFIG_LOCKDEP
struct lockdep_map  lockdep_map;
 #endif
-   charname[]; /* I: workqueue name */
+   charname[WQ_NAME_LEN]; /* I: workqueue name */
 };
 
 static struct kmem_cache *pwq_cache;
@@ -3755,17 +3757,12 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
   struct lock_class_key *key,
   const char *lock_name, ...)
 {
-   va_list args, args1;
+   va_list args;
struct workqueue_struct *wq;
struct pool_workqueue *pwq;
-   size_t namelen;
-
-   /* determine namelen, allocate wq and format name */
-   va_start(args, lock_name);
-   va_copy(args1, args);
-   namelen = vsnprintf(NULL, 0, fmt, args) + 1;
 
-   wq = kzalloc(sizeof(*wq) + namelen, GFP_KERNEL);
+   /* allocate wq and format name */
+   wq = kzalloc(sizeof(*wq), GFP_KERNEL);
if (!wq)
return NULL;
 
@@ -3775,9 +3772,9 @@ struct workqueue_struct *__alloc_workqueue_key(const char 
*fmt,
goto err_free_wq;
}
 
-   vsnprintf(wq->name, namelen, fmt, args1);
+   va_start(args, lock_name);
+   vsnprintf(wq->name, sizeof(wq->name), fmt, args);
va_end(args);
-   va_end(args1);
 
max_active = max_active ?: WQ_DFL_ACTIVE;
max_active = wq_clamp_max_active(max_active, flags, wq->name);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues

2013-03-19 Thread Tejun Heo

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued.  This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues.  Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has possible CPUs in
@attrs->cpumask.  Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active.  The limit is now per NUMA
node instead of global.  While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control.  Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 108 ++---
 1 file changed, 86 insertions(+), 22 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bbbfc92..0c36327 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3658,13 +3658,13 @@ static void init_pwq(struct pool_workqueue *pwq, struct 
workqueue_struct *wq,
pwq->flush_color = -1;
pwq->refcnt = 1;
INIT_LIST_HEAD(&pwq->delayed_works);
+   INIT_LIST_HEAD(&pwq->pwqs_node);
INIT_LIST_HEAD(&pwq->mayday_node);
INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
 }
 
 /* sync @pwq with the current state of its associated wq and link it */
-static void link_pwq(struct pool_workqueue *pwq,
-struct pool_workqueue **p_last_pwq)
+static void link_pwq(struct pool_workqueue *pwq)
 {
struct workqueue_struct *wq = pwq->wq;
 
@@ -3675,8 +3675,6 @@ static void link_pwq(struct pool_workqueue *pwq,
 * Set the matching work_color.  This is synchronized with
 * flush_mutex to avoid confusing flush_workqueue().
 */
-   if (p_last_pwq)
-   *p_last_pwq = first_pwq(wq);
pwq->work_color = wq->work_color;
 
/* sync max_active to the current setting */
@@ -3712,11 +3710,12 @@ static struct pool_workqueue *alloc_unbound_pwq(struct 
workqueue_struct *wq,
  * @wq: the target workqueue
  * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
  *
- * Apply @attrs to an unbound workqueue @wq.  If @attrs doesn't match the
- * current attributes, a new pwq is created and made the first pwq which
- * will serve all new work items.  Older pwqs are released as in-flight
- * work items finish.  Note that a work item which repeatedly requeues
- * itself back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq.  Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on.  Older pwqs are released as in-flight work
+ * items finish.  Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
  *
  * Performs GFP_KERNEL allocations.  Returns 0 on success and -errno on
  * failure.
@@ -3724,7 +3723,8 @@ static struct pool_workqueue *alloc_unbound_pwq(struct 
workqueue_struct *wq,
 int apply_workqueue_attrs(struct workqueue_struct *wq,
  const struct workqueue_attrs *attrs)
 {
-   struct pool_workqueue *pwq, *last_pwq;
+   struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
+   struct workqueue_attrs *tmp_attrs = NULL;
int node;
 
/* only unbound workqueues can change attributes */
@@ -3735,29 +3735,93 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
return -EINVAL;
 
-   pwq = alloc_unbound_pwq(wq, attrs);
-   if (!pwq)
-   return -ENOMEM;
+   pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
+   tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+   if (!pwq_tbl || !tmp_attrs)
+   goto enomem;
+
+   copy_workqueue_attrs(tmp_attrs, attrs);
+
+   /*
+* We want NUMA affinity.  For each node with intersecting possible
+* CPUs with the requested cpumask, create a separate pwq covering
+* the instersection.  Nodes witho

[PATCH 06/10] workqueue: move hot fields of workqueue_struct to the end

2013-03-19 Thread Tejun Heo

Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align
them to the cacheline.  These two fields are used in the work item
issue path and thus hot.  The scheduled NUMA affinity support will add
dispatch table at the end of workqueue_struct and relocating these two
fields will allow us hitting only single cacheline on hot paths.

Note that wq->pwqs isn't moved although it currently is being used in
the work item issue path for unbound workqueues.  The dispatch table
mentioned above will replace its use in the issue path, so it will
become cold once NUMA support is implemented.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 151ce49..25dab9d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -230,8 +230,6 @@ struct wq_device;
  * the appropriate worker_pool through its pool_workqueues.
  */
 struct workqueue_struct {
-   unsigned intflags;  /* WQ: WQ_* flags */
-   struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwq's */
struct list_headpwqs;   /* FR: all pwqs of this wq */
struct list_headlist;   /* WQ: list of all workqueues */
 
@@ -258,6 +256,10 @@ struct workqueue_struct {
struct lockdep_map  lockdep_map;
 #endif
charname[WQ_NAME_LEN]; /* I: workqueue name */
+
+   /* hot fields used during command issue, aligned to cacheline */
+   unsigned intflags cacheline_aligned; /* WQ: WQ_* flags 
*/
+   struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
 };
 
 static struct kmem_cache *pwq_cache;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/10] workqueue: add workqueue->unbound_attrs

2013-03-19 Thread Tejun Heo

Currently, when exposing attrs of an unbound workqueue via sysfs, the
workqueue_attrs of first_pwq() is used as that should equal the
current state of the workqueue.

The planned NUMA affinity support will make unbound workqueues make
use of multiple pool_workqueues for different NUMA nodes and the above
assumption will no longer hold.  Introduce workqueue->unbound_attrs
which records the current attrs in effect and use it for sysfs instead
of first_pwq()->attrs.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 36 
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2768ed2..57e6139 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -247,6 +247,8 @@ struct workqueue_struct {
int nr_drainers;/* WQ: drain in progress */
int saved_max_active; /* PW: saved pwq max_active */
 
+   struct workqueue_attrs  *unbound_attrs; /* WQ: only for unbound wqs */
+
 #ifdef CONFIG_SYSFS
struct wq_device*wq_dev;/* I: for sysfs interface */
 #endif
@@ -3099,10 +3101,9 @@ static ssize_t wq_nice_show(struct device *dev, struct 
device_attribute *attr,
struct workqueue_struct *wq = dev_to_wq(dev);
int written;
 
-   rcu_read_lock_sched();
-   written = scnprintf(buf, PAGE_SIZE, "%d\n",
-   first_pwq(wq)->pool->attrs->nice);
-   rcu_read_unlock_sched();
+   mutex_lock(&wq_mutex);
+   written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->unbound_attrs->nice);
+   mutex_unlock(&wq_mutex);
 
return written;
 }
@@ -3116,9 +3117,9 @@ static struct workqueue_attrs *wq_sysfs_prep_attrs(struct 
workqueue_struct *wq)
if (!attrs)
return NULL;
 
-   rcu_read_lock_sched();
-   copy_workqueue_attrs(attrs, first_pwq(wq)->pool->attrs);
-   rcu_read_unlock_sched();
+   mutex_lock(&wq_mutex);
+   copy_workqueue_attrs(attrs, wq->unbound_attrs);
+   mutex_unlock(&wq_mutex);
return attrs;
 }
 
@@ -3149,10 +3150,9 @@ static ssize_t wq_cpumask_show(struct device *dev,
struct workqueue_struct *wq = dev_to_wq(dev);
int written;
 
-   rcu_read_lock_sched();
-   written = cpumask_scnprintf(buf, PAGE_SIZE,
-   first_pwq(wq)->pool->attrs->cpumask);
-   rcu_read_unlock_sched();
+   mutex_lock(&wq_mutex);
+   written = cpumask_scnprintf(buf, PAGE_SIZE, wq->unbound_attrs->cpumask);
+   mutex_unlock(&wq_mutex);
 
written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
return written;
@@ -3585,8 +3585,10 @@ static void pwq_unbound_release_workfn(struct 
work_struct *work)
 * If we're the last pwq going away, @wq is already dead and no one
 * is gonna access it anymore.  Free it.
 */
-   if (list_empty(&wq->pwqs))
+   if (list_empty(&wq->pwqs)) {
+   free_workqueue_attrs(wq->unbound_attrs);
kfree(wq);
+   }
 }
 
 /**
@@ -3656,6 +3658,9 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
/* link in @pwq */
list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
 
+   if (wq->flags & WQ_UNBOUND)
+   copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
+
spin_unlock_irq(&pwq_lock);
mutex_unlock(&wq->flush_mutex);
 }
@@ -3764,6 +3769,12 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
if (!wq)
return NULL;
 
+   if (flags & WQ_UNBOUND) {
+   wq->unbound_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+   if (!wq->unbound_attrs)
+   goto err_free_wq;
+   }
+
vsnprintf(wq->name, namelen, fmt, args1);
va_end(args);
va_end(args1);
@@ -3832,6 +3843,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char 
*fmt,
return wq;
 
 err_free_wq:
+   free_workqueue_attrs(wq->unbound_attrs);
kfree(wq);
return NULL;
 err_destroy:
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/10] workqueue: determine NUMA node of workers accourding to the allowed cpumask

2013-03-19 Thread Tejun Heo

When worker tasks are created using kthread_create_on_node(),
currently only per-cpu ones have the matching NUMA node specified.
All unbound workers are always created with NUMA_NO_NODE.

Now that an unbound worker pool may have an arbitrary cpumask
associated with it, this isn't optimal.  Add pool->node which is
determined by the pool's cpumask.  If the pool's cpumask is contained
inside a NUMA node proper, the pool is associated with that node, and
all workers of the pool are created on that node.

This currently only makes difference for unbound worker pools with
cpumask contained inside single NUMA node, but this will serve as
foundation for making all unbound pools NUMA-affine.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c1a931c..2768ed2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -141,6 +141,7 @@ enum {
 struct worker_pool {
spinlock_t  lock;   /* the pool lock */
int cpu;/* I: the associated cpu */
+   int node;   /* I: the associated node ID */
int id; /* I: pool ID */
unsigned intflags;  /* X: flags */
 
@@ -1649,7 +1650,6 @@ static struct worker *alloc_worker(void)
 static struct worker *create_worker(struct worker_pool *pool)
 {
struct worker *worker = NULL;
-   int node = pool->cpu >= 0 ? cpu_to_node(pool->cpu) : NUMA_NO_NODE;
int id = -1;
char id_buf[16];
 
@@ -1682,7 +1682,7 @@ static struct worker *create_worker(struct worker_pool 
*pool)
else
snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
 
-   worker->task = kthread_create_on_node(worker_thread, worker, node,
+   worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
  "kworker/%s", id_buf);
if (IS_ERR(worker->task))
goto fail;
@@ -3390,6 +3390,7 @@ static int init_worker_pool(struct worker_pool *pool)
spin_lock_init(&pool->lock);
pool->id = -1;
pool->cpu = -1;
+   pool->node = NUMA_NO_NODE;
pool->flags |= POOL_DISASSOCIATED;
INIT_LIST_HEAD(&pool->worklist);
INIT_LIST_HEAD(&pool->idle_list);
@@ -3496,6 +3497,7 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
 {
u32 hash = wqattrs_hash(attrs);
struct worker_pool *pool;
+   int node;
 
mutex_lock(&wq_mutex);
 
@@ -3515,6 +3517,17 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
lockdep_set_subclass(&pool->lock, 1);   /* see put_pwq() */
copy_workqueue_attrs(pool->attrs, attrs);
 
+   /* if cpumask is contained inside a NUMA node, we belong to that node */
+   if (wq_numa_possible_cpumask) {
+   for_each_node(node) {
+   if (cpumask_subset(pool->attrs->cpumask,
+  wq_numa_possible_cpumask[node])) {
+   pool->node = node;
+   break;
+   }
+   }
+   }
+
if (worker_pool_assign_id(pool) < 0)
goto fail;
 
@@ -4473,6 +4486,7 @@ static int __init init_workqueues(void)
pool->cpu = cpu;
cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
pool->attrs->nice = std_nice[i++];
+   pool->node = cpu_to_node(cpu);
 
/* alloc pool ID */
mutex_lock(&wq_mutex);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/10] workqueue: drop 'H' from kworker names of unbound worker pools

2013-03-19 Thread Tejun Heo

Currently, all workqueue workers which have negative nice value has
'H' postfixed to their names.  This is necessary for per-cpu workers
as they use the CPU number instead of pool->id to identify the pool
and the 'H' postfix is the only thing distinguishing normal and
highpri workers.

As workers for unbound pools use pool->id, the 'H' postfix is purely
informational.  TASK_COMM_LEN is 16 and after the static part and
delimiters, there are only five characters left for the pool and
worker IDs.  We're expecting to have more unbound pools with the
scheduled NUMA awareness support.  Let's drop the non-essential 'H'
postfix from unbound kworker name.

While at it, restructure kthread_create*() invocation to help future
NUMA related changes.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9b096e3..c1a931c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1648,9 +1648,10 @@ static struct worker *alloc_worker(void)
  */
 static struct worker *create_worker(struct worker_pool *pool)
 {
-   const char *pri = pool->attrs->nice < 0  ? "H" : "";
struct worker *worker = NULL;
+   int node = pool->cpu >= 0 ? cpu_to_node(pool->cpu) : NUMA_NO_NODE;
int id = -1;
+   char id_buf[16];
 
lockdep_assert_held(&pool->manager_mutex);
 
@@ -1676,13 +1677,13 @@ static struct worker *create_worker(struct worker_pool 
*pool)
worker->id = id;
 
if (pool->cpu >= 0)
-   worker->task = kthread_create_on_node(worker_thread,
-   worker, cpu_to_node(pool->cpu),
-   "kworker/%d:%d%s", pool->cpu, id, pri);
+   snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
+pool->attrs->nice < 0  ? "H" : "");
else
-   worker->task = kthread_create(worker_thread, worker,
- "kworker/u%d:%d%s",
- pool->id, id, pri);
+   snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
+
+   worker->task = kthread_create_on_node(worker_thread, worker, node,
+ "kworker/%s", id_buf);
if (IS_ERR(worker->task))
goto fail;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/10] workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity

2013-03-19 Thread Tejun Heo

Unbound workqueues are now NUMA aware.  Let's add some control knobs
and update sysfs interface accordingly.

* Add kernel param workqueue.numa_disable which disables NUMA affinity
  globally.

* Replace sysfs file "pool_id" with "pool_ids" which contain
  node:pool_id pairs.  This change is userland-visible but "pool_id"
  hasn't seen a release yet, so this is okay.

* Add a new sysf files "numa" which can toggle NUMA affinity on
  individual workqueues.  This is implemented as attrs->no_numa whichn
  is special in that it isn't part of a pool's attributes.  It only
  affects how apply_workqueue_attrs() picks which pools to use.

After "pool_ids" change, first_pwq() doesn't have any user left.
Removed.

Signed-off-by: Tejun Heo 
---
 Documentation/kernel-parameters.txt |   9 +++
 include/linux/workqueue.h   |   5 ++
 kernel/workqueue.c  | 125 +---
 3 files changed, 102 insertions(+), 37 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 4609e81..c75ea0b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3222,6 +3222,15 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
or other driver-specific files in the
Documentation/watchdog/ directory.
 
+   workqueue.disable_numa
+   By default, all work items queued to unbound
+   workqueues are affine to the NUMA nodes they're
+   issued on, which results in better behavior in
+   general.  If NUMA affinity needs to be disabled for
+   whatever reason, this option can be used.  Note
+   that this also can be controlled per-workqueue for
+   workqueues visible under /sys/bus/workqueue/.
+
x2apic_phys [X86-64,APIC] Use x2apic physical mode instead of
default x2apic cluster mode on platforms
supporting x2apic.
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 835d12b..7179756 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -119,10 +119,15 @@ struct delayed_work {
 /*
  * A struct for workqueue attributes.  This can be used to change
  * attributes of an unbound workqueue.
+ *
+ * Unlike other fields, ->no_numa isn't a property of a worker_pool.  It
+ * only modifies how apply_workqueue_attrs() select pools and thus doesn't
+ * participate in pool hash calculations or equality comparisons.
  */
 struct workqueue_attrs {
int nice;   /* nice level */
cpumask_var_t   cpumask;/* allowed CPUs */
+   boolno_numa;/* disable NUMA affinity */
 };
 
 static inline struct delayed_work *to_delayed_work(struct work_struct *work)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0c36327..b48373a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -45,6 +45,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "workqueue_internal.h"
 
@@ -302,6 +303,9 @@ EXPORT_SYMBOL_GPL(system_unbound_wq);
 struct workqueue_struct *system_freezable_wq __read_mostly;
 EXPORT_SYMBOL_GPL(system_freezable_wq);
 
+static bool wq_disable_numa;
+module_param_named(disable_numa, wq_disable_numa, bool, 0444);
+
 static int worker_thread(void *__worker);
 static void copy_workqueue_attrs(struct workqueue_attrs *to,
 const struct workqueue_attrs *from);
@@ -516,21 +520,6 @@ static int worker_pool_assign_id(struct worker_pool *pool)
 }
 
 /**
- * first_pwq - return the first pool_workqueue of the specified workqueue
- * @wq: the target workqueue
- *
- * This must be called either with pwq_lock held or sched RCU read locked.
- * If the pwq needs to be used beyond the locking in effect, the caller is
- * responsible for guaranteeing that the pwq stays online.
- */
-static struct pool_workqueue *first_pwq(struct workqueue_struct *wq)
-{
-   assert_rcu_or_pwq_lock();
-   return list_first_or_null_rcu(&wq->pwqs, struct pool_workqueue,
- pwqs_node);
-}
-
-/**
  * unbound_pwq_by_node - return the unbound pool_workqueue for the given node
  * @wq: the target workqueue
  * @node: the node ID
@@ -3101,16 +3090,21 @@ static struct device_attribute wq_sysfs_attrs[] = {
__ATTR_NULL,
 };
 
-static ssize_t wq_pool_id_show(struct device *dev,
-  struct device_attribute *attr, char *buf)
+static ssize_t wq_pool_ids_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
 {
struct workqueue_struct *wq = dev_to_wq(dev);
-   struct worker_pool *pool;
-   int writ

[PATCH 07/10] workqueue: map an unbound workqueues to multiple per-node pool_workqueues

2013-03-19 Thread Tejun Heo

Currently, an unbound workqueue has only one "current" pool_workqueue
associated with it.  It may have multple pool_workqueues but only the
first pool_workqueue servies new work items.  For NUMA affinity, we
want to change this so that there are multiple current pool_workqueues
serving different NUMA nodes.

Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and
points to the pool_workqueue to use for each possible node.  This
replaces first_pwq() in __queue_work() and workqueue_congested().

numa_pwq_tbl[] is currently initialized to point to the same
pool_workqueue as first_pwq() so this patch doesn't make any behavior
changes.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 48 +---
 1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 25dab9d..3f820a5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -260,6 +260,7 @@ struct workqueue_struct {
/* hot fields used during command issue, aligned to cacheline */
unsigned intflags cacheline_aligned; /* WQ: WQ_* flags 
*/
struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
+   struct pool_workqueue __rcu *numa_pwq_tbl[]; /* FR: unbound pwqs 
indexed by node */
 };
 
 static struct kmem_cache *pwq_cache;
@@ -529,6 +530,22 @@ static struct pool_workqueue *first_pwq(struct 
workqueue_struct *wq)
  pwqs_node);
 }
 
+/**
+ * unbound_pwq_by_node - return the unbound pool_workqueue for the given node
+ * @wq: the target workqueue
+ * @node: the node ID
+ *
+ * This must be called either with pwq_lock held or sched RCU read locked.
+ * If the pwq needs to be used beyond the locking in effect, the caller is
+ * responsible for guaranteeing that the pwq stays online.
+ */
+static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
+ int node)
+{
+   assert_rcu_or_pwq_lock();
+   return rcu_dereference_sched(wq->numa_pwq_tbl[node]);
+}
+
 static unsigned int work_color_to_flags(int color)
 {
return color << WORK_STRUCT_COLOR_SHIFT;
@@ -1282,14 +1299,14 @@ static void __queue_work(int cpu, struct 
workqueue_struct *wq,
WARN_ON_ONCE(!is_chained_work(wq)))
return;
 retry:
+   if (req_cpu == WORK_CPU_UNBOUND)
+   cpu = raw_smp_processor_id();
+
/* pwq which will be used unless @work is executing elsewhere */
-   if (!(wq->flags & WQ_UNBOUND)) {
-   if (cpu == WORK_CPU_UNBOUND)
-   cpu = raw_smp_processor_id();
+   if (!(wq->flags & WQ_UNBOUND))
pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
-   } else {
-   pwq = first_pwq(wq);
-   }
+   else
+   pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
 
/*
 * If @work was previously on a different pool, it might still be
@@ -1319,8 +1336,8 @@ retry:
 * pwq is determined and locked.  For unbound pools, we could have
 * raced with pwq release and it could already be dead.  If its
 * refcnt is zero, repeat pwq selection.  Note that pwqs never die
-* without another pwq replacing it as the first pwq or while a
-* work item is executing on it, so the retying is guaranteed to
+* without another pwq replacing it in the numa_pwq_tbl or while
+* work items are executing on it, so the retrying is guaranteed to
 * make forward-progress.
 */
if (unlikely(!pwq->refcnt)) {
@@ -3635,6 +3652,8 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
  struct worker_pool *pool,
  struct pool_workqueue **p_last_pwq)
 {
+   int node;
+
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
 
pwq->pool = pool;
@@ -3662,8 +3681,11 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
/* link in @pwq */
list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
 
-   if (wq->flags & WQ_UNBOUND)
+   if (wq->flags & WQ_UNBOUND) {
copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
+   for_each_node(node)
+   rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+   }
 
spin_unlock_irq(&pwq_lock);
mutex_unlock(&wq->flush_mutex);
@@ -3759,12 +3781,16 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
   struct lock_class_key *key,
   const char *lock_name, ...)
 {
+   size_t tbl_size = 0;
va_list args;
struct workqueue_struct *wq;
struct pool_workqueue *pwq;
 
/* allocate wq and format name */
-   wq = kzalloc(sizeof(*wq), GFP_KERNEL)

[PATCH 08/10] workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()

2013-03-19 Thread Tejun Heo

Break init_and_link_pwq() into init_pwq() and link_pwq() and move
unbound-workqueue specific handling into apply_workqueue_attrs().
Also, factor out unbound pool and pool_workqueue allocation into
alloc_unbound_pwq().

This reorganization is to prepare for NUMA affinity and doesn't
introduce any functional changes.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 75 +-
 1 file changed, 52 insertions(+), 23 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3f820a5..bbbfc92 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3647,13 +3647,10 @@ static void pwq_adjust_max_active(struct pool_workqueue 
*pwq)
spin_unlock(&pwq->pool->lock);
 }
 
-static void init_and_link_pwq(struct pool_workqueue *pwq,
- struct workqueue_struct *wq,
- struct worker_pool *pool,
- struct pool_workqueue **p_last_pwq)
+/* initialize newly zalloced @pwq which is associated with @wq and @pool */
+static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
+struct worker_pool *pool)
 {
-   int node;
-
BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
 
pwq->pool = pool;
@@ -3663,9 +3660,16 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
INIT_LIST_HEAD(&pwq->delayed_works);
INIT_LIST_HEAD(&pwq->mayday_node);
INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
+}
 
-   mutex_lock(&wq->flush_mutex);
-   spin_lock_irq(&pwq_lock);
+/* sync @pwq with the current state of its associated wq and link it */
+static void link_pwq(struct pool_workqueue *pwq,
+struct pool_workqueue **p_last_pwq)
+{
+   struct workqueue_struct *wq = pwq->wq;
+
+   lockdep_assert_held(&wq->flush_mutex);
+   lockdep_assert_held(&pwq_lock);
 
/*
 * Set the matching work_color.  This is synchronized with
@@ -3680,15 +3684,27 @@ static void init_and_link_pwq(struct pool_workqueue 
*pwq,
 
/* link in @pwq */
list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
+}
 
-   if (wq->flags & WQ_UNBOUND) {
-   copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
-   for_each_node(node)
-   rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+/* obtain a pool matching @attr and create a pwq associating the pool and @wq 
*/
+static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
+   const struct workqueue_attrs *attrs)
+{
+   struct worker_pool *pool;
+   struct pool_workqueue *pwq;
+
+   pool = get_unbound_pool(attrs);
+   if (!pool)
+   return NULL;
+
+   pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+   if (!pwq) {
+   put_unbound_pool(pool);
+   return NULL;
}
 
-   spin_unlock_irq(&pwq_lock);
-   mutex_unlock(&wq->flush_mutex);
+   init_pwq(pwq, wq, pool);
+   return pwq;
 }
 
 /**
@@ -3709,7 +3725,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
  const struct workqueue_attrs *attrs)
 {
struct pool_workqueue *pwq, *last_pwq;
-   struct worker_pool *pool;
+   int node;
 
/* only unbound workqueues can change attributes */
if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
@@ -3719,17 +3735,22 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
return -EINVAL;
 
-   pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+   pwq = alloc_unbound_pwq(wq, attrs);
if (!pwq)
return -ENOMEM;
 
-   pool = get_unbound_pool(attrs);
-   if (!pool) {
-   kmem_cache_free(pwq_cache, pwq);
-   return -ENOMEM;
-   }
+   mutex_lock(&wq->flush_mutex);
+   spin_lock_irq(&pwq_lock);
+
+   link_pwq(pwq, &last_pwq);
+
+   copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
+   for_each_node(node)
+   rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+
+   spin_unlock_irq(&pwq_lock);
+   mutex_unlock(&wq->flush_mutex);
 
-   init_and_link_pwq(pwq, wq, pool, &last_pwq);
if (last_pwq) {
spin_lock_irq(&last_pwq->pool->lock);
put_pwq(last_pwq);
@@ -3755,7 +3776,15 @@ static int alloc_and_link_pwqs(struct workqueue_struct 
*wq)
struct worker_pool *cpu_pools =
per_cpu(cpu_worker_pools, cpu);
 
-   init_and_link_pwq(pwq, wq, &cpu_pools[highpri], NULL);
+

[PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]

2013-03-19 Thread Tejun Heo

Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
and wq_numa_possible_cpumask[] in preparation.  The former is the
highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
each NUMA node.

This patch only introduces these.  Future patches will make use of
them.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 35 ++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 775c2f4..9b096e3 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "workqueue_internal.h"
 
@@ -256,6 +257,11 @@ struct workqueue_struct {
 
 static struct kmem_cache *pwq_cache;
 
+static int wq_numa_tbl_len;/* highest possible NUMA node id + 1 */
+static cpumask_var_t *wq_numa_possible_cpumask;
+   /* possible CPUs of each node, may
+  be NULL if init failed */
+
 static DEFINE_MUTEX(wq_mutex); /* protects workqueues and pools */
 static DEFINE_SPINLOCK(pwq_lock);  /* protects pool_workqueues */
 static DEFINE_SPINLOCK(wq_mayday_lock);/* protects wq->maydays list */
@@ -4416,7 +4422,7 @@ out_unlock:
 static int __init init_workqueues(void)
 {
int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
-   int i, cpu;
+   int i, node, cpu;
 
/* make sure we have enough bits for OFFQ pool ID */
BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
@@ -4429,6 +4435,33 @@ static int __init init_workqueues(void)
cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);
 
+   /* determine NUMA pwq table len - highest node id + 1 */
+   for_each_node(node)
+   wq_numa_tbl_len = max(wq_numa_tbl_len, node + 1);
+
+   /*
+* We want masks of possible CPUs of each node which isn't readily
+* available.  Build one from cpu_to_node() which should have been
+* fully initialized by now.
+*/
+   wq_numa_possible_cpumask = kzalloc(wq_numa_tbl_len *
+  sizeof(wq_numa_possible_cpumask[0]),
+  GFP_KERNEL);
+   BUG_ON(!wq_numa_possible_cpumask);
+
+   for_each_node(node)
+   BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
+  GFP_KERNEL, node));
+   for_each_possible_cpu(cpu) {
+   node = cpu_to_node(cpu);
+   if (WARN_ON(node == NUMA_NO_NODE)) {
+   pr_err("workqueue: NUMA node mapping not available for 
cpu%d, disabling NUMA support\n", cpu);
+   wq_numa_possible_cpumask = NULL;
+   break;
+   }
+   cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+   }
+
/* initialize CPU pools */
for_each_possible_cpu(cpu) {
struct worker_pool *pool;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

2013-03-19 Thread Tejun Heo

Hello,

There are two types of workqueues - per-cpu and unbound.  The former
is bound to each CPU and the latter isn't not bound to any by default.
While the recently added attrs support allows unbound workqueues to be
confined to subset of CPUs, it still is quite cumbersome for
applications where CPU affinity is too constricted but NUMA locality
still matters.

This patchset tries to solve that issue by automatically making
unbound workqueues affine to NUMA nodes by default.  A work item
queued to an unbound workqueue is executed on one of the CPUs allowed
by the workqueue in the same node.  If there's none allowed, it may be
executed on any cpu allowed by the workqueue.  It doesn't require any
changes on the user side.  Every interface of workqueues functions the
same as before.

This would be most helpful to subsystems which use some form of async
execution to process significant amount of data - e.g. crypto and
btrfs; however, I wanted to find out whether it would make any dent in
much less favorable use cases.  The following is total run time in
seconds of buliding allmodconfig kernel w/ -j20 on a dual socket
opteron machine with writeback thread pool converted to unbound
workqueue and thus made NUMA-affine.  The file system is ext4 on top
of a WD SSD.

before conversion   after conversion
1396.1261394.763
1397.6211394.965
1399.6361394.738
1397.4631398.162
1395.5431393.670

AVG 1397.2781395.260DIFF2.018
STDEV  1.585   1.700

And, yes, it actually made things go faster by about 1.2 sigma, which
isn't completely conclusive but is a pretty good indication that it's
actually faster.  Note that this is a workload which is dominated by
CPU time and while there's writeback going on continously it really
isn't touching too much data or a dominating factor, so the gain is
understandably small, 0.14%, but hey it's still a gain and it should
be much more interesting for crypto and btrfs which would actully
access the data or workloads which are more sensitive to NUMA
affinity.

The implementation is fairly simple.  After the recent attrs support
changes, a lot of the differences in pwq (pool_workqueue) handling
between unbound and per-cpu workqueues are gone.  An unbound workqueue
still has one "current" pwq that it uses for queueing any new work
items but can handle multiple pwqs perfectly well while they're
draining, so this patchset adds pwq dispatch table to unbound
workqueues which is indexed by NUMA node and points to the matching
pwq.  Unbound workqueues now simply have multiple "current" pwqs keyed
by NUMA node.

NUMA affinity can be turned off system-wide by workqueue.disable_numa
kernel param or per-workqueue using "numa" sysfs file.

This patchset contains the following ten patches.

 0001-workqueue-add-wq_numa_tbl_len-and-wq_numa_possible_c.patch
 0002-workqueue-drop-H-from-kworker-names-of-unbound-worke.patch
 0003-workqueue-determine-NUMA-node-of-workers-accourding-.patch
 0004-workqueue-add-workqueue-unbound_attrs.patch
 0005-workqueue-make-workqueue-name-fixed-len.patch
 0006-workqueue-move-hot-fields-of-workqueue_struct-to-the.patch
 0007-workqueue-map-an-unbound-workqueues-to-multiple-per-.patch
 0008-workqueue-break-init_and_link_pwq-into-two-functions.patch
 0009-workqueue-implement-NUMA-affinity-for-unbound-workqu.patch
 0010-workqueue-update-sysfs-interface-to-reflect-NUMA-awa.patch

0001 adds basic NUMA topology knoweldge to workqueue.

0002-0006 are prep patches.

0007-0009 implement NUMA affinity.

0010 adds control knobs and updates sysfs interface.

This patchset is on top of

 wq/for-3.10 a0265a7f51 ("workqueue: define workqueue_freezing static variable 
iff CONFIG_FREEZER")

and also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa

diffstat follows.

 Documentation/kernel-parameters.txt |9
 include/linux/workqueue.h   |5
 kernel/workqueue.c  |  393 
 3 files changed, 325 insertions(+), 82 deletions(-)

Thanks.

--
tejun

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH UPDATED v3 6/6] workqueue: reimplement WQ_HIGHPRI using a separate worker_pool

2012-07-13 Thread Tejun Heo

>From a465fcee388d62d22e390b57c81ca8411f25a1da Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Fri, 13 Jul 2012 22:16:45 -0700

WQ_HIGHPRI was implemented by queueing highpri work items at the head
of the global worklist.  Other than queueing at the head, they weren't
handled differently; unfortunately, this could lead to execution
latency of a few seconds on heavily loaded systems.

Now that workqueue code has been updated to deal with multiple
worker_pools per global_cwq, this patch reimplements WQ_HIGHPRI using
a separate worker_pool.  NR_WORKER_POOLS is bumped to two and
gcwq->pools[0] is used for normal pri work items and ->pools[1] for
highpri.  Highpri workers get -20 nice level and has 'H' suffix in
their names.  Note that this change increases the number of kworkers
per cpu.

POOL_HIGHPRI_PENDING, pool_determine_ins_pos() and highpri chain
wakeup code in process_one_work() are no longer used and removed.

This allows proper prioritization of highpri work items and removes
high execution latency of highpri work items.

v2: nr_running indexing bug in get_pool_nr_running() fixed.

v3: Refreshed for the get_pool_nr_running() update in the previous
    patch.

Signed-off-by: Tejun Heo 
Reported-by: Josh Hunt 
LKML-Reference: 

Cc: Tony Luck 
Cc: Fengguang Wu 
---
 Documentation/workqueue.txt |  103 ---
 kernel/workqueue.c  |  100 +++--
 2 files changed, 65 insertions(+), 138 deletions(-)

diff --git a/Documentation/workqueue.txt b/Documentation/workqueue.txt
index a0b577d..a6ab4b6 100644
--- a/Documentation/workqueue.txt
+++ b/Documentation/workqueue.txt
@@ -89,25 +89,28 @@ called thread-pools.
 
 The cmwq design differentiates between the user-facing workqueues that
 subsystems and drivers queue work items on and the backend mechanism
-which manages thread-pool and processes the queued work items.
+which manages thread-pools and processes the queued work items.
 
 The backend is called gcwq.  There is one gcwq for each possible CPU
-and one gcwq to serve work items queued on unbound workqueues.
+and one gcwq to serve work items queued on unbound workqueues.  Each
+gcwq has two thread-pools - one for normal work items and the other
+for high priority ones.
 
 Subsystems and drivers can create and queue work items through special
 workqueue API functions as they see fit. They can influence some
 aspects of the way the work items are executed by setting flags on the
 workqueue they are putting the work item on. These flags include
-things like CPU locality, reentrancy, concurrency limits and more. To
-get a detailed overview refer to the API description of
+things like CPU locality, reentrancy, concurrency limits, priority and
+more.  To get a detailed overview refer to the API description of
 alloc_workqueue() below.
 
-When a work item is queued to a workqueue, the target gcwq is
-determined according to the queue parameters and workqueue attributes
-and appended on the shared worklist of the gcwq.  For example, unless
-specifically overridden, a work item of a bound workqueue will be
-queued on the worklist of exactly that gcwq that is associated to the
-CPU the issuer is running on.
+When a work item is queued to a workqueue, the target gcwq and
+thread-pool is determined according to the queue parameters and
+workqueue attributes and appended on the shared worklist of the
+thread-pool.  For example, unless specifically overridden, a work item
+of a bound workqueue will be queued on the worklist of either normal
+or highpri thread-pool of the gcwq that is associated to the CPU the
+issuer is running on.
 
 For any worker pool implementation, managing the concurrency level
 (how many execution contexts are active) is an important issue.  cmwq
@@ -115,26 +118,26 @@ tries to keep the concurrency at a minimal but sufficient 
level.
 Minimal to save resources and sufficient in that the system is used at
 its full capacity.
 
-Each gcwq bound to an actual CPU implements concurrency management by
-hooking into the scheduler.  The gcwq is notified whenever an active
-worker wakes up or sleeps and keeps track of the number of the
-currently runnable workers.  Generally, work items are not expected to
-hog a CPU and consume many cycles.  That means maintaining just enough
-concurrency to prevent work processing from stalling should be
-optimal.  As long as there are one or more runnable workers on the
-CPU, the gcwq doesn't start execution of a new work, but, when the
-last running worker goes to sleep, it immediately schedules a new
-worker so that the CPU doesn't sit idle while there are pending work
-items.  This allows using a minimal number of workers without losing
-execution bandwidth.
+Each thread-pool bound to an actual CPU implements concurrency
+management by hooking into the scheduler.  The thread-pool is notified
+whenever an active worker wakes up or sleeps and keeps track of the

[PATCH UPDATED 5/6] workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool()

2012-07-13 Thread Tejun Heo

>From 4ce62e9e30cacc26885cab133ad1de358dd79f21 Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Fri, 13 Jul 2012 22:16:44 -0700

Introduce NR_WORKER_POOLS and for_each_worker_pool() and convert code
paths which need to manipulate all pools in a gcwq to use them.
NR_WORKER_POOLS is currently one and for_each_worker_pool() iterates
over only @gcwq->pool.

Note that nr_running is per-pool property and converted to an array
with NR_WORKER_POOLS elements and renamed to pool_nr_running.  Note
that get_pool_nr_running() currently assumes 0 index.  The next patch
will make use of non-zero index.

The changes in this patch are mechanical and don't caues any
functional difference.  This is to prepare for multiple pools per
gcwq.

v2: nr_running indexing bug in get_pool_nr_running() fixed.

v3: Pointer to array is stupid.  Don't use it in get_pool_nr_running()
as suggested by Linus.

Signed-off-by: Tejun Heo 
Cc: Tony Luck 
Cc: Fengguang Wu 
Cc: Linus Torvalds 
---
So, the same 0 index silliness but this shouldn't be as fugly.

Thanks.

 kernel/workqueue.c |  223 +++
 1 files changed, 153 insertions(+), 70 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7a98bae..b0daaea 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -74,6 +74,8 @@ enum {
TRUSTEE_RELEASE = 3,/* release workers */
TRUSTEE_DONE= 4,/* trustee is done */
 
+   NR_WORKER_POOLS = 1,/* # worker pools per gcwq */
+
BUSY_WORKER_HASH_ORDER  = 6,/* 64 pointers */
BUSY_WORKER_HASH_SIZE   = 1 << BUSY_WORKER_HASH_ORDER,
BUSY_WORKER_HASH_MASK   = BUSY_WORKER_HASH_SIZE - 1,
@@ -274,6 +276,9 @@ EXPORT_SYMBOL_GPL(system_nrt_freezable_wq);
 #define CREATE_TRACE_POINTS
 #include 
 
+#define for_each_worker_pool(pool, gcwq)   \
+   for ((pool) = &(gcwq)->pool; (pool); (pool) = NULL)
+
 #define for_each_busy_worker(worker, i, pos, gcwq) \
for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++) \
hlist_for_each_entry(worker, pos, &gcwq->busy_hash[i], hentry)
@@ -454,7 +459,7 @@ static bool workqueue_freezing; /* W: have wqs 
started freezing? */
  * try_to_wake_up().  Put it in a separate cacheline.
  */
 static DEFINE_PER_CPU(struct global_cwq, global_cwq);
-static DEFINE_PER_CPU_SHARED_ALIGNED(atomic_t, gcwq_nr_running);
+static DEFINE_PER_CPU_SHARED_ALIGNED(atomic_t, 
pool_nr_running[NR_WORKER_POOLS]);
 
 /*
  * Global cpu workqueue and nr_running counter for unbound gcwq.  The
@@ -462,7 +467,9 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(atomic_t, 
gcwq_nr_running);
  * workers have WORKER_UNBOUND set.
  */
 static struct global_cwq unbound_global_cwq;
-static atomic_t unbound_gcwq_nr_running = ATOMIC_INIT(0);  /* always 0 */
+static atomic_t unbound_pool_nr_running[NR_WORKER_POOLS] = {
+   [0 ... NR_WORKER_POOLS - 1] = ATOMIC_INIT(0),   /* always 0 */
+};
 
 static int worker_thread(void *__worker);
 
@@ -477,11 +484,12 @@ static struct global_cwq *get_gcwq(unsigned int cpu)
 static atomic_t *get_pool_nr_running(struct worker_pool *pool)
 {
int cpu = pool->gcwq->cpu;
+   int idx = 0;
 
if (cpu != WORK_CPU_UNBOUND)
-   return &per_cpu(gcwq_nr_running, cpu);
+   return &per_cpu(pool_nr_running, cpu)[idx];
else
-   return &unbound_gcwq_nr_running;
+   return &unbound_pool_nr_running[idx];
 }
 
 static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
@@ -3345,9 +3353,30 @@ EXPORT_SYMBOL_GPL(work_busy);
__ret1 < 0 ? -1 : 0;\
 })
 
+static bool gcwq_is_managing_workers(struct global_cwq *gcwq)
+{
+   struct worker_pool *pool;
+
+   for_each_worker_pool(pool, gcwq)
+   if (pool->flags & POOL_MANAGING_WORKERS)
+   return true;
+   return false;
+}
+
+static bool gcwq_has_idle_workers(struct global_cwq *gcwq)
+{
+   struct worker_pool *pool;
+
+   for_each_worker_pool(pool, gcwq)
+   if (!list_empty(&pool->idle_list))
+   return true;
+   return false;
+}
+
 static int __cpuinit trustee_thread(void *__gcwq)
 {
struct global_cwq *gcwq = __gcwq;
+   struct worker_pool *pool;
struct worker *worker;
struct work_struct *work;
struct hlist_node *pos;
@@ -3363,13 +3392,15 @@ static int __cpuinit trustee_thread(void *__gcwq)
 * cancelled.
 */
BUG_ON(gcwq->cpu != smp_processor_id());
-   rc = trustee_wait_event(!(gcwq->pool.flags & POOL_MANAGING_WORKERS));
+   rc = trustee_wait_event(!gcwq_is_managing_workers(gcwq));
BUG_ON(rc < 0);
 
-   gcwq->pool.flags |= POOL_MANAGING_WORKE

Re: [PATCH 5/6] workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool()

2012-07-13 Thread Tejun Heo

Hey, Linus.

On Fri, Jul 13, 2012 at 10:00:10PM -0700, Linus Torvalds wrote:
> On Fri, Jul 13, 2012 at 9:44 PM, Tejun Heo  wrote:
> >
> > nr_running is atomic_t (*nr_running)[2].  Ignoring the pointer to
> > array part, it's just returning the address of N'th element of the
> > array.  ARRAY + N == &ARRAY[N].
> 
> None of this matters one whit.
> 
> You did "&(x)[0]".
> 
> That's insane. It's crazy. It doesn't even matter what "x" is in
> between, it's crazy regardless.

Eh, from my previous reply.

| Ah okay, you're looking at the fifth patch in isolation.  Upto this
| point, the index is always 0.  I'm puttin it in as a placeholder for
| the next patch which makes use of non-zero index.  This patch is
| supposed to prepare everything for multiple pools and thus non-zero
| index.

The patch is about converting stuff to handle size-1 array without
introducing any actual behavior change so that the next patch can bump
the array size and just change the index.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/6] workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool()

2012-07-13 Thread Tejun Heo

Hello, Linus.

On Fri, Jul 13, 2012 at 09:27:03PM -0700, Linus Torvalds wrote:
> Seeing code like this
> 
> +   return &(*nr_running)[0];
> 
> just makes me go "WTF?"

I was going WTF too.  This was the smallest fix and I wanted to make
it minimal because there's another stack of patches on top of it.
Planning to just fold nr_running into worker_pool afterwards which
will remove the whole function.

> Why are you taking the address of something you just dereferenced (the
> "& [0]" part).

nr_running is atomic_t (*nr_running)[2].  Ignoring the pointer to
array part, it's just returning the address of N'th element of the
array.  ARRAY + N == &ARRAY[N].

> And you actually do that *twice*, except the inner one is more
> complicated. When you assign nr_runing, you take the address of it, so
> the "*nr_running" is actually just the same kind of odd thing (except
> in reverse - you take dereference something you just took the
> address-of).
> 
> Seriously, this to me is a sign of *deeply* confused code. And the
> fact that your first version of that code was buggy *EXACTLY* due to
> this confusion should have made you take a step back.

Type-wise, I don't think it's confused.  Ah okay, you're looking at
the fifth patch in isolation.  Upto this point, the index is always 0.
I'm puttin it in as a placeholder for the next patch which makes use
of non-zero index.  This patch is supposed to prepare everything for
multiple pools and thus non-zero index.

> As far as I can tell, what you actually want that function to do is:
> 
>   static atomic_t *get_pool_nr_running(struct worker_pool *pool)
>   {
> int cpu = pool->gcwq->cpu;
> 
> if (cpu != WORK_CPU_UNBOUND)
> return per_cpu(pool_nr_running, cpu);
> 
> return unbound_pool_nr_running;
>   }

More like the folloiwng in the end.

static atomic_t *get_pool_nr_running(struct worker_pool *pool)
{
int cpu = pool->gcwq->cpu;
int is_highpri = pool_is_highpri(pool);

if (cpu != WORK_CPU_UNBOUND)
return &per_cpu(pool_nr_running, cpu)[is_highpri];

return &unbound_pool_nr_running[is_highpri];
}

> I didn't test the code, btw. I just looked at the patch and went WTF.

Eh... yeah, with or without [2], this is WTF.  I'll just refresh it
with the above version.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH UPDATED 6/6] workqueue: reimplement WQ_HIGHPRI using a separate worker_pool

2012-07-13 Thread Tejun Heo

>From 12f804d130d966f2a094e8037e9f163215d13f23 Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Fri, 13 Jul 2012 20:50:50 -0700

WQ_HIGHPRI was implemented by queueing highpri work items at the head
of the global worklist.  Other than queueing at the head, they weren't
handled differently; unfortunately, this could lead to execution
latency of a few seconds on heavily loaded systems.

Now that workqueue code has been updated to deal with multiple
worker_pools per global_cwq, this patch reimplements WQ_HIGHPRI using
a separate worker_pool.  NR_WORKER_POOLS is bumped to two and
gcwq->pools[0] is used for normal pri work items and ->pools[1] for
highpri.  Highpri workers get -20 nice level and has 'H' suffix in
their names.  Note that this change increases the number of kworkers
per cpu.

POOL_HIGHPRI_PENDING, pool_determine_ins_pos() and highpri chain
wakeup code in process_one_work() are no longer used and removed.

This allows proper prioritization of highpri work items and removes
high execution latency of highpri work items.

v2: nr_running indexing bug in get_pool_nr_running() fixed.

Signed-off-by: Tejun Heo 
Reported-by: Josh Hunt 
LKML-Reference: 

Cc: Tony Luck 
Cc: Fengguang Wu 
---
git branch updated accordingly.  Thanks.

 Documentation/workqueue.txt |  103 ---
 kernel/workqueue.c  |  100 +++--
 2 files changed, 65 insertions(+), 138 deletions(-)

diff --git a/Documentation/workqueue.txt b/Documentation/workqueue.txt
index a0b577d..a6ab4b6 100644
--- a/Documentation/workqueue.txt
+++ b/Documentation/workqueue.txt
@@ -89,25 +89,28 @@ called thread-pools.
 
 The cmwq design differentiates between the user-facing workqueues that
 subsystems and drivers queue work items on and the backend mechanism
-which manages thread-pool and processes the queued work items.
+which manages thread-pools and processes the queued work items.
 
 The backend is called gcwq.  There is one gcwq for each possible CPU
-and one gcwq to serve work items queued on unbound workqueues.
+and one gcwq to serve work items queued on unbound workqueues.  Each
+gcwq has two thread-pools - one for normal work items and the other
+for high priority ones.
 
 Subsystems and drivers can create and queue work items through special
 workqueue API functions as they see fit. They can influence some
 aspects of the way the work items are executed by setting flags on the
 workqueue they are putting the work item on. These flags include
-things like CPU locality, reentrancy, concurrency limits and more. To
-get a detailed overview refer to the API description of
+things like CPU locality, reentrancy, concurrency limits, priority and
+more.  To get a detailed overview refer to the API description of
 alloc_workqueue() below.
 
-When a work item is queued to a workqueue, the target gcwq is
-determined according to the queue parameters and workqueue attributes
-and appended on the shared worklist of the gcwq.  For example, unless
-specifically overridden, a work item of a bound workqueue will be
-queued on the worklist of exactly that gcwq that is associated to the
-CPU the issuer is running on.
+When a work item is queued to a workqueue, the target gcwq and
+thread-pool is determined according to the queue parameters and
+workqueue attributes and appended on the shared worklist of the
+thread-pool.  For example, unless specifically overridden, a work item
+of a bound workqueue will be queued on the worklist of either normal
+or highpri thread-pool of the gcwq that is associated to the CPU the
+issuer is running on.
 
 For any worker pool implementation, managing the concurrency level
 (how many execution contexts are active) is an important issue.  cmwq
@@ -115,26 +118,26 @@ tries to keep the concurrency at a minimal but sufficient 
level.
 Minimal to save resources and sufficient in that the system is used at
 its full capacity.
 
-Each gcwq bound to an actual CPU implements concurrency management by
-hooking into the scheduler.  The gcwq is notified whenever an active
-worker wakes up or sleeps and keeps track of the number of the
-currently runnable workers.  Generally, work items are not expected to
-hog a CPU and consume many cycles.  That means maintaining just enough
-concurrency to prevent work processing from stalling should be
-optimal.  As long as there are one or more runnable workers on the
-CPU, the gcwq doesn't start execution of a new work, but, when the
-last running worker goes to sleep, it immediately schedules a new
-worker so that the CPU doesn't sit idle while there are pending work
-items.  This allows using a minimal number of workers without losing
-execution bandwidth.
+Each thread-pool bound to an actual CPU implements concurrency
+management by hooking into the scheduler.  The thread-pool is notified
+whenever an active worker wakes up or sleeps and keeps track of the
+number of the currently runnable w

Re: [PATCH 5/6] workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool()

2012-07-13 Thread Tejun Heo

>From 8a0597bf9939d50039d4a6f446db51cf920daaad Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Fri, 13 Jul 2012 20:50:50 -0700

Introduce NR_WORKER_POOLS and for_each_worker_pool() and convert code
paths which need to manipulate all pools in a gcwq to use them.
NR_WORKER_POOLS is currently one and for_each_worker_pool() iterates
over only @gcwq->pool.

Note that nr_running is per-pool property and converted to an array
with NR_WORKER_POOLS elements and renamed to pool_nr_running.

The changes in this patch are mechanical and don't caues any
functional difference.  This is to prepare for multiple pools per
gcwq.

v2: nr_running indexing bug in get_pool_nr_running() fixed.

Signed-off-by: Tejun Heo 
Cc: Tony Luck 
Cc: Fengguang Wu 
---
git branch updated accordingly.  Thanks!

 kernel/workqueue.c |  225 
 1 files changed, 155 insertions(+), 70 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7a98bae..82eee34 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -74,6 +74,8 @@ enum {
TRUSTEE_RELEASE = 3,/* release workers */
TRUSTEE_DONE= 4,/* trustee is done */
 
+   NR_WORKER_POOLS = 1,/* # worker pools per gcwq */
+
BUSY_WORKER_HASH_ORDER  = 6,/* 64 pointers */
BUSY_WORKER_HASH_SIZE   = 1 << BUSY_WORKER_HASH_ORDER,
BUSY_WORKER_HASH_MASK   = BUSY_WORKER_HASH_SIZE - 1,
@@ -274,6 +276,9 @@ EXPORT_SYMBOL_GPL(system_nrt_freezable_wq);
 #define CREATE_TRACE_POINTS
 #include 
 
+#define for_each_worker_pool(pool, gcwq)   \
+   for ((pool) = &(gcwq)->pool; (pool); (pool) = NULL)
+
 #define for_each_busy_worker(worker, i, pos, gcwq) \
for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++) \
hlist_for_each_entry(worker, pos, &gcwq->busy_hash[i], hentry)
@@ -454,7 +459,7 @@ static bool workqueue_freezing; /* W: have wqs 
started freezing? */
  * try_to_wake_up().  Put it in a separate cacheline.
  */
 static DEFINE_PER_CPU(struct global_cwq, global_cwq);
-static DEFINE_PER_CPU_SHARED_ALIGNED(atomic_t, gcwq_nr_running);
+static DEFINE_PER_CPU_SHARED_ALIGNED(atomic_t, 
pool_nr_running[NR_WORKER_POOLS]);
 
 /*
  * Global cpu workqueue and nr_running counter for unbound gcwq.  The
@@ -462,7 +467,9 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(atomic_t, 
gcwq_nr_running);
  * workers have WORKER_UNBOUND set.
  */
 static struct global_cwq unbound_global_cwq;
-static atomic_t unbound_gcwq_nr_running = ATOMIC_INIT(0);  /* always 0 */
+static atomic_t unbound_pool_nr_running[NR_WORKER_POOLS] = {
+   [0 ... NR_WORKER_POOLS - 1] = ATOMIC_INIT(0),   /* always 0 */
+};
 
 static int worker_thread(void *__worker);
 
@@ -477,11 +484,14 @@ static struct global_cwq *get_gcwq(unsigned int cpu)
 static atomic_t *get_pool_nr_running(struct worker_pool *pool)
 {
int cpu = pool->gcwq->cpu;
+   atomic_t (*nr_running)[NR_WORKER_POOLS];
 
if (cpu != WORK_CPU_UNBOUND)
-   return &per_cpu(gcwq_nr_running, cpu);
+   nr_running = &per_cpu(pool_nr_running, cpu);
else
-   return &unbound_gcwq_nr_running;
+   nr_running = &unbound_pool_nr_running;
+
+   return &(*nr_running)[0];
 }
 
 static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
@@ -3345,9 +3355,30 @@ EXPORT_SYMBOL_GPL(work_busy);
__ret1 < 0 ? -1 : 0;\
 })
 
+static bool gcwq_is_managing_workers(struct global_cwq *gcwq)
+{
+   struct worker_pool *pool;
+
+   for_each_worker_pool(pool, gcwq)
+   if (pool->flags & POOL_MANAGING_WORKERS)
+   return true;
+   return false;
+}
+
+static bool gcwq_has_idle_workers(struct global_cwq *gcwq)
+{
+   struct worker_pool *pool;
+
+   for_each_worker_pool(pool, gcwq)
+   if (!list_empty(&pool->idle_list))
+   return true;
+   return false;
+}
+
 static int __cpuinit trustee_thread(void *__gcwq)
 {
struct global_cwq *gcwq = __gcwq;
+   struct worker_pool *pool;
struct worker *worker;
struct work_struct *work;
struct hlist_node *pos;
@@ -3363,13 +3394,15 @@ static int __cpuinit trustee_thread(void *__gcwq)
 * cancelled.
 */
BUG_ON(gcwq->cpu != smp_processor_id());
-   rc = trustee_wait_event(!(gcwq->pool.flags & POOL_MANAGING_WORKERS));
+   rc = trustee_wait_event(!gcwq_is_managing_workers(gcwq));
BUG_ON(rc < 0);
 
-   gcwq->pool.flags |= POOL_MANAGING_WORKERS;
+   for_each_worker_pool(pool, gcwq) {
+   pool->flags |= POOL_MANAGING_WORKERS;
 
-   list_for_each_entry(worker, &gcwq->pool.idle_list, entry)
-

Re: [PATCH 6/6] workqueue: reimplement WQ_HIGHPRI using a separate worker_pool

2012-07-13 Thread Tejun Heo

Hello,

On Fri, Jul 13, 2012 at 10:08:00AM +0800, Fengguang Wu wrote:
> [0.165669] Performance Events: unsupported Netburst CPU model 6 no PMU 
> driver, software events only.
> [0.167001] XXX cpu=0 gcwq=88000dc0cfc0 base=88000dc11e80
> [0.167989] XXX cpu=0 nr_running=0 @ 88000dc11e80
> [0.168988] XXX cpu=0 nr_running=0 @ 88000dc11e88
> [0.169988] XXX cpu=1 gcwq=88000dd0cfc0 base=88000dd11e80
> [0.170988] XXX cpu=1 nr_running=0 @ 88000dd11e80
> [0.171987] XXX cpu=1 nr_running=0 @ 88000dd11e88
> [0.172988] XXX cpu=8 nr_running=0 @ 81d7c430
> [0.173987] XXX cpu=8 nr_running=12 @ 81d7c438

Heh, I found it.  get_pool_nr_running() stores the nr_running array to
use in a local pointer to array and then returns pointer to the
specific element from there depending on the priority.

atomic_t (*nr_running)[NR_WORKER_POOLS];

/* set @nr_running to the array to use */
return nr_running[worker_pool_pri(pool)];

The [] operator in the return statement is indexing to the arrays
instead of the array elements, so if the index is 1, the above
statement offsets nr_running by sizeof(atomic_t [NR_WORKER_POOLS])
instead of sizeof(atomic_t).  This should have been
&(*nr_running)[worker_pool_pri(pool)] instead.

So, highpri ends up dereferencing out-of-bounds and depending on
variable layout, it may see garbage value from the beginning (what you
were seeing) or get interfered afterwards (what Tony was seeing).
This also explains why I didn't see it and Tony can no longer
reproduce it after debug patch.

Will post updated patches.

Thank you.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/6] workqueue: reimplement WQ_HIGHPRI using a separate worker_pool

2012-07-12 Thread Tejun Heo

Hello, Tony.

On Thu, Jul 12, 2012 at 04:24:47PM -0700, Tony Luck wrote:
> On Thu, Jul 12, 2012 at 3:32 PM, Tejun Heo  wrote:
> > Can you please try the following debug patch instead?  Yours is
> > different from Fengguang's.
> 
> New dmesg from mext-20120712 + this new patch (instead of previous one)
> 
> [Note - I see some XXX traces, but no WARN_ON stack dump this time]

The debug patch didn't do anything for the bug itself.  I suppose it's
timing dependent and doesn't always happen (it never reproduces here
for some reason).  Can you please repeat several times and see whether
the warning can be triggered?

Thank you very much!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/6] workqueue: reimplement WQ_HIGHPRI using a separate worker_pool

2012-07-12 Thread Tejun Heo

Hello, Tony.

On Thu, Jul 12, 2012 at 03:16:30PM -0700, Tony Luck wrote:
> On Thu, Jul 12, 2012 at 2:45 PM, Tejun Heo  wrote:
> > I was wrong and am now dazed and confused.  That's from
> > init_workqueues() where only cpu0 is running.  How the hell did
> > nr_running manage to become non-zero at that point?  Can you please
> > apply the following patch and report the boot log?  Thank you.
> 
> Patch applied on top of next-20120712 (which still has the same problem).

Can you please try the following debug patch instead?  Yours is
different from Fengguang's.

Thanks a lot!
---
 kernel/workqueue.c |   40 
 1 file changed, 36 insertions(+), 4 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -699,8 +699,10 @@ void wq_worker_waking_up(struct task_str
 {
struct worker *worker = kthread_data(task);
 
-   if (!(worker->flags & WORKER_NOT_RUNNING))
+   if (!(worker->flags & WORKER_NOT_RUNNING)) {
+   WARN_ON_ONCE(cpu != worker->pool->gcwq->cpu);
atomic_inc(get_pool_nr_running(worker->pool));
+   }
 }
 
 /**
@@ -730,6 +732,7 @@ struct task_struct *wq_worker_sleeping(s
 
/* this can only happen on the local cpu */
BUG_ON(cpu != raw_smp_processor_id());
+   WARN_ON_ONCE(cpu != worker->pool->gcwq->cpu);
 
/*
 * The counterpart of the following dec_and_test, implied mb,
@@ -1212,9 +1215,30 @@ static void worker_enter_idle(struct wor
 * between setting %WORKER_ROGUE and zapping nr_running, the
 * warning may trigger spuriously.  Check iff trustee is idle.
 */
-   WARN_ON_ONCE(gcwq->trustee_state == TRUSTEE_DONE &&
-pool->nr_workers == pool->nr_idle &&
-atomic_read(get_pool_nr_running(pool)));
+   if (WARN_ON_ONCE(gcwq->trustee_state == TRUSTEE_DONE &&
+pool->nr_workers == pool->nr_idle &&
+atomic_read(get_pool_nr_running(pool {
+   static bool once = false;
+   int cpu;
+
+   if (once)
+   return;
+   once = true;
+
+   printk("XXX nr_running mismatch on gcwq[%d] pool[%ld]\n",
+  gcwq->cpu, pool - gcwq->pools);
+
+   for_each_gcwq_cpu(cpu) {
+   gcwq = get_gcwq(cpu);
+
+   printk("XXX gcwq[%d] flags=0x%x\n", gcwq->cpu, 
gcwq->flags);
+   for_each_worker_pool(pool, gcwq)
+   printk("XXX gcwq[%d] pool[%ld] nr_workers=%d 
nr_idle=%d nr_running=%d\n",
+  gcwq->cpu, pool - gcwq->pools,
+  pool->nr_workers, pool->nr_idle,
+  atomic_read(get_pool_nr_running(pool)));
+   }
+   }
 }
 
 /**
@@ -3855,6 +3879,10 @@ static int __init init_workqueues(void)
for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++)
INIT_HLIST_HEAD(&gcwq->busy_hash[i]);
 
+   if (cpu != WORK_CPU_UNBOUND)
+   printk("XXX cpu=%d gcwq=%p base=%p\n", cpu, gcwq,
+  per_cpu_ptr(&pool_nr_running, cpu));
+
for_each_worker_pool(pool, gcwq) {
pool->gcwq = gcwq;
INIT_LIST_HEAD(&pool->worklist);
@@ -3868,6 +3896,10 @@ static int __init init_workqueues(void)
(unsigned long)pool);
 
ida_init(&pool->worker_ida);
+
+   printk("XXX cpu=%d nr_running=%d @ %p\n", gcwq->cpu,
+  atomic_read(get_pool_nr_running(pool)),
+  get_pool_nr_running(pool));
}
 
gcwq->trustee_state = TRUSTEE_DONE;
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH UPDATED 2/6] workqueue: factor out worker_pool from global_cwq

2012-07-12 Thread Tejun Heo

>From bd7bdd43dcb81bb08240b9401b36a104f77dc135 Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Thu, 12 Jul 2012 14:46:37 -0700

Move worklist and all worker management fields from global_cwq into
the new struct worker_pool.  worker_pool points back to the containing
gcwq.  worker and cpu_workqueue_struct are updated to point to
worker_pool instead of gcwq too.

This change is mechanical and doesn't introduce any functional
difference other than rearranging of fields and an added level of
indirection in some places.  This is to prepare for multiple pools per
gcwq.

v2: Comment typo fixes as suggested by Namhyung.

Signed-off-by: Tejun Heo 
Cc: Namhyung Kim 
---
Minor update.  git branches updated accoringly.  Thanks.

 include/trace/events/workqueue.h |2 +-
 kernel/workqueue.c   |  216 -
 2 files changed, 118 insertions(+), 100 deletions(-)

diff --git a/include/trace/events/workqueue.h b/include/trace/events/workqueue.h
index 4018f50..f28d1b6 100644
--- a/include/trace/events/workqueue.h
+++ b/include/trace/events/workqueue.h
@@ -54,7 +54,7 @@ TRACE_EVENT(workqueue_queue_work,
__entry->function   = work->func;
__entry->workqueue  = cwq->wq;
__entry->req_cpu= req_cpu;
-   __entry->cpu= cwq->gcwq->cpu;
+   __entry->cpu= cwq->pool->gcwq->cpu;
),
 
TP_printk("work struct=%p function=%pf workqueue=%p req_cpu=%u cpu=%u",
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 27637c2..61f1544 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -115,6 +115,7 @@ enum {
  */
 
 struct global_cwq;
+struct worker_pool;
 
 /*
  * The poor guys doing the actual heavy lifting.  All on-duty workers
@@ -131,7 +132,7 @@ struct worker {
struct cpu_workqueue_struct *current_cwq; /* L: current_work's cwq */
struct list_headscheduled;  /* L: scheduled works */
struct task_struct  *task;  /* I: worker task */
-   struct global_cwq   *gcwq;  /* I: the associated gcwq */
+   struct worker_pool  *pool;  /* I: the associated pool */
/* 64 bytes boundary on 64bit, 32 on 32bit */
unsigned long   last_active;/* L: last active timestamp */
unsigned intflags;  /* X: flags */
@@ -139,6 +140,21 @@ struct worker {
struct work_struct  rebind_work;/* L: rebind worker to cpu */
 };
 
+struct worker_pool {
+   struct global_cwq   *gcwq;  /* I: the owning gcwq */
+
+   struct list_headworklist;   /* L: list of pending works */
+   int nr_workers; /* L: total number of workers */
+   int nr_idle;/* L: currently idle ones */
+
+   struct list_headidle_list;  /* X: list of idle workers */
+   struct timer_list   idle_timer; /* L: worker idle timeout */
+   struct timer_list   mayday_timer;   /* L: SOS timer for workers */
+
+   struct ida  worker_ida; /* L: for worker IDs */
+   struct worker   *first_idle;/* L: first idle worker */
+};
+
 /*
  * Global per-cpu workqueue.  There's one and only one for each cpu
  * and all works are queued and processed here regardless of their
@@ -146,27 +162,18 @@ struct worker {
  */
 struct global_cwq {
spinlock_t  lock;   /* the gcwq lock */
-   struct list_headworklist;   /* L: list of pending works */
unsigned intcpu;/* I: the associated cpu */
unsigned intflags;  /* L: GCWQ_* flags */
 
-   int nr_workers; /* L: total number of workers */
-   int nr_idle;/* L: currently idle ones */
-
-   /* workers are chained either in the idle_list or busy_hash */
-   struct list_headidle_list;  /* X: list of idle workers */
+   /* workers are chained either in busy_hash or pool idle_list */
struct hlist_head   busy_hash[BUSY_WORKER_HASH_SIZE];
/* L: hash of busy workers */
 
-   struct timer_list   idle_timer; /* L: worker idle timeout */
-   struct timer_list   mayday_timer;   /* L: SOS timer for dworkers */
-
-   struct ida  worker_ida; /* L: for worker IDs */
+   struct worker_pool  pool;   /* the worker pools */
 
struct task_struct  *trustee;   /* L: for gcwq shutdown */
unsigned inttrustee_state;  /* L: trustee state */
wait_queue_head_t   trustee_wait;   /* trustee wait */
-   struct worker   *first_idle;/* L: first idle worker */
 } cacheline_aligned_in_smp;
 
 /*
@@ -175,7 +182,7 @@

Re: [PATCH 6/6] workqueue: reimplement WQ_HIGHPRI using a separate worker_pool

2012-07-12 Thread Tejun Heo

Hello, again.

On Thu, Jul 12, 2012 at 10:05:19AM -0700, Tejun Heo wrote:
> On Thu, Jul 12, 2012 at 09:06:48PM +0800, Fengguang Wu wrote:
> > [0.207977] WARNING: at /c/kernel-tests/mm/kernel/workqueue.c:1217 
> > worker_enter_idle+0x2b8/0x32b()
> > [0.207977] Modules linked in:
> > [0.207977] Pid: 1, comm: swapper/0 Not tainted 3.5.0-rc6-08414-g9645fff 
> > #15
> > [0.207977] Call Trace:
> > [0.207977]  [] ? worker_enter_idle+0x2b8/0x32b
> > [0.207977]  [] warn_slowpath_common+0xae/0xdb
> > [0.207977]  [] warn_slowpath_null+0x28/0x31
> > [0.207977]  [] worker_enter_idle+0x2b8/0x32b
> > [0.207977]  [] start_worker+0x26/0x42
> > [0.207977]  [] init_workqueues+0x2d2/0x59a
> > [0.207977]  [] ? usermodehelper_init+0x8a/0x8a
> > [0.207977]  [] do_one_initcall+0xce/0x272
> > [0.207977]  [] kernel_init+0x12e/0x3c1
> > [0.207977]  [] kernel_thread_helper+0x4/0x10
> > [0.207977]  [] ? retint_restore_args+0x13/0x13
> > [0.207977]  [] ? start_kernel+0x737/0x737
> > [0.207977]  [] ? gs_change+0x13/0x13
> 
> Yeah, I forgot to flip the WARN_ON_ONCE() condition so that it checks
> nr_running before looking at pool->nr_running.  The warning is
> spurious.  Will post fix soon.

I was wrong and am now dazed and confused.  That's from
init_workqueues() where only cpu0 is running.  How the hell did
nr_running manage to become non-zero at that point?  Can you please
apply the following patch and report the boot log?  Thank you.

---
 kernel/workqueue.c |   13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -699,8 +699,10 @@ void wq_worker_waking_up(struct task_str
 {
struct worker *worker = kthread_data(task);
 
-   if (!(worker->flags & WORKER_NOT_RUNNING))
+   if (!(worker->flags & WORKER_NOT_RUNNING)) {
+   WARN_ON_ONCE(cpu != worker->pool->gcwq->cpu);
atomic_inc(get_pool_nr_running(worker->pool));
+   }
 }
 
 /**
@@ -730,6 +732,7 @@ struct task_struct *wq_worker_sleeping(s
 
/* this can only happen on the local cpu */
BUG_ON(cpu != raw_smp_processor_id());
+   WARN_ON_ONCE(cpu != worker->pool->gcwq->cpu);
 
/*
 * The counterpart of the following dec_and_test, implied mb,
@@ -3855,6 +3858,10 @@ static int __init init_workqueues(void)
for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++)
INIT_HLIST_HEAD(&gcwq->busy_hash[i]);
 
+   if (cpu != WORK_CPU_UNBOUND)
+   printk("XXX cpu=%d gcwq=%p base=%p\n", cpu, gcwq,
+  per_cpu_ptr(&pool_nr_running, cpu));
+
for_each_worker_pool(pool, gcwq) {
pool->gcwq = gcwq;
INIT_LIST_HEAD(&pool->worklist);
@@ -3868,6 +3875,10 @@ static int __init init_workqueues(void)
(unsigned long)pool);
 
ida_init(&pool->worker_ida);
+
+   printk("XXX cpu=%d nr_running=%d @ %p\n", gcwq->cpu,
+  atomic_read(get_pool_nr_running(pool)),
+  get_pool_nr_running(pool));
}
 
gcwq->trustee_state = TRUSTEE_DONE;
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/6] workqueue: factor out worker_pool from global_cwq

2012-07-12 Thread Tejun Heo

Hello, Namhyung.

Sorry about the delay.

On Tue, Jul 10, 2012 at 01:48:44PM +0900, Namhyung Kim wrote:
> > +   struct list_headidle_list;  /* X: list of idle workers */
> > +   struct timer_list   idle_timer; /* L: worker idle timeout */
> > +   struct timer_list   mayday_timer;   /* L: SOS timer for dworkers */
> 
> What is 'dworkers'?

My stupid finger pressing 'd' when I never meant to. :)

> > -   /* workers are chained either in the idle_list or busy_hash */
> > -   struct list_headidle_list;  /* X: list of idle workers */
> > +   /* workers are chained either in busy_head or pool idle_list */
> 
> s/busy_head/busy_hash/ ?

Will fix.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] workqueue: use @pool instead of @gcwq or @cpu where applicable

2012-07-12 Thread Tejun Heo

Hello, Tony.

On Tue, Jul 10, 2012 at 04:30:36PM -0700, Tony Luck wrote:
> On Mon, Jul 9, 2012 at 11:41 AM, Tejun Heo  wrote:
> > @@ -1234,7 +1235,7 @@ static void worker_enter_idle(struct worker *worker)
> >  */
> > WARN_ON_ONCE(gcwq->trustee_state == TRUSTEE_DONE &&
> >  pool->nr_workers == pool->nr_idle &&
> > -atomic_read(get_gcwq_nr_running(gcwq->cpu)));
> > +atomic_read(get_pool_nr_running(pool)));
> >  }
> 
> Just had this WARN_ON_ONCE trigger on ia64 booting next-20120710. I
> haven't bisected ... just noticed  that two patches in this series tinker
> with lines in this check. next-20120706 didn't generate the WARN.

Sorry about the delay.  The warning is spurious.  As now there are
multiple pools, nr_running check should be done before
pool->nr_workers check.  Will post fix soon.

Thank you.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/6] workqueue: reimplement WQ_HIGHPRI using a separate worker_pool

2012-07-12 Thread Tejun Heo

Hello, Fengguang.

On Thu, Jul 12, 2012 at 09:06:48PM +0800, Fengguang Wu wrote:
> [0.207977] WARNING: at /c/kernel-tests/mm/kernel/workqueue.c:1217 
> worker_enter_idle+0x2b8/0x32b()
> [0.207977] Modules linked in:
> [0.207977] Pid: 1, comm: swapper/0 Not tainted 3.5.0-rc6-08414-g9645fff 
> #15
> [0.207977] Call Trace:
> [0.207977]  [] ? worker_enter_idle+0x2b8/0x32b
> [0.207977]  [] warn_slowpath_common+0xae/0xdb
> [0.207977]  [] warn_slowpath_null+0x28/0x31
> [0.207977]  [] worker_enter_idle+0x2b8/0x32b
> [0.207977]  [] start_worker+0x26/0x42
> [0.207977]  [] init_workqueues+0x2d2/0x59a
> [0.207977]  [] ? usermodehelper_init+0x8a/0x8a
> [0.207977]  [] do_one_initcall+0xce/0x272
> [0.207977]  [] kernel_init+0x12e/0x3c1
> [0.207977]  [] kernel_thread_helper+0x4/0x10
> [0.207977]  [] ? retint_restore_args+0x13/0x13
> [0.207977]  [] ? start_kernel+0x737/0x737
> [0.207977]  [] ? gs_change+0x13/0x13

Yeah, I forgot to flip the WARN_ON_ONCE() condition so that it checks
nr_running before looking at pool->nr_running.  The warning is
spurious.  Will post fix soon.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/6] workqueue: separate out worker_pool flags

2012-07-09 Thread Tejun Heo

GCWQ_MANAGE_WORKERS, GCWQ_MANAGING_WORKERS and GCWQ_HIGHPRI_PENDING
are per-pool properties.  Add worker_pool->flags and make the above
three flags per-pool flags.

The changes in this patch are mechanical and don't caues any
functional difference.  This is to prepare for multiple pools per
gcwq.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c |   47 +--
 1 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9f82c25..e700dcc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -46,11 +46,13 @@
 
 enum {
/* global_cwq flags */
-   GCWQ_MANAGE_WORKERS = 1 << 0,   /* need to manage workers */
-   GCWQ_MANAGING_WORKERS   = 1 << 1,   /* managing workers */
-   GCWQ_DISASSOCIATED  = 1 << 2,   /* cpu can't serve workers */
-   GCWQ_FREEZING   = 1 << 3,   /* freeze in progress */
-   GCWQ_HIGHPRI_PENDING= 1 << 4,   /* highpri works on queue */
+   GCWQ_DISASSOCIATED  = 1 << 0,   /* cpu can't serve workers */
+   GCWQ_FREEZING   = 1 << 1,   /* freeze in progress */
+
+   /* pool flags */
+   POOL_MANAGE_WORKERS = 1 << 0,   /* need to manage workers */
+   POOL_MANAGING_WORKERS   = 1 << 1,   /* managing workers */
+   POOL_HIGHPRI_PENDING= 1 << 2,   /* highpri works on queue */
 
/* worker flags */
WORKER_STARTED  = 1 << 0,   /* started */
@@ -142,6 +144,7 @@ struct worker {
 
 struct worker_pool {
struct global_cwq   *gcwq;  /* I: the owning gcwq */
+   unsigned intflags;  /* X: flags */
 
struct list_headworklist;   /* L: list of pending works */
int nr_workers; /* L: total number of workers */
@@ -583,7 +586,7 @@ static struct global_cwq *get_work_gcwq(struct work_struct 
*work)
 static bool __need_more_worker(struct worker_pool *pool)
 {
return !atomic_read(get_pool_nr_running(pool)) ||
-   pool->gcwq->flags & GCWQ_HIGHPRI_PENDING;
+   (pool->flags & POOL_HIGHPRI_PENDING);
 }
 
 /*
@@ -612,7 +615,7 @@ static bool keep_working(struct worker_pool *pool)
 
return !list_empty(&pool->worklist) &&
(atomic_read(nr_running) <= 1 ||
-pool->gcwq->flags & GCWQ_HIGHPRI_PENDING);
+(pool->flags & POOL_HIGHPRI_PENDING));
 }
 
 /* Do we need a new worker?  Called from manager. */
@@ -625,13 +628,13 @@ static bool need_to_create_worker(struct worker_pool 
*pool)
 static bool need_to_manage_workers(struct worker_pool *pool)
 {
return need_to_create_worker(pool) ||
-   pool->gcwq->flags & GCWQ_MANAGE_WORKERS;
+   (pool->flags & POOL_MANAGE_WORKERS);
 }
 
 /* Do we have too many workers and should some go away? */
 static bool too_many_workers(struct worker_pool *pool)
 {
-   bool managing = pool->gcwq->flags & GCWQ_MANAGING_WORKERS;
+   bool managing = pool->flags & POOL_MANAGING_WORKERS;
int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
int nr_busy = pool->nr_workers - nr_idle;
 
@@ -889,7 +892,7 @@ static struct worker *find_worker_executing_work(struct 
global_cwq *gcwq,
  * position for the work.  If @cwq is for HIGHPRI wq, the work is
  * queued at the head of the queue but in FIFO order with respect to
  * other HIGHPRI works; otherwise, at the end of the queue.  This
- * function also sets GCWQ_HIGHPRI_PENDING flag to hint @pool that
+ * function also sets POOL_HIGHPRI_PENDING flag to hint @pool that
  * there are HIGHPRI works pending.
  *
  * CONTEXT:
@@ -913,7 +916,7 @@ static inline struct list_head 
*pool_determine_ins_pos(struct worker_pool *pool,
break;
}
 
-   pool->gcwq->flags |= GCWQ_HIGHPRI_PENDING;
+   pool->flags |= POOL_HIGHPRI_PENDING;
return &twork->entry;
 }
 
@@ -1500,7 +1503,7 @@ static void idle_worker_timeout(unsigned long __pool)
mod_timer(&pool->idle_timer, expires);
else {
/* it's been idle for too long, wake up manager */
-   gcwq->flags |= GCWQ_MANAGE_WORKERS;
+   pool->flags |= POOL_MANAGE_WORKERS;
wake_up_worker(pool);
}
}
@@ -1680,11 +1683,11 @@ static bool manage_workers(struct worker *worker)
struct global_cwq *gcwq = pool->gcwq;
bool ret = false;
 
-   if (gcwq->flags & GCWQ_MANAGING_WORKERS)
+   if (pool->flags & POOL_MANAGING_WORKERS)
return ret;
 
-   gcwq->flags &= ~GCWQ_MANAGE_WORKERS;
-

[PATCH 3/6] workqueue: use @pool instead of @gcwq or @cpu where applicable

2012-07-09 Thread Tejun Heo

Modify all functions which deal with per-pool properties to pass
around @pool instead of @gcwq or @cpu.

The changes in this patch are mechanical and don't caues any
functional difference.  This is to prepare for multiple pools per
gcwq.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c |  218 ++-
 1 files changed, 111 insertions(+), 107 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bc43a0c..9f82c25 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -471,8 +471,10 @@ static struct global_cwq *get_gcwq(unsigned int cpu)
return &unbound_global_cwq;
 }
 
-static atomic_t *get_gcwq_nr_running(unsigned int cpu)
+static atomic_t *get_pool_nr_running(struct worker_pool *pool)
 {
+   int cpu = pool->gcwq->cpu;
+
if (cpu != WORK_CPU_UNBOUND)
return &per_cpu(gcwq_nr_running, cpu);
else
@@ -578,10 +580,10 @@ static struct global_cwq *get_work_gcwq(struct 
work_struct *work)
  * assume that they're being called with gcwq->lock held.
  */
 
-static bool __need_more_worker(struct global_cwq *gcwq)
+static bool __need_more_worker(struct worker_pool *pool)
 {
-   return !atomic_read(get_gcwq_nr_running(gcwq->cpu)) ||
-   gcwq->flags & GCWQ_HIGHPRI_PENDING;
+   return !atomic_read(get_pool_nr_running(pool)) ||
+   pool->gcwq->flags & GCWQ_HIGHPRI_PENDING;
 }
 
 /*
@@ -592,45 +594,46 @@ static bool __need_more_worker(struct global_cwq *gcwq)
  * function will always return %true for unbound gcwq as long as the
  * worklist isn't empty.
  */
-static bool need_more_worker(struct global_cwq *gcwq)
+static bool need_more_worker(struct worker_pool *pool)
 {
-   return !list_empty(&gcwq->pool.worklist) && __need_more_worker(gcwq);
+   return !list_empty(&pool->worklist) && __need_more_worker(pool);
 }
 
 /* Can I start working?  Called from busy but !running workers. */
-static bool may_start_working(struct global_cwq *gcwq)
+static bool may_start_working(struct worker_pool *pool)
 {
-   return gcwq->pool.nr_idle;
+   return pool->nr_idle;
 }
 
 /* Do I need to keep working?  Called from currently running workers. */
-static bool keep_working(struct global_cwq *gcwq)
+static bool keep_working(struct worker_pool *pool)
 {
-   atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+   atomic_t *nr_running = get_pool_nr_running(pool);
 
-   return !list_empty(&gcwq->pool.worklist) &&
+   return !list_empty(&pool->worklist) &&
(atomic_read(nr_running) <= 1 ||
-gcwq->flags & GCWQ_HIGHPRI_PENDING);
+pool->gcwq->flags & GCWQ_HIGHPRI_PENDING);
 }
 
 /* Do we need a new worker?  Called from manager. */
-static bool need_to_create_worker(struct global_cwq *gcwq)
+static bool need_to_create_worker(struct worker_pool *pool)
 {
-   return need_more_worker(gcwq) && !may_start_working(gcwq);
+   return need_more_worker(pool) && !may_start_working(pool);
 }
 
 /* Do I need to be the manager? */
-static bool need_to_manage_workers(struct global_cwq *gcwq)
+static bool need_to_manage_workers(struct worker_pool *pool)
 {
-   return need_to_create_worker(gcwq) || gcwq->flags & GCWQ_MANAGE_WORKERS;
+   return need_to_create_worker(pool) ||
+   pool->gcwq->flags & GCWQ_MANAGE_WORKERS;
 }
 
 /* Do we have too many workers and should some go away? */
-static bool too_many_workers(struct global_cwq *gcwq)
+static bool too_many_workers(struct worker_pool *pool)
 {
-   bool managing = gcwq->flags & GCWQ_MANAGING_WORKERS;
-   int nr_idle = gcwq->pool.nr_idle + managing; /* manager is considered 
idle */
-   int nr_busy = gcwq->pool.nr_workers - nr_idle;
+   bool managing = pool->gcwq->flags & GCWQ_MANAGING_WORKERS;
+   int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
+   int nr_busy = pool->nr_workers - nr_idle;
 
return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
 }
@@ -640,26 +643,26 @@ static bool too_many_workers(struct global_cwq *gcwq)
  */
 
 /* Return the first worker.  Safe with preemption disabled */
-static struct worker *first_worker(struct global_cwq *gcwq)
+static struct worker *first_worker(struct worker_pool *pool)
 {
-   if (unlikely(list_empty(&gcwq->pool.idle_list)))
+   if (unlikely(list_empty(&pool->idle_list)))
return NULL;
 
-   return list_first_entry(&gcwq->pool.idle_list, struct worker, entry);
+   return list_first_entry(&pool->idle_list, struct worker, entry);
 }
 
 /**
  * wake_up_worker - wake up an idle worker
- * @gcwq: gcwq to wake worker for
+ * @pool: worker pool to wake worker from
  *
-

[PATCH 1/6] workqueue: don't use WQ_HIGHPRI for unbound workqueues

2012-07-09 Thread Tejun Heo

Unbound wqs aren't concurrency-managed and try to execute work items
as soon as possible.  This is currently achieved by implicitly setting
%WQ_HIGHPRI on all unbound workqueues; however, WQ_HIGHPRI
implementation is about to be restructured and this usage won't be
valid anymore.

Add an explicit chain-wakeup path for unbound workqueues in
process_one_work() instead of piggy backing on %WQ_HIGHPRI.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c |   18 +++---
 1 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9a3128d..27637c2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -580,6 +580,10 @@ static bool __need_more_worker(struct global_cwq *gcwq)
 /*
  * Need to wake up a worker?  Called from anything but currently
  * running workers.
+ *
+ * Note that, because unbound workers never contribute to nr_running, this
+ * function will always return %true for unbound gcwq as long as the
+ * worklist isn't empty.
  */
 static bool need_more_worker(struct global_cwq *gcwq)
 {
@@ -1867,6 +1871,13 @@ __acquires(&gcwq->lock)
if (unlikely(cpu_intensive))
worker_set_flags(worker, WORKER_CPU_INTENSIVE, true);
 
+   /*
+* Unbound gcwq isn't concurrency managed and work items should be
+* executed ASAP.  Wake up another worker if necessary.
+*/
+   if ((worker->flags & WORKER_UNBOUND) && need_more_worker(gcwq))
+   wake_up_worker(gcwq);
+
spin_unlock_irq(&gcwq->lock);
 
work_clear_pending(work);
@@ -2984,13 +2995,6 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
if (flags & WQ_MEM_RECLAIM)
flags |= WQ_RESCUER;
 
-   /*
-* Unbound workqueues aren't concurrency managed and should be
-* dispatched to workers immediately.
-*/
-   if (flags & WQ_UNBOUND)
-   flags |= WQ_HIGHPRI;
-
max_active = max_active ?: WQ_DFL_ACTIVE;
max_active = wq_clamp_max_active(max_active, flags, wq->name);
 
-- 
1.7.7.3

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHSET] workqueue: reimplement high priority using a separate worker pool

2012-07-09 Thread Tejun Heo

Currently, WQ_HIGHPRI workqueues share the same worker pool as the
normal priority ones.  The only difference is that work items from
highpri wq are queued at the head instead of tail of the worklist.  On
pathological cases, this simplistics highpri implementation doesn't
seem to be sufficient.

For example, block layer request_queue delayed processing uses high
priority delayed_work to restart request processing after a short
delay.  Unfortunately, it doesn't seem to take too much to push the
latency between the delay timer expiring and the work item execution
to few second range leading to unintended long idling of the
underlying device.  There seem to be real-world cases where this
latency shows up[1].

A simplistic test case is measuring queue-to-execution latencies with
a lot of threads saturating CPU cycles.  Measuring over 300sec period
with 3000 0-nice threads performing 1ms sleeps continuously and a
highpri work item being repeatedly queued with 1 jiffy interval on a
single CPU machine, the top latency was 1624ms and the average of top
20 was 1268ms with stdev 927ms.

This patchset reimplements high priority workqueues so that it uses a
separate worklist and worker pool.  Now each global_cwq contains two
worker_pools - one for normal priority work items and the other for
high priority.  Each has its own worklist and worker pool and the
highpri worker pool is populated with worker threads w/ -20 nice
value.

This reimplementation brings down the top latency to 16ms with top 20
average of 3.8ms w/ stdev 5.6ms.  The original block layer bug hasn't
been verfieid to be fixed yet (Josh?).

The addition of separate worker pools doesn't add much to the
complexity but does add more threads per cpu.  Highpri worker pool is
expected to remain small, but the effect is noticeable especially in
idle states.

I'm cc'ing all WQ_HIGHPRI users - block, bio-integrity, crypto, gfs2,
xfs and bluetooth.  Now you guys get proper high priority scheduling
for highpri work items; however, with more power comes more
responsibility.

Especially, the ones with both WQ_HIGHPRI and WQ_CPU_INTENSIVE -
bio-integrity and crypto - may end up dominating CPU usage.  I think
it should be mostly okay for bio-integrity considering it sits right
in the block request completion path.  I don't know enough about
tegra-aes tho.  aes_workqueue_handler() seems to mostly interact with
the hardware crypto.  Is it actually cpu cycle intensive?

This patchset contains the following six patches.

 0001-workqueue-don-t-use-WQ_HIGHPRI-for-unbound-workqueue.patch
 0002-workqueue-factor-out-worker_pool-from-global_cwq.patch
 0003-workqueue-use-pool-instead-of-gcwq-or-cpu-where-appl.patch
 0004-workqueue-separate-out-worker_pool-flags.patch
 0005-workqueue-introduce-NR_WORKER_POOLS-and-for_each_wor.patch
 0006-workqueue-reimplement-WQ_HIGHPRI-using-a-separate-wo.patch

0001 makes unbound wq not use WQ_HIGHPRI as its meaning will be
changing and won't suit the purpose unbound wq is using it for.

0002-0005 gradually pulls out worker_pool from global_cwq and update
code paths to be able to deal with multiple worker_pools per
global_cwq.

0006 replaces the head-queueing WQ_HIGHPRI implementation with the one
with separate worker_pool using the multiple worker_pool mechanism
previously implemented.

The patchset is available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-wq-highpri

diffstat follows.

 Documentation/workqueue.txt  |  103 ++
 include/trace/events/workqueue.h |2 
 kernel/workqueue.c   |  624 +--
 3 files changed, 385 insertions(+), 344 deletions(-)

Thanks.

--
tejun

[1] https://lkml.org/lkml/2012/3/6/475
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/6] workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool()

2012-07-09 Thread Tejun Heo

Introduce NR_WORKER_POOLS and for_each_worker_pool() and convert code
paths which need to manipulate all pools in a gcwq to use them.
NR_WORKER_POOLS is currently one and for_each_worker_pool() iterates
over only @gcwq->pool.

Note that nr_running is per-pool property and converted to an array
with NR_WORKER_POOLS elements and renamed to pool_nr_running.

The changes in this patch are mechanical and don't caues any
functional difference.  This is to prepare for multiple pools per
gcwq.

Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c |  225 
 1 files changed, 155 insertions(+), 70 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e700dcc..9cbf3bc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -74,6 +74,8 @@ enum {
TRUSTEE_RELEASE = 3,/* release workers */
TRUSTEE_DONE= 4,/* trustee is done */
 
+   NR_WORKER_POOLS = 1,/* # worker pools per gcwq */
+
BUSY_WORKER_HASH_ORDER  = 6,/* 64 pointers */
BUSY_WORKER_HASH_SIZE   = 1 << BUSY_WORKER_HASH_ORDER,
BUSY_WORKER_HASH_MASK   = BUSY_WORKER_HASH_SIZE - 1,
@@ -274,6 +276,9 @@ EXPORT_SYMBOL_GPL(system_nrt_freezable_wq);
 #define CREATE_TRACE_POINTS
 #include 
 
+#define for_each_worker_pool(pool, gcwq)   \
+   for ((pool) = &(gcwq)->pool; (pool); (pool) = NULL)
+
 #define for_each_busy_worker(worker, i, pos, gcwq) \
for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++) \
hlist_for_each_entry(worker, pos, &gcwq->busy_hash[i], hentry)
@@ -454,7 +459,7 @@ static bool workqueue_freezing; /* W: have wqs 
started freezing? */
  * try_to_wake_up().  Put it in a separate cacheline.
  */
 static DEFINE_PER_CPU(struct global_cwq, global_cwq);
-static DEFINE_PER_CPU_SHARED_ALIGNED(atomic_t, gcwq_nr_running);
+static DEFINE_PER_CPU_SHARED_ALIGNED(atomic_t, 
pool_nr_running[NR_WORKER_POOLS]);
 
 /*
  * Global cpu workqueue and nr_running counter for unbound gcwq.  The
@@ -462,7 +467,9 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(atomic_t, 
gcwq_nr_running);
  * workers have WORKER_UNBOUND set.
  */
 static struct global_cwq unbound_global_cwq;
-static atomic_t unbound_gcwq_nr_running = ATOMIC_INIT(0);  /* always 0 */
+static atomic_t unbound_pool_nr_running[NR_WORKER_POOLS] = {
+   [0 ... NR_WORKER_POOLS - 1] = ATOMIC_INIT(0),   /* always 0 */
+};
 
 static int worker_thread(void *__worker);
 
@@ -477,11 +484,14 @@ static struct global_cwq *get_gcwq(unsigned int cpu)
 static atomic_t *get_pool_nr_running(struct worker_pool *pool)
 {
int cpu = pool->gcwq->cpu;
+   atomic_t (*nr_running)[NR_WORKER_POOLS];
 
if (cpu != WORK_CPU_UNBOUND)
-   return &per_cpu(gcwq_nr_running, cpu);
+   nr_running = &per_cpu(pool_nr_running, cpu);
else
-   return &unbound_gcwq_nr_running;
+   nr_running = &unbound_pool_nr_running;
+
+   return nr_running[0];
 }
 
 static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
@@ -3345,9 +3355,30 @@ EXPORT_SYMBOL_GPL(work_busy);
__ret1 < 0 ? -1 : 0;\
 })
 
+static bool gcwq_is_managing_workers(struct global_cwq *gcwq)
+{
+   struct worker_pool *pool;
+
+   for_each_worker_pool(pool, gcwq)
+   if (pool->flags & POOL_MANAGING_WORKERS)
+   return true;
+   return false;
+}
+
+static bool gcwq_has_idle_workers(struct global_cwq *gcwq)
+{
+   struct worker_pool *pool;
+
+   for_each_worker_pool(pool, gcwq)
+   if (!list_empty(&pool->idle_list))
+   return true;
+   return false;
+}
+
 static int __cpuinit trustee_thread(void *__gcwq)
 {
struct global_cwq *gcwq = __gcwq;
+   struct worker_pool *pool;
struct worker *worker;
struct work_struct *work;
struct hlist_node *pos;
@@ -3363,13 +3394,15 @@ static int __cpuinit trustee_thread(void *__gcwq)
 * cancelled.
 */
BUG_ON(gcwq->cpu != smp_processor_id());
-   rc = trustee_wait_event(!(gcwq->pool.flags & POOL_MANAGING_WORKERS));
+   rc = trustee_wait_event(!gcwq_is_managing_workers(gcwq));
BUG_ON(rc < 0);
 
-   gcwq->pool.flags |= POOL_MANAGING_WORKERS;
+   for_each_worker_pool(pool, gcwq) {
+   pool->flags |= POOL_MANAGING_WORKERS;
 
-   list_for_each_entry(worker, &gcwq->pool.idle_list, entry)
-   worker->flags |= WORKER_ROGUE;
+   list_for_each_entry(worker, &pool->idle_list, entry)
+   worker->flags |= WORKER_ROGUE;
+   }
 
for_each_busy_worker(worker, i, pos, gcwq)
worker->flags |= WORKER_ROGUE

[PATCH 6/6] workqueue: reimplement WQ_HIGHPRI using a separate worker_pool

2012-07-09 Thread Tejun Heo

WQ_HIGHPRI was implemented by queueing highpri work items at the head
of the global worklist.  Other than queueing at the head, they weren't
handled differently; unfortunately, this could lead to execution
latency of a few seconds on heavily loaded systems.

Now that workqueue code has been updated to deal with multiple
worker_pools per global_cwq, this patch reimplements WQ_HIGHPRI using
a separate worker_pool.  NR_WORKER_POOLS is bumped to two and
gcwq->pools[0] is used for normal pri work items and ->pools[1] for
highpri.  Highpri workers get -20 nice level and has 'H' suffix in
their names.  Note that this change increases the number of kworkers
per cpu.

POOL_HIGHPRI_PENDING, pool_determine_ins_pos() and highpri chain
wakeup code in process_one_work() are no longer used and removed.

This allows proper prioritization of highpri work items and removes
high execution latency of highpri work items.

Signed-off-by: Tejun Heo 
Reported-by: Josh Hunt 
LKML-Reference: 

---
 Documentation/workqueue.txt |  103 ---
 kernel/workqueue.c  |  100 +++--
 2 files changed, 65 insertions(+), 138 deletions(-)

diff --git a/Documentation/workqueue.txt b/Documentation/workqueue.txt
index a0b577d..a6ab4b6 100644
--- a/Documentation/workqueue.txt
+++ b/Documentation/workqueue.txt
@@ -89,25 +89,28 @@ called thread-pools.
 
 The cmwq design differentiates between the user-facing workqueues that
 subsystems and drivers queue work items on and the backend mechanism
-which manages thread-pool and processes the queued work items.
+which manages thread-pools and processes the queued work items.
 
 The backend is called gcwq.  There is one gcwq for each possible CPU
-and one gcwq to serve work items queued on unbound workqueues.
+and one gcwq to serve work items queued on unbound workqueues.  Each
+gcwq has two thread-pools - one for normal work items and the other
+for high priority ones.
 
 Subsystems and drivers can create and queue work items through special
 workqueue API functions as they see fit. They can influence some
 aspects of the way the work items are executed by setting flags on the
 workqueue they are putting the work item on. These flags include
-things like CPU locality, reentrancy, concurrency limits and more. To
-get a detailed overview refer to the API description of
+things like CPU locality, reentrancy, concurrency limits, priority and
+more.  To get a detailed overview refer to the API description of
 alloc_workqueue() below.
 
-When a work item is queued to a workqueue, the target gcwq is
-determined according to the queue parameters and workqueue attributes
-and appended on the shared worklist of the gcwq.  For example, unless
-specifically overridden, a work item of a bound workqueue will be
-queued on the worklist of exactly that gcwq that is associated to the
-CPU the issuer is running on.
+When a work item is queued to a workqueue, the target gcwq and
+thread-pool is determined according to the queue parameters and
+workqueue attributes and appended on the shared worklist of the
+thread-pool.  For example, unless specifically overridden, a work item
+of a bound workqueue will be queued on the worklist of either normal
+or highpri thread-pool of the gcwq that is associated to the CPU the
+issuer is running on.
 
 For any worker pool implementation, managing the concurrency level
 (how many execution contexts are active) is an important issue.  cmwq
@@ -115,26 +118,26 @@ tries to keep the concurrency at a minimal but sufficient 
level.
 Minimal to save resources and sufficient in that the system is used at
 its full capacity.
 
-Each gcwq bound to an actual CPU implements concurrency management by
-hooking into the scheduler.  The gcwq is notified whenever an active
-worker wakes up or sleeps and keeps track of the number of the
-currently runnable workers.  Generally, work items are not expected to
-hog a CPU and consume many cycles.  That means maintaining just enough
-concurrency to prevent work processing from stalling should be
-optimal.  As long as there are one or more runnable workers on the
-CPU, the gcwq doesn't start execution of a new work, but, when the
-last running worker goes to sleep, it immediately schedules a new
-worker so that the CPU doesn't sit idle while there are pending work
-items.  This allows using a minimal number of workers without losing
-execution bandwidth.
+Each thread-pool bound to an actual CPU implements concurrency
+management by hooking into the scheduler.  The thread-pool is notified
+whenever an active worker wakes up or sleeps and keeps track of the
+number of the currently runnable workers.  Generally, work items are
+not expected to hog a CPU and consume many cycles.  That means
+maintaining just enough concurrency to prevent work processing from
+stalling should be optimal.  As long as there are one or more runnable
+workers on the CPU

[PATCH 2/6] workqueue: factor out worker_pool from global_cwq

2012-07-09 Thread Tejun Heo

Move worklist and all worker management fields from global_cwq into
the new struct worker_pool.  worker_pool points back to the containing
gcwq.  worker and cpu_workqueue_struct are updated to point to
worker_pool instead of gcwq too.

This change is mechanical and doesn't introduce any functional
difference other than rearranging of fields and an added level of
indirection in some places.  This is to prepare for multiple pools per
gcwq.

Signed-off-by: Tejun Heo 
---
 include/trace/events/workqueue.h |2 +-
 kernel/workqueue.c   |  216 -
 2 files changed, 118 insertions(+), 100 deletions(-)

diff --git a/include/trace/events/workqueue.h b/include/trace/events/workqueue.h
index 4018f50..f28d1b6 100644
--- a/include/trace/events/workqueue.h
+++ b/include/trace/events/workqueue.h
@@ -54,7 +54,7 @@ TRACE_EVENT(workqueue_queue_work,
__entry->function   = work->func;
__entry->workqueue  = cwq->wq;
__entry->req_cpu= req_cpu;
-   __entry->cpu= cwq->gcwq->cpu;
+   __entry->cpu= cwq->pool->gcwq->cpu;
),
 
TP_printk("work struct=%p function=%pf workqueue=%p req_cpu=%u cpu=%u",
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 27637c2..bc43a0c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -115,6 +115,7 @@ enum {
  */
 
 struct global_cwq;
+struct worker_pool;
 
 /*
  * The poor guys doing the actual heavy lifting.  All on-duty workers
@@ -131,7 +132,7 @@ struct worker {
struct cpu_workqueue_struct *current_cwq; /* L: current_work's cwq */
struct list_headscheduled;  /* L: scheduled works */
struct task_struct  *task;  /* I: worker task */
-   struct global_cwq   *gcwq;  /* I: the associated gcwq */
+   struct worker_pool  *pool;  /* I: the associated pool */
/* 64 bytes boundary on 64bit, 32 on 32bit */
unsigned long   last_active;/* L: last active timestamp */
unsigned intflags;  /* X: flags */
@@ -139,6 +140,21 @@ struct worker {
struct work_struct  rebind_work;/* L: rebind worker to cpu */
 };
 
+struct worker_pool {
+   struct global_cwq   *gcwq;  /* I: the owning gcwq */
+
+   struct list_headworklist;   /* L: list of pending works */
+   int nr_workers; /* L: total number of workers */
+   int nr_idle;/* L: currently idle ones */
+
+   struct list_headidle_list;  /* X: list of idle workers */
+   struct timer_list   idle_timer; /* L: worker idle timeout */
+   struct timer_list   mayday_timer;   /* L: SOS timer for dworkers */
+
+   struct ida  worker_ida; /* L: for worker IDs */
+   struct worker   *first_idle;/* L: first idle worker */
+};
+
 /*
  * Global per-cpu workqueue.  There's one and only one for each cpu
  * and all works are queued and processed here regardless of their
@@ -146,27 +162,18 @@ struct worker {
  */
 struct global_cwq {
spinlock_t  lock;   /* the gcwq lock */
-   struct list_headworklist;   /* L: list of pending works */
unsigned intcpu;/* I: the associated cpu */
unsigned intflags;  /* L: GCWQ_* flags */
 
-   int nr_workers; /* L: total number of workers */
-   int nr_idle;/* L: currently idle ones */
-
-   /* workers are chained either in the idle_list or busy_hash */
-   struct list_headidle_list;  /* X: list of idle workers */
+   /* workers are chained either in busy_head or pool idle_list */
struct hlist_head   busy_hash[BUSY_WORKER_HASH_SIZE];
/* L: hash of busy workers */
 
-   struct timer_list   idle_timer; /* L: worker idle timeout */
-   struct timer_list   mayday_timer;   /* L: SOS timer for dworkers */
-
-   struct ida  worker_ida; /* L: for worker IDs */
+   struct worker_pool  pool;   /* the worker pools */
 
struct task_struct  *trustee;   /* L: for gcwq shutdown */
unsigned inttrustee_state;  /* L: trustee state */
wait_queue_head_t   trustee_wait;   /* trustee wait */
-   struct worker   *first_idle;/* L: first idle worker */
 } cacheline_aligned_in_smp;
 
 /*
@@ -175,7 +182,7 @@ struct global_cwq {
  * aligned at two's power of the number of flag bits.
  */
 struct cpu_workqueue_struct {
-   struct global_cwq   *gcwq;  /* I: the associated gcwq */
+   struct worker_pool  *pool;  /* I: the assoc

[PATCH 05/32] crypto: mark crypto workqueues CPU_INTENSIVE

2011-01-03 Thread Tejun Heo

kcrypto_wq and pcrypt->wq's are used to run ciphers and may consume
considerable amount of CPU cycles.  Mark both as CPU_INTENSIVE so that
they don't block other work items.

As the workqueues are primarily used to burn CPU cycles, concurrency
levels shouldn't matter much and are left at 1.  A higher value may be
beneficial and needs investigation.

Signed-off-by: Tejun Heo 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: linux-crypto@vger.kernel.org
---
Only compile tested.  Please feel free to take it into the subsystem
tree or simply ack - I'll route it through the wq tree.

Thanks.

 crypto/crypto_wq.c |3 ++-
 crypto/pcrypt.c|3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/crypto/crypto_wq.c b/crypto/crypto_wq.c
index fdcf624..b980ee1 100644
--- a/crypto/crypto_wq.c
+++ b/crypto/crypto_wq.c
@@ -20,7 +20,8 @@ EXPORT_SYMBOL_GPL(kcrypto_wq);
 
 static int __init crypto_wq_init(void)
 {
-   kcrypto_wq = create_workqueue("crypto");
+   kcrypto_wq = alloc_workqueue("crypto",
+WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE, 1);
if (unlikely(!kcrypto_wq))
return -ENOMEM;
return 0;
diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index 75586f1..29a89da 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -455,7 +455,8 @@ static int pcrypt_init_padata(struct padata_pcrypt *pcrypt,
 
get_online_cpus();
 
-   pcrypt->wq = create_workqueue(name);
+   pcrypt->wq = alloc_workqueue(name,
+WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE, 1);
if (!pcrypt->wq)
goto err;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2.6.36-rc7] crypto: use cancel_delayed_work_sync() in hifn_795x

2010-10-15 Thread Tejun Heo

Make hifn_795x::hifn_remove() call cancel_delayed_work_sync() instead
of calling cancel_delayed_work() followed by flush_scheduled_work().

This is to prepare for the deprecation and removal of
flush_scheduled_work().

Signed-off-by: Tejun Heo 
---
 drivers/crypto/hifn_795x.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Index: work/drivers/crypto/hifn_795x.c
===
--- work.orig/drivers/crypto/hifn_795x.c
+++ work/drivers/crypto/hifn_795x.c
@@ -2700,8 +2700,7 @@ static void __devexit hifn_remove(struct
dev = pci_get_drvdata(pdev);

if (dev) {
-   cancel_delayed_work(&dev->work);
-   flush_scheduled_work();
+   cancel_delayed_work_sync(&dev->work);

hifn_unregister_rng(dev);
hifn_unregister_alg(dev);
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

78 matches

Mail list logo