from:"Lai Jiangshan"

[V5 PATCH 23/26] x86: use memblock_set_current_limit() to set memblock.current_limit

2012-10-29 Thread Lai Jiangshan

From: Yasuaki Ishimatsu 

memblock.current_limit is set directly though memblock_set_current_limit()
is prepared. So fix it.

Signed-off-by: Yasuaki Ishimatsu 
Signed-off-by: Lai Jiangshan 
---
 arch/x86/kernel/setup.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ca45696..ab3017a 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -890,7 +890,7 @@ void __init setup_arch(char **cmdline_p)
 
cleanup_highmap();
 
-   memblock.current_limit = get_max_mapped();
+   memblock_set_current_limit(get_max_mapped());
memblock_x86_fill();
 
/*
@@ -940,7 +940,7 @@ void __init setup_arch(char **cmdline_p)
max_low_pfn = max_pfn;
}
 #endif
-   memblock.current_limit = get_max_mapped();
+   memblock_set_current_limit(get_max_mapped());
dma_contiguous_reserve(0);
 
/*
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V5 PATCH 14/26] kthread: use N_MEMORY instead N_HIGH_MEMORY

2012-10-29 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 kernel/kthread.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 29fb60c..691dc2e 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -428,7 +428,7 @@ int kthreadd(void *unused)
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
set_cpus_allowed_ptr(tsk, cpu_all_mask);
-   set_mems_allowed(node_states[N_HIGH_MEMORY]);
+   set_mems_allowed(node_states[N_MEMORY]);
 
current->flags |= PF_NOFREEZE;
 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V5 PATCH 25/26] memblock: compare current_limit with end variable at memblock_find_in_range_node()

2012-10-29 Thread Lai Jiangshan

From: Yasuaki Ishimatsu 

memblock_find_in_range_node() does not compare memblock.current_limit
with end variable. Thus even if memblock.current_limit is smaller than
end variable, the function allocates memory address that is bigger than
memblock.current_limit.

The patch adds the check to "memblock_find_in_range_node()"

Signed-off-by: Yasuaki Ishimatsu 
Signed-off-by: Lai Jiangshan 
---
 mm/memblock.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index ee2e307..50ab53c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -100,11 +100,12 @@ phys_addr_t __init_memblock 
memblock_find_in_range_node(phys_addr_t start,
phys_addr_t align, int nid)
 {
phys_addr_t this_start, this_end, cand;
+   phys_addr_t current_limit = memblock.current_limit;
u64 i;
 
/* pump up @end */
-   if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
-   end = memblock.current_limit;
+   if ((end == MEMBLOCK_ALLOC_ACCESSIBLE) || (end > current_limit))
+   end = current_limit;
 
/* avoid allocating the first page */
start = max_t(phys_addr_t, start, PAGE_SIZE);
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V5 PATCH 22/26] x86: get pg_data_t's memory from other node

2012-10-29 Thread Lai Jiangshan

From: Yasuaki Ishimatsu 

If system can create movable node which all memory of the
node is allocated as ZONE_MOVABLE, setup_node_data() cannot
allocate memory for the node's pg_data_t.
So when memblock_alloc_nid() fails, setup_node_data() retries
memblock_alloc().

Signed-off-by: Yasuaki Ishimatsu 
Signed-off-by: Lai Jiangshan 
---
 arch/x86/mm/numa.c |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..a86e315 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -223,9 +223,13 @@ static void __init setup_node_data(int nid, u64 start, u64 
end)
remapped = true;
} else {
nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
-   if (!nd_pa) {
-   pr_err("Cannot find %zu bytes in node %d\n",
+   if (!nd_pa)
+   printk(KERN_WARNING "Cannot find %zu bytes in node 
%d\n",
   nd_size, nid);
+   nd_pa = memblock_alloc(nd_size, SMP_CACHE_BYTES);
+   if (!nd_pa) {
+   pr_err("Cannot find %zu bytes in other node\n",
+  nd_size);
return;
}
nd = __va(nd_pa);
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V5 PATCH 10/26] mm,migrate: use N_MEMORY instead N_HIGH_MEMORY

2012-10-29 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
Acked-by: Christoph Lameter 
---
 mm/migrate.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..d595e58 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1201,7 +1201,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t 
task_nodes,
if (node < 0 || node >= MAX_NUMNODES)
goto out_pm;
 
-   if (!node_state(node, N_HIGH_MEMORY))
+   if (!node_state(node, N_MEMORY))
goto out_pm;
 
err = -EACCES;
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V5 PATCH 26/26] mempolicy: fix is_valid_nodemask()

2012-10-29 Thread Lai Jiangshan

is_valid_nodemask() is introduced by 19770b32. but it does not match
its comments, because it does not check the zone which > policy_zone.

Also in b377fd, this commits told us, if highest zone is ZONE_MOVABLE,
we should also apply memory policies to it. so ZONE_MOVABLE should be valid zone
for policies. is_valid_nodemask() need to be changed to match it.

Fix: check all zones, even its zoneid > policy_zone.
Use nodes_intersects() instead open code to check it.

Signed-off-by: Lai Jiangshan 
Reported-by: Wen Congyang 
---
 mm/mempolicy.c |   36 ++--
 1 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d4a084c..ed7c249 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -140,19 +140,7 @@ static const struct mempolicy_operations {
 /* Check that the nodemask contains at least one populated zone */
 static int is_valid_nodemask(const nodemask_t *nodemask)
 {
-   int nd, k;
-
-   for_each_node_mask(nd, *nodemask) {
-   struct zone *z;
-
-   for (k = 0; k <= policy_zone; k++) {
-   z = &NODE_DATA(nd)->node_zones[k];
-   if (z->present_pages > 0)
-   return 1;
-   }
-   }
-
-   return 0;
+   return nodes_intersects(*nodemask, node_states[N_MEMORY]);
 }
 
 static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
@@ -1572,6 +1560,26 @@ struct mempolicy *get_vma_policy(struct task_struct 
*task,
return pol;
 }
 
+static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
+{
+   enum zone_type dynamic_policy_zone = policy_zone;
+
+   BUG_ON(dynamic_policy_zone == ZONE_MOVABLE);
+
+   /*
+* if policy->v.nodes has movable memory only,
+* we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only.
+*
+* policy->v.nodes is intersect with node_states[N_MEMORY].
+* so if the following test faile, it implies
+* policy->v.nodes has movable memory only.
+*/
+   if (!nodes_intersects(policy->v.nodes, node_states[N_HIGH_MEMORY]))
+   dynamic_policy_zone = ZONE_MOVABLE;
+
+   return zone >= dynamic_policy_zone;
+}
+
 /*
  * Return a nodemask representing a mempolicy for filtering nodes for
  * page allocation
@@ -1580,7 +1588,7 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct 
mempolicy *policy)
 {
/* Lower zones don't get a nodemask applied for MPOL_BIND */
if (unlikely(policy->mode == MPOL_BIND) &&
-   gfp_zone(gfp) >= policy_zone &&
+   apply_policy_zone(policy, gfp_zone(gfp)) &&
cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
return &policy->v.nodes;
 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V5 PATCH 15/26] init: use N_MEMORY instead N_HIGH_MEMORY

2012-10-29 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 init/main.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/init/main.c b/init/main.c
index 9cf77ab..9595968 100644
--- a/init/main.c
+++ b/init/main.c
@@ -855,7 +855,7 @@ static void __init kernel_init_freeable(void)
/*
 * init can allocate pages on any node
 */
-   set_mems_allowed(node_states[N_HIGH_MEMORY]);
+   set_mems_allowed(node_states[N_MEMORY]);
/*
 * init can run on any cpu.
 */
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V5 PATCH 06/26] cpuset: use N_MEMORY instead N_HIGH_MEMORY

2012-10-29 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
Acked-by: Hillf Danton 
---
 Documentation/cgroups/cpusets.txt |2 +-
 include/linux/cpuset.h|2 +-
 kernel/cpuset.c   |   32 
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt 
b/Documentation/cgroups/cpusets.txt
index cefd3d8..12e01d4 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -218,7 +218,7 @@ and name space for cpusets, with a minimum of additional 
kernel code.
 The cpus and mems files in the root (top_cpuset) cpuset are
 read-only.  The cpus file automatically tracks the value of
 cpu_online_mask using a CPU hotplug notifier, and the mems file
-automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e.,
+automatically tracks the value of node_states[N_MEMORY]--i.e.,
 nodes with memory--using the cpuset_track_online_nodes() hook.
 
 
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 838320f..8c8a60d29 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -144,7 +144,7 @@ static inline nodemask_t cpuset_mems_allowed(struct 
task_struct *p)
return node_possible_map;
 }
 
-#define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
+#define cpuset_current_mems_allowed (node_states[N_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
 
 static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index f33c715..2b133db 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -302,10 +302,10 @@ static void guarantee_online_cpus(const struct cpuset *cs,
  * are online, with memory.  If none are online with memory, walk
  * up the cpuset hierarchy until we find one that does have some
  * online mems.  If we get all the way to the top and still haven't
- * found any online mems, return node_states[N_HIGH_MEMORY].
+ * found any online mems, return node_states[N_MEMORY].
  *
  * One way or another, we guarantee to return some non-empty subset
- * of node_states[N_HIGH_MEMORY].
+ * of node_states[N_MEMORY].
  *
  * Call with callback_mutex held.
  */
@@ -313,14 +313,14 @@ static void guarantee_online_cpus(const struct cpuset *cs,
 static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
 {
while (cs && !nodes_intersects(cs->mems_allowed,
-   node_states[N_HIGH_MEMORY]))
+   node_states[N_MEMORY]))
cs = cs->parent;
if (cs)
nodes_and(*pmask, cs->mems_allowed,
-   node_states[N_HIGH_MEMORY]);
+   node_states[N_MEMORY]);
else
-   *pmask = node_states[N_HIGH_MEMORY];
-   BUG_ON(!nodes_intersects(*pmask, node_states[N_HIGH_MEMORY]));
+   *pmask = node_states[N_MEMORY];
+   BUG_ON(!nodes_intersects(*pmask, node_states[N_MEMORY]));
 }
 
 /*
@@ -1100,7 +1100,7 @@ static int update_nodemask(struct cpuset *cs, struct 
cpuset *trialcs,
return -ENOMEM;
 
/*
-* top_cpuset.mems_allowed tracks node_stats[N_HIGH_MEMORY];
+* top_cpuset.mems_allowed tracks node_stats[N_MEMORY];
 * it's read-only
 */
if (cs == &top_cpuset) {
@@ -1122,7 +1122,7 @@ static int update_nodemask(struct cpuset *cs, struct 
cpuset *trialcs,
goto done;
 
if (!nodes_subset(trialcs->mems_allowed,
-   node_states[N_HIGH_MEMORY])) {
+   node_states[N_MEMORY])) {
retval =  -EINVAL;
goto done;
}
@@ -2034,7 +2034,7 @@ static struct cpuset *cpuset_next(struct list_head *queue)
  * before dropping down to the next.  It always processes a node before
  * any of its children.
  *
- * In the case of memory hot-unplug, it will remove nodes from N_HIGH_MEMORY
+ * In the case of memory hot-unplug, it will remove nodes from N_MEMORY
  * if all present pages from a node are offlined.
  */
 static void
@@ -2073,7 +2073,7 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum 
hotplug_event event)
 
/* Continue past cpusets with all mems online */
if (nodes_subset(cp->mems_allowed,
-   node_states[N_HIGH_MEMORY]))
+   node_states[N_MEMORY]))
continue;
 
oldmems = cp->mems_allowed;
@@ -2081,7 +2081,7 @@ scan_cpusets_upon_hotplug(struct

[V5 PATCH 02/26] memory_hotplug: handle empty zone when online_movable/online_kernel

2012-10-29 Thread Lai Jiangshan

make online_movable/online_kernel can empty a zone
or can move memory to a empty zone.

Signed-off-by: Lai Jiangshan 
---
 mm/memory_hotplug.c |   51 +--
 1 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6d3bec4..bdcdaf6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -227,8 +227,17 @@ static void resize_zone(struct zone *zone, unsigned long 
start_pfn,
 
zone_span_writelock(zone);
 
-   zone->zone_start_pfn = start_pfn;
-   zone->spanned_pages = end_pfn - start_pfn;
+   if (end_pfn - start_pfn) {
+   zone->zone_start_pfn = start_pfn;
+   zone->spanned_pages = end_pfn - start_pfn;
+   } else {
+   /*
+* make it consist as free_area_init_core(),
+* if spanned_pages = 0, then keep start_pfn = 0
+*/
+   zone->zone_start_pfn = 0;
+   zone->spanned_pages = 0;
+   }
 
zone_span_writeunlock(zone);
 }
@@ -244,10 +253,19 @@ static void fix_zone_id(struct zone *zone, unsigned long 
start_pfn,
set_page_links(pfn_to_page(pfn), zid, nid, pfn);
 }
 
-static int move_pfn_range_left(struct zone *z1, struct zone *z2,
+static int __meminit move_pfn_range_left(struct zone *z1, struct zone *z2,
unsigned long start_pfn, unsigned long end_pfn)
 {
+   int ret;
unsigned long flags;
+   unsigned long z1_start_pfn;
+
+   if (!z1->wait_table) {
+   ret = init_currently_empty_zone(z1, start_pfn,
+   end_pfn - start_pfn, MEMMAP_HOTPLUG);
+   if (ret)
+   return ret;
+   }
 
pgdat_resize_lock(z1->zone_pgdat, &flags);
 
@@ -261,7 +279,13 @@ static int move_pfn_range_left(struct zone *z1, struct 
zone *z2,
if (end_pfn <= z2->zone_start_pfn)
goto out_fail;
 
-   resize_zone(z1, z1->zone_start_pfn, end_pfn);
+   /* use start_pfn for z1's start_pfn if z1 is empty */
+   if (z1->spanned_pages)
+   z1_start_pfn = z1->zone_start_pfn;
+   else
+   z1_start_pfn = start_pfn;
+
+   resize_zone(z1, z1_start_pfn, end_pfn);
resize_zone(z2, end_pfn, z2->zone_start_pfn + z2->spanned_pages);
 
pgdat_resize_unlock(z1->zone_pgdat, &flags);
@@ -274,10 +298,19 @@ out_fail:
return -1;
 }
 
-static int move_pfn_range_right(struct zone *z1, struct zone *z2,
+static int __meminit move_pfn_range_right(struct zone *z1, struct zone *z2,
unsigned long start_pfn, unsigned long end_pfn)
 {
+   int ret;
unsigned long flags;
+   unsigned long z2_end_pfn;
+
+   if (!z2->wait_table) {
+   ret = init_currently_empty_zone(z2, start_pfn,
+   end_pfn - start_pfn, MEMMAP_HOTPLUG);
+   if (ret)
+   return ret;
+   }
 
pgdat_resize_lock(z1->zone_pgdat, &flags);
 
@@ -291,8 +324,14 @@ static int move_pfn_range_right(struct zone *z1, struct 
zone *z2,
if (start_pfn >= z1->zone_start_pfn + z1->spanned_pages)
goto out_fail;
 
+   /* use end_pfn for z2's end_pfn if z2 is empty */
+   if (z2->spanned_pages)
+   z2_end_pfn = z2->zone_start_pfn + z2->spanned_pages;
+   else
+   z2_end_pfn = end_pfn;
+
resize_zone(z1, z1->zone_start_pfn, start_pfn);
-   resize_zone(z2, start_pfn, z2->zone_start_pfn + z2->spanned_pages);
+   resize_zone(z2, start_pfn, z2_end_pfn);
 
pgdat_resize_unlock(z1->zone_pgdat, &flags);
 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V5 PATCH 16/26] vmscan: use N_MEMORY instead N_HIGH_MEMORY

2012-10-29 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
Acked-by: Hillf Danton 
---
 mm/vmscan.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2624edc..98a2e11 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3135,7 +3135,7 @@ static int __devinit cpu_callback(struct notifier_block 
*nfb,
int nid;
 
if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) {
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
pg_data_t *pgdat = NODE_DATA(nid);
const struct cpumask *mask;
 
@@ -3191,7 +3191,7 @@ static int __init kswapd_init(void)
int nid;
 
swap_setup();
-   for_each_node_state(nid, N_HIGH_MEMORY)
+   for_each_node_state(nid, N_MEMORY)
kswapd_run(nid);
hotcpu_notifier(cpu_callback, 0);
return 0;
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V5 PATCH 11/26] mempolicy: use N_MEMORY instead N_HIGH_MEMORY

2012-10-29 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 mm/mempolicy.c |   12 ++--
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..d4a084c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -212,9 +212,9 @@ static int mpol_set_nodemask(struct mempolicy *pol,
/* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
if (pol == NULL)
return 0;
-   /* Check N_HIGH_MEMORY */
+   /* Check N_MEMORY */
nodes_and(nsc->mask1,
- cpuset_current_mems_allowed, node_states[N_HIGH_MEMORY]);
+ cpuset_current_mems_allowed, node_states[N_MEMORY]);
 
VM_BUG_ON(!nodes);
if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
@@ -1388,7 +1388,7 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, 
maxnode,
goto out_put;
}
 
-   if (!nodes_subset(*new, node_states[N_HIGH_MEMORY])) {
+   if (!nodes_subset(*new, node_states[N_MEMORY])) {
err = -EINVAL;
goto out_put;
}
@@ -2361,7 +2361,7 @@ void __init numa_policy_init(void)
 * fall back to the largest node if they're all smaller.
 */
nodes_clear(interleave_nodes);
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
unsigned long total_pages = node_present_pages(nid);
 
/* Preserve the largest node */
@@ -2442,7 +2442,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, 
int no_context)
*nodelist++ = '\0';
if (nodelist_parse(nodelist, nodes))
goto out;
-   if (!nodes_subset(nodes, node_states[N_HIGH_MEMORY]))
+   if (!nodes_subset(nodes, node_states[N_MEMORY]))
goto out;
} else
nodes_clear(nodes);
@@ -2476,7 +2476,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, 
int no_context)
 * Default to online nodes with memory if no nodelist
 */
if (!nodelist)
-   nodes = node_states[N_HIGH_MEMORY];
+   nodes = node_states[N_MEMORY];
break;
case MPOL_LOCAL:
/*
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V5 PATCH 21/26] page_alloc: add kernelcore_max_addr

2012-10-29 Thread Lai Jiangshan

Current ZONE_MOVABLE (kernelcore=) setting policy with boot option doesn't meet
our requirement. We need something like kernelcore_max_addr=XX boot option
to limit the kernelcore upper address.

The memory with higher address will be migratable(movable) and they
are easier to be offline(always ready to be offline when the system don't 
require
so much memory).

It makes things easy when we dynamic hot-add/remove memory, make better
utilities of memories, and helps for THP.

All kernelcore_max_addr=, kernelcore= and movablecore= can be safely specified
at the same time(or any 2 of them).

Signed-off-by: Lai Jiangshan 
---
 Documentation/kernel-parameters.txt |9 +
 mm/page_alloc.c |   29 -
 2 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 9776f06..2b72ffb 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1223,6 +1223,15 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.
 
+   kernelcore_max_addr=nn[KMG] [KNL,X86,IA-64,PPC] This parameter
+   is the same effect as kernelcore parameter, except it
+   specifies the up physical address of memory range
+   usable by the kernel for non-movable allocations.
+   If both kernelcore and kernelcore_max_addr are
+   specified, this requested's priority is higher than
+   kernelcore's.
+   See the kernelcore parameter.
+
kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
Format: [,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a42337f..11df8b5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -203,6 +203,7 @@ static unsigned long __meminitdata dma_reserve;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+static unsigned long __initdata required_kernelcore_max_pfn;
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
@@ -4700,6 +4701,7 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 {
int i, nid;
unsigned long usable_startpfn;
+   unsigned long kernelcore_max_pfn;
unsigned long kernelcore_node, kernelcore_remaining;
/* save the state before borrow the nodemask */
nodemask_t saved_node_state = node_states[N_MEMORY];
@@ -4728,6 +4730,9 @@ static void __init find_zone_movable_pfns_for_nodes(void)
required_kernelcore = max(required_kernelcore, corepages);
}
 
+   if (required_kernelcore_max_pfn && !required_kernelcore)
+   required_kernelcore = totalpages;
+
/* If kernelcore was not specified, there is no ZONE_MOVABLE */
if (!required_kernelcore)
goto out;
@@ -4736,6 +4741,12 @@ static void __init find_zone_movable_pfns_for_nodes(void)
find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
 
+   if (required_kernelcore_max_pfn)
+   kernelcore_max_pfn = required_kernelcore_max_pfn;
+   else
+   kernelcore_max_pfn = ULONG_MAX >> PAGE_SHIFT;
+   kernelcore_max_pfn = max(kernelcore_max_pfn, usable_startpfn);
+
 restart:
/* Spread kernelcore memory as evenly as possible throughout nodes */
kernelcore_node = required_kernelcore / usable_nodes;
@@ -4762,8 +4773,12 @@ restart:
unsigned long size_pages;
 
start_pfn = max(start_pfn, zone_movable_pfn[nid]);
-   if (start_pfn >= end_pfn)
+   end_pfn = min(kernelcore_max_pfn, end_pfn);
+   if (start_pfn >= end_pfn) {
+   if (!zone_movable_pfn[nid])
+   zone_movable_pfn[nid] = start_pfn;
continue;
+   }
 
/* Account for what is only usable for kernelcore */
if (start_pfn < usable_startpfn) {
@@ -4954,6 +4969,18 @@ static int __init cmdline_parse_core(char *p, unsigned 
long *core)
return 0;
 }
 
+#ifdef CONFIG_MOVABLE_NODE
+/*
+ * kernelcore_max_addr=addr sets the up physical address of memory range
+ * for use for allocations that cannot be reclaimed or migrated.
+ */
+static int __ini

Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

2013-08-25 Thread Lai Jiangshan

On 08/26/2013 01:43 AM, Paul E. McKenney wrote:
> On Sun, Aug 25, 2013 at 11:19:37PM +0800, Lai Jiangshan wrote:
>> Hi, Steven
>>
>> Any comments about this patch?
> 
> For whatever it is worth, it ran without incident for two hours worth
> of rcutorture on my P5 test (boosting but no CPU hotplug).
> 
> Lai, do you have a specific test for this patch?  

Also rcutorture.
(A special module is added to ensure all paths of my code are covered.)

> Your deadlock
> scenario looks plausible, but is apparently not occurring in the
> mainline kernel.

Yes, you can leave this possible bug until the real problem happens
or just disallow overlapping.
I can write some debug code for it which allow us find out
the problems earlier.

I guess this is an useful usage pattern of rcu:

again:
rcu_read_lock();
obj = read_dereference(ptr);
spin_lock_XX(obj->lock);
if (obj is invalid) {
spin_unlock_XX(obj->lock);
rcu_read_unlock();
goto again;
}
rcu_read_unlock();
# use obj
spin_unlock_XX(obj->lock);

If we encourage this pattern, we should fix all the related problems.

Thanks,
Lai

> 
>   Thanx, Paul
> 
>> Thanks,
>> Lai
>>
>>
>> On Fri, Aug 23, 2013 at 2:26 PM, Lai Jiangshan  wrote:
>>
>>> [PATCH] rcu/rt_mutex: eliminate a kind of deadlock for rcu read site
>>>
>>> Current rtmutex's lock->wait_lock doesn't disables softirq nor irq, it will
>>> cause rcu read site deadlock when rcu overlaps with any
>>> softirq-context/irq-context lock.
>>>
>>> @L is a spinlock of softirq or irq context.
>>>
>>> CPU1cpu2(rcu boost)
>>> rcu_read_lock() rt_mutext_lock()
>>>   raw_spin_lock(lock->wait_lock)
>>> spin_lock_XX(L)   >> irq>
>>> rcu_read_unlock() do_softirq()
>>>   rcu_read_unlock_special()
>>> rt_mutext_unlock()
>>>   raw_spin_lock(lock->wait_lock)spin_lock_XX(L)  **DEADLOCK**
>>>
>>> This patch fixes this kind of deadlock by removing rt_mutext_unlock() from
>>> rcu_read_unlock(), new rt_mutex_rcu_deboost_unlock() is called instead.
>>> Thus rtmutex's lock->wait_lock will not be called from rcu_read_unlock().
>>>
>>> This patch does not eliminate all kinds of rcu-read-site deadlock,
>>> if @L is a scheduler lock, it will be deadlock, we should apply Paul's rule
>>> in this case.(avoid overlapping or preempt_disable()).
>>>
>>> rt_mutex_rcu_deboost_unlock() requires the @waiter is queued, so we
>>> can't directly call rt_mutex_lock(&mtx) in the rcu_boost thread,
>>> we split rt_mutex_lock(&mtx) into two steps just like pi-futex.
>>> This result a internal state in rcu_boost thread and cause
>>> rcu_boost thread a bit more complicated.
>>>
>>> Thanks
>>> Lai
>>>
>>> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
>>> index 5cd0f09..8830874 100644
>>> --- a/include/linux/init_task.h
>>> +++ b/include/linux/init_task.h
>>> @@ -102,7 +102,7 @@ extern struct group_info init_groups;
>>>
>>>  #ifdef CONFIG_RCU_BOOST
>>>  #define INIT_TASK_RCU_BOOST()  \
>>> -   .rcu_boost_mutex = NULL,
>>> +   .rcu_boost_waiter = NULL,
>>>  #else
>>>  #define INIT_TASK_RCU_BOOST()
>>>  #endif
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index e9995eb..1eca99f 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1078,7 +1078,7 @@ struct task_struct {
>>> struct rcu_node *rcu_blocked_node;
>>>  #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
>>>  #ifdef CONFIG_RCU_BOOST
>>> -   struct rt_mutex *rcu_boost_mutex;
>>> +   struct rt_mutex_waiter *rcu_boost_waiter;
>>>  #endif /* #ifdef CONFIG_RCU_BOOST */
>>>
>>>  #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
>>> @@ -1723,7 +1723,7 @@ static inline void rcu_copy_process(struct
>>> task_struct *p)
>>> p->rcu_blocked_node = NULL;
>>>  #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
>>>  #ifdef CONFIG_RCU_BOOST
>>> -   p->rcu_boost_mutex = NULL;
>>> +   p->rcu_boost_waiter = NULL;
>>>  #endif /* #ifdef

Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

2013-08-25 Thread Lai Jiangshan

On 08/20/2013 10:47 AM, Paul E. McKenney wrote:
> From: "Paul E. McKenney" 
> 
> This commit adds the state machine that takes the per-CPU idle data
> as input and produces a full-system-idle indication as output.  This
> state machine is driven out of RCU's quiescent-state-forcing
> mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
> idle state and then rcu_sysidle_report() to drive the state machine.
> 
> The full-system-idle state is sampled using rcu_sys_is_idle(), which
> also drives the state machine if RCU is idle (and does so by forcing
> RCU to become non-idle).  This function returns true if all but the
> timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
> enough to avoid memory contention on the full_sysidle_state state
> variable.  The rcu_sysidle_force_exit() may be called externally
> to reset the state machine back into non-idle state.
> 
> For large systems the state machine is driven out of RCU's
> force-quiescent-state logic, which provides good scalability at the price
> of millisecond-scale latencies on the transition to full-system-idle
> state.  This is not so good for battery-powered systems, which are usually
> small enough that they don't need to care about scalability, but which
> do care deeply about energy efficiency.  Small systems therefore drive
> the state machine directly out of the idle-entry code.  The number of
> CPUs in a "small" system is defined by a new NO_HZ_FULL_SYSIDLE_SMALL
> Kconfig parameter, which defaults to 8.  Note that this is a build-time
> definition.
> 
> Signed-off-by: Paul E. McKenney 
> Cc: Frederic Weisbecker 
> Cc: Steven Rostedt 
> Cc: Lai Jiangshan 
> [ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
> Reviewed-by: Josh Triplett 
> ---
>  include/linux/rcupdate.h |  18 +++
>  kernel/rcutree.c |  16 ++-
>  kernel/rcutree.h |   5 +
>  kernel/rcutree_plugin.h  | 284 
> ++-
>  kernel/time/Kconfig  |  27 +
>  5 files changed, 343 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 30bea9c..f1f1bc3 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1011,4 +1011,22 @@ static inline bool rcu_is_nocb_cpu(int cpu) { return 
> false; }
>  #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
>  
>  
> +/* Only for use by adaptive-ticks code. */
> +#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
> +extern bool rcu_sys_is_idle(void);
> +extern void rcu_sysidle_force_exit(void);
> +#else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> +
> +static inline bool rcu_sys_is_idle(void)
> +{
> + return false;
> +}
> +
> +static inline void rcu_sysidle_force_exit(void)
> +{
> +}
> +
> +#endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> +
> +
>  #endif /* __LINUX_RCUPDATE_H */
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 7b5be56..eca70f44 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -734,6 +734,7 @@ static int dyntick_save_progress_counter(struct rcu_data 
> *rdp,
>bool *isidle, unsigned long *maxj)
>  {
>   rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
> + rcu_sysidle_check_cpu(rdp, isidle, maxj);
>   return (rdp->dynticks_snap & 0x1) == 0;
>  }
>  
> @@ -1373,11 +1374,17 @@ int rcu_gp_fqs(struct rcu_state *rsp, int 
> fqs_state_in)
>   rsp->n_force_qs++;
>   if (fqs_state == RCU_SAVE_DYNTICK) {
>   /* Collect dyntick-idle snapshots. */
> + if (is_sysidle_rcu_state(rsp)) {
> + isidle = 1;
> + maxj = jiffies - ULONG_MAX / 4;
> + }
>   force_qs_rnp(rsp, dyntick_save_progress_counter,
>&isidle, &maxj);
> + rcu_sysidle_report_gp(rsp, isidle, maxj);
>   fqs_state = RCU_FORCE_QS;
>   } else {
>   /* Handle dyntick-idle and offline CPUs. */
> + isidle = 0;
>   force_qs_rnp(rsp, rcu_implicit_dynticks_qs, &isidle, &maxj);
>   }
>   /* Clear flag to prevent immediate re-entry. */
> @@ -2103,9 +2110,12 @@ static void force_qs_rnp(struct rcu_state *rsp,
>   cpu = rnp->grplo;
>   bit = 1;
>   for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
> - if ((rnp->qsmask & bit) != 0 &&
> - f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
> - mask |= bit;
> +

Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

2013-08-26 Thread Lai Jiangshan

On 08/27/2013 12:24 AM, Paul E. McKenney wrote:
> On Mon, Aug 26, 2013 at 01:45:32PM +0800, Lai Jiangshan wrote:
>> On 08/20/2013 10:47 AM, Paul E. McKenney wrote:
>>> From: "Paul E. McKenney" 
>>>
>>> This commit adds the state machine that takes the per-CPU idle data
>>> as input and produces a full-system-idle indication as output.  This
>>> state machine is driven out of RCU's quiescent-state-forcing
>>> mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
>>> idle state and then rcu_sysidle_report() to drive the state machine.
>>>
>>> The full-system-idle state is sampled using rcu_sys_is_idle(), which
>>> also drives the state machine if RCU is idle (and does so by forcing
>>> RCU to become non-idle).  This function returns true if all but the
>>> timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
>>> enough to avoid memory contention on the full_sysidle_state state
>>> variable.  The rcu_sysidle_force_exit() may be called externally
>>> to reset the state machine back into non-idle state.
>>>
>>> For large systems the state machine is driven out of RCU's
>>> force-quiescent-state logic, which provides good scalability at the price
>>> of millisecond-scale latencies on the transition to full-system-idle
>>> state.  This is not so good for battery-powered systems, which are usually
>>> small enough that they don't need to care about scalability, but which
>>> do care deeply about energy efficiency.  Small systems therefore drive
>>> the state machine directly out of the idle-entry code.  The number of
>>> CPUs in a "small" system is defined by a new NO_HZ_FULL_SYSIDLE_SMALL
>>> Kconfig parameter, which defaults to 8.  Note that this is a build-time
>>> definition.
>>>
>>> Signed-off-by: Paul E. McKenney 
>>> Cc: Frederic Weisbecker 
>>> Cc: Steven Rostedt 
>>> Cc: Lai Jiangshan 
>>> [ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
>>> Reviewed-by: Josh Triplett 
>>> ---
>>>  include/linux/rcupdate.h |  18 +++
>>>  kernel/rcutree.c |  16 ++-
>>>  kernel/rcutree.h |   5 +
>>>  kernel/rcutree_plugin.h  | 284 
>>> ++-
>>>  kernel/time/Kconfig  |  27 +
>>>  5 files changed, 343 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>>> index 30bea9c..f1f1bc3 100644
>>> --- a/include/linux/rcupdate.h
>>> +++ b/include/linux/rcupdate.h
>>> @@ -1011,4 +1011,22 @@ static inline bool rcu_is_nocb_cpu(int cpu) { return 
>>> false; }
>>>  #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
>>>  
>>>  
>>> +/* Only for use by adaptive-ticks code. */
>>> +#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
>>> +extern bool rcu_sys_is_idle(void);
>>> +extern void rcu_sysidle_force_exit(void);
>>> +#else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
>>> +
>>> +static inline bool rcu_sys_is_idle(void)
>>> +{
>>> +   return false;
>>> +}
>>> +
>>> +static inline void rcu_sysidle_force_exit(void)
>>> +{
>>> +}
>>> +
>>> +#endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
>>> +
>>> +
>>>  #endif /* __LINUX_RCUPDATE_H */
>>> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
>>> index 7b5be56..eca70f44 100644
>>> --- a/kernel/rcutree.c
>>> +++ b/kernel/rcutree.c
>>> @@ -734,6 +734,7 @@ static int dyntick_save_progress_counter(struct 
>>> rcu_data *rdp,
>>>  bool *isidle, unsigned long *maxj)
>>>  {
>>> rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
>>> +   rcu_sysidle_check_cpu(rdp, isidle, maxj);
>>> return (rdp->dynticks_snap & 0x1) == 0;
>>>  }
>>>  
>>> @@ -1373,11 +1374,17 @@ int rcu_gp_fqs(struct rcu_state *rsp, int 
>>> fqs_state_in)
>>> rsp->n_force_qs++;
>>> if (fqs_state == RCU_SAVE_DYNTICK) {
>>> /* Collect dyntick-idle snapshots. */
>>> +   if (is_sysidle_rcu_state(rsp)) {
>>> +   isidle = 1;
>>> +   maxj = jiffies - ULONG_MAX / 4;
>>> +   }
>>> force_qs_rnp(rsp, dyntick_save_progress_counter,
>>>

Re: [PATCH 09/11 V5] workqueue: rebind newly created worker

2012-09-05 Thread Lai Jiangshan

On Wed, Sep 5, 2012 at 6:37 PM, Lai Jiangshan  wrote:
> when the newly created needs to be rebound. exile it!
> it will rebind itself in worker_thead().
>
> It just prepares, this code is useless until
> "we unbind/rebind without manager_mutex"
>
> Signed-off-by: Lai Jiangshan 
> ---
>  kernel/workqueue.c |5 +
>  1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 223d128..819c84e 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1838,6 +1838,11 @@ static void manager_start_worker(struct worker 
> *worker, struct worker *manager)
>
> /* copy the flags. also unbind the worker if the manager is unbound */
> worker->flags |= manager->flags & WORKER_COPY_FLAGS;
> +
> +   if (need_to_rebind_worker(worker)) {
> +   /* exile it, and let it rebind itself */
> +   list_del_init(&worker->entry);

Sorry, we should not add it to idle_list.
"add and then remove " is wrong here.

> +   }
>  }
>
>  /**
> --
> 1.7.4.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 05/11 V5] workqueue: Add @bind arguement back without change any thing

2012-09-05 Thread Lai Jiangshan

On 09/06/2012 03:49 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 05, 2012 at 06:37:42PM +0800, Lai Jiangshan wrote:
>> Ensure the gcwq->flags is only accessed with gcwq->lock held.
>> And make the code more easier to understand.
>>
>> In all current callsite of create_worker(), DISASSOCIATED can't
>> be flipped while create_worker().
>> So the whole behavior is unchanged with this patch.
> 
> This doesn't change anything.  You're just moving the test to the
> caller with comments there explaining how it won't change even if
> gcwq->lock is released.  It seems more confusing to me.  The flag is
> still protected by manager_mutex.  How is this an improvement?
> 

Some other bit of gcwq->flags is accessed(modified) without manager_mutex.
making gcwq->flags be accessed only form gcwq->lock C.S. will help the reviewer.

I don't like adding special things/code when not-absolutely-required.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 02/11 V5] workqueue: async idle rebinding

2012-09-05 Thread Lai Jiangshan

On 09/06/2012 02:06 AM, Tejun Heo wrote:
> Hello, Lai.
> 
> Ooh, I like the approach.  That said, I think it's a bit too invasive
> for 3.6-fixes.  I'll merge the two patches I posted yesterday in
> 3.6-fixes.  Let's do this restructuring in for-3.7.

OK for me.
it is too complicated for 3.6.

> 
> On Wed, Sep 05, 2012 at 06:37:39PM +0800, Lai Jiangshan wrote:
>>  static void idle_worker_rebind(struct worker *worker)
>>  {
>>  struct global_cwq *gcwq = worker->pool->gcwq;
>>  
>> -/* CPU must be online at this point */
>> -WARN_ON(!worker_maybe_bind_and_lock(worker));
>> -if (!--worker->idle_rebind->cnt)
>> -complete(&worker->idle_rebind->done);
>> -spin_unlock_irq(&worker->pool->gcwq->lock);
>> +if (worker_maybe_bind_and_lock(worker))
>> +worker_clr_flags(worker, WORKER_UNBOUND);
>>  
>> -/* we did our part, wait for rebind_workers() to finish up */
>> -wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND));
>> +worker_clr_flags(worker, WORKER_REBIND);
>> +list_add(&worker->entry, &worker->pool->idle_list);
>> +spin_unlock_irq(&gcwq->lock);
> 
> This looks correct to me but it's still a bit scary.  Some comments
> explaining why the above is correct would be nice.

How to explain the correct, could you give some clues.
correctness for rebinding and the flags: comments is missing. (old code miss it 
too, so I forgot it)
correctness for idle management: list_del_init() and list_add(), I don't like 
to add comment for slef-explain-code.
correctness for quick-enabled-CMWQ, local-wake-up: comments is in the 
changelog. (I should also add it to the code)
correctness for integrating of above: ..

> 
> Yeah, other than that, looks good to me.  I'll prepare new for-3.7
> branch this can be based on, so please wait a bit.  Also, I think I'll
> probably update commit description / comments while committing.
> 

I was coding it based on wq/for-3.7. so you can merge it easier.
waiting for you merged-result.

Thanks.
Lai

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] workqueue: fix possible deadlock in idle worker rebinding

2012-09-05 Thread Lai Jiangshan

On 09/06/2012 07:11 AM, Tejun Heo wrote:
> On Tue, Sep 04, 2012 at 11:16:32PM -0700, Tejun Heo wrote:
>> Currently, rebind_workers() and idle_worker_rebind() are two-way
>> interlocked.  rebind_workers() waits for idle workers to finish
>> rebinding and rebound idle workers wait for rebind_workers() to finish
>> rebinding busy workers before proceeding.
> 
> Applied to wq/for-3.6-fixes.
> 
> Thanks.
> 

OK for me.

Thanks.
Lai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/10 V4] workqueue: add manage_workers_slowpath()

2012-09-05 Thread Lai Jiangshan

On 09/05/2012 09:12 AM, Tejun Heo wrote:
> Hello, Lai.
> 
> On Sun, Sep 02, 2012 at 12:28:22AM +0800, Lai Jiangshan wrote:
>> If hotplug code grabbed the manager_mutex and worker_thread try to create
>> a worker, the manage_worker() will return false and worker_thread go to
>> process work items. Now, on the CPU, all workers are processing work items,
>> no idle_worker left/ready for managing. It breaks the concept of workqueue
>> and it is bug.
>>
>> So when manage_worker() failed to grab the manager_mutex, it should
>> try to enter normal process contex and then compete on the manager_mutex
>> instead of return false.
>>
>> To safely do this, we add manage_workers_slowpath() and the worker
>> go to process work items mode to do the managing jobs. thus
>> managing jobs are processed via work item and can free to compete
>> on manager_mutex.
> 
> Ummm this seems overly complicated.  How about scheduling rebind
> work to a worker and forcing it to break out of the work processing
> loop?  I think it can be done fairly easily using POOL_MANAGE_WORKERS
> - set it from the rebind function, break out of work processing loop
> if it's set, replace need_to_manage_workers() with POOL_MANAGE_WORKERS
> test (the function really isn't necessary) and always jump back to
> recheck afterwards.  It might need a bit more mangling here and there
> but that should be the essence of it.  I'll give a stab at it later
> today.
> 

This approach is a little like my unsent approach3.(I will explain in other 
mail)
This approach is most complicated and changing more code if it is implemented.

First we should rebind/unbind separated by pool. because,
if we queue the rebind-work to high-pri pool, we will break normal-pool
vice versa

and this forces us move DISASSOCIATED to pool-flags.
and this forces us add more code in cpu-notify

second, reuse POOL_MANAGE_WORKERS, or add new one.

third, need to restruct of rebind/unbind and change a lot in worker_thread.

my partial/unsent approach3 has almost the same problem.
(different, my approach3 don't use work item, it is checked and called from
the "recheck" label of worker_thread. it is called with WORKER_PREP bit set
and it uses "mutex_trylock" to grab lock like manage_workers())

how much code will be changed for only unbind part of this approach:

 kernel/workqueue.c |  103 ++--
 1 files changed, 76 insertions(+), 27 deletions(-)

Thanks
Lai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/11 V5] workqueue: new day don't need WORKER_REBIND for busy rebinding

2012-09-05 Thread Lai Jiangshan

On 09/06/2012 02:31 AM, Tejun Heo wrote:
> On Wed, Sep 05, 2012 at 06:37:40PM +0800, Lai Jiangshan wrote:
>> because old busy_worker_rebind_fn() have to wait until all idle worker 
>> finish.
>> so we have to use two flags WORKER_UNBOUND and WORKER_REBIND to avoid
>> prematurely clear all NOT_RUNNING bit when highly frequent offline/online.
>>
>> but current code don't need to wait idle workers. so we don't need to
>> use two flags, just one is enough. remove WORKER_REBIND from busy rebinding.
> 
> ROGUE / REBIND thing existed for busy workers from the beginning when
> there was no idle worker rebinding, so this definitely wasn't about
> whether idle rebind is synchronous or not. 

In very old day, this definitely wasn't about whether idle rebind is 
synchronous or not.
but after you reimplement rebind_worker(), it is the only reason for 
WORKER_REBIND in busy rebinding.

if I miss something, this 03/11 will be wrong. the old code did not comment all 
why
WORKER_REBIND is needed. so we have to think more about the correctness of this 
03/11.

> Trying to remember
> what... ah, okay, setting of DISASSOCIATED and setting of WORKER_ROGUE
> didn't use to happen together with gcwq->lock held.  CPU_DOWN would
> first set ROGUE and then later on set DISASSOCIATED, so if the
> rebind_fn kicks in inbetween that, it would break CPU_DOWN.
> 
> I think now that both CPU_DOWN and UP are done under single holding of
> gcwq->lock, this should be safe.  It would be nice to note what
> changed in the patch description and the atomicity requirement as a
> comment tho.
> 

Oh, I forgot to add changelog about single holding of gcwq->lock.


Thanks
Lai

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v2] memory-hotplug: remove MIGRATE_ISOLATE from free_area->free_list

2012-09-06 Thread Lai Jiangshan

On 09/06/2012 10:53 AM, Minchan Kim wrote:
> Normally, MIGRATE_ISOLATE type is used for memory-hotplug.
> But it's irony type because the pages isolated would exist
> as free page in free_area->free_list[MIGRATE_ISOLATE] so people
> can think of it as allocatable pages but it is *never* allocatable.
> It ends up confusing NR_FREE_PAGES vmstat so it would be
> totally not accurate so some of place which depend on such vmstat
> could reach wrong decision by the context.
> 
> There were already report about it.[1]
> [1] 702d1a6e, memory-hotplug: fix kswapd looping forever problem
> 
> Then, there was other report which is other problem.[2]
> [2] http://www.spinics.net/lists/linux-mm/msg41251.html
> 
> I believe it can make problems in future, too.
> So I hope removing such irony type by another design.
> 
> I hope this patch solves it and let's revert [1] and doesn't need [2].
> 
> * Changelog v1
>  * Fix from Michal's many suggestion
> 
> Cc: Michal Nazarewicz 
> Cc: Mel Gorman 
> Cc: Kamezawa Hiroyuki 
> Cc: Yasuaki Ishimatsu 
> Cc: Wen Congyang 
> Cc: Konrad Rzeszutek Wilk 
> Signed-off-by: Minchan Kim 
> ---

> @@ -180,30 +287,35 @@ int undo_isolate_page_range(unsigned long start_pfn, 
> unsigned long end_pfn,
>   * all pages in [start_pfn...end_pfn) must be in the same zone.
>   * zone->lock must be held before call this.
>   *
> - * Returns 1 if all pages in the range are isolated.
> + * Returns true if all pages in the range are isolated.
>   */
> -static int
> -__test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn)
> +static bool
> +__test_page_isolated_in_pageblock(unsigned long start_pfn, unsigned long 
> end_pfn)
>  {
> + unsigned long pfn, next_pfn;
>   struct page *page;
>  
> - while (pfn < end_pfn) {
> - if (!pfn_valid_within(pfn)) {
> - pfn++;
> - continue;
> - }
> - page = pfn_to_page(pfn);
> - if (PageBuddy(page))
> - pfn += 1 << page_order(page);
> - else if (page_count(page) == 0 &&
> - page_private(page) == MIGRATE_ISOLATE)
> - pfn += 1;
> - else
> - break;
> + list_for_each_entry(page, &isolated_pages, lru) {

> + if (&page->lru == &isolated_pages)
> + return false;

what's the mean of this line?

> + pfn = page_to_pfn(page);
> + if (pfn >= end_pfn)
> + return false;
> + if (pfn >= start_pfn)
> + goto found;
> + }
> + return false;
> +
> + list_for_each_entry_continue(page, &isolated_pages, lru) {
> + if (page_to_pfn(page) != next_pfn)
> + return false;

where is next_pfn init-ed? 

> +found:
> + pfn = page_to_pfn(page);
> + next_pfn = pfn + (1UL << page_order(page));
> + if (next_pfn >= end_pfn)
> + return true;
>   }
> - if (pfn < end_pfn)
> - return 0;
> - return 1;
> + return false;
>  }
>  
>  int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
> @@ -211,7 +323,7 @@ int test_pages_isolated(unsigned long start_pfn, unsigned 
> long end_pfn)
>   unsigned long pfn, flags;
>   struct page *page;
>   struct zone *zone;
> - int ret;
> + bool ret;
>  
>   /*
>* Note: pageblock_nr_page != MAX_ORDER. Then, chunks of free page
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index df7a674..bb59ff7 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -616,7 +616,6 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
>  #ifdef CONFIG_CMA
>   "CMA",
>  #endif
> - "Isolate",
>  };
>  
>  static void *frag_start(struct seq_file *m, loff_t *pos)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v2] memory-hotplug: remove MIGRATE_ISOLATE from free_area->free_list

2012-09-06 Thread Lai Jiangshan

On 09/06/2012 04:18 PM, Minchan Kim wrote:
> Hello Lai,
> 
> On Thu, Sep 06, 2012 at 04:14:51PM +0800, Lai Jiangshan wrote:
>> On 09/06/2012 10:53 AM, Minchan Kim wrote:
>>> Normally, MIGRATE_ISOLATE type is used for memory-hotplug.
>>> But it's irony type because the pages isolated would exist
>>> as free page in free_area->free_list[MIGRATE_ISOLATE] so people
>>> can think of it as allocatable pages but it is *never* allocatable.
>>> It ends up confusing NR_FREE_PAGES vmstat so it would be
>>> totally not accurate so some of place which depend on such vmstat
>>> could reach wrong decision by the context.
>>>
>>> There were already report about it.[1]
>>> [1] 702d1a6e, memory-hotplug: fix kswapd looping forever problem
>>>
>>> Then, there was other report which is other problem.[2]
>>> [2] http://www.spinics.net/lists/linux-mm/msg41251.html
>>>
>>> I believe it can make problems in future, too.
>>> So I hope removing such irony type by another design.
>>>
>>> I hope this patch solves it and let's revert [1] and doesn't need [2].
>>>
>>> * Changelog v1
>>>  * Fix from Michal's many suggestion
>>>
>>> Cc: Michal Nazarewicz 
>>> Cc: Mel Gorman 
>>> Cc: Kamezawa Hiroyuki 
>>> Cc: Yasuaki Ishimatsu 
>>> Cc: Wen Congyang 
>>> Cc: Konrad Rzeszutek Wilk 
>>> Signed-off-by: Minchan Kim 
>>> ---
>>
>>> @@ -180,30 +287,35 @@ int undo_isolate_page_range(unsigned long start_pfn, 
>>> unsigned long end_pfn,
>>>   * all pages in [start_pfn...end_pfn) must be in the same zone.
>>>   * zone->lock must be held before call this.
>>>   *
>>> - * Returns 1 if all pages in the range are isolated.
>>> + * Returns true if all pages in the range are isolated.
>>>   */
>>> -static int
>>> -__test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn)
>>> +static bool
>>> +__test_page_isolated_in_pageblock(unsigned long start_pfn, unsigned long 
>>> end_pfn)
>>>  {
>>> +   unsigned long pfn, next_pfn;
>>> struct page *page;
>>>  
>>> -   while (pfn < end_pfn) {
>>> -   if (!pfn_valid_within(pfn)) {
>>> -   pfn++;
>>> -   continue;
>>> -   }
>>> -   page = pfn_to_page(pfn);
>>> -   if (PageBuddy(page))
>>> -   pfn += 1 << page_order(page);
>>> -   else if (page_count(page) == 0 &&
>>> -   page_private(page) == MIGRATE_ISOLATE)
>>> -   pfn += 1;
>>> -   else
>>> -   break;
>>> +   list_for_each_entry(page, &isolated_pages, lru) {
>>
>>> +   if (&page->lru == &isolated_pages)
>>> +   return false;
>>
>> what's the mean of this line?
> 
> I just copied it from Michal's code but It seem to be not needed.
> I will remove it in next spin.
> 
>>
>>> +   pfn = page_to_pfn(page);
>>> +   if (pfn >= end_pfn)
>>> +   return false;



>>> +   if (pfn >= start_pfn)
>>> +   goto found;

this test is wrong.

if ((pfn <= start_pfn) && (start_pfn < pfn + (1UL << page_order(page
goto found;


>>> +   }
>>> +   return false;
>>> +
>>> +   list_for_each_entry_continue(page, &isolated_pages, lru) {
>>> +   if (page_to_pfn(page) != next_pfn)
>>> +   return false;
>>
>> where is next_pfn init-ed? 
> 
> by "goto found"

don't goto inner label.

move the found label up:

+
+found:
+   next_pfn = page_to_pfn(page);
+   list_for_each_entry_from(page, &isolated_pages, lru) {
+   if (page_to_pfn(page) != next_pfn)
+   return false;
+   pfn = page_to_pfn(page);
+   next_pfn = pfn + (1UL << page_order(page));
+   if (next_pfn >= end_pfn)
+   return true;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v2] memory-hotplug: remove MIGRATE_ISOLATE from free_area->free_list

2012-09-06 Thread Lai Jiangshan

On 09/06/2012 04:18 PM, Minchan Kim wrote:
> Hello Lai,
> 
> On Thu, Sep 06, 2012 at 04:14:51PM +0800, Lai Jiangshan wrote:
>> On 09/06/2012 10:53 AM, Minchan Kim wrote:
>>> Normally, MIGRATE_ISOLATE type is used for memory-hotplug.
>>> But it's irony type because the pages isolated would exist
>>> as free page in free_area->free_list[MIGRATE_ISOLATE] so people
>>> can think of it as allocatable pages but it is *never* allocatable.
>>> It ends up confusing NR_FREE_PAGES vmstat so it would be
>>> totally not accurate so some of place which depend on such vmstat
>>> could reach wrong decision by the context.
>>>
>>> There were already report about it.[1]
>>> [1] 702d1a6e, memory-hotplug: fix kswapd looping forever problem
>>>
>>> Then, there was other report which is other problem.[2]
>>> [2] http://www.spinics.net/lists/linux-mm/msg41251.html
>>>
>>> I believe it can make problems in future, too.
>>> So I hope removing such irony type by another design.
>>>
>>> I hope this patch solves it and let's revert [1] and doesn't need [2].
>>>
>>> * Changelog v1
>>>  * Fix from Michal's many suggestion
>>>
>>> Cc: Michal Nazarewicz 
>>> Cc: Mel Gorman 
>>> Cc: Kamezawa Hiroyuki 
>>> Cc: Yasuaki Ishimatsu 
>>> Cc: Wen Congyang 
>>> Cc: Konrad Rzeszutek Wilk 
>>> Signed-off-by: Minchan Kim 
>>> ---
>>
>>> @@ -180,30 +287,35 @@ int undo_isolate_page_range(unsigned long start_pfn, 
>>> unsigned long end_pfn,
>>>   * all pages in [start_pfn...end_pfn) must be in the same zone.
>>>   * zone->lock must be held before call this.
>>>   *
>>> - * Returns 1 if all pages in the range are isolated.
>>> + * Returns true if all pages in the range are isolated.
>>>   */
>>> -static int
>>> -__test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn)
>>> +static bool
>>> +__test_page_isolated_in_pageblock(unsigned long start_pfn, unsigned long 
>>> end_pfn)
>>>  {
>>> +   unsigned long pfn, next_pfn;
>>> struct page *page;
>>>  
>>> -   while (pfn < end_pfn) {
>>> -   if (!pfn_valid_within(pfn)) {
>>> -   pfn++;
>>> -   continue;
>>> -   }
>>> -   page = pfn_to_page(pfn);
>>> -   if (PageBuddy(page))
>>> -   pfn += 1 << page_order(page);
>>> -   else if (page_count(page) == 0 &&
>>> -   page_private(page) == MIGRATE_ISOLATE)
>>> -   pfn += 1;
>>> -   else
>>> -   break;
>>> +   list_for_each_entry(page, &isolated_pages, lru) {
>>
>>> +   if (&page->lru == &isolated_pages)
>>> +   return false;
>>
>> what's the mean of this line?
> 
> I just copied it from Michal's code but It seem to be not needed.
> I will remove it in next spin.
> 
>>
>>> +   pfn = page_to_pfn(page);
>>> +   if (pfn >= end_pfn)
>>> +   return false;



>>> +   if (pfn >= start_pfn)
>>> +   goto found;

this test is wrong.

use this:

if ((pfn <= start_pfn) && (start_pfn < pfn + (1UL << page_order(page
goto found;

if (pfn > start_pfn)
return false;


>>> +   }
>>> +   return false;
>>> +
>>> +   list_for_each_entry_continue(page, &isolated_pages, lru) {
>>> +   if (page_to_pfn(page) != next_pfn)
>>> +   return false;
>>
>> where is next_pfn init-ed? 
> 
> by "goto found"

don't goto inner label.

move the found label up:

+
+found:
+   next_pfn = page_to_pfn(page);
+   list_for_each_entry_from(page, &isolated_pages, lru) {
+   if (page_to_pfn(page) != next_pfn)
+   return false;
+   pfn = page_to_pfn(page);
+   next_pfn = pfn + (1UL << page_order(page));
+   if (next_pfn >= end_pfn)
+   return true;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/11 V5] workqueue: unbind/rebind without manager_mutex

2012-09-06 Thread Lai Jiangshan

On 09/06/2012 04:04 AM, Tejun Heo wrote:
> Hello, Lai.
> 
> On Wed, Sep 05, 2012 at 06:37:47PM +0800, Lai Jiangshan wrote:
>> gcwq_unbind_fn() unbind manager by ->manager pointer.
>>
>> rebinding-manger, unbinding/rebinding newly created worker are done by
>> other place. so we don't need manager_mutex any more.
>>
>> Also change the comment of @bind accordingly.
> 
> Please don't scatter small prep patches like this.  Each piece in
> isolation doesn't make much sense to me and the patch descriptions
> don't help much.  Please collect the prep patches and explain in more
> detail.

There are 4 different tasks. unbind/rebind manager/newbie

1 task for 1 patch. if I collect them into one patch, it will be hard
to explain which code do which task.

> 
> In general, I'm not sure about this approach.  I'd really like the
> hotplug logic to be contained in hotplug logic proper as much as
> possible.  This scatters around hotplug handling to usual code paths
> and seems too invasive for 3.6-fixes.

I don't expect to fix it in 3.6. no approach is simple.

> 
> Also, can you please talk to me before going ahead and sending me
> completely new 10 patch series every other day?  You're taking
> disproportionate amount of my time and I can't continue to do this.
> Please discuss with me or at least explain the high-level approach in
> the head message in detail.  Going through the patch series to figure
> out high-level design which is constantly flipping is rather
> inefficient and unfortunately your patch descriptions aren't too
> helpful.  :(
> 

I'm not good in English, so I prefer to attach code when I show my idea.
(and the code can prove the idea). I admit that my changelog and comments
are always bad.

I have 4 idea/approach for bug of hotplug VS manage_workers().
there all come up to my mind last week. 
NOTE: (this V5 patch is my approach2)

(list with the order they came into my mind)
Approach 1  V3 patchset non_manager_role_manager_mutex_unlock()
Approach 2  V5 patchset "rebind manager, unbind/rebind newbie" 
are done outside. no manage mutex for hotplug
Approach 3  un-implemented  move unbind/rebind to worker_thread and 
handle them as POOL_MANAGE_WORKERS
Approach 4  V4 parchset manage_workers_slowpath()

Approach 2,3 is partial implemented last week, but Approach2 is quickly 
finished yesterday.
Approach 3 is too complicated to finish.

Approach 1: the simplest. after it, we can use manage_mutex anywhere as needed, 
but we need to use non_manager_role_manager_mutex_unlock() to unlock.

Approach 2: the binding of manager and newly created worker is handled outside 
of hotplug code. thus hoplug code don't need manage_mutex. manage_mutex is 
typical protect-code-pattern, it is not good. we should always use lock to 
protect data instead of protecting code. although in linux kernel, there are 
many lock which are only used for protecting code, I think we can reduce them 
as possible. the removing of BIG-KERNEL-LOCK is an example. the line of code is 
also less in this approach, but it touch 2 place outside of hotplug code and 
the logic/path are increasing. GOOD to me: disallow manage_mutex(for future), 
not too much code.

Approach 3: complicated. make unbind/rebind 's calle-site and context are the 
same as manage_workers(). BAD: we can't free to use manage_mutex in future when 
need. encounter some other problems.(you suggested approach will also have some 
problem I encountered)

Approach 4: the problem comes from manage_worker(), just add 
manage_workers_slowpath() to fix it inside manage_worker(). it fixs problem in 
only 1 bulk of code. after it, we can use manage_mutex anywhere as needed. the 
line of code is more, but it just in one place. GOOD: the most clean approach.

Thanks
Lai

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH wq/for-3.6-fixes 3/3] workqueue: fix possible idle worker depletion during CPU_ONLINE

2012-09-06 Thread Lai Jiangshan

On 09/07/2012 04:08 AM, Tejun Heo wrote:
>>From 985aafbf530834a9ab16348300adc7cbf35aab76 Mon Sep 17 00:00:00 2001
> From: Tejun Heo 
> Date: Thu, 6 Sep 2012 12:50:41 -0700
> 
> To simplify both normal and CPU hotplug paths, while CPU hotplug is in
> progress, manager_mutex is held to prevent one of the workers from
> becoming a manager and creating or destroying workers; unfortunately,
> it currently may lead to idle worker depletion which in turn can lead
> to deadlock under extreme circumstances.
> 
> Idle workers aren't allowed to become busy if there's no other idle
> worker left to create more idle workers, but during CPU_ONLINE
> gcwq_associate() is holding all managerships and all the idle workers
> can proceed to become busy before gcwq_associate() is finished.

The any code which grab the manage_mutex can cause the bug.
Not only rebind_workers(), but also gcwq_unbind_fn().
Not only during CPU_ONLINE, but also during CPU_DOWN_PREPARE

> 
> This patch fixes the bug by releasing manager_mutexes before letting
> the rebound idle workers go.  This ensures that by the time idle
> workers check whether management is necessary, CPU_ONLINE already has
> released the positions.

This can't fix the problem.

+   gcwq_claim_management(gcwq);
+   spin_lock_irq(&gcwq->lock);


If manage_workers() happens between the two line, the problem occurs!.


My non_manager_role_manager_mutex_unlock() approach has the same idea: release 
manage_mutex before release gcwq->lock.
but non_manager_role_manager_mutex_unlock() approach will detect the fail 
reason of failing to grab manage_lock and go to sleep.
rebind_workers()/gcwq_unbind_fn() will release manage_mutex and then wait up 
some before release gcwq->lock.


==
A "release manage_mutex before release gcwq->lock" approach.(no one likes it, I 
think)


/* claim manager positions of all pools */
static void gcwq_claim_management_and_lock(struct global_cwq *gcwq)
{
struct worker_pool *pool, *pool_fail;

again:
spin_lock_irq(&gcwq->lock);
for_each_worker_pool(pool, gcwq) {
if (!mutex_trylock(&pool->manager_mutex))
goto fail;
}
return;

fail: /* unlikely, because manage_workers() are very unlike path in my box */

for_each_worker_pool(pool_fail, gcwq) {
if (pool_fail != pool)
mutex_unlock(&pool->manager_mutex);
else
break;
}
spin_unlock_irq(&gcwq->lock);
cpu_relax();
goto again;
}

/* release manager positions */
static void gcwq_release_management_and_unlock(struct global_cwq *gcwq)
{
struct worker_pool *pool;

for_each_worker_pool(pool, gcwq)
        mutex_unlock(&pool->manager_mutex);
spin_unlock_irq(&gcwq->lock);
}


> 
> Signed-off-by: Tejun Heo 
> Reported-by: Lai Jiangshan 
> ---
>  kernel/workqueue.c |   20 ++--
>  1 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index b19170b..74487ef 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1454,10 +1454,19 @@ retry:
>   }
>  
>   /*
> -  * All idle workers are rebound and waiting for %WORKER_REBIND to
> -  * be cleared inside idle_worker_rebind().  Clear and release.
> -  * Clearing %WORKER_REBIND from this foreign context is safe
> -  * because these workers are still guaranteed to be idle.
> +  * At this point, each pool is guaranteed to have at least one idle
> +  * worker and all idle workers are waiting for WORKER_REBIND to
> +  * clear.  Release management before releasing idle workers;
> +  * otherwise, they can all go become busy as we're holding the
> +  * manager_mutexes, which can lead to deadlock as we don't actually
> +  * create new workers.
> +  */
> + gcwq_release_management(gcwq);
> +
> + /*
> +  * Clear %WORKER_REBIND and release.  Clearing it from this foreign
> +  * context is safe because these workers are still guaranteed to be
> +  * idle.
>*
>* We need to make sure all idle workers passed WORKER_REBIND wait
>* in idle_worker_rebind() before returning; otherwise, workers can
> @@ -1467,6 +1476,7 @@ retry:
>   INIT_COMPLETION(idle_rebind.done);
>  
>   for_each_worker_pool(pool, gcwq) {
> + WARN_ON_ONCE(list_empty(&pool->idle_list));
>   list_for_each_entry(worker, &pool->idle_list, entry) {
>   worker->flags &= ~WORKER_REBIND;
>   idle_rebind.cnt++;
> @@ -1481,8 +1491,6 @@

Re: [PATCH 05/11 V5] workqueue: Add @bind arguement back without change any thing

2012-09-06 Thread Lai Jiangshan

On 09/07/2012 12:51 AM, Tejun Heo wrote:
> Hello, Lai.
> 
> On Thu, Sep 06, 2012 at 09:04:06AM +0800, Lai Jiangshan wrote:
>>> This doesn't change anything.  You're just moving the test to the
>>> caller with comments there explaining how it won't change even if
>>> gcwq->lock is released.  It seems more confusing to me.  The flag is
>>> still protected by manager_mutex.  How is this an improvement?
>>>
>>
>> Some other bit of gcwq->flags is accessed(modified) without manager_mutex.
>> making gcwq->flags be accessed only form gcwq->lock C.S. will help the 
>> reviewer.
>>
>> I don't like adding special things/code when not-absolutely-required.
> 
> I really fail to see this.  The flag has to stay stable while
> manage_mutex is held no matter where you test it. 

Only one bit is stable, the whole flags can be changed outside.

I prefer the whole byte or short or int or long is protected under the same 
synchronization.
I don't prefer different bit uses different synchronization.

> It doesn't make any
> it any more readable whether you test it inside gcwq->lock with the
> comment saying "this won't change while manager_mutex is held" or just
> test it while manager_mutex is held.  It is a synchronization oddity
> no matter what and as long as it's well documented, I don't really see
> the point in the change.
> 

When I read "gcwq->flags & GCWQ_DISASSOCIATED" in create_worker, I thought:
WTF, gcwq->flags can be change by other, is it correct?. When I saw the 
comments claim
it is correct, I have to use about 30 mins to check whether it is correct in 
several
places of code in workqueue.c(include check does flags has internal state in 
all gcwq->lock).
I'm not good on it, but I think there are some reviewers will be confused like 
me.

Thanks,
Lai
will be 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH wq/for-3.6-fixes 3/3] workqueue: fix possible idle worker depletion during CPU_ONLINE

2012-09-06 Thread Lai Jiangshan

On 09/07/2012 04:08 AM, Tejun Heo wrote:
>>From 985aafbf530834a9ab16348300adc7cbf35aab76 Mon Sep 17 00:00:00 2001
> From: Tejun Heo 
> Date: Thu, 6 Sep 2012 12:50:41 -0700
> 
> To simplify both normal and CPU hotplug paths, while CPU hotplug is in
> progress, manager_mutex is held to prevent one of the workers from
> becoming a manager and creating or destroying workers; unfortunately,
> it currently may lead to idle worker depletion which in turn can lead
> to deadlock under extreme circumstances.
> 
> Idle workers aren't allowed to become busy if there's no other idle
> worker left to create more idle workers, but during CPU_ONLINE
> gcwq_associate() is holding all managerships and all the idle workers
> can proceed to become busy before gcwq_associate() is finished.
> 
> This patch fixes the bug by releasing manager_mutexes before letting
> the rebound idle workers go.  This ensures that by the time idle
> workers check whether management is necessary, CPU_ONLINE already has
> released the positions.
> 

Could you review manage_workers_slowpath() in V4 patchset.
It has enough changelog and comments.

After the discussion,

We don't move the hotplug code outside hotplug code. it matches this 
requirement.

Since we introduce manage_mutex(), any palace should be allowed to grab it
when its context allows. So it is not hotplug code's responsibility of this bug.

manage_workers() just use mutex_trylock() to grab the lock, it does not make
hard to do it jobs when need, and it does not try to find out the reason of 
fail.
so I think it is manage_workers()'s responsibility to handle this bug.
a manage_workers_slowpath() is enough to fix the bug.

=
manage_workers_slowpath() just adds a little overhead over manage_workers(),
so we can use manage_workers_slowpath() to replace manage_workers(), thus we
can reduce the code of manage_workers() and we can do more cleanup for manage.

Thanks,
Lai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/7 V6] workqueue: fix hoplug things

2012-09-08 Thread Lai Jiangshan

The patch set is based on 3b07e9ca26866697616097044f25fbe53dbab693 of wq.git

Patch 1,2 are accepted. Patch 1 goes to 3.6. tj has a replacement goes
to 3.6 instead of Patch 2. so Patch2 will go to 3.7. Patch2 will need
to be rebased if the replacement is still in 3.7.
(tj, could you help me do the rebase if I don't need to respin the patchset
as V7 ?)

Patch3,4 fix depletion problem, it is simple enough. it goes to 3.6.

Patch 5,6,7 are clean up. -> 3.7


Lai Jiangshan (7):
  workqueue: ensure the wq_worker_sleeping() see the right flags
  workqueue: async idle rebinding
  workqueue:  add manager pointer for worker_pool
  workqueue: fix idle worker depletion
  workqueue: rename manager_mutex to assoc_mutex
  workqueue: new day don't need WORKER_REBIND for busy rebinding
  workqueue: remove WORKER_REBIND

 kernel/workqueue.c |  195 +++-
 1 files changed, 85 insertions(+), 110 deletions(-)

-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/7 V6] workqueue: add manager pointer for worker_pool

2012-09-08 Thread Lai Jiangshan

We have this plan for manage_workers(): if failed to grab
manager_mutex via mutex_trylock(), we will release gcwq->lock and then
grab manager_mutex again.

This plan will open a hole: hotplug is running after we release gcwq->lock,
and it will not handle the binding of manager. so we add ->manager
on worker_pool and let hotplug code(gcwq_unbind_fn()) handle it.

also fix too_many_workers() to use this pointer.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |   12 ++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3dd7ce2..b203806 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -165,6 +165,7 @@ struct worker_pool {
struct timer_list   idle_timer; /* L: worker idle timeout */
struct timer_list   mayday_timer;   /* L: SOS timer for workers */
 
+   struct worker   *manager;   /* L: manager worker */
struct mutexmanager_mutex;  /* mutex manager should hold */
struct ida  worker_ida; /* L: for worker IDs */
 };
@@ -680,7 +681,7 @@ static bool need_to_manage_workers(struct worker_pool *pool)
 /* Do we have too many workers and should some go away? */
 static bool too_many_workers(struct worker_pool *pool)
 {
-   bool managing = mutex_is_locked(&pool->manager_mutex);
+   bool managing = !!pool->manager;
int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
int nr_busy = pool->nr_workers - nr_idle;
 
@@ -2066,6 +2067,7 @@ static bool manage_workers(struct worker *worker)
if (!mutex_trylock(&pool->manager_mutex))
return ret;
 
+   pool->manager = worker;
pool->flags &= ~POOL_MANAGE_WORKERS;
 
/*
@@ -2076,6 +2078,8 @@ static bool manage_workers(struct worker *worker)
ret |= maybe_create_worker(pool);
 
mutex_unlock(&pool->manager_mutex);
+   pool->manager = NULL;
+
return ret;
 }
 
@@ -3438,9 +3442,12 @@ static void gcwq_unbind_fn(struct work_struct *work)
 * ones which are still executing works from before the last CPU
 * down must be on the cpu.  After this, they may become diasporas.
 */
-   for_each_worker_pool(pool, gcwq)
+   for_each_worker_pool(pool, gcwq) {
list_for_each_entry(worker, &pool->idle_list, entry)
worker->flags |= WORKER_UNBOUND;
+   if (pool->manager)
+   pool->manager->flags |= WORKER_UNBOUND;
+   }
 
for_each_busy_worker(worker, i, pos, gcwq)
worker->flags |= WORKER_UNBOUND;
@@ -3760,6 +3767,7 @@ static int __init init_workqueues(void)
setup_timer(&pool->mayday_timer, gcwq_mayday_timeout,
(unsigned long)pool);
 
+   pool->manager = NULL;
mutex_init(&pool->manager_mutex);
ida_init(&pool->worker_ida);
}
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/7 V6] workqueue: ensure the wq_worker_sleeping() see the right flags

2012-09-08 Thread Lai Jiangshan

The compiler may compile this code into TWO write/modify instructions.
worker->flags &= ~WORKER_UNBOUND;
worker->flags |= WORKER_REBIND;

so the other CPU may see the temporary of worker->flags which has
not WORKER_UNBOUND nor WORKER_REBIND, it will wrongly do local wake up.

so we use one write/modify instruction explicitly instead.

This bug will not occur on idle workers, because they have another
WORKER_NOT_RUNNING flags.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |7 +--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 85bd340..050b2a5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1739,10 +1739,13 @@ retry:
for_each_busy_worker(worker, i, pos, gcwq) {
struct work_struct *rebind_work = &worker->rebind_work;
struct workqueue_struct *wq;
+   unsigned long worker_flags = worker->flags;
 
/* morph UNBOUND to REBIND */
-   worker->flags &= ~WORKER_UNBOUND;
-   worker->flags |= WORKER_REBIND;
+   worker_flags &= ~WORKER_UNBOUND;
+   worker_flags |= WORKER_REBIND;
+   /* ensure the wq_worker_sleeping() see the right flags */
+   ACCESS_ONCE(worker->flags) = worker_flags;
 
if (test_and_set_bit(WORK_STRUCT_PENDING_BIT,
 work_data_bits(rebind_work)))
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/7 V6] workqueue: new day don't need WORKER_REBIND for busy rebinding

2012-09-08 Thread Lai Jiangshan

because old busy_worker_rebind_fn() have to wait until all idle worker finish.
so we have to use two flags WORKER_UNBOUND and WORKER_REBIND to avoid
prematurely clear all NOT_RUNNING bit when highly frequent offline/online.

but current code don't need to wait idle workers. so we don't need to
use two flags, just one is enough. remove WORKER_REBIND from busy rebinding.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |9 +
 1 files changed, 1 insertions(+), 8 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d9765c4..4863162 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1639,7 +1639,7 @@ static void busy_worker_rebind_fn(struct work_struct 
*work)
struct global_cwq *gcwq = worker->pool->gcwq;
 
if (worker_maybe_bind_and_lock(worker))
-   worker_clr_flags(worker, WORKER_REBIND);
+   worker_clr_flags(worker, WORKER_UNBOUND);
 
spin_unlock_irq(&gcwq->lock);
 }
@@ -1692,13 +1692,6 @@ static void rebind_workers(struct global_cwq *gcwq)
for_each_busy_worker(worker, i, pos, gcwq) {
struct work_struct *rebind_work = &worker->rebind_work;
struct workqueue_struct *wq;
-   unsigned long worker_flags = worker->flags;
-
-   /* morph UNBOUND to REBIND */
-   worker_flags &= ~WORKER_UNBOUND;
-   worker_flags |= WORKER_REBIND;
-   /* ensure the wq_worker_sleeping() see the right flags */
-   ACCESS_ONCE(worker->flags) = worker_flags;
 
if (test_and_set_bit(WORK_STRUCT_PENDING_BIT,
 work_data_bits(rebind_work)))
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/7 V6] workqueue: rename manager_mutex to assoc_mutex

2012-09-08 Thread Lai Jiangshan

assoc_mutex is clear, it protects GCWQ_DISASSOCIATED.

And the C.S. of assoc_mutex is narrowed, it protects
create_worker()+start_worker() which require
GCWQ_DISASSOCIATED stable. don't need to protects
the whole manage_workers().

A result of narrowed C.S. maybe_rebind_manager()
has to be moved to the bottom of manage_workers().

Other result of narrowed C.S. manager_workers() becomes
simpler.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |   63 
 1 files changed, 29 insertions(+), 34 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 207b6a1..d9765c4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -58,7 +58,7 @@ enum {
 * be executing on any CPU.  The gcwq behaves as an unbound one.
 *
 * Note that DISASSOCIATED can be flipped only while holding
-* managership of all pools on the gcwq to avoid changing binding
+* assoc_mutex of all pools on the gcwq to avoid changing binding
 * state while create_worker() is in progress.
 */
GCWQ_DISASSOCIATED  = 1 << 0,   /* cpu can't serve workers */
@@ -166,7 +166,7 @@ struct worker_pool {
struct timer_list   mayday_timer;   /* L: SOS timer for workers */
 
struct worker   *manager;   /* L: manager worker */
-   struct mutexmanager_mutex;  /* mutex manager should hold */
+   struct mutexassoc_mutex;/* protect GCWQ_DISASSOCIATED */
struct ida  worker_ida; /* L: for worker IDs */
 };
 
@@ -1673,7 +1673,7 @@ static void rebind_workers(struct global_cwq *gcwq)
lockdep_assert_held(&gcwq->lock);
 
for_each_worker_pool(pool, gcwq)
-   lockdep_assert_held(&pool->manager_mutex);
+   lockdep_assert_held(&pool->assoc_mutex);
 
/* set REBIND and kick idle ones */
for_each_worker_pool(pool, gcwq) {
@@ -1975,15 +1975,18 @@ restart:
while (true) {
struct worker *worker;
 
+   mutex_lock(&pool->assoc_mutex);
worker = create_worker(pool);
if (worker) {
del_timer_sync(&pool->mayday_timer);
spin_lock_irq(&gcwq->lock);
start_worker(worker);
BUG_ON(need_to_create_worker(pool));
+   mutex_unlock(&pool->assoc_mutex);
return true;
}
 
+   mutex_unlock(&pool->assoc_mutex);
if (!need_to_create_worker(pool))
break;
 
@@ -2040,7 +2043,7 @@ static bool maybe_destroy_workers(struct worker_pool 
*pool)
 }
 
 /* does the manager need to be rebind after we just release gcwq->lock */
-static void maybe_rebind_manager(struct worker *manager)
+static bool maybe_rebind_manager(struct worker *manager)
 {
struct global_cwq *gcwq = manager->pool->gcwq;
bool assoc = !(gcwq->flags & GCWQ_DISASSOCIATED);
@@ -2050,7 +2053,11 @@ static void maybe_rebind_manager(struct worker *manager)
 
if (worker_maybe_bind_and_lock(manager))
worker_clr_flags(manager, WORKER_UNBOUND);
+
+   return true;
}
+
+   return false;
 }
 
 /**
@@ -2061,9 +2068,7 @@ static void maybe_rebind_manager(struct worker *manager)
  * to.  At any given time, there can be only zero or one manager per
  * gcwq.  The exclusion is handled automatically by this function.
  *
- * The caller can safely start processing works on false return.  On
- * true return, it's guaranteed that need_to_create_worker() is false
- * and may_start_working() is true.
+ * The caller can safely start processing works on false return.
  *
  * CONTEXT:
  * spin_lock_irq(gcwq->lock) which may be released and regrabbed
@@ -2076,29 +2081,12 @@ static void maybe_rebind_manager(struct worker *manager)
 static bool manage_workers(struct worker *worker)
 {
struct worker_pool *pool = worker->pool;
-   struct global_cwq *gcwq = pool->gcwq;
bool ret = false;
 
if (pool->manager)
return ret;
 
pool->manager = worker;
-   if (unlikely(!mutex_trylock(&pool->manager_mutex))) {
-   /*
-* Ouch! rebind_workers() or gcwq_unbind_fn() beats we.
-* it can't return false here, otherwise it will lead to
-* worker depletion. So we release gcwq->lock and then
-* grab manager_mutex again.
-*/
-   spin_unlock_irq(&gcwq->lock);
-   mutex_lock(&pool->manager_mutex);
-   spin_lock_irq(&gcwq->lock);
-
-   /* rebind_workers() can happen when we release gcwq->lock */
-   maybe_rebind_ma

[PATCH 4/7 V6] workqueue: fix idle worker depletion

2012-09-08 Thread Lai Jiangshan

If hotplug code grabbed the manager_mutex and worker_thread try to create
a worker, the manage_worker() will return false and worker_thread go to
process work items. Now, on the CPU, all workers are processing work items,
no idle_worker left/ready for managing. It breaks the concept of workqueue
and it is bug.

So when manage_worker() failed to grab the manager_mutex, it should
release gcwq->lock and try again.

After gcwq->lock is released, hotplug can happen. gcwq_unbind_fn() will
do the right thing for manager via ->manager. But rebind_workers()
can't rebind workers directly, worker rebind itself when it is noticed.

Manager worker will be noticed by the bit of GCWQ_DISASSOCIATED and
WORKER_UNBIND. Because the %UNBOUND bit of manager can't be cleared
while it is managing workers. maybe_rebind_manager() will be noticed
when rebind_workers() happens.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |   33 -
 1 files changed, 32 insertions(+), 1 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b203806..207b6a1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2039,6 +2039,20 @@ static bool maybe_destroy_workers(struct worker_pool 
*pool)
return ret;
 }
 
+/* does the manager need to be rebind after we just release gcwq->lock */
+static void maybe_rebind_manager(struct worker *manager)
+{
+   struct global_cwq *gcwq = manager->pool->gcwq;
+   bool assoc = !(gcwq->flags & GCWQ_DISASSOCIATED);
+
+   if (assoc && (manager->flags & WORKER_UNBOUND)) {
+   spin_unlock_irq(&gcwq->lock);
+
+   if (worker_maybe_bind_and_lock(manager))
+   worker_clr_flags(manager, WORKER_UNBOUND);
+   }
+}
+
 /**
  * manage_workers - manage worker pool
  * @worker: self
@@ -2062,12 +2076,29 @@ static bool maybe_destroy_workers(struct worker_pool 
*pool)
 static bool manage_workers(struct worker *worker)
 {
struct worker_pool *pool = worker->pool;
+   struct global_cwq *gcwq = pool->gcwq;
bool ret = false;
 
-   if (!mutex_trylock(&pool->manager_mutex))
+   if (pool->manager)
return ret;
 
pool->manager = worker;
+   if (unlikely(!mutex_trylock(&pool->manager_mutex))) {
+   /*
+* Ouch! rebind_workers() or gcwq_unbind_fn() beats we.
+* it can't return false here, otherwise it will lead to
+* worker depletion. So we release gcwq->lock and then
+* grab manager_mutex again.
+*/
+   spin_unlock_irq(&gcwq->lock);
+   mutex_lock(&pool->manager_mutex);
+   spin_lock_irq(&gcwq->lock);
+
+   /* rebind_workers() can happen when we release gcwq->lock */
+   maybe_rebind_manager(worker);
+   ret = true;
+   }
+
pool->flags &= ~POOL_MANAGE_WORKERS;
 
/*
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 7/7 V6] workqueue: remove WORKER_REBIND

2012-09-08 Thread Lai Jiangshan

exile operation   = list_del_init(&worker->entry).
and destory operation -> list_del_init(&worker->entry).

so we can use list_empty(&worker->entry) to know: does the worker
has been exiled or killed.

WORKER_REBIND is not need any more, remove it to reduce the states
of workers.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |   21 -
 1 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4863162..aa46308 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -72,11 +72,10 @@ enum {
WORKER_DIE  = 1 << 1,   /* die die die */
WORKER_IDLE = 1 << 2,   /* is idle */
WORKER_PREP = 1 << 3,   /* preparing to run works */
-   WORKER_REBIND   = 1 << 5,   /* mom is home, come back */
WORKER_CPU_INTENSIVE= 1 << 6,   /* cpu intensive */
WORKER_UNBOUND  = 1 << 7,   /* worker is unbound */
 
-   WORKER_NOT_RUNNING  = WORKER_PREP | WORKER_REBIND | WORKER_UNBOUND |
+   WORKER_NOT_RUNNING  = WORKER_PREP | WORKER_UNBOUND |
  WORKER_CPU_INTENSIVE,
 
NR_WORKER_POOLS = 2,/* # worker pools per gcwq */
@@ -1613,7 +1612,7 @@ __acquires(&gcwq->lock)
 
 /*
  * Rebind an idle @worker to its CPU. worker_thread() will test
- * %WORKER_REBIND before leaving idle and call this function.
+ * worker->entry before leaving idle and call this function.
  */
 static void idle_worker_rebind(struct worker *worker)
 {
@@ -1622,7 +1621,6 @@ static void idle_worker_rebind(struct worker *worker)
if (worker_maybe_bind_and_lock(worker))
worker_clr_flags(worker, WORKER_UNBOUND);
 
-   worker_clr_flags(worker, WORKER_REBIND);
list_add(&worker->entry, &worker->pool->idle_list);
spin_unlock_irq(&gcwq->lock);
 }
@@ -1675,11 +1673,9 @@ static void rebind_workers(struct global_cwq *gcwq)
for_each_worker_pool(pool, gcwq)
lockdep_assert_held(&pool->assoc_mutex);
 
-   /* set REBIND and kick idle ones */
+   /* exile and kick idle ones */
for_each_worker_pool(pool, gcwq) {
list_for_each_entry_safe(worker, n, &pool->idle_list, entry) {
-   worker->flags |= WORKER_REBIND;
-
/* exile idle workers */
list_del_init(&worker->entry);
 
@@ -2145,7 +2141,7 @@ __acquires(&gcwq->lock)
 * necessary to avoid spurious warnings from rescuers servicing the
 * unbound or a disassociated gcwq.
 */
-   WARN_ON_ONCE(!(worker->flags & (WORKER_UNBOUND | WORKER_REBIND)) &&
+   WARN_ON_ONCE(!(worker->flags & WORKER_UNBOUND) &&
 !(gcwq->flags & GCWQ_DISASSOCIATED) &&
 raw_smp_processor_id() != gcwq->cpu);
 
@@ -2269,18 +2265,17 @@ static int worker_thread(void *__worker)
 woke_up:
spin_lock_irq(&gcwq->lock);
 
-   /*
-* DIE can be set only while idle and REBIND set while busy has
-* @worker->rebind_work scheduled.  Checking here is enough.
-*/
-   if (unlikely(worker->flags & (WORKER_REBIND | WORKER_DIE))) {
+   /* Is it still home ? */
+   if (unlikely(list_empty(&worker->entry))) {
spin_unlock_irq(&gcwq->lock);
 
+   /* reason: DIE */
if (worker->flags & WORKER_DIE) {
worker->task->flags &= ~PF_WQ_WORKER;
return 0;
}
 
+   /* reason: idle rebind */
idle_worker_rebind(worker);
goto woke_up;
}
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/7 V6] workqueue: async idle rebinding

2012-09-08 Thread Lai Jiangshan

fix deadlock in rebind_workers()

Current idle_worker_rebind() has a bug.

idle_worker_rebind() path   HOTPLUG path
online
rebind_workers()
wait_event(gcwq->rebind_hold)
woken up but no scheduled   rebind_workers() returns
the same cpu offline
the same cpu online again
rebind_workers()
set 
WORKER_REBIND
scheduled,see the WORKER_REBIND
wait rebind_workers() clear it<--bug--> wait 
idle_worker_rebind()
rebound.

The two thread wait each other. It is bug.

This patch gives up to rebind idles synchronously. make them async instead.

To avoid to wrongly do local-wake-up, we add a exile-operation for
idle-worker-rebinding. When a idle worker is exiled, it will not queued
on @idle_list until it is rebound.

After we have exile-operation, the @nr_idle is not only the count of @idle_list,
but also exiled idle workers. so I check all the code, and make
them exile-operation aware. (too_many_workers())

exile-operation is also the core idea to rebind newly created worker.
(patch 9)

rebind_workers() become single pass and don't release gcwq->lock.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |  100 +---
 1 files changed, 25 insertions(+), 75 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 050b2a5..3dd7ce2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -125,7 +125,6 @@ enum {
 
 struct global_cwq;
 struct worker_pool;
-struct idle_rebind;
 
 /*
  * The poor guys doing the actual heavy lifting.  All on-duty workers
@@ -149,7 +148,6 @@ struct worker {
int id; /* I: worker id */
 
/* for rebinding worker to CPU */
-   struct idle_rebind  *idle_rebind;   /* L: for idle worker */
struct work_struct  rebind_work;/* L: for busy worker */
 };
 
@@ -159,7 +157,9 @@ struct worker_pool {
 
struct list_headworklist;   /* L: list of pending works */
int nr_workers; /* L: total number of workers */
-   int nr_idle;/* L: currently idle ones */
+   int nr_idle;/* L: currently idle ones,
+ include ones in idle_list
+ and in doing rebind. */
 
struct list_headidle_list;  /* X: list of idle workers */
struct timer_list   idle_timer; /* L: worker idle timeout */
@@ -185,8 +185,6 @@ struct global_cwq {
 
struct worker_pool  pools[NR_WORKER_POOLS];
/* normal and highpri pools */
-
-   wait_queue_head_t   rebind_hold;/* rebind hold wait */
 } cacheline_aligned_in_smp;
 
 /*
@@ -686,6 +684,10 @@ static bool too_many_workers(struct worker_pool *pool)
int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
int nr_busy = pool->nr_workers - nr_idle;
 
+   /* Is any idle home ? */
+   if (unlikely(list_empty(&pool->idle_list)))
+   return false;
+
return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
 }
 
@@ -1608,28 +1610,20 @@ __acquires(&gcwq->lock)
}
 }
 
-struct idle_rebind {
-   int cnt;/* # workers to be rebound */
-   struct completion   done;   /* all workers rebound */
-};
-
 /*
- * Rebind an idle @worker to its CPU.  During CPU onlining, this has to
- * happen synchronously for idle workers.  worker_thread() will test
+ * Rebind an idle @worker to its CPU. worker_thread() will test
  * %WORKER_REBIND before leaving idle and call this function.
  */
 static void idle_worker_rebind(struct worker *worker)
 {
struct global_cwq *gcwq = worker->pool->gcwq;
 
-   /* CPU must be online at this point */
-   WARN_ON(!worker_maybe_bind_and_lock(worker));
-   if (!--worker->idle_rebind->cnt)
-   complete(&worker->idle_rebind->done);
-   spin_unlock_irq(&worker->pool->gcwq->lock);
+   if (worker_maybe_bind_and_lock(worker))
+   worker_clr_flags(worker, WORKER_UNBOUND);
 
-   /* we did our part, wait for rebind_workers() to finish up */
-   wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND));
+   worker_clr_flags(worker, WORKER_REBIND);
+   list_add(&worker->entry, &w

Re: [PATCH wq/for-3.6-fixes 3/3] workqueue: fix possible idle worker depletion during CPU_ONLINE

2012-09-08 Thread Lai Jiangshan

On Sat, Sep 8, 2012 at 7:41 AM, Tejun Heo  wrote:
> I think this should do it.  Can you spot any hole with the following
> patch?
>
> Thanks.
>
> Index: work/kernel/workqueue.c
> ===
> --- work.orig/kernel/workqueue.c
> +++ work/kernel/workqueue.c
> @@ -66,6 +66,7 @@ enum {
>
> /* pool flags */
> POOL_MANAGE_WORKERS = 1 << 0,   /* need to manage workers */
> +   POOL_MANAGING_WORKERS   = 1 << 1,   /* managing workers */
>
> /* worker flags */
> WORKER_STARTED  = 1 << 0,   /* started */
> @@ -165,7 +166,7 @@ struct worker_pool {
> struct timer_list   idle_timer; /* L: worker idle timeout */
> struct timer_list   mayday_timer;   /* L: SOS timer for workers */
>
> -   struct mutexmanager_mutex;  /* mutex manager should hold 
> */
> +   struct mutexmanager_mutex;  /* manager <-> CPU hotplug */
> struct ida  worker_ida; /* L: for worker IDs */
>  };
>
> @@ -480,6 +481,7 @@ static atomic_t unbound_pool_nr_running[
>  };
>
>  static int worker_thread(void *__worker);
> +static void process_scheduled_works(struct worker *worker);
>
>  static int worker_pool_pri(struct worker_pool *pool)
>  {
> @@ -652,7 +654,7 @@ static bool need_to_manage_workers(struc
>  /* Do we have too many workers and should some go away? */
>  static bool too_many_workers(struct worker_pool *pool)
>  {
> -   bool managing = mutex_is_locked(&pool->manager_mutex);
> +   bool managing = pool->flags & POOL_MANAGING_WORKERS;
> int nr_idle = pool->nr_idle + managing; /* manager is considered idle 
> */
> int nr_busy = pool->nr_workers - nr_idle;
>
> @@ -1820,14 +1822,43 @@ static bool maybe_destroy_workers(struct
>   * some action was taken.
>   */
>  static bool manage_workers(struct worker *worker)
> +   __releases(&gcwq->lock) __acquires(&gcwq->lock)
>  {
> struct worker_pool *pool = worker->pool;
> +   struct global_cwq *gcwq = pool->gcwq;
> bool ret = false;
>
> -   if (!mutex_trylock(&pool->manager_mutex))
> -   return ret;
> +   if (pool->flags & POOL_MANAGING_WORKERS)
> +   return ret;
>
> pool->flags &= ~POOL_MANAGE_WORKERS;
> +   pool->flags |= POOL_MANAGING_WORKERS;
> +
> +   /*
> +* To simplify both worker management and CPU hotplug, hold off
> +* management while hotplug is in progress.  CPU hotplug path can't
> +* grab %POOL_MANAGING_WORKERS to achieve this because that can
> +* lead to idle worker depletion (all become busy thinking someone
> +* else is managing) which in turn can result in deadlock under
> +* extreme circumstances.
> +*
> +* manager_mutex would always be free unless CPU hotplug is in
> +* progress.  trylock first without dropping gcwq->lock.
> +*/
> +   if (unlikely(!mutex_trylock(&pool->manager_mutex))) {
> +   spin_unlock_irq(&gcwq->lock);

hotplug can happen here.

> +   mutex_lock(&pool->manager_mutex);
> +   spin_lock_irq(&gcwq->lock);
> +
> +   /*
> +* CPU hotplug could have scheduled rebind_work while we're
> +* waiting for manager_mutex.  Rebind before doing anything
> +* else.  This has to be handled here.  worker_thread()
> +* will be confused by the unexpected work item.
> +*/
> +   process_scheduled_works(worker);

hotplug code can't iterate manager.  not rebind_work() nor UNBOUND for manager.

> +   ret = true;
> +   }
>
> /*
>  * Destroy and then create so that may_start_working() is true
> @@ -1836,7 +1867,9 @@ static bool manage_workers(struct worker
> ret |= maybe_destroy_workers(pool);
> ret |= maybe_create_worker(pool);
>
> +   pool->flags &= ~POOL_MANAGING_WORKERS;
> mutex_unlock(&pool->manager_mutex);
> +
> return ret;
>  }
>
> @@ -3393,7 +3426,7 @@ EXPORT_SYMBOL_GPL(work_busy);
>   * cpu comes back online.
>   */
>
> -/* claim manager positions of all pools */
> +/* claim manager positions of all pools, see manage_workers() for details */
>  static void gcwq_claim_management_and_lock(struct global_cwq *gcwq)
>  {
> struct worker_pool *pool;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/7 V6] workqueue: fix hoplug things

2012-09-08 Thread Lai Jiangshan

On Sun, Sep 9, 2012 at 1:27 AM, Lai Jiangshan  wrote:
> On Sun, Sep 9, 2012 at 1:12 AM, Lai Jiangshan  wrote:
>> The patch set is based on 3b07e9ca26866697616097044f25fbe53dbab693 of wq.git
>>
>> Patch 1,2 are accepted. Patch 1 goes to 3.6. tj has a replacement goes
>> to 3.6 instead of Patch 2. so Patch2 will go to 3.7. Patch2 will need
>> to be rebased if the replacement is still in 3.7.
>> (tj, could you help me do the rebase if I don't need to respin the patchset
>> as V7 ?)
>>
>> Patch3,4 fix depletion problem, it is simple enough. it goes to 3.6.
>
> sorry.
> 3.6 is synchronous idles when we use tj's replacement for patch2.
> and maybe_rebind_manager() don't wait for idles rebind. so it can't go to 3.6.
>
> Choice1: also push Patch 2(async idle rebinding) to 3.6? thus patch 4
> can goto 3.6 too.
> Choice2: add workaroud and make patch4 and make it go to 3.6. (add some code.)
>

Sorry again. the above worry is incorrect.
maybe_rebind_manager() **DO** wait for idles rebind via
mutex_lock(manager_lock).
so it is safe to 3.6.

sorry. don't worry anything. I just think it without code.

Thanks
Lai

>
>>
>> Patch 5,6,7 are clean up. -> 3.7
>>
>>
>> Lai Jiangshan (7):
>>   workqueue: ensure the wq_worker_sleeping() see the right flags
>>   workqueue: async idle rebinding
>>   workqueue:  add manager pointer for worker_pool
>>   workqueue: fix idle worker depletion
>>   workqueue: rename manager_mutex to assoc_mutex
>>   workqueue: new day don't need WORKER_REBIND for busy rebinding
>>   workqueue: remove WORKER_REBIND
>>
>>  kernel/workqueue.c |  195 
>> +++-
>>  1 files changed, 85 insertions(+), 110 deletions(-)
>>
>> --
>> 1.7.4.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/7 V6] workqueue: fix hoplug things

2012-09-08 Thread Lai Jiangshan

On Sun, Sep 9, 2012 at 1:12 AM, Lai Jiangshan  wrote:
> The patch set is based on 3b07e9ca26866697616097044f25fbe53dbab693 of wq.git
>
> Patch 1,2 are accepted. Patch 1 goes to 3.6. tj has a replacement goes
> to 3.6 instead of Patch 2. so Patch2 will go to 3.7. Patch2 will need
> to be rebased if the replacement is still in 3.7.
> (tj, could you help me do the rebase if I don't need to respin the patchset
> as V7 ?)
>
> Patch3,4 fix depletion problem, it is simple enough. it goes to 3.6.

sorry.
3.6 is synchronous idles when we use tj's replacement for patch2.
and maybe_rebind_manager() don't wait for idles rebind. so it can't go to 3.6.

Choice1: also push Patch 2(async idle rebinding) to 3.6? thus patch 4
can goto 3.6 too.
Choice2: add workaroud and make patch4 and make it go to 3.6. (add some code.)

Thanks.
Lai

>
> Patch 5,6,7 are clean up. -> 3.7
>
>
> Lai Jiangshan (7):
>   workqueue: ensure the wq_worker_sleeping() see the right flags
>   workqueue: async idle rebinding
>   workqueue:  add manager pointer for worker_pool
>   workqueue: fix idle worker depletion
>   workqueue: rename manager_mutex to assoc_mutex
>   workqueue: new day don't need WORKER_REBIND for busy rebinding
>   workqueue: remove WORKER_REBIND
>
>  kernel/workqueue.c |  195 
> +++-
>  1 files changed, 85 insertions(+), 110 deletions(-)
>
> --
> 1.7.4.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH wq/for-3.6-fixes 3/3] workqueue: fix possible idle worker depletion during CPU_ONLINE

2012-09-08 Thread Lai Jiangshan

On Sun, Sep 9, 2012 at 1:32 AM, Tejun Heo  wrote:
> On Sat, Sep 08, 2012 at 10:29:50AM -0700, Tejun Heo wrote:
>> > hotplug code can't iterate manager.  not rebind_work() nor UNBOUND for 
>> > manager.
>>
>> Ah, right.  It isn't either on idle or busy list.  Maybe have
>> pool->manager pointer?
>
> Ooh, this is what you did with the new patchset, right?

I already did it in V5 patchset. not in new patchset. I just change it
as you like in V6.
I change the strategy of calling may_rebind_manager().

Thanks.
Lai

>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/7 V6] workqueue: fix hoplug things

2012-09-08 Thread Lai Jiangshan

On Sun, Sep 9, 2012 at 1:37 AM, Tejun Heo  wrote:
> Hello, Lai.
>
> On Sun, Sep 09, 2012 at 01:27:37AM +0800, Lai Jiangshan wrote:
>> On Sun, Sep 9, 2012 at 1:12 AM, Lai Jiangshan  wrote:
>> > The patch set is based on 3b07e9ca26866697616097044f25fbe53dbab693 of 
>> > wq.git
>> >
>> > Patch 1,2 are accepted. Patch 1 goes to 3.6. tj has a replacement goes
>> > to 3.6 instead of Patch 2. so Patch2 will go to 3.7. Patch2 will need
>> > to be rebased if the replacement is still in 3.7.
>> > (tj, could you help me do the rebase if I don't need to respin the patchset
>> > as V7 ?)
>> >
>> > Patch3,4 fix depletion problem, it is simple enough. it goes to 3.6.
>>
>> sorry.
>> 3.6 is synchronous idles when we use tj's replacement for patch2.
>> and maybe_rebind_manager() don't wait for idles rebind. so it can't go to 
>> 3.6.
>
> Let's get the fix down first.  I *think* we can do it for 3.6-fixes.
> Can't we do the following?
>
> * Instead of MANAGING, add pool->manager.
>
> * Fix the idle depletion bug by using pool->manager for exclusion and
>   always grabbing pool->manager_mutex.  Hotplug can use pool->manager
>   to schedule rebind work (or UNBIND) to the manager.
>
> Thoughts?

Don't need.  my worry is wrong.

>
> Also, can you please base the fix patches on top of wq/for-3.6-fixes?
> It's getting quite confusing.
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-3.6-fixes

I base on wq/for-3.7 of several days ago.

I can change the base, but which blanch should patch5,6,7 base on?


Thanks.
Lai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/7 V6] workqueue: fix idle worker depletion

2012-09-08 Thread Lai Jiangshan

On Sun, Sep 9, 2012 at 1:40 AM, Tejun Heo  wrote:
> Hello, Lai.
>
> On Sun, Sep 09, 2012 at 01:12:53AM +0800, Lai Jiangshan wrote:
>> +/* does the manager need to be rebind after we just release gcwq->lock */
>> +static void maybe_rebind_manager(struct worker *manager)
>> +{
>> + struct global_cwq *gcwq = manager->pool->gcwq;
>> + bool assoc = !(gcwq->flags & GCWQ_DISASSOCIATED);
>> +
>> + if (assoc && (manager->flags & WORKER_UNBOUND)) {
>> + spin_unlock_irq(&gcwq->lock);
>> +
>> + if (worker_maybe_bind_and_lock(manager))
>> + worker_clr_flags(manager, WORKER_UNBOUND);
>> + }
>> +}
>
> We can reuse busy_worker_rebind_fn(), right?

busy_worker_rebind_fn() releases the gcwq->lock. we can't release the lock here.

>
>>   pool->manager = worker;
>> + if (unlikely(!mutex_trylock(&pool->manager_mutex))) {
>> + /*
>> +  * Ouch! rebind_workers() or gcwq_unbind_fn() beats we.
>> +  * it can't return false here, otherwise it will lead to
>> +  * worker depletion. So we release gcwq->lock and then
>> +  * grab manager_mutex again.
>> +  */
>> + spin_unlock_irq(&gcwq->lock);
>> + mutex_lock(&pool->manager_mutex);
>> + spin_lock_irq(&gcwq->lock);
>> +
>> + /* rebind_workers() can happen when we release gcwq->lock */
>> + maybe_rebind_manager(worker);
>
> And we can call process_scheduled_works() here and make the CPU
> hotplug check pool->manager and schedule rebind_work there.
>

sorry again. don't need.

Thanks.
Lai

>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/7 V6] workqueue: fix hoplug things

2012-09-08 Thread Lai Jiangshan

On Sun, Sep 9, 2012 at 1:50 AM, Tejun Heo  wrote:
> Hello, Lai.
>
> On Sun, Sep 09, 2012 at 01:46:59AM +0800, Lai Jiangshan wrote:
>> > * Instead of MANAGING, add pool->manager.
>> >
>> > * Fix the idle depletion bug by using pool->manager for exclusion and
>> >   always grabbing pool->manager_mutex.  Hotplug can use pool->manager
>> >   to schedule rebind work (or UNBIND) to the manager.
>> >
>> > Thoughts?
>>
>> Don't need.  my worry is wrong.
>
> So, your worry was incorrect and the above is what we're gonna do, no?

NO BUG found for PATCH4 even idle rebinding is synchronous.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/7 V6] workqueue: fix idle worker depletion

2012-09-08 Thread Lai Jiangshan

On Sun, Sep 9, 2012 at 1:53 AM, Tejun Heo  wrote:
> Hello,
>
> On Sun, Sep 09, 2012 at 01:50:41AM +0800, Lai Jiangshan wrote:
>> >> + if (worker_maybe_bind_and_lock(manager))
>> >> + worker_clr_flags(manager, WORKER_UNBOUND);
>> >> + }
>> >> +}
>> >
>> > We can reuse busy_worker_rebind_fn(), right?
>>
>> busy_worker_rebind_fn() releases the gcwq->lock. we can't release
>> the lock here.
>
> Why so?  Can you please elaborate?
>

when we release gcwq->lock and then grab it, we leave a hole that things
can be changed.

I don't want to open a hole. if the hole has bug we have to fix it.
if the hole has no bug, we have to add lot of comments to explain it.

When I write this reply. I am thinking: is the hole  has bug if
I release gcwq->lock here? result: no. But I don't want to add all things
what I have thought as comments to explain there is no bug even when we
open a hole. don't leave reviewers too much burden.

Thanks.
Lai.

>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/7 V6] workqueue: fix hoplug things

2012-09-08 Thread Lai Jiangshan

>
> Hmmm... so, I'm having some difficulty communicating with you.  We
> need two separate patch series.  One for for-3.6-fixes and the other
> for restructuring on top of for-3.7 after the fixes are merged into
> it.
>
> As you currently posted, the patches are based on for-3.7 and fixes
> and restructuring are intermixed, so I'm asking you to separate out
> two patches to fix the idle depletion bug and base them on top of
> for-3.6-fixes.  Am I misunderstanding something?
>


Patch 3,4 are ready for 3.6 and they can be applied to 3.6 directly.
don't need to revise nor resend.
you just pick them up to for-3.6-fixes.

Thanks.
Lai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/7 V6] workqueue: fix idle worker depletion

2012-09-08 Thread Lai Jiangshan

On Sun, Sep 9, 2012 at 2:11 AM, Tejun Heo  wrote:
> On Sun, Sep 09, 2012 at 02:07:50AM +0800, Lai Jiangshan wrote:
>> when we release gcwq->lock and then grab it, we leave a hole that things
>> can be changed.
>>
>> I don't want to open a hole. if the hole has bug we have to fix it.
>> if the hole has no bug, we have to add lot of comments to explain it.
>>
>> When I write this reply. I am thinking: is the hole  has bug if
>> I release gcwq->lock here? result: no. But I don't want to add all things
>> what I have thought as comments to explain there is no bug even when we
>> open a hole. don't leave reviewers too much burden.
>
> We're already releasing gcwq->lock in maybe_create_worker().  That's
> the reason why @ret is set to true.  In addition, we already released
> the lock to grab manager_mutex.  So, you're not adding any burden.
> Please reuse the busy rebinding mechanism.
>

in 3.6 busy_worker_rebind() handle WORKER_REBIND bit,
not WORKER_UNBOUND bit.

busy_worker_rebind() takes struct work_struct *work argument, we have to
add new patch to add a helper and restruct it at first.

worker_maybe_bind_and_lock() 's mean is very clear here. busy_worker_rebind()
seems for busy workers, manager is not busy workers.

>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/7 V6] workqueue: fix idle worker depletion

2012-09-08 Thread Lai Jiangshan

On Sun, Sep 9, 2012 at 3:02 AM, Tejun Heo  wrote:
> Hello, Lai.
>
> On Sun, Sep 09, 2012 at 02:34:02AM +0800, Lai Jiangshan wrote:
>> in 3.6 busy_worker_rebind() handle WORKER_REBIND bit,
>> not WORKER_UNBOUND bit.
>>
>> busy_worker_rebind() takes struct work_struct *work argument, we have to
>> add new patch to add a helper and restruct it at first.
>
> What's wrong with just treating manager as busy.  Factor out,
> rebind_work scheduling from rebind_workers() and call it for busy
> workers and the manager if it exists.  manage_workers() only need to
> call process_scheduled_works().  Wouldn't that work?
>
>> worker_maybe_bind_and_lock() 's mean is very clear
>> here. busy_worker_rebind() seems for busy workers, manager is not
>> busy workers.
>
> I don't know.  It just seems unnecessarily wordy.  If you don't like
> reusing the busy worker path, how about just calling
> maybe_bind_and_lock() unconditionally after locking manager_mutex?  I
> mean, can't it just do the following?
>
> spin_unlock_irq(&gcwq->lock);
>
> /*
>  * Explain what's going on.
>  */
> mutex_lock(&pool->manager_mutex);
> if (worker_maybe_bind_and_lock(worker))
> worker_clr_flags(worker, WORKER_UNBOUND);
> ret = true;
>


This code is correct. worker_maybe_bind_and_lock() can be called any time.
but I like to call it only when it is really needed.

Thanks.
Lai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2 V7 for-3.6-fixes] workqueue: add POOL_MANAGING_WORKERS

2012-09-09 Thread Lai Jiangshan

When hotplug happens, the plug code will also grab the manager_mutex,
it will break too_many_workers()'s assumption, and make too_many_workers()
ugly(kick the timer wrongly, no found bug).

To avoid assumption-coruption, we add the original POOL_MANAGING_WORKERS back.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index dc7b845..383548e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -66,6 +66,7 @@ enum {
 
/* pool flags */
POOL_MANAGE_WORKERS = 1 << 0,   /* need to manage workers */
+   POOL_MANAGING_WORKERS   = 1 << 1,   /* managing workers */
 
/* worker flags */
WORKER_STARTED  = 1 << 0,   /* started */
@@ -652,7 +653,7 @@ static bool need_to_manage_workers(struct worker_pool *pool)
 /* Do we have too many workers and should some go away? */
 static bool too_many_workers(struct worker_pool *pool)
 {
-   bool managing = mutex_is_locked(&pool->manager_mutex);
+   bool managing = pool->flags & POOL_MANAGING_WORKERS;
int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
int nr_busy = pool->nr_workers - nr_idle;
 
@@ -1827,6 +1828,7 @@ static bool manage_workers(struct worker *worker)
if (!mutex_trylock(&pool->manager_mutex))
return ret;
 
+   pool->flags |= POOL_MANAGING_WORKERS;
pool->flags &= ~POOL_MANAGE_WORKERS;
 
/*
@@ -1836,6 +1838,7 @@ static bool manage_workers(struct worker *worker)
ret |= maybe_destroy_workers(pool);
ret |= maybe_create_worker(pool);
 
+   pool->flags &= ~POOL_MANAGING_WORKERS;
mutex_unlock(&pool->manager_mutex);
return ret;
 }
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2 V7 for-3.6-fixes] workqueue: fix idle worker depletion

2012-09-09 Thread Lai Jiangshan

If hotplug code grabbed the manager_mutex and worker_thread try to create
a worker, the manage_worker() will return false and worker_thread go to
process work items. Now, on the CPU, all workers are processing work items,
no idle_worker left/ready for managing. It breaks the concept of workqueue
and it is bug.

So when manage_worker() failed to grab the manager_mutex, it should
release gcwq->lock and then grab manager_mutex.

After gcwq->lock is released, hotplug can happen. but the hoplug code
can't unbind/rebind the manager, so the manager should try to rebind
itself unconditionaly, if it fails, unbind itself.

Signed-off-by: Lai Jiangshan 
---
 kernel/workqueue.c |   31 ++-
 1 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 383548e..74434c8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1825,10 +1825,39 @@ static bool manage_workers(struct worker *worker)
struct worker_pool *pool = worker->pool;
bool ret = false;
 
-   if (!mutex_trylock(&pool->manager_mutex))
+   if (pool->flags & POOL_MANAGING_WORKERS)
return ret;
 
pool->flags |= POOL_MANAGING_WORKERS;
+
+   if (unlikely(!mutex_trylock(&pool->manager_mutex))) {
+   /*
+* Ouch! rebind_workers() or gcwq_unbind_fn() beats it.
+* it can't return false here, otherwise it will lead to
+* worker depletion. So we release gcwq->lock and then
+* grab manager_mutex again.
+*/
+   spin_unlock_irq(&pool->gcwq->lock);
+   mutex_lock(&pool->manager_mutex);
+
+   /*
+* The hotplug had happened after the previous releasing
+* of gcwq->lock. So we can't assume that this worker is
+* still associated or not. And we have to try to rebind it
+* via worker_maybe_bind_and_lock(). If it returns false,
+* we can conclude that the whole gcwq is disassociated,
+* and we must unbind this worker. (hotplug code can't
+* unbind/rebind the manager, because hotplug code can't
+* iterate the manager)
+*/
+   if (worker_maybe_bind_and_lock(worker))
+   worker->flags &= ~WORKER_UNBOUND;
+   else
+   worker->flags |= WORKER_UNBOUND;
+
+   ret = true;
+   }
+
pool->flags &= ~POOL_MANAGE_WORKERS;
 
/*
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 00/26] memory,numa: introduce MOVABLE-dedicated node and online_movable for hotplug

2012-09-10 Thread Lai Jiangshan

A) Introduction:

This patchset adds MOVABLE-dedicated node and online_movable for 
memory-management.

It is used for anti-fragmentation(hugepage, big-order allocation...),
hot-removal-of-memory(virtualization, power-conserve, move memory between 
systems
to make better utilities of memories).

This patchset is based on 650470d1da17c20bf9700f9446775a01cbda52c3 of newest 
tip tree.

B) User Interface:

When users(big system manager) need config some node/memory as MOVABLE:
1 Use kernelcore_max_addr=XX when boot
2 Use movable_online hotplug action when running
We may introduce some more convenient interface, such as
movable_node=NODE_LIST boot option.

C) Patches

Patch1-3  Fix problems of the current code.(all related with hotplug)
Patch4cleanup for node_state_attr
Patch5introduce N_MEMORY
Patch6-18 use N_MEMORY instead N_HIGH_MEMORY.
  The patches are separated by subsystem,
  *these conversions was(must be) checked carefully*.
  Patch18 also changes the node_states initialization
Patch19   Add config to allow MOVABLE-dedicated node
Patch20-24Add kernelcore_max_addr
Patch25,26   Add online_movable and online_kernel


D) changes
change V4-v3
rebase.
online_movable/online_kernel can create a zone from empty
or empyt a zone

change V3-v2:
Proper nodemask management

change V2-V1:

The original V1 patchset of MOVABLE-dedicated node is here:
http://comments.gmane.org/gmane.linux.kernel.mm/78122

The new V2 adds N_MEMORY and a notion of "MOVABLE-dedicated node".
And fix some related problems.

The orignal V1 patchset of "add online_movable" is here:
https://lkml.org/lkml/2012/7/4/145

The new V2 discards the MIGRATE_HOTREMOVE approach, and use a more straight
implementation(only 1 patch).
Lai Jiangshan (22):
  page_alloc.c: don't subtract unrelated memmap from zone's present
pages
  memory_hotplug: fix missing nodemask management
  slub, hotplug: ignore unrelated node's hot-adding and hot-removing
  node: cleanup node_state_attr
  node_states: introduce N_MEMORY
  cpuset: use N_MEMORY instead N_HIGH_MEMORY
  procfs: use N_MEMORY instead N_HIGH_MEMORY
  memcontrol: use N_MEMORY instead N_HIGH_MEMORY
  oom: use N_MEMORY instead N_HIGH_MEMORY
  mm,migrate: use N_MEMORY instead N_HIGH_MEMORY
  mempolicy: use N_MEMORY instead N_HIGH_MEMORY
  hugetlb: use N_MEMORY instead N_HIGH_MEMORY
  vmstat: use N_MEMORY instead N_HIGH_MEMORY
  kthread: use N_MEMORY instead N_HIGH_MEMORY
  init: use N_MEMORY instead N_HIGH_MEMORY
  vmscan: use N_MEMORY instead N_HIGH_MEMORY
  page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states
initialization
  hotplug: update nodemasks management
  numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
  page_alloc: add kernelcore_max_addr
  mm, memory-hotplug: add online_movable and online_kernel
  memory_hotplug: handle empty zone when online_movable/online_kernel

Yasuaki Ishimatsu (4):
  x86: get pg_data_t's memory from other node
  x86: use memblock_set_current_limit() to set memblock.current_limit
  memblock: limit memory address from memblock
  memblock: compare current_limit with end variable at
memblock_find_in_range_node()

 Documentation/cgroups/cpusets.txt   |2 +-
 Documentation/kernel-parameters.txt |9 ++
 Documentation/memory-hotplug.txt|   24 +++-
 arch/x86/kernel/setup.c |4 +-
 arch/x86/mm/init_64.c   |4 +-
 arch/x86/mm/numa.c  |8 +-
 drivers/base/memory.c   |   19 ++-
 drivers/base/node.c |   28 +++--
 fs/proc/kcore.c |2 +-
 fs/proc/task_mmu.c  |4 +-
 include/linux/cpuset.h  |2 +-
 include/linux/memblock.h|1 +
 include/linux/memory.h  |2 +
 include/linux/memory_hotplug.h  |   13 ++-
 include/linux/nodemask.h|5 +
 init/main.c |2 +-
 kernel/cpuset.c |   32 ++--
 kernel/kthread.c|2 +-
 mm/Kconfig  |8 +
 mm/hugetlb.c|   24 ++--
 mm/memblock.c   |   10 +-
 mm/memcontrol.c |   18 ++--
 mm/memory_hotplug.c |  271 ---
 mm/mempolicy.c  |   12 +-
 mm/migrate.c|2 +-
 mm/oom_kill.c   |2 +-
 mm/page_alloc.c |   96 -
 mm/page_cgroup.c|2 +-
 mm/slub.c   |4 +-
 mm/vmscan.c |4 +-
 mm/vmstat.c |4 +-
 31 files changed, 476 insertions(+), 144 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the b

[V4 PATCH 01/26] page_alloc.c: don't subtract unrelated memmap from zone's present pages

2012-09-10 Thread Lai Jiangshan

A)==
Currently, memory-page-map(struct page array) is not defined in struct zone.
It is defined in several ways:

FLATMEM: global memmap, can be allocated from any zone <= ZONE_NORMAL
CONFIG_DISCONTIGMEM: node-specific memmap, can be allocated from any
 zone <= ZONE_NORMAL within that node.
CONFIG_SPARSEMEM: memorysection-specific memmap, can be allocated from any zone,
  when CONFIG_SPARSEMEM_VMEMMAP, it is even not physical 
continuous.

So, the memmap has nothing directly related with the zone. And it's memory can 
be
allocated outside, so it is wrong to subtract memmap's size from zone's
present pages.

B)==
When system has large holes, the subtracted-present-pages-size will become
very small or negative, make the memory management works bad at the zone or
make the zone unusable even the real-present-pages-size is actually large.

C)==
And subtracted-present-pages-size has problem when memory-hot-removing,
the zone->zone->present_pages may overflow and become huge(unsigned long).

D)==
memory-page-map is large and long living unreclaimable memory, it is good to
subtract them for proper watermark.
So a new proper approach is needed to do it similarly
and new approach should also handle other long living unreclaimable memory.

Current blindly subtracted-present-pages-size approach does wrong, remove it.

Signed-off-by: Lai Jiangshan 
---
 mm/page_alloc.c |   20 +---
 1 files changed, 1 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c66fb87..9e3c8b2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4401,30 +4401,12 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
 
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
-   unsigned long size, realsize, memmap_pages;
+   unsigned long size, realsize;
 
size = zone_spanned_pages_in_node(nid, j, zones_size);
realsize = size - zone_absent_pages_in_node(nid, j,
zholes_size);
 
-   /*
-* Adjust realsize so that it accounts for how much memory
-* is used by this zone for memmap. This affects the watermark
-* and per-cpu initialisations
-*/
-   memmap_pages =
-   PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
-   if (realsize >= memmap_pages) {
-   realsize -= memmap_pages;
-   if (memmap_pages)
-   printk(KERN_DEBUG
-  "  %s zone: %lu pages used for memmap\n",
-  zone_names[j], memmap_pages);
-   } else
-   printk(KERN_WARNING
-   "  %s zone: %lu pages exceeds realsize %lu\n",
-   zone_names[j], memmap_pages, realsize);
-
/* Account for reserved pages */
if (j == 0 && realsize > dma_reserve) {
realsize -= dma_reserve;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 04/26] node: cleanup node_state_attr

2012-09-10 Thread Lai Jiangshan

use [index] = init_value
use N_x instead of hardcode.

Make it more readability and easy to add new state.

Signed-off-by: Lai Jiangshan 
---
 drivers/base/node.c |   20 ++--
 1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index af1a177..5d7731e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -614,23 +614,23 @@ static ssize_t show_node_state(struct device *dev,
{ __ATTR(name, 0444, show_node_state, NULL), state }
 
 static struct node_attr node_state_attr[] = {
-   _NODE_ATTR(possible, N_POSSIBLE),
-   _NODE_ATTR(online, N_ONLINE),
-   _NODE_ATTR(has_normal_memory, N_NORMAL_MEMORY),
-   _NODE_ATTR(has_cpu, N_CPU),
+   [N_POSSIBLE] = _NODE_ATTR(possible, N_POSSIBLE),
+   [N_ONLINE] = _NODE_ATTR(online, N_ONLINE),
+   [N_NORMAL_MEMORY] = _NODE_ATTR(has_normal_memory, N_NORMAL_MEMORY),
 #ifdef CONFIG_HIGHMEM
-   _NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
+   [N_HIGH_MEMORY] = _NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
 #endif
+   [N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
 };
 
 static struct attribute *node_state_attrs[] = {
-   &node_state_attr[0].attr.attr,
-   &node_state_attr[1].attr.attr,
-   &node_state_attr[2].attr.attr,
-   &node_state_attr[3].attr.attr,
+   &node_state_attr[N_POSSIBLE].attr.attr,
+   &node_state_attr[N_ONLINE].attr.attr,
+   &node_state_attr[N_NORMAL_MEMORY].attr.attr,
 #ifdef CONFIG_HIGHMEM
-   &node_state_attr[4].attr.attr,
+   &node_state_attr[N_HIGH_MEMORY].attr.attr,
 #endif
+   &node_state_attr[N_CPU].attr.attr,
NULL
 };
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 06/26] cpuset: use N_MEMORY instead N_HIGH_MEMORY

2012-09-10 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
Acked-by: Hillf Danton 
---
 Documentation/cgroups/cpusets.txt |2 +-
 include/linux/cpuset.h|2 +-
 kernel/cpuset.c   |   32 
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt 
b/Documentation/cgroups/cpusets.txt
index cefd3d8..12e01d4 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -218,7 +218,7 @@ and name space for cpusets, with a minimum of additional 
kernel code.
 The cpus and mems files in the root (top_cpuset) cpuset are
 read-only.  The cpus file automatically tracks the value of
 cpu_online_mask using a CPU hotplug notifier, and the mems file
-automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e.,
+automatically tracks the value of node_states[N_MEMORY]--i.e.,
 nodes with memory--using the cpuset_track_online_nodes() hook.
 
 
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 838320f..8c8a60d 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -144,7 +144,7 @@ static inline nodemask_t cpuset_mems_allowed(struct 
task_struct *p)
return node_possible_map;
 }
 
-#define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
+#define cpuset_current_mems_allowed (node_states[N_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
 
 static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index f33c715..2b133db 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -302,10 +302,10 @@ static void guarantee_online_cpus(const struct cpuset *cs,
  * are online, with memory.  If none are online with memory, walk
  * up the cpuset hierarchy until we find one that does have some
  * online mems.  If we get all the way to the top and still haven't
- * found any online mems, return node_states[N_HIGH_MEMORY].
+ * found any online mems, return node_states[N_MEMORY].
  *
  * One way or another, we guarantee to return some non-empty subset
- * of node_states[N_HIGH_MEMORY].
+ * of node_states[N_MEMORY].
  *
  * Call with callback_mutex held.
  */
@@ -313,14 +313,14 @@ static void guarantee_online_cpus(const struct cpuset *cs,
 static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
 {
while (cs && !nodes_intersects(cs->mems_allowed,
-   node_states[N_HIGH_MEMORY]))
+   node_states[N_MEMORY]))
cs = cs->parent;
if (cs)
nodes_and(*pmask, cs->mems_allowed,
-   node_states[N_HIGH_MEMORY]);
+   node_states[N_MEMORY]);
else
-   *pmask = node_states[N_HIGH_MEMORY];
-   BUG_ON(!nodes_intersects(*pmask, node_states[N_HIGH_MEMORY]));
+   *pmask = node_states[N_MEMORY];
+   BUG_ON(!nodes_intersects(*pmask, node_states[N_MEMORY]));
 }
 
 /*
@@ -1100,7 +1100,7 @@ static int update_nodemask(struct cpuset *cs, struct 
cpuset *trialcs,
return -ENOMEM;
 
/*
-* top_cpuset.mems_allowed tracks node_stats[N_HIGH_MEMORY];
+* top_cpuset.mems_allowed tracks node_stats[N_MEMORY];
 * it's read-only
 */
if (cs == &top_cpuset) {
@@ -1122,7 +1122,7 @@ static int update_nodemask(struct cpuset *cs, struct 
cpuset *trialcs,
goto done;
 
if (!nodes_subset(trialcs->mems_allowed,
-   node_states[N_HIGH_MEMORY])) {
+   node_states[N_MEMORY])) {
retval =  -EINVAL;
goto done;
}
@@ -2034,7 +2034,7 @@ static struct cpuset *cpuset_next(struct list_head *queue)
  * before dropping down to the next.  It always processes a node before
  * any of its children.
  *
- * In the case of memory hot-unplug, it will remove nodes from N_HIGH_MEMORY
+ * In the case of memory hot-unplug, it will remove nodes from N_MEMORY
  * if all present pages from a node are offlined.
  */
 static void
@@ -2073,7 +2073,7 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum 
hotplug_event event)
 
/* Continue past cpusets with all mems online */
if (nodes_subset(cp->mems_allowed,
-   node_states[N_HIGH_MEMORY]))
+   node_states[N_MEMORY]))
continue;
 
oldmems = cp->mems_allowed;
@@ -2081,7 +2081,7 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum 
hotplug_event

[V4 PATCH 07/26] procfs: use N_MEMORY instead N_HIGH_MEMORY

2012-09-10 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
Acked-by: Hillf Danton 
---
 fs/proc/kcore.c|2 +-
 fs/proc/task_mmu.c |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 86c67ee..e96d4f1 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -249,7 +249,7 @@ static int kcore_update_ram(void)
/* Not inializedupdate now */
/* find out "max pfn" */
end_pfn = 0;
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
unsigned long node_end;
node_end  = NODE_DATA(nid)->node_start_pfn +
NODE_DATA(nid)->node_spanned_pages;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4540b8f..ed3d381 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1080,7 +1080,7 @@ static struct page *can_gather_numa_stats(pte_t pte, 
struct vm_area_struct *vma,
return NULL;
 
nid = page_to_nid(page);
-   if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
+   if (!node_isset(nid, node_states[N_MEMORY]))
return NULL;
 
return page;
@@ -1232,7 +1232,7 @@ static int show_numa_map(struct seq_file *m, void *v, int 
is_pid)
if (md->writeback)
seq_printf(m, " writeback=%lu", md->writeback);
 
-   for_each_node_state(n, N_HIGH_MEMORY)
+   for_each_node_state(n, N_MEMORY)
if (md->node[n])
seq_printf(m, " N%d=%lu", n, md->node[n]);
 out:
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 11/26] mempolicy: use N_MEMORY instead N_HIGH_MEMORY

2012-09-10 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 mm/mempolicy.c |   12 ++--
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ada3be..54cf023 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -212,9 +212,9 @@ static int mpol_set_nodemask(struct mempolicy *pol,
/* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
if (pol == NULL)
return 0;
-   /* Check N_HIGH_MEMORY */
+   /* Check N_MEMORY */
nodes_and(nsc->mask1,
- cpuset_current_mems_allowed, node_states[N_HIGH_MEMORY]);
+ cpuset_current_mems_allowed, node_states[N_MEMORY]);
 
VM_BUG_ON(!nodes);
if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
@@ -1363,7 +1363,7 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, 
maxnode,
goto out_put;
}
 
-   if (!nodes_subset(*new, node_states[N_HIGH_MEMORY])) {
+   if (!nodes_subset(*new, node_states[N_MEMORY])) {
err = -EINVAL;
goto out_put;
}
@@ -2320,7 +2320,7 @@ void __init numa_policy_init(void)
 * fall back to the largest node if they're all smaller.
 */
nodes_clear(interleave_nodes);
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
unsigned long total_pages = node_present_pages(nid);
 
/* Preserve the largest node */
@@ -2401,7 +2401,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, 
int no_context)
*nodelist++ = '\0';
if (nodelist_parse(nodelist, nodes))
goto out;
-   if (!nodes_subset(nodes, node_states[N_HIGH_MEMORY]))
+   if (!nodes_subset(nodes, node_states[N_MEMORY]))
goto out;
} else
nodes_clear(nodes);
@@ -2435,7 +2435,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, 
int no_context)
 * Default to online nodes with memory if no nodelist
 */
if (!nodelist)
-   nodes = node_states[N_HIGH_MEMORY];
+   nodes = node_states[N_MEMORY];
break;
case MPOL_LOCAL:
/*
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 09/26] oom: use N_MEMORY instead N_HIGH_MEMORY

2012-09-10 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
Acked-by: Hillf Danton 
---
 mm/oom_kill.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1986008..5269e9d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -257,7 +257,7 @@ static enum oom_constraint constrained_alloc(struct 
zonelist *zonelist,
 * the page allocator means a mempolicy is in effect.  Cpuset policy
 * is enforced in get_page_from_freelist().
 */
-   if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
+   if (nodemask && !nodes_subset(node_states[N_MEMORY], *nodemask)) {
*totalpages = total_swap_pages;
for_each_node_mask(nid, *nodemask)
*totalpages += node_spanned_pages(nid);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 10/26] mm,migrate: use N_MEMORY instead N_HIGH_MEMORY

2012-09-10 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
Acked-by: Christoph Lameter 
---
 mm/migrate.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..d595e58 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1201,7 +1201,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t 
task_nodes,
if (node < 0 || node >= MAX_NUMNODES)
goto out_pm;
 
-   if (!node_state(node, N_HIGH_MEMORY))
+   if (!node_state(node, N_MEMORY))
goto out_pm;
 
err = -EACCES;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 23/26] memblock: limit memory address from memblock

2012-09-10 Thread Lai Jiangshan

From: Yasuaki Ishimatsu 

Setting kernelcore_max_pfn means all memory which is bigger than
the boot parameter is allocated as ZONE_MOVABLE. So memory which
is allocated by memblock also should be limited by the parameter.

The patch limits memory from memblock.

Signed-off-by: Yasuaki Ishimatsu 
Signed-off-by: Lai Jiangshan 
---
 include/linux/memblock.h |1 +
 mm/memblock.c|5 -
 mm/page_alloc.c  |6 +-
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 19dc455..f2977ae 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,6 +42,7 @@ struct memblock {
 
 extern struct memblock memblock;
 extern int memblock_debug;
+extern phys_addr_t memblock_limit;
 
 #define memblock_dbg(fmt, ...) \
if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
diff --git a/mm/memblock.c b/mm/memblock.c
index 82aa349..fbf5efc 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -932,7 +932,10 @@ int __init_memblock 
memblock_is_region_reserved(phys_addr_t base, phys_addr_t si
 
 void __init_memblock memblock_set_current_limit(phys_addr_t limit)
 {
-   memblock.current_limit = limit;
+   if (!memblock_limit || (memblock_limit > limit))
+   memblock.current_limit = limit;
+   else
+   memblock.current_limit = memblock_limit;
 }
 
 static void __init_memblock memblock_dump(struct memblock_type *type, char 
*name)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c1c5834..3878170 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -208,6 +208,8 @@ static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
 
+phys_addr_t memblock_limit;
+
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
 EXPORT_SYMBOL(movable_zone);
@@ -4926,7 +4928,9 @@ static int __init cmdline_parse_core(char *p, unsigned 
long *core)
  */
 static int __init cmdline_parse_kernelcore_max_addr(char *p)
 {
-   return cmdline_parse_core(p, &required_kernelcore_max_pfn);
+   cmdline_parse_core(p, &required_kernelcore_max_pfn);
+   memblock_limit = required_kernelcore_max_pfn << PAGE_SHIFT;
+   return 0;
 }
 early_param("kernelcore_max_addr", cmdline_parse_kernelcore_max_addr);
 #endif
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 20/26] page_alloc: add kernelcore_max_addr

2012-09-10 Thread Lai Jiangshan

Current ZONE_MOVABLE (kernelcore=) setting policy with boot option doesn't meet
our requirement. We need something like kernelcore_max_addr=XX boot option
to limit the kernelcore upper address.

The memory with higher address will be migratable(movable) and they
are easier to be offline(always ready to be offline when the system don't 
require
so much memory).

It makes things easy when we dynamic hot-add/remove memory, make better
utilities of memories, and helps for THP.

All kernelcore_max_addr=, kernelcore= and movablecore= can be safely specified
at the same time(or any 2 of them).

Signed-off-by: Lai Jiangshan 
---
 Documentation/kernel-parameters.txt |9 +
 mm/page_alloc.c |   29 -
 2 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 7aef334..02a2ce9 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1215,6 +1215,15 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.
 
+   kernelcore_max_addr=nn[KMG] [KNL,X86,IA-64,PPC] This parameter
+   is the same effect as kernelcore parameter, except it
+   specifies the up physical address of memory range
+   usable by the kernel for non-movable allocations.
+   If both kernelcore and kernelcore_max_addr are
+   specified, this requested's priority is higher than
+   kernelcore's.
+   See the kernelcore parameter.
+
kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
Format: [,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 621c666..c1c5834 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -203,6 +203,7 @@ static unsigned long __meminitdata dma_reserve;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+static unsigned long __initdata required_kernelcore_max_pfn;
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
@@ -4650,6 +4651,7 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 {
int i, nid;
unsigned long usable_startpfn;
+   unsigned long kernelcore_max_pfn;
unsigned long kernelcore_node, kernelcore_remaining;
/* save the state before borrow the nodemask */
nodemask_t saved_node_state = node_states[N_MEMORY];
@@ -4678,6 +4680,9 @@ static void __init find_zone_movable_pfns_for_nodes(void)
required_kernelcore = max(required_kernelcore, corepages);
}
 
+   if (required_kernelcore_max_pfn && !required_kernelcore)
+   required_kernelcore = totalpages;
+
/* If kernelcore was not specified, there is no ZONE_MOVABLE */
if (!required_kernelcore)
goto out;
@@ -4686,6 +4691,12 @@ static void __init find_zone_movable_pfns_for_nodes(void)
find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
 
+   if (required_kernelcore_max_pfn)
+   kernelcore_max_pfn = required_kernelcore_max_pfn;
+   else
+   kernelcore_max_pfn = ULONG_MAX >> PAGE_SHIFT;
+   kernelcore_max_pfn = max(kernelcore_max_pfn, usable_startpfn);
+
 restart:
/* Spread kernelcore memory as evenly as possible throughout nodes */
kernelcore_node = required_kernelcore / usable_nodes;
@@ -4712,8 +4723,12 @@ restart:
unsigned long size_pages;
 
start_pfn = max(start_pfn, zone_movable_pfn[nid]);
-   if (start_pfn >= end_pfn)
+   end_pfn = min(kernelcore_max_pfn, end_pfn);
+   if (start_pfn >= end_pfn) {
+   if (!zone_movable_pfn[nid])
+   zone_movable_pfn[nid] = start_pfn;
continue;
+   }
 
/* Account for what is only usable for kernelcore */
if (start_pfn < usable_startpfn) {
@@ -4904,6 +4919,18 @@ static int __init cmdline_parse_core(char *p, unsigned 
long *core)
return 0;
 }
 
+#ifdef CONFIG_MOVABLE_NODE
+/*
+ * kernelcore_max_addr=addr sets the up physical address of memory range
+ * for use for allocations that cannot be reclaimed or migrated.
+ */
+static int __ini

[V4 PATCH 08/26] memcontrol: use N_MEMORY instead N_HIGH_MEMORY

2012-09-10 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 mm/memcontrol.c  |   18 +-
 mm/page_cgroup.c |2 +-
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 795e525..0d42f53 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -801,7 +801,7 @@ static unsigned long mem_cgroup_nr_lru_pages(struct 
mem_cgroup *memcg,
int nid;
u64 total = 0;
 
-   for_each_node_state(nid, N_HIGH_MEMORY)
+   for_each_node_state(nid, N_MEMORY)
total += mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask);
return total;
 }
@@ -1617,9 +1617,9 @@ static void mem_cgroup_may_update_nodemask(struct 
mem_cgroup *memcg)
return;
 
/* make a nodemask where this memcg uses memory from */
-   memcg->scan_nodes = node_states[N_HIGH_MEMORY];
+   memcg->scan_nodes = node_states[N_MEMORY];
 
-   for_each_node_mask(nid, node_states[N_HIGH_MEMORY]) {
+   for_each_node_mask(nid, node_states[N_MEMORY]) {
 
if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
node_clear(nid, memcg->scan_nodes);
@@ -1690,7 +1690,7 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup 
*memcg, bool noswap)
/*
 * Check rest of nodes.
 */
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
if (node_isset(nid, memcg->scan_nodes))
continue;
if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
@@ -3765,7 +3765,7 @@ move_account:
drain_all_stock_sync(memcg);
ret = 0;
mem_cgroup_start_move(memcg);
-   for_each_node_state(node, N_HIGH_MEMORY) {
+   for_each_node_state(node, N_MEMORY) {
for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
enum lru_list lru;
for_each_lru(lru) {
@@ -4093,7 +4093,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, 
struct cftype *cft,
 
total_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL);
seq_printf(m, "total=%lu", total_nr);
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL);
seq_printf(m, " N%d=%lu", nid, node_nr);
}
@@ -4101,7 +4101,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, 
struct cftype *cft,
 
file_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
seq_printf(m, "file=%lu", file_nr);
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
LRU_ALL_FILE);
seq_printf(m, " N%d=%lu", nid, node_nr);
@@ -4110,7 +4110,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, 
struct cftype *cft,
 
anon_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
seq_printf(m, "anon=%lu", anon_nr);
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
LRU_ALL_ANON);
seq_printf(m, " N%d=%lu", nid, node_nr);
@@ -4119,7 +4119,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, 
struct cftype *cft,
 
unevictable_nr = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_UNEVICTABLE));
seq_printf(m, "unevictable=%lu", unevictable_nr);
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
BIT(LRU_UNEVICTABLE));
seq_printf(m, " N%d=%lu", nid, node_nr);
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 5ddad0c..c1054ad 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -271,7 +271,7 @@ void __init page_cgroup_init(void)
if (mem_cgroup_disabled())
return;
 
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
unsigned long start_pfn, end_pfn;
 
start_pfn = node_start_pfn(nid);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 17/26] page_alloc: use N_MEMORY instead N_HIGH_MEMORY change the node_states initialization

2012-09-10 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Since we introduced N_MEMORY, we update the initialization of node_states.

Signed-off-by: Lai Jiangshan 
---
 arch/x86/mm/init_64.c |4 +++-
 mm/page_alloc.c   |   40 ++--
 2 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 2b6b4a3..005f00c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -625,7 +625,9 @@ void __init paging_init(void)
 *   numa support is not compiled in, and later node_set_state
 *   will not set it back.
 */
-   node_clear_state(0, N_NORMAL_MEMORY);
+   node_clear_state(0, N_MEMORY);
+   if (N_MEMORY != N_NORMAL_MEMORY)
+   node_clear_state(0, N_NORMAL_MEMORY);
 
zone_sizes_init();
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9e3c8b2..3bb04ed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1675,7 +1675,7 @@ bool zone_watermark_ok_safe(struct zone *z, int order, 
unsigned long mark,
  *
  * If the zonelist cache is present in the passed in zonelist, then
  * returns a pointer to the allowed node mask (either the current
- * tasks mems_allowed, or node_states[N_HIGH_MEMORY].)
+ * tasks mems_allowed, or node_states[N_MEMORY].)
  *
  * If the zonelist cache is not available for this zonelist, does
  * nothing and returns NULL.
@@ -1704,7 +1704,7 @@ static nodemask_t *zlc_setup(struct zonelist *zonelist, 
int alloc_flags)
 
allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
&cpuset_current_mems_allowed :
-   &node_states[N_HIGH_MEMORY];
+   &node_states[N_MEMORY];
return allowednodes;
 }
 
@@ -3132,7 +3132,7 @@ static int find_next_best_node(int node, nodemask_t 
*used_node_mask)
return node;
}
 
-   for_each_node_state(n, N_HIGH_MEMORY) {
+   for_each_node_state(n, N_MEMORY) {
 
/* Don't want a node to appear more than once */
if (node_isset(n, *used_node_mask))
@@ -3274,7 +3274,7 @@ static int default_zonelist_order(void)
 * local memory, NODE_ORDER may be suitable.
  */
average_size = total_size /
-   (nodes_weight(node_states[N_HIGH_MEMORY]) + 1);
+   (nodes_weight(node_states[N_MEMORY]) + 1);
for_each_online_node(nid) {
low_kmem_size = 0;
total_size = 0;
@@ -4619,7 +4619,7 @@ unsigned long __init 
find_min_pfn_with_active_regions(void)
 /*
  * early_calculate_totalpages()
  * Sum pages in active regions for movable zone.
- * Populate N_HIGH_MEMORY for calculating usable_nodes.
+ * Populate N_MEMORY for calculating usable_nodes.
  */
 static unsigned long __init early_calculate_totalpages(void)
 {
@@ -4632,7 +4632,7 @@ static unsigned long __init 
early_calculate_totalpages(void)
 
totalpages += pages;
if (pages)
-   node_set_state(nid, N_HIGH_MEMORY);
+   node_set_state(nid, N_MEMORY);
}
return totalpages;
 }
@@ -4649,9 +4649,9 @@ static void __init find_zone_movable_pfns_for_nodes(void)
unsigned long usable_startpfn;
unsigned long kernelcore_node, kernelcore_remaining;
/* save the state before borrow the nodemask */
-   nodemask_t saved_node_state = node_states[N_HIGH_MEMORY];
+   nodemask_t saved_node_state = node_states[N_MEMORY];
unsigned long totalpages = early_calculate_totalpages();
-   int usable_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
+   int usable_nodes = nodes_weight(node_states[N_MEMORY]);
 
/*
 * If movablecore was specified, calculate what size of
@@ -4686,7 +4686,7 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 restart:
/* Spread kernelcore memory as evenly as possible throughout nodes */
kernelcore_node = required_kernelcore / usable_nodes;
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
unsigned long start_pfn, end_pfn;
 
/*
@@ -4778,23 +4778,27 @@ restart:
 
 out:
/* restore the node_state */
-   node_states[N_HIGH_MEMORY] = saved_node_state;
+   node_states[N_MEMORY] = saved_node_state;
 }
 
-/* Any regular memory on that node ? */
-static void __init check_for_regular_memory(pg_data_t *pgdat)
+/* Any regular or high memory on that node ? */
+static void check_for_memory(pg_data_t *pgdat, int nid)
 {
-#ifdef CONFIG_HIGHMEM
enum zone_type zone_type;
 
-   for (zone_type = 0; zone_type <= ZONE_NORMAL; zon

[V4 PATCH 22/26] x86: use memblock_set_current_limit() to set memblock.current_limit

2012-09-10 Thread Lai Jiangshan

From: Yasuaki Ishimatsu 

memblock.current_limit is set directly though memblock_set_current_limit()
is prepared. So fix it.

Signed-off-by: Yasuaki Ishimatsu 
Signed-off-by: Lai Jiangshan 
---
 arch/x86/kernel/setup.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f4b9b80..bb9d9f8 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -889,7 +889,7 @@ void __init setup_arch(char **cmdline_p)
 
cleanup_highmap();
 
-   memblock.current_limit = get_max_mapped();
+   memblock_set_current_limit(get_max_mapped());
memblock_x86_fill();
 
/*
@@ -925,7 +925,7 @@ void __init setup_arch(char **cmdline_p)
max_low_pfn = max_pfn;
}
 #endif
-   memblock.current_limit = get_max_mapped();
+   memblock_set_current_limit(get_max_mapped());
dma_contiguous_reserve(0);
 
/*
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 19/26] numa: add CONFIG_MOVABLE_NODE for movable-dedicated node

2012-09-10 Thread Lai Jiangshan

All are prepared, we can actually introduce N_MEMORY.
add CONFIG_MOVABLE_NODE make we can use it for movable-dedicated node

Signed-off-by: Lai Jiangshan 
---
 drivers/base/node.c  |6 ++
 include/linux/nodemask.h |4 
 mm/Kconfig   |8 
 mm/page_alloc.c  |3 +++
 4 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 4c3aa7c..9cdd66f 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -620,6 +620,9 @@ static struct node_attr node_state_attr[] = {
 #ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = _NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
 #endif
+#ifdef CONFIG_MOVABLE_NODE
+   [N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY),
+#endif
[N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
 };
 
@@ -630,6 +633,9 @@ static struct attribute *node_state_attrs[] = {
 #ifdef CONFIG_HIGHMEM
&node_state_attr[N_HIGH_MEMORY].attr.attr,
 #endif
+#ifdef CONFIG_MOVABLE_NODE
+   &node_state_attr[N_MEMORY].attr.attr,
+#endif
&node_state_attr[N_CPU].attr.attr,
NULL
 };
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index c6ebdc9..4e2cbfa 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -380,7 +380,11 @@ enum node_states {
 #else
N_HIGH_MEMORY = N_NORMAL_MEMORY,
 #endif
+#ifdef CONFIG_MOVABLE_NODE
+   N_MEMORY,   /* The node has memory(regular, high, movable) 
*/
+#else
N_MEMORY = N_HIGH_MEMORY,
+#endif
N_CPU,  /* The node has one or more cpus */
NR_NODE_STATES
 };
diff --git a/mm/Kconfig b/mm/Kconfig
index d5c8019..8c14a2c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -143,6 +143,14 @@ config NO_BOOTMEM
 config MEMORY_ISOLATION
boolean
 
+config MOVABLE_NODE
+   boolean "Enable to assign a node has only movable memory"
+   depends on HAVE_MEMBLOCK
+   depends on NO_BOOTMEM
+   depends on X86_64
+   depends on NUMA
+   default y
+
 # eventually, we can have this option just 'select SPARSEMEM'
 config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3bb04ed..621c666 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -90,6 +90,9 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
 #ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = { { [0] = 1UL } },
 #endif
+#ifdef CONFIG_MOVABLE_NODE
+   [N_MEMORY] = { { [0] = 1UL } },
+#endif
[N_CPU] = { { [0] = 1UL } },
 #endif /* NUMA */
 };
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 14/26] kthread: use N_MEMORY instead N_HIGH_MEMORY

2012-09-10 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 kernel/kthread.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 146a6fa..065a0a8 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -427,7 +427,7 @@ int kthreadd(void *unused)
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
set_cpus_allowed_ptr(tsk, cpu_all_mask);
-   set_mems_allowed(node_states[N_HIGH_MEMORY]);
+   set_mems_allowed(node_states[N_MEMORY]);
 
current->flags |= PF_NOFREEZE;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 25/26] mm, memory-hotplug: add online_movable and online_kernel

2012-09-10 Thread Lai Jiangshan

When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.

It makes things easy when we dynamic hot-add/remove memory, make better
utilities of memories, and helps for THP.

Current constraints: Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be onlined from ZONE_NORMAL to ZONE_MOVABLE.

For opposite onlining behavior, we also introduce "online_kernel" to change
a memoryblock of ZONE_MOVABLE to ZONE_KERNEL when online.

Signed-off-by: Lai Jiangshan 
---
 Documentation/memory-hotplug.txt |   14 +-
 drivers/base/memory.c|   19 +---
 include/linux/memory_hotplug.h   |   13 +-
 mm/memory_hotplug.c  |  101 +-
 4 files changed, 137 insertions(+), 10 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 70bc1c7..8e5eacb 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -161,7 +161,8 @@ a recent addition and not present on older kernels.
in the memory block.
 'state'   : read-write
 at read:  contains online/offline state of memory.
-at write: user can specify "online", "offline" command
+at write: user can specify "online_kernel",
+"online_movable", "online", "offline" command
 which will be performed on al sections in the block.
 'phys_device' : read-only: designed to show the name of physical memory
 device.  This is not well implemented now.
@@ -255,6 +256,17 @@ For onlining, you have to write "online" to the section's 
state file as:
 
 % echo online > /sys/devices/system/memory/memoryXXX/state
 
+This onlining will not change the ZONE type of the target memory section,
+If the memory section is in ZONE_NORMAL, you can change it to ZONE_MOVABLE:
+
+% echo online_movable > /sys/devices/system/memory/memoryXXX/state
+(NOTE: current limit: this memory section must be adjacent to ZONE_MOVABLE)
+
+And if the memory section is in ZONE_MOVABLE, you can change it to ZONE_NORMAL:
+
+% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
+(NOTE: current limit: this memory section must be adjacent to ZONE_NORMAL)
+
 After this, section memoryXXX's state will be 'online' and the amount of
 available memory will be increased.
 
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 7dda4f7..1ad2f48 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -246,7 +246,7 @@ static bool pages_correctly_reserved(unsigned long 
start_pfn,
  * OK to have direct references to sparsemem variables in here.
  */
 static int
-memory_block_action(unsigned long phys_index, unsigned long action)
+memory_block_action(unsigned long phys_index, unsigned long action, int 
online_type)
 {
unsigned long start_pfn, start_paddr;
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
@@ -262,7 +262,7 @@ memory_block_action(unsigned long phys_index, unsigned long 
action)
if (!pages_correctly_reserved(start_pfn, nr_pages))
return -EBUSY;
 
-   ret = online_pages(start_pfn, nr_pages);
+   ret = online_pages(start_pfn, nr_pages, online_type);
break;
case MEM_OFFLINE:
start_paddr = page_to_pfn(first_page) << PAGE_SHIFT;
@@ -279,7 +279,8 @@ memory_block_action(unsigned long phys_index, unsigned long 
action)
 }
 
 static int memory_block_change_state(struct memory_block *mem,
-   unsigned long to_state, unsigned long from_state_req)
+   unsigned long to_state, unsigned long from_state_req,
+   int online_type)
 {
int ret = 0;
 
@@ -293,7 +294,7 @@ static int memory_block_change_state(struct memory_block 
*mem,
if (to_state == MEM_OFFLINE)
mem->state = MEM_GOING_OFFLINE;
 
-   ret = memory_block_action(mem->start_section_nr, to_state);
+   ret = memory_block_action(mem->start_section_nr, to_state, online_type);
 
if (ret) {
mem->state = from_state_req;
@@ -325,10 +326,14 @@ store_mem_state(struct device *dev,
 
mem = container_of(dev, struct memory_block, dev);
 
-   if (!strncmp(buf, "online", min((int)count, 6)))
-   ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
+   if (!strncmp(buf, "online_kernel", min((int)count, 13)))
+   ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE, 
ONLINE_KERNEL);
+   else if (!strncmp(buf, "online_mova

[V4 PATCH 18/26] hotplug: update nodemasks management

2012-09-10 Thread Lai Jiangshan

update nodemasks management for N_MEMORY

Signed-off-by: Lai Jiangshan 
---
 Documentation/memory-hotplug.txt |5 +++-
 include/linux/memory.h   |1 +
 mm/memory_hotplug.c  |   49 +
 3 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 6e6cbc7..70bc1c7 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -378,6 +378,7 @@ struct memory_notify {
unsigned long start_pfn;
unsigned long nr_pages;
int status_change_nid_normal;
+   int status_change_nid_high;
int status_change_nid;
 }
 
@@ -385,7 +386,9 @@ start_pfn is start_pfn of online/offline memory.
 nr_pages is # of pages of online/offline memory.
 status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask
 is (will be) set/clear, if this is -1, then nodemask status is not changed.
-status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be)
+status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask
+is (will be) set/clear, if this is -1, then nodemask status is not changed.
+status_change_nid is set node id when N_MEMORY of nodemask is (will be)
 set/clear. It means a new(memoryless) node gets new memory by online and a
 node loses all memory. If this is -1, then nodemask status is not changed.
 If status_changed_nid* >= 0, callback should create/discard structures for the
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 6b9202b..8089e49 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -54,6 +54,7 @@ struct memory_notify {
unsigned long start_pfn;
unsigned long nr_pages;
int status_change_nid_normal;
+   int status_change_nid_high;
int status_change_nid;
 };
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8c3bcf6..d2b0158 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -462,7 +462,7 @@ static void check_nodemasks_changes_online(unsigned long 
nr_pages,
int nid = zone_to_nid(zone);
enum zone_type zone_last = ZONE_NORMAL;
 
-   if (N_HIGH_MEMORY == N_NORMAL_MEMORY)
+   if (N_MEMORY == N_NORMAL_MEMORY)
zone_last = ZONE_MOVABLE;
 
if (zone_idx(zone) <= zone_last && !node_state(nid, N_NORMAL_MEMORY))
@@ -470,7 +470,20 @@ static void check_nodemasks_changes_online(unsigned long 
nr_pages,
else
arg->status_change_nid_normal = -1;
 
-   if (!node_state(nid, N_HIGH_MEMORY))
+#ifdef CONFIG_HIGHMEM
+   zone_last = ZONE_HIGH;
+   if (N_MEMORY == N_HIGH_MEMORY)
+   zone_last = ZONE_MOVABLE;
+
+   if (zone_idx(zone) <= zone_last && !node_state(nid, N_HIGH_MEMORY))
+   arg->status_change_nid_high = nid;
+   else
+   arg->status_change_nid_high = -1;
+#else
+   arg->status_change_nid_high = arg->status_change_nid_normal;
+#endif
+
+   if (!node_state(nid, N_MEMORY))
arg->status_change_nid = nid;
else
arg->status_change_nid = -1;
@@ -481,7 +494,10 @@ static void set_nodemasks(int node, struct memory_notify 
*arg)
if (arg->status_change_nid_normal >= 0)
node_set_state(node, N_NORMAL_MEMORY);
 
-   node_set_state(node, N_HIGH_MEMORY);
+   if (arg->status_change_nid_high >= 0)
+   node_set_state(node, N_HIGH_MEMORY);
+
+   node_set_state(node, N_MEMORY);
 }
 
 
@@ -900,7 +916,7 @@ static void check_nodemasks_changes_offline(unsigned long 
nr_pages,
unsigned long present_pages = 0;
enum zone_type zt, zone_last = ZONE_NORMAL;
 
-   if (N_HIGH_MEMORY == N_NORMAL_MEMORY)
+   if (N_MEMORY == N_NORMAL_MEMORY)
zone_last = ZONE_MOVABLE;
 
for (zt = 0; zt <= zone_last; zt++)
@@ -910,6 +926,21 @@ static void check_nodemasks_changes_offline(unsigned long 
nr_pages,
else
arg->status_change_nid_normal = -1;
 
+#ifdef CONIG_HIGHMEM
+   zone_last = ZONE_HIGH;
+   if (N_MEMORY == N_HIGH_MEMORY)
+   zone_last = ZONE_MOVABLE;
+
+   for (; zt <= zone_last; zt++)
+   present_pages += pgdat->node_zones[zt].present_pages;
+   if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
+   arg->status_change_nid_high = zone_to_nid(zone);
+   else
+   arg->status_change_nid_high = -1;
+#else
+   arg->status_change_nid_high = arg->status_change_nid_normal;
+#endif
+
zone_last = ZONE_MOVABLE;
for (; zt <= zone_last; zt++)
present_pages += pgdat->node_zones[zt].present_pages;
@@ -924,11 +955,17 @@ static void clear_nodemasks(int node, struct 
memory_notify *arg)
if (arg->status_change_nid_normal >= 0)
node_clear_state(nod

[V4 PATCH 12/26] hugetlb: use N_MEMORY instead N_HIGH_MEMORY

2012-09-10 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
Acked-by: Hillf Danton 
---
 drivers/base/node.c |2 +-
 mm/hugetlb.c|   24 
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5d7731e..4c3aa7c 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -227,7 +227,7 @@ static node_registration_func_t __hugetlb_unregister_node;
 static inline bool hugetlb_register_node(struct node *node)
 {
if (__hugetlb_register_node &&
-   node_state(node->dev.id, N_HIGH_MEMORY)) {
+   node_state(node->dev.id, N_MEMORY)) {
__hugetlb_register_node(node);
return true;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bc72712..a254dfb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1052,7 +1052,7 @@ static void return_unused_surplus_pages(struct hstate *h,
 * on-line nodes with memory and will handle the hstate accounting.
 */
while (nr_pages--) {
-   if (!free_pool_huge_page(h, &node_states[N_HIGH_MEMORY], 1))
+   if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
break;
}
 }
@@ -1175,14 +1175,14 @@ static struct page *alloc_huge_page(struct 
vm_area_struct *vma,
 int __weak alloc_bootmem_huge_page(struct hstate *h)
 {
struct huge_bootmem_page *m;
-   int nr_nodes = nodes_weight(node_states[N_HIGH_MEMORY]);
+   int nr_nodes = nodes_weight(node_states[N_MEMORY]);
 
while (nr_nodes) {
void *addr;
 
addr = __alloc_bootmem_node_nopanic(
NODE_DATA(hstate_next_node_to_alloc(h,
-   &node_states[N_HIGH_MEMORY])),
+   &node_states[N_MEMORY])),
huge_page_size(h), huge_page_size(h), 0);
 
if (addr) {
@@ -1254,7 +1254,7 @@ static void __init hugetlb_hstate_alloc_pages(struct 
hstate *h)
if (!alloc_bootmem_huge_page(h))
break;
} else if (!alloc_fresh_huge_page(h,
-&node_states[N_HIGH_MEMORY]))
+&node_states[N_MEMORY]))
break;
}
h->max_huge_pages = i;
@@ -1522,7 +1522,7 @@ static ssize_t nr_hugepages_store_common(bool 
obey_mempolicy,
if (!(obey_mempolicy &&
init_nodemask_of_mempolicy(nodes_allowed))) {
NODEMASK_FREE(nodes_allowed);
-   nodes_allowed = &node_states[N_HIGH_MEMORY];
+   nodes_allowed = &node_states[N_MEMORY];
}
} else if (nodes_allowed) {
/*
@@ -1532,11 +1532,11 @@ static ssize_t nr_hugepages_store_common(bool 
obey_mempolicy,
count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
init_nodemask_of_node(nodes_allowed, nid);
} else
-   nodes_allowed = &node_states[N_HIGH_MEMORY];
+   nodes_allowed = &node_states[N_MEMORY];
 
h->max_huge_pages = set_max_huge_pages(h, count, nodes_allowed);
 
-   if (nodes_allowed != &node_states[N_HIGH_MEMORY])
+   if (nodes_allowed != &node_states[N_MEMORY])
NODEMASK_FREE(nodes_allowed);
 
return len;
@@ -1839,7 +1839,7 @@ static void hugetlb_register_all_nodes(void)
 {
int nid;
 
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
struct node *node = &node_devices[nid];
if (node->dev.id == nid)
hugetlb_register_node(node);
@@ -1934,8 +1934,8 @@ void __init hugetlb_add_hstate(unsigned order)
for (i = 0; i < MAX_NUMNODES; ++i)
INIT_LIST_HEAD(&h->hugepage_freelists[i]);
INIT_LIST_HEAD(&h->hugepage_activelist);
-   h->next_nid_to_alloc = first_node(node_states[N_HIGH_MEMORY]);
-   h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
+   h->next_nid_to_alloc = first_node(node_states[N_MEMORY]);
+   h->next_nid_to_free = first_node(node_states[N_MEMORY]);
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
huge_page_size(h)/1024);
/*
@@ -2030,11 +2030,11 @@ static int hugetlb_sysctl_handler_common(bool 
obey_mempolicy,
if (!(obey_mempolicy &&
   init_nodemask_of_mempolicy(

[V4 PATCH 26/26] memory_hotplug: handle empty zone when online_movable/online_kernel

2012-09-10 Thread Lai Jiangshan

make online_movable/online_kernel can empty a zone
or can move memory to a empty zone.

Signed-off-by: Lai Jiangshan 
---
 mm/memory_hotplug.c |   51 +--
 1 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e691076..1903850 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -216,8 +216,17 @@ static void resize_zone(struct zone *zone, unsigned long 
start_pfn,
 
zone_span_writelock(zone);
 
-   zone->zone_start_pfn = start_pfn;
-   zone->spanned_pages = end_pfn - start_pfn;
+   if (end_pfn - start_pfn) {
+   zone->zone_start_pfn = start_pfn;
+   zone->spanned_pages = end_pfn - start_pfn;
+   } else {
+   /*
+* make it consist as free_area_init_core(),
+* if spanned_pages = 0, then keep start_pfn = 0
+*/
+   zone->zone_start_pfn = 0;
+   zone->spanned_pages = 0;
+   }
 
zone_span_writeunlock(zone);
 }
@@ -233,10 +242,19 @@ static void fix_zone_id(struct zone *zone, unsigned long 
start_pfn,
set_page_links(pfn_to_page(pfn), zid, nid, pfn);
 }
 
-static int move_pfn_range_left(struct zone *z1, struct zone *z2,
+static int __meminit move_pfn_range_left(struct zone *z1, struct zone *z2,
unsigned long start_pfn, unsigned long end_pfn)
 {
+   int ret;
unsigned long flags;
+   unsigned long z1_start_pfn;
+
+   if (!z1->wait_table) {
+   ret = init_currently_empty_zone(z1, start_pfn,
+   end_pfn - start_pfn, MEMMAP_HOTPLUG);
+   if (ret)
+   return ret;
+   }
 
pgdat_resize_lock(z1->zone_pgdat, &flags);
 
@@ -250,7 +268,13 @@ static int move_pfn_range_left(struct zone *z1, struct 
zone *z2,
if (end_pfn <= z2->zone_start_pfn)
goto out_fail;
 
-   resize_zone(z1, z1->zone_start_pfn, end_pfn);
+   /* use start_pfn for z1's start_pfn if z1 is empty */
+   if (z1->zone_start_pfn)
+   z1_start_pfn = z1->zone_start_pfn;
+   else
+   z1_start_pfn = start_pfn;
+
+   resize_zone(z1, z1_start_pfn, end_pfn);
resize_zone(z2, end_pfn, z2->zone_start_pfn + z2->spanned_pages);
 
pgdat_resize_unlock(z1->zone_pgdat, &flags);
@@ -263,10 +287,19 @@ out_fail:
return -1;
 }
 
-static int move_pfn_range_right(struct zone *z1, struct zone *z2,
+static int __meminit move_pfn_range_right(struct zone *z1, struct zone *z2,
unsigned long start_pfn, unsigned long end_pfn)
 {
+   int ret;
unsigned long flags;
+   unsigned long z2_end_pfn;
+
+   if (!z2->wait_table) {
+   ret = init_currently_empty_zone(z2, start_pfn,
+   end_pfn - start_pfn, MEMMAP_HOTPLUG);
+   if (ret)
+   return ret;
+   }
 
pgdat_resize_lock(z1->zone_pgdat, &flags);
 
@@ -280,8 +313,14 @@ static int move_pfn_range_right(struct zone *z1, struct 
zone *z2,
if (start_pfn >= z1->zone_start_pfn + z1->spanned_pages)
goto out_fail;
 
+   /* use end_pfn for z2's end_pfn if z2 is empty */
+   if (z2->zone_start_pfn)
+   z2_end_pfn = z2->zone_start_pfn + z2->spanned_pages;
+   else
+   z2_end_pfn = end_pfn;
+
resize_zone(z1, z1->zone_start_pfn, start_pfn);
-   resize_zone(z2, start_pfn, z2->zone_start_pfn + z2->spanned_pages);
+   resize_zone(z2, start_pfn, z2_end_pfn);
 
pgdat_resize_unlock(z1->zone_pgdat, &flags);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 24/26] memblock: compare current_limit with end variable at memblock_find_in_range_node()

2012-09-10 Thread Lai Jiangshan

From: Yasuaki Ishimatsu 

memblock_find_in_range_node() does not compare memblock.current_limit
with end variable. Thus even if memblock.current_limit is smaller than
end variable, the function allocates memory address that is bigger than
memblock.current_limit.

The patch adds the check to "memblock_find_in_range_node()"

Signed-off-by: Yasuaki Ishimatsu 
Signed-off-by: Lai Jiangshan 
---
 mm/memblock.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index fbf5efc..f726b5e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -99,11 +99,12 @@ phys_addr_t __init_memblock 
memblock_find_in_range_node(phys_addr_t start,
phys_addr_t align, int nid)
 {
phys_addr_t this_start, this_end, cand;
+   phys_addr_t current_limit = memblock.current_limit;
u64 i;
 
/* pump up @end */
-   if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
-   end = memblock.current_limit;
+   if ((end == MEMBLOCK_ALLOC_ACCESSIBLE) || (end > current_limit))
+   end = current_limit;
 
/* avoid allocating the first page */
start = max_t(phys_addr_t, start, PAGE_SIZE);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 21/26] x86: get pg_data_t's memory from other node

2012-09-10 Thread Lai Jiangshan

From: Yasuaki Ishimatsu 

If system can create movable node which all memory of the
node is allocated as ZONE_MOVABLE, setup_node_data() cannot
allocate memory for the node's pg_data_t.
So when memblock_alloc_nid() fails, setup_node_data() retries
memblock_alloc().

Signed-off-by: Yasuaki Ishimatsu 
Signed-off-by: Lai Jiangshan 
---
 arch/x86/mm/numa.c |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..a86e315 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -223,9 +223,13 @@ static void __init setup_node_data(int nid, u64 start, u64 
end)
remapped = true;
} else {
nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
-   if (!nd_pa) {
-   pr_err("Cannot find %zu bytes in node %d\n",
+   if (!nd_pa)
+   printk(KERN_WARNING "Cannot find %zu bytes in node 
%d\n",
   nd_size, nid);
+   nd_pa = memblock_alloc(nd_size, SMP_CACHE_BYTES);
+   if (!nd_pa) {
+   pr_err("Cannot find %zu bytes in other node\n",
+  nd_size);
return;
}
nd = __va(nd_pa);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 13/26] vmstat: use N_MEMORY instead N_HIGH_MEMORY

2012-09-10 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
Acked-by: Christoph Lameter 
---
 mm/vmstat.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index df7a674..eeaf4e1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -918,7 +918,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
pg_data_t *pgdat = (pg_data_t *)arg;
 
/* check memoryless node */
-   if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+   if (!node_state(pgdat->node_id, N_MEMORY))
return 0;
 
seq_printf(m, "Page block order: %d\n", pageblock_order);
@@ -1280,7 +1280,7 @@ static int unusable_show(struct seq_file *m, void *arg)
pg_data_t *pgdat = (pg_data_t *)arg;
 
/* check memoryless node */
-   if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+   if (!node_state(pgdat->node_id, N_MEMORY))
return 0;
 
walk_zones_in_node(m, pgdat, unusable_show_print);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 15/26] init: use N_MEMORY instead N_HIGH_MEMORY

2012-09-10 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
---
 init/main.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/init/main.c b/init/main.c
index b286730..637e604 100644
--- a/init/main.c
+++ b/init/main.c
@@ -848,7 +848,7 @@ static int __init kernel_init(void * unused)
/*
 * init can allocate pages on any node
 */
-   set_mems_allowed(node_states[N_HIGH_MEMORY]);
+   set_mems_allowed(node_states[N_MEMORY]);
/*
 * init can run on any cpu.
 */
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 16/26] vmscan: use N_MEMORY instead N_HIGH_MEMORY

2012-09-10 Thread Lai Jiangshan

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan 
Acked-by: Hillf Danton 
---
 mm/vmscan.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8d01243..cb42747 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3071,7 +3071,7 @@ static int __devinit cpu_callback(struct notifier_block 
*nfb,
int nid;
 
if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) {
-   for_each_node_state(nid, N_HIGH_MEMORY) {
+   for_each_node_state(nid, N_MEMORY) {
pg_data_t *pgdat = NODE_DATA(nid);
const struct cpumask *mask;
 
@@ -3126,7 +3126,7 @@ static int __init kswapd_init(void)
int nid;
 
swap_setup();
-   for_each_node_state(nid, N_HIGH_MEMORY)
+   for_each_node_state(nid, N_MEMORY)
kswapd_run(nid);
hotcpu_notifier(cpu_callback, 0);
return 0;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 03/26] slub, hotplug: ignore unrelated node's hot-adding and hot-removing

2012-09-10 Thread Lai Jiangshan

SLUB only fucus on the nodes which has normal memory, so ignore the other
node's hot-adding and hot-removing.

so we only do something when marg->status_change_nid_normal > 0.

Signed-off-by: Lai Jiangshan 
---
 mm/slub.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 8f78e25..7a1d02c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3572,7 +3572,7 @@ static void slab_mem_offline_callback(void *arg)
struct memory_notify *marg = arg;
int offline_node;
 
-   offline_node = marg->status_change_nid;
+   offline_node = marg->status_change_nid_normal;
 
/*
 * If the node still has available memory. we need kmem_cache_node
@@ -3605,7 +3605,7 @@ static int slab_mem_going_online_callback(void *arg)
struct kmem_cache_node *n;
struct kmem_cache *s;
struct memory_notify *marg = arg;
-   int nid = marg->status_change_nid;
+   int nid = marg->status_change_nid_normal;
int ret = 0;
 
/*
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 05/26] node_states: introduce N_MEMORY

2012-09-10 Thread Lai Jiangshan

We have N_NORMAL_MEMORY for standing for the nodes that have normal memory with
zone_type <= ZONE_NORMAL.

And we have N_HIGH_MEMORY for standing for the nodes that have normal or high
memory.

But we don't have any word to stand for the nodes that have *any* memory.

And we have N_CPU but without N_MEMORY.

Current code reuse the N_HIGH_MEMORY for this purpose because any node which
has memory must have high memory or normal memory currently.

A)  But this reusing is bad for *readability*. Because the name
N_HIGH_MEMORY just stands for high or normal:

A.example 1)
mem_cgroup_nr_lru_pages():
for_each_node_state(nid, N_HIGH_MEMORY)

The user will be confused(why this function just counts for high or
normal memory node? does it counts for ZONE_MOVABLE's lru pages?)
until someone else tell them N_HIGH_MEMORY is reused to stand for
nodes that have any memory.

A.cont) If we introduce N_MEMORY, we can reduce this confusing
AND make the code more clearly:

A.example 2) mm/page_cgroup.c use N_HIGH_MEMORY twice:

One is in page_cgroup_init(void):
for_each_node_state(nid, N_HIGH_MEMORY) {

It means if the node have memory, we will allocate page_cgroup map for
the node. We should use N_MEMORY instead here to gaim more clearly.

The second using is in alloc_page_cgroup():
if (node_state(nid, N_HIGH_MEMORY))
addr = vzalloc_node(size, nid);

It means if the node has high or normal memory that can be allocated
from kernel. We should keep N_HIGH_MEMORY here, and it will be better
if the "any memory" semantic of N_HIGH_MEMORY is removed.

B)  This reusing is out-dated if we introduce MOVABLE-dedicated node.
The MOVABLE-dedicated node should not appear in
node_stats[N_HIGH_MEMORY] nor node_stats[N_NORMAL_MEMORY],
because MOVABLE-dedicated node has no high or normal memory.

In x86_64, N_HIGH_MEMORY=N_NORMAL_MEMORY, if a MOVABLE-dedicated node
is in node_stats[N_HIGH_MEMORY], it is also means it is in
node_stats[N_NORMAL_MEMORY], it causes SLUB wrong.

The slub uses
for_each_node_state(nid, N_NORMAL_MEMORY)
and creates kmem_cache_node for MOVABLE-dedicated node and cause 
problem.

In one word, we need a N_MEMORY. We just intrude it as an alias to
N_HIGH_MEMORY and fix all im-proper usages of N_HIGH_MEMORY in late patches.

Signed-off-by: Lai Jiangshan 
Acked-by: Christoph Lameter 
Acked-by: Hillf Danton 
---
 include/linux/nodemask.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 7afc363..c6ebdc9 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -380,6 +380,7 @@ enum node_states {
 #else
N_HIGH_MEMORY = N_NORMAL_MEMORY,
 #endif
+   N_MEMORY = N_HIGH_MEMORY,
N_CPU,  /* The node has one or more cpus */
NR_NODE_STATES
 };
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[V4 PATCH 02/26] memory_hotplug: fix missing nodemask management

2012-09-10 Thread Lai Jiangshan

Currently memory_hotplug only manages the node_states[N_HIGH_MEMORY],
it forgot to manage node_states[N_NORMAL_MEMORY]. fix it.

Add check_nodemasks_changes_online() and check_nodemasks_changes_offline
to detect do node_states[N_HIGH_MEMORY] and node_states[N_NORMAL_MEMORY]
are changed while hotplug.

Also add @status_change_nid_normal to struct memory_notify, thus
the memory hotplug callbacks know whether the node_states[N_NORMAL_MEMORY]
are changed.

Signed-off-by: Lai Jiangshan 
---
 Documentation/memory-hotplug.txt |5 ++-
 include/linux/memory.h   |1 +
 mm/memory_hotplug.c  |   94 +++--
 3 files changed, 83 insertions(+), 17 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 6d0c251..6e6cbc7 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -377,15 +377,18 @@ The third argument is passed by pointer of struct 
memory_notify.
 struct memory_notify {
unsigned long start_pfn;
unsigned long nr_pages;
+   int status_change_nid_normal;
int status_change_nid;
 }
 
 start_pfn is start_pfn of online/offline memory.
 nr_pages is # of pages of online/offline memory.
+status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask
+is (will be) set/clear, if this is -1, then nodemask status is not changed.
 status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be)
 set/clear. It means a new(memoryless) node gets new memory by online and a
 node loses all memory. If this is -1, then nodemask status is not changed.
-If status_changed_nid >= 0, callback should create/discard structures for the
+If status_changed_nid* >= 0, callback should create/discard structures for the
 node if necessary.
 
 --
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 1ac7f6e..6b9202b 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -53,6 +53,7 @@ int arch_get_memory_phys_device(unsigned long start_pfn);
 struct memory_notify {
unsigned long start_pfn;
unsigned long nr_pages;
+   int status_change_nid_normal;
int status_change_nid;
 };
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3ad25f9..8c3bcf6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -456,6 +456,34 @@ static int online_pages_range(unsigned long start_pfn, 
unsigned long nr_pages,
return 0;
 }
 
+static void check_nodemasks_changes_online(unsigned long nr_pages,
+   struct zone *zone, struct memory_notify *arg)
+{
+   int nid = zone_to_nid(zone);
+   enum zone_type zone_last = ZONE_NORMAL;
+
+   if (N_HIGH_MEMORY == N_NORMAL_MEMORY)
+   zone_last = ZONE_MOVABLE;
+
+   if (zone_idx(zone) <= zone_last && !node_state(nid, N_NORMAL_MEMORY))
+   arg->status_change_nid_normal = nid;
+   else
+   arg->status_change_nid_normal = -1;
+
+   if (!node_state(nid, N_HIGH_MEMORY))
+   arg->status_change_nid = nid;
+   else
+   arg->status_change_nid = -1;
+}
+
+static void set_nodemasks(int node, struct memory_notify *arg)
+{
+   if (arg->status_change_nid_normal >= 0)
+   node_set_state(node, N_NORMAL_MEMORY);
+
+   node_set_state(node, N_HIGH_MEMORY);
+}
+
 
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
 {
@@ -467,13 +495,18 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages)
struct memory_notify arg;
 
lock_memory_hotplug();
+   /*
+* This doesn't need a lock to do pfn_to_page().
+* The section can't be removed here because of the
+* memory_block->state_mutex.
+*/
+   zone = page_zone(pfn_to_page(pfn));
+
arg.start_pfn = pfn;
arg.nr_pages = nr_pages;
-   arg.status_change_nid = -1;
+   check_nodemasks_changes_online(nr_pages, zone, &arg);
 
nid = page_to_nid(pfn_to_page(pfn));
-   if (node_present_pages(nid) == 0)
-   arg.status_change_nid = nid;
 
ret = memory_notify(MEM_GOING_ONLINE, &arg);
ret = notifier_to_errno(ret);
@@ -483,12 +516,6 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages)
return ret;
}
/*
-* This doesn't need a lock to do pfn_to_page().
-* The section can't be removed here because of the
-* memory_block->state_mutex.
-*/
-   zone = page_zone(pfn_to_page(pfn));
-   /*
 * If this zone is not populated, then it is not in zonelist.
 * This means the page allocator ignores this zone.
 * So, zonelist must be updated after online.
@@ -513,7 +540,7 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages)
zone->present_pages += onlined_pages;
zone->zone_pgdat->node_present_pages

[PATCH] task_work: avoid unneeded cmpxchg() in task_work_run()

2012-10-08 Thread Lai Jiangshan

We only require cmpxchg()&retry when task is exiting.
xchg() is enough in other cases like original code in ac3d0da8.

So we try our best to use xchg() and avoid competition&latency
from task_work_add().

Also remove the inner loop

Signed-off-by: Lai Jiangshan 
---
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 65bd3c9..82a42e7 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -56,14 +56,13 @@ void task_work_run(void)
 * work->func() can do task_work_add(), do not set
 * work_exited unless the list is empty.
 */
-   do {
-   work = ACCESS_ONCE(task->task_works);
-   head = !work && (task->flags & PF_EXITING) ?
-   &work_exited : NULL;
-   } while (cmpxchg(&task->task_works, work, head) != work);
-
-   if (!work)
+   if (!ACCESS_ONCE(task->task_works) ||
+   !(work = xchg(&task->task_works, NULL))) {
+   if ((task->flags & PF_EXITING) &&
+   cmpxchg(&task->task_works, NULL, &work_exited))
+   continue;
break;
+   }
/*
 * Synchronize with task_work_cancel(). It can't remove
 * the first entry == work, cmpxchg(task_works) should
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] lglock: remove unused DEFINE_LGLOCK_LOCKDEP()

2012-10-08 Thread Lai Jiangshan

struct lglocks use their own lock_key/lock_dep_map which are defined
in struct lglock. DEFINE_LGLOCK_LOCKDEP() is unused now, so we remove it
and save a small piece of memory.

Signed-off-by: Lai Jiangshan 
---
 include/linux/lglock.h |9 -
 1 files changed, 0 insertions(+), 9 deletions(-)

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index f01e5f6..45eff71 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -36,16 +36,8 @@
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 #define LOCKDEP_INIT_MAP lockdep_init_map
-
-#define DEFINE_LGLOCK_LOCKDEP(name)\
- struct lock_class_key name##_lock_key;
\
- struct lockdep_map name##_lock_dep_map;   \
- EXPORT_SYMBOL(name##_lock_dep_map)
-
 #else
 #define LOCKDEP_INIT_MAP(a, b, c, d)
-
-#define DEFINE_LGLOCK_LOCKDEP(name)
 #endif
 
 struct lglock {
@@ -57,7 +49,6 @@ struct lglock {
 };
 
 #define DEFINE_LGLOCK(name)\
-   DEFINE_LGLOCK_LOCKDEP(name);\
DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)  \
= __ARCH_SPIN_LOCK_UNLOCKED;\
struct lglock name = { .lock = &name ## _lock }
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] lglock: make the per_cpu locks static

2012-10-08 Thread Lai Jiangshan

The per_cpu locks are not used outside nor exported.
Add a "static" linkage keyword to it.

Signed-off-by: Lai Jiangshan 
---
 include/linux/lglock.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index 45eff71..8f97451 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -49,7 +49,7 @@ struct lglock {
 };
 
 #define DEFINE_LGLOCK(name)\
-   DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)  \
+   static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)   \
= __ARCH_SPIN_LOCK_UNLOCKED;\
struct lglock name = { .lock = &name ## _lock }
 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] lglock: add DEFINE_STATIC_LGLOCK()

2012-10-08 Thread Lai Jiangshan

When if the lglock don't to be exported,
we can use DEFINE_STATIC_LGLOCK().

Signed-off-by: Lai Jiangshan 
---
 fs/file_table.c|2 +-
 include/linux/lglock.h |8 +++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 701985e..e26fd31 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -36,7 +36,7 @@ struct files_stat_struct files_stat = {
.max_files = NR_FILE
 };
 
-DEFINE_LGLOCK(files_lglock);
+DEFINE_STATIC_LGLOCK(files_lglock);
 
 /* SLAB cache for file structures */
 static struct kmem_cache *filp_cachep __read_mostly;
diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index 8f97451..0d24e93 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -32,7 +32,8 @@
 #define br_write_lock(name)lg_global_lock(name)
 #define br_write_unlock(name)  lg_global_unlock(name)
 
-#define DEFINE_BRLOCK(name)DEFINE_LGLOCK(name)
+#define DEFINE_BRLOCK(name)DEFINE_LGLOCK(name)
+#define DEFINE_STATIC_BRLOCK(name) DEFINE_STATIC_LGLOCK(name)
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 #define LOCKDEP_INIT_MAP lockdep_init_map
@@ -53,6 +54,11 @@ struct lglock {
= __ARCH_SPIN_LOCK_UNLOCKED;\
struct lglock name = { .lock = &name ## _lock }
 
+#define DEFINE_STATIC_LGLOCK(name) \
+   static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)   \
+   = __ARCH_SPIN_LOCK_UNLOCKED;\
+   static struct lglock name = { .lock = &name ## _lock }
+
 void lg_lock_init(struct lglock *lg, char *name);
 void lg_local_lock(struct lglock *lg);
 void lg_local_unlock(struct lglock *lg);
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2] task_work: avoid unneeded cmpxchg() in task_work_run()

2012-10-09 Thread Lai Jiangshan

On 10/09/2012 07:04 PM, Peter Zijlstra wrote:
> On Mon, 2012-10-08 at 14:38 +0200, Oleg Nesterov wrote:
>> But the code looks more complex, and the only advantage is that
>> non-exiting task does xchg() instead of cmpxchg(). Not sure this
>> worth the trouble, in this case task_work_run() will likey run
>> the callbacks (the caller checks ->task_works != NULL), I do not
>> think this can add any noticeable speedup. 
> 
> Yeah, I agree, the patch doesn't seem worth the trouble. It makes tricky
> code unreadable at best.
> 

To gain better readability, we need to move work_exited things out
from task_work_run() too.

Thanks,
Lai

Subject: task_work: avoid unneeded cmpxchg() in task_work_run()

We only require cmpxchg()&retry when task is exiting.
xchg() is enough in other cases like original code in ac3d0da8.

So we use xchg() for task_work_run() and move the logic
of exit_task_work() out from task_work_run().

Signed-off-by: Lai Jiangshan 
---

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index ca5a1cf..1e686a5 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -15,10 +15,6 @@ init_task_work(struct callback_head *twork, task_work_func_t 
func)
 int task_work_add(struct task_struct *task, struct callback_head *twork, bool);
 struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
 void task_work_run(void);
-
-static inline void exit_task_work(struct task_struct *task)
-{
-   task_work_run();
-}
+void exit_task_work(struct task_struct *task);
 
 #endif /* _LINUX_TASK_WORK_H */
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 65bd3c9..87ef3b7 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -52,16 +52,7 @@ void task_work_run(void)
struct callback_head *work, *head, *next;
 
for (;;) {
-   /*
-* work->func() can do task_work_add(), do not set
-* work_exited unless the list is empty.
-*/
-   do {
-   work = ACCESS_ONCE(task->task_works);
-   head = !work && (task->flags & PF_EXITING) ?
-   &work_exited : NULL;
-   } while (cmpxchg(&task->task_works, work, head) != work);
-
+   work = xchg(&task->task_works, NULL);
if (!work)
break;
/*
@@ -90,3 +81,17 @@ void task_work_run(void)
} while (work);
}
 }
+
+void exit_task_work(struct task_struct *task)
+{
+   for (;;) {
+   /*
+* work->func() can do task_work_add(), do not set
+* work_exited unless the list is empty.
+*/
+   if (unlikely(task->task_works))
+   task_work_run();
+   if (cmpxchg(&task->task_works, NULL, &work_exited) == NULL)
+   break;
+   }
+}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-17 Thread Lai Jiangshan

On 10/18/2012 04:28 AM, Steven Rostedt wrote:
> On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote:
>>>
>>> Even the previous patch is applied, percpu_down_read() still
>>> needs mb() to pair with it.
>>
>> percpu_down_read uses rcu_read_lock which should guarantee that memory 
>> accesses don't escape in front of a rcu-protected section.
> 
> You do realize that rcu_read_lock() does nothing more that a barrier(),
> right?
> 
> Paul worked really hard to get rcu_read_locks() to not call HW barriers.
> 
>>
>> If rcu_read_unlock has only an unlock barrier and not a full barrier, 
>> memory accesses could be moved in front of rcu_read_unlock and reordered 
>> with this_cpu_inc(*p->counters), but it doesn't matter because 
>> percpu_down_write does synchronize_rcu(), so it never sees these accesses 
>> halfway through.
> 
> Looking at the patch, you are correct. The read side doesn't need the
> memory barrier as the worse thing that will happen is that it sees the
> locked = false, and will just grab the mutex unnecessarily.

-
A memory barrier can be added iff these two things are known:
1) it disables the disordering between what and what.
2) what is the corresponding mb() that it pairs with.

You tried to add a mb() in percpu_up_write(), OK, I know it disables the 
disordering
between the writes to the protected data and the statement "p->locked = false",
But I can't find out the corresponding mb() that it pairs with.

percpu_down_read()  writes to the data
The cpu cache/prefetch the data writes to the data
which is chaos  writes to the data
percpu_up_write()
mb()
p->locked = 
false;
unlikely(p->locked)
the cpu see p->lock = false,
don't discard the cached/prefetch data
this_cpu_inc(*p->counters);
the code of read-access to the data
and we use the chaos data*

So you need to add a mb() after "unlikely(p->locked)".

-

The RCU you use don't protect any data. It protects codes of the fast path:
unlikely(p->locked);
this_cpu_inc(*p->counters);

and synchronize_rcu() ensures all previous fast path had fully finished
"this_cpu_inc(*p->counters);".

It don't protect other code/data, if you want to protect other code or other
data, please add more synchronizations or mb()s.

---

I extremely hate a synchronization protects code instead of data.
but sometimes I also have to do it.

---

a very draft example of paired-mb()s is here:


diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index cf80f7e..84a93c0 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -12,6 +12,14 @@ struct percpu_rw_semaphore {
struct mutex mtx;
 };
 
+#if 1
+#define light_mb() barrier()
+#define heavy_mb() synchronize_sched()
+#else
+#define light_mb() smp_mb()
+#define heavy_mb() smp_mb();
+#endif
+
 static inline void percpu_down_read(struct percpu_rw_semaphore *p)
 {
rcu_read_lock();
@@ -24,22 +32,12 @@ static inline void percpu_down_read(struct 
percpu_rw_semaphore *p)
}
this_cpu_inc(*p->counters);
rcu_read_unlock();
+   light_mb(); /* A, between read of p->locked and read of data, paired 
with D */
 }
 
 static inline void percpu_up_read(struct percpu_rw_semaphore *p)
 {
-   /*
-* On X86, write operation in this_cpu_dec serves as a memory unlock
-* barrier (i.e. memory accesses may be moved before the write, but
-* no memory accesses are moved past the write).
-* On other architectures this may not be the case, so we need smp_mb()
-* there.
-*/
-#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && 
!defined(CONFIG_X86_OOSTORE))
-   barrier();
-#else
-   smp_mb();
-#endif
+   light_mb(); /* B, between read of the data and write to p->counter, 
paired with C */
this_cpu_dec(*p->counters);
 }
 
@@ -61,11 +59,12 @@ static inline void percpu_down_write(struct 
percpu_rw_semaphore *p)
synchronize_rcu();
while (__percpu_count(p->counters))
msleep(1);
-   smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */
+   heavy_mb(); /* C, between read of p->counter and write to data, paired 
with B */
 }
 
 static inline void percpu_up_write(struct percpu_rw_semaphore *p)
 {
+   heavy_mb(); /* D, between write to data and write to p->locked, paired 
with A */
p->locked = false;
mutex_unlock(&p->mtx);
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at

Re: [PATCH 0/8] workqueue: advance concurrency management

2013-04-18 Thread Lai Jiangshan

Ping.

On Mon, Apr 15, 2013 at 12:41 AM, Lai Jiangshan  wrote:
> I found the early-increasing nr_running in wq_worker_waking_up() is useless
> in many cases. it tries to avoid waking up idle workers for pending work item.
> but delay increasing nr_running does not increase waking up idle workers.
>
> so we delay increasing and remove wq_worker_waking_up() and ...
>
> enjoy a simpler concurrency management.
>
> Lai Jiangshan (8):
>   workqueue: remove @cpu from wq_worker_sleeping()
>   workqueue: use create_and_start_worker() in manage_workers()
>   workqueue: remove cpu_intensive from process_one_work()
>   workqueue: quit cm mode when sleeping
>   workqueue: remove disabled wq_worker_waking_up()
>   workqueue: make nr_running non-atomic
>   workqueue: move worker->flags up
>   workqueue: rename ->nr_running to ->nr_cm_workers
>
>  kernel/sched/core.c |6 +-
>  kernel/workqueue.c  |  234 +++---
>  kernel/workqueue_internal.h |9 +-
>  3 files changed, 89 insertions(+), 160 deletions(-)
>
> --
> 1.7.7.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] workqueue: advance concurrency management

2013-04-20 Thread Lai Jiangshan

On Sat, Apr 20, 2013 at 2:11 AM, Tejun Heo  wrote:
> Hey,
>
> On Fri, Apr 19, 2013 at 06:10:57AM +0800, Lai Jiangshan wrote:
>> Ping.
>
> Sorry, I've been at collab summit / lsf.  Plus, it's a bit too late
> for for-3.10 anyway.  Anyways, after glancing over it, here are my
> preliminary thoughts.  The first one looks good but I'm not sure about
> dropping nr_running adjustment.  The only real benefit coming from
> that is dropping a sched callback and if there's any performance /
> overhead impact, I'm afraid it's gonna be negative.  There are actual
> benefits in using as few tasks as possible -

waking_up() callback doesn't win too much in this.


> the cache footprint gets smaller,

cache footprint also be reduced in different way in the patchset.
and memory atomic operations are reduced.

> so unless there's a clear indication that the suggested

Only simple.
and remove the optimization from rare cases.

> behavior is better in some way, I'm not sure what we're buying with
> the proposed changes.
>
> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/7] workqueue: add __WQ_FREEZING and remove POOL_FREEZING

2013-04-20 Thread Lai Jiangshan

Please forget all my other patches.

But these 1/7 and 2/7 __WQ_FREEZING patches can be in 3.10

On Thu, Apr 4, 2013 at 10:12 PM, Tejun Heo  wrote:
> Hello, Lai.
>
> On Thu, Apr 04, 2013 at 10:05:32AM +0800, Lai Jiangshan wrote:
>> @@ -4757,25 +4747,16 @@ void thaw_workqueues(void)
>>  {
>>   struct workqueue_struct *wq;
>>   struct pool_workqueue *pwq;
>> - struct worker_pool *pool;
>> - int pi;
>>
>>   mutex_lock(&wq_pool_mutex);
>>
>>   if (!workqueue_freezing)
>>   goto out_unlock;
>>
>> - /* clear FREEZING */
>> - for_each_pool(pool, pi) {
>> - spin_lock_irq(&pool->lock);
>> - WARN_ON_ONCE(!(pool->flags & POOL_FREEZING));
>> - pool->flags &= ~POOL_FREEZING;
>> - spin_unlock_irq(&pool->lock);
>> - }
>> -
>>   /* restore max_active and repopulate worklist */
>>   list_for_each_entry(wq, &workqueues, list) {
>>   mutex_lock(&wq->mutex);
>> + wq->flags &= ~__WQ_FREEZING;
>
> I want an assertion here.

freezing codes is very simple for verifying.

> Maybe we can fold the next patch into this
> one and add WARN_ON_ONCE() here?

I consider the two patches are different intent.

Thanks,
Lai

>
>>   for_each_pwq(pwq, wq)
>>   pwq_adjust_max_active(pwq);
>>   mutex_unlock(&wq->mutex);
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Lai Jiangshan

g cmpxchg.
>>
>> Another point is that the 16 threads maybe setting up the queues in
>> consecutive slots in the head table. This is both a source of
>> contention and a waste of effort. One possible solution is to add
>> one more field (set to cpuid + 1, for example) to indicate that that
>> setup is being done with asp set to the target lock address
>> immediately. We will need to use cmpxchg128() for 64-bit machine,
>> though. Another solution is to have only that thread with ticket
>> number that is a fixed distance from head (e.g. 16*2) to do the
>> queue setup while the rest wait until the setup is done before
>> spinning on the queue.
>>
>> As my colleague Davidlohr had reported there are more regressions
>> than performance improvement in the AIM7 benchmark. I believe that
>> queue setup contention is likely a source of performance regression.
>
> Please see below for a v3 patch that:
>
> 1.  Fixes cpu_relax().
>
> 2.  Tests before doing cmpxchg().
>
> 3.  Reduces the number of CPUs attempting to set up the queue,
> in the common case, to a single CPU.  (Multiple CPUs can
> still be trying to set up the queue given unfortunate
> sequences of concurrent ticket-lock handoffs.)
>
> Please let me know how it goes!
>
> Thanx, Paul
>
> 
>
> ticketlock: Add queued-ticketlock capability
>
> Breaking up locks is better than implementing high-contention locks, but
> if we must have high-contention locks, why not make them automatically
> switch between light-weight ticket locks at low contention and queued
> locks at high contention?  After all, this would remove the need for
> the developer to predict which locks will be highly contended.
>
> This commit allows ticket locks to automatically switch between pure
> ticketlock and queued-lock operation as needed.  If too many CPUs are
> spinning on a given ticket lock, a queue structure will be allocated
> and the lock will switch to queued-lock operation.  When the lock becomes
> free, it will switch back into ticketlock operation.  The low-order bit
> of the head counter is used to indicate that the lock is in queued mode,
> which forces an unconditional mismatch between the head and tail counters.
> This approach means that the common-case code path under conditions of
> low contention is very nearly that of a plain ticket lock.
>
> A fixed number of queueing structures is statically allocated in an
> array.  The ticket-lock address is used to hash into an initial element,
> but if that element is already in use, it moves to the next element.  If
> the entire array is already in use, continue to spin in ticket mode.
>
> Signed-off-by: Paul E. McKenney 
> [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
> [ paulmck: Address Eric Dumazet review feedback. ]
> [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> [ paulmck: Reduce queue-switch contention (Waiman Long). ]
>
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 33692ea..509c51a 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -34,6 +34,21 @@
>  # define UNLOCK_LOCK_PREFIX
>  #endif
>
> +#ifdef CONFIG_TICKET_LOCK_QUEUED
> +
> +#define __TKT_SPIN_INC 2
> +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> +
> +#else /* #ifdef CONFIG_TICKET_LOCK_QUEUED */
> +
> +#define __TKT_SPIN_INC 1
> +static inline bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets 
> inc)
> +{
> +   return false;
> +}
> +
> +#endif /* #else #ifdef CONFIG_TICKET_LOCK_QUEUED */
> +
>  /*
>   * Ticket locks are conceptually two parts, one indicating the current head 
> of
>   * the queue, and the other indicating the current tail. The lock is acquired
> @@ -49,17 +64,16 @@
>   */
>  static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
>  {
> -   register struct __raw_tickets inc = { .tail = 1 };
> +   register struct __raw_tickets inc = { .tail = __TKT_SPIN_INC };
>
> inc = xadd(&lock->tickets, inc);
> -
> for (;;) {
> -   if (inc.head == inc.tail)
> +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
> break;
> cpu_relax();
> inc.head = ACCESS_ONCE(lock->tickets.head);
> }
> -

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-12 Thread Lai Jiangshan

On Wed, Jun 12, 2013 at 11:40 PM, Paul E. McKenney
 wrote:
> Breaking up locks is better than implementing high-contention locks, but
> if we must have high-contention locks, why not make them automatically
> switch between light-weight ticket locks at low contention and queued
> locks at high contention?  After all, this would remove the need for
> the developer to predict which locks will be highly contended.
>
> This commit allows ticket locks to automatically switch between pure
> ticketlock and queued-lock operation as needed.  If too many CPUs are
> spinning on a given ticket lock, a queue structure will be allocated
> and the lock will switch to queued-lock operation.  When the lock becomes
> free, it will switch back into ticketlock operation.  The low-order bit
> of the head counter is used to indicate that the lock is in queued mode,
> which forces an unconditional mismatch between the head and tail counters.
> This approach means that the common-case code path under conditions of
> low contention is very nearly that of a plain ticket lock.
>
> A fixed number of queueing structures is statically allocated in an
> array.  The ticket-lock address is used to hash into an initial element,
> but if that element is already in use, it moves to the next element.  If
> the entire array is already in use, continue to spin in ticket mode.
>
> Signed-off-by: Paul E. McKenney 
> [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
> [ paulmck: Address Eric Dumazet review feedback. ]
> [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> [ paulmck: Reduce queue-switch contention (Waiman Long). ]
> [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ]
> [ paulmck: Type safety fixes (Steven Rostedt). ]
> [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
> [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
>
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 33692ea..5aa0177 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -34,6 +34,21 @@
>  # define UNLOCK_LOCK_PREFIX
>  #endif
>
> +#ifdef CONFIG_TICKET_LOCK_QUEUED
> +
> +#define __TKT_SPIN_INC 2
> +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> +
> +#else /* #ifdef CONFIG_TICKET_LOCK_QUEUED */
> +
> +#define __TKT_SPIN_INC 1
> +static inline bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets 
> inc)
> +{
> +   return false;
> +}
> +
> +#endif /* #else #ifdef CONFIG_TICKET_LOCK_QUEUED */
> +
>  /*
>   * Ticket locks are conceptually two parts, one indicating the current head 
> of
>   * the queue, and the other indicating the current tail. The lock is acquired
> @@ -49,17 +64,16 @@
>   */
>  static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
>  {
> -   register struct __raw_tickets inc = { .tail = 1 };
> +   register struct __raw_tickets inc = { .tail = __TKT_SPIN_INC };
>
> inc = xadd(&lock->tickets, inc);
> -
> for (;;) {
> -   if (inc.head == inc.tail)
> +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
> break;
> cpu_relax();
> inc.head = ACCESS_ONCE(lock->tickets.head);
> }
> -   barrier();  /* make sure nothing creeps before the lock 
> is taken */
> +   barrier(); /* Make sure nothing creeps in before the lock is taken. */
>  }
>
>  static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
> @@ -70,17 +84,33 @@ static __always_inline int 
> __ticket_spin_trylock(arch_spinlock_t *lock)
> if (old.tickets.head != old.tickets.tail)
> return 0;
>
> -   new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
> +   new.head_tail = old.head_tail + (__TKT_SPIN_INC << TICKET_SHIFT);
>
> /* cmpxchg is a full barrier, so nothing can move before it */
> return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
> old.head_tail;
>  }
>
> +#ifndef CONFIG_TICKET_LOCK_QUEUED
> +
>  static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
>  {
> __add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
>  }
>
> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
> +extern void tkt_q_do_wake(arch_spinlock_t *lock);
> +
> +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> +{
> +   __ticket_t head = 2;
> +
> +   head = xadd

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-12 Thread Lai Jiangshan

On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
> Breaking up locks is better than implementing high-contention locks, but
> if we must have high-contention locks, why not make them automatically
> switch between light-weight ticket locks at low contention and queued
> locks at high contention?  After all, this would remove the need for
> the developer to predict which locks will be highly contended.
> 
> This commit allows ticket locks to automatically switch between pure
> ticketlock and queued-lock operation as needed.  If too many CPUs are
> spinning on a given ticket lock, a queue structure will be allocated
> and the lock will switch to queued-lock operation.  When the lock becomes
> free, it will switch back into ticketlock operation.  The low-order bit
> of the head counter is used to indicate that the lock is in queued mode,
> which forces an unconditional mismatch between the head and tail counters.
> This approach means that the common-case code path under conditions of
> low contention is very nearly that of a plain ticket lock.
> 
> A fixed number of queueing structures is statically allocated in an
> array.  The ticket-lock address is used to hash into an initial element,
> but if that element is already in use, it moves to the next element.  If
> the entire array is already in use, continue to spin in ticket mode.
> 
> Signed-off-by: Paul E. McKenney 
> [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
> [ paulmck: Address Eric Dumazet review feedback. ]
> [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> [ paulmck: Reduce queue-switch contention (Waiman Long). ]
> [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ]
> [ paulmck: Type safety fixes (Steven Rostedt). ]
> [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
> [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]


Hi, Paul,

I simplify the code and remove the search by encoding the index of struct 
tkt_q_head
into lock->tickets.head.

A) lock->tickets.head(when queued-lock):
-
 index of struct tkt_q_head |0|1|
-

The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
__ticket_spin_unlock(),
thus __ticket_spin_unlock() will not change the higher bits of 
lock->tickets.head.

B) tqhp->head is for the real value of lock->tickets.head.
if the last bit of tqhp->head is 1, it means it is the head ticket when started 
queuing.

Thanks,
Lai

 kernel/tktqlock.c |   51 +--
 1 files changed, 13 insertions(+), 38 deletions(-)

diff --git a/kernel/tktqlock.c b/kernel/tktqlock.c
index 912817c..1329d0f 100644
--- a/kernel/tktqlock.c
+++ b/kernel/tktqlock.c
@@ -33,7 +33,7 @@ struct tkt_q {
 
 struct tkt_q_head {
arch_spinlock_t *ref;   /* Pointer to spinlock if in use. */
-   s64 head_tkt;   /* Head ticket when started queuing. */
+   __ticket_t head;/* Real head when queued. */
struct tkt_q *spin; /* Head of queue. */
struct tkt_q **spin_tail;   /* Tail of queue. */
 };
@@ -77,15 +77,8 @@ static unsigned long tkt_q_hash(arch_spinlock_t *lock)
  */
 static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *lock)
 {
-   int i;
-   int start;
-
-   start = i = tkt_q_hash(lock);
-   do
-   if (ACCESS_ONCE(tkt_q_heads[i].ref) == lock)
-   return &tkt_q_heads[i];
-   while ((i = tkt_q_next_slot(i)) != start);
-   return NULL;
+   BUILD_BUG_ON(TKT_Q_NQUEUES > (1 << (TICKET_SHIFT - 2)));
+   return &tkt_q_heads[ACCESS_ONCE(lock->tickets.head) >> 2];
 }
 
 /*
@@ -101,11 +94,11 @@ static bool tkt_q_try_unqueue(arch_spinlock_t *lock, 
struct tkt_q_head *tqhp)
 
/* Pick up the ticket values. */
asold = ACCESS_ONCE(*lock);
-   if ((asold.tickets.head & ~0x1) == asold.tickets.tail) {
+   if (tqhp->head == asold.tickets.tail) {
 
/* Attempt to mark the lock as not having a queue. */
asnew = asold;
-   asnew.tickets.head &= ~0x1;
+   asnew.tickets.head = tqhp->head;
if (cmpxchg(&lock->head_tail,
asold.head_tail,
asnew.head_tail) == asold.head_tail) {
@@ -128,12 +121,9 @@ void tkt_q_do_wake(arch_spinlock_t *lock)
struct tkt_q_head *tqhp;
struct tkt_q *tqp;
 
-   /*
-* If the queue is still being set up, wait for it.  Note that
-* the caller's xadd() provides the needed memory ordering.
-*/
-   while ((tqhp = tkt_q_find_head(loc

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-13 Thread Lai Jiangshan

On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
 wrote:
> On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
>> On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
>> > Breaking up locks is better than implementing high-contention locks, but
>> > if we must have high-contention locks, why not make them automatically
>> > switch between light-weight ticket locks at low contention and queued
>> > locks at high contention?  After all, this would remove the need for
>> > the developer to predict which locks will be highly contended.
>> >
>> > This commit allows ticket locks to automatically switch between pure
>> > ticketlock and queued-lock operation as needed.  If too many CPUs are
>> > spinning on a given ticket lock, a queue structure will be allocated
>> > and the lock will switch to queued-lock operation.  When the lock becomes
>> > free, it will switch back into ticketlock operation.  The low-order bit
>> > of the head counter is used to indicate that the lock is in queued mode,
>> > which forces an unconditional mismatch between the head and tail counters.
>> > This approach means that the common-case code path under conditions of
>> > low contention is very nearly that of a plain ticket lock.
>> >
>> > A fixed number of queueing structures is statically allocated in an
>> > array.  The ticket-lock address is used to hash into an initial element,
>> > but if that element is already in use, it moves to the next element.  If
>> > the entire array is already in use, continue to spin in ticket mode.
>> >
>> > Signed-off-by: Paul E. McKenney 
>> > [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
>> > [ paulmck: Address Eric Dumazet review feedback. ]
>> > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
>> > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
>> > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
>> > [ paulmck: Reduce queue-switch contention (Waiman Long). ]
>> > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ]
>> > [ paulmck: Type safety fixes (Steven Rostedt). ]
>> > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
>> > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
>>
>>
>> Hi, Paul,
>>
>> I simplify the code and remove the search by encoding the index of struct 
>> tkt_q_head
>> into lock->tickets.head.
>>
>> A) lock->tickets.head(when queued-lock):
>> -
>>  index of struct tkt_q_head |0|1|
>> -
>
> Interesting approach!  It might reduce queued-mode overhead a bit in
> some cases, though I bet that in the common case the first queue element
> examined is the right one.  More on this below.
>
>> The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
>> __ticket_spin_unlock(),
>> thus __ticket_spin_unlock() will not change the higher bits of 
>> lock->tickets.head.
>>
>> B) tqhp->head is for the real value of lock->tickets.head.
>> if the last bit of tqhp->head is 1, it means it is the head ticket when 
>> started queuing.
>
> But don't you also need the xadd() in __ticket_spin_unlock() to become
> a cmpxchg() for this to work?  Or is your patch missing your changes to
> arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
> the no-contention overhead, which might be counterproductive.  Wouldn't
> hurt to get measurements, though.

No, don't need to change __ticket_spin_unlock() in my idea.
bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
__ticket_spin_unlock() only changes the bit1, it will not change
the higher bits. tkt_q_do_wake() will restore the tickets.head.

This approach avoids cmpxchg in __ticket_spin_unlock().

>
> Given the results that Davidlohr posted, I believe that the following
> optimizations would also provide some improvement:
>
> 1.  Move the call to tkt_spin_pass() from __ticket_spin_lock()
> to a separate linker section in order to reduce the icache
> penalty exacted by the spinloop.  This is likely to be causing
> some of the performance reductions in the cases where ticket
> locks are not highly contended.
>
> 2.  Limit the number of elements searched for in the array of
> queues.  However, this would help only if a number of ticket
> locks were in queued mode at the same time.
>
> 3.  Dynamically allocate the queue array at boot.  This might
>

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-13 Thread Lai Jiangshan

On 06/14/2013 07:57 AM, Paul E. McKenney wrote:
> On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote:
>> On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
>>  wrote:
>>> On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
>>>> On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
>>>>> Breaking up locks is better than implementing high-contention locks, but
>>>>> if we must have high-contention locks, why not make them automatically
>>>>> switch between light-weight ticket locks at low contention and queued
>>>>> locks at high contention?  After all, this would remove the need for
>>>>> the developer to predict which locks will be highly contended.
>>>>>
>>>>> This commit allows ticket locks to automatically switch between pure
>>>>> ticketlock and queued-lock operation as needed.  If too many CPUs are
>>>>> spinning on a given ticket lock, a queue structure will be allocated
>>>>> and the lock will switch to queued-lock operation.  When the lock becomes
>>>>> free, it will switch back into ticketlock operation.  The low-order bit
>>>>> of the head counter is used to indicate that the lock is in queued mode,
>>>>> which forces an unconditional mismatch between the head and tail counters.
>>>>> This approach means that the common-case code path under conditions of
>>>>> low contention is very nearly that of a plain ticket lock.
>>>>>
>>>>> A fixed number of queueing structures is statically allocated in an
>>>>> array.  The ticket-lock address is used to hash into an initial element,
>>>>> but if that element is already in use, it moves to the next element.  If
>>>>> the entire array is already in use, continue to spin in ticket mode.
>>>>>
>>>>> Signed-off-by: Paul E. McKenney 
>>>>> [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). 
>>>>> ]
>>>>> [ paulmck: Address Eric Dumazet review feedback. ]
>>>>> [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
>>>>> [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
>>>>> [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
>>>>> [ paulmck: Reduce queue-switch contention (Waiman Long). ]
>>>>> [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). 
>>>>> ]
>>>>> [ paulmck: Type safety fixes (Steven Rostedt). ]
>>>>> [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
>>>>> [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
>>>>
>>>>
>>>> Hi, Paul,
>>>>
>>>> I simplify the code and remove the search by encoding the index of struct 
>>>> tkt_q_head
>>>> into lock->tickets.head.
>>>>
>>>> A) lock->tickets.head(when queued-lock):
>>>> -
>>>>  index of struct tkt_q_head |0|1|
>>>> -
>>>
>>> Interesting approach!  It might reduce queued-mode overhead a bit in
>>> some cases, though I bet that in the common case the first queue element
>>> examined is the right one.  More on this below.
>>>
>>>> The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
>>>> __ticket_spin_unlock(),
>>>> thus __ticket_spin_unlock() will not change the higher bits of 
>>>> lock->tickets.head.
>>>>
>>>> B) tqhp->head is for the real value of lock->tickets.head.
>>>> if the last bit of tqhp->head is 1, it means it is the head ticket when 
>>>> started queuing.
>>>
>>> But don't you also need the xadd() in __ticket_spin_unlock() to become
>>> a cmpxchg() for this to work?  Or is your patch missing your changes to
>>> arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
>>> the no-contention overhead, which might be counterproductive.  Wouldn't
>>> hurt to get measurements, though.
>>
>> No, don't need to change __ticket_spin_unlock() in my idea.
>> bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
>> __ticket_spin_unlock() only changes the bit1, it will not change
>> the higher bits. tkt_q_do_wake() will restore the tickets.head.
>>
>> This approach avoids cmpxchg in __ticket_spin_unlock().
> 
> Ah, I did miss that.  But doesn't the adjustment in __ticket_spin_lock()
> need to be atomic in order to handle concurrent invocations of
> __ticket_spin_lock()?

I don't understand, do we just discuss about __ticket_spin_unlock() only?
Or does my suggestion hurt __ticket_spin_lock()?

> 
> Either way, it would be good to see the performance effects of this.
> 
>   Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-14 Thread Lai Jiangshan

On 06/14/2013 07:57 AM, Paul E. McKenney wrote:
> On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote:
>> On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
>>  wrote:
>>> On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
>>>> On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
>>>>> Breaking up locks is better than implementing high-contention locks, but
>>>>> if we must have high-contention locks, why not make them automatically
>>>>> switch between light-weight ticket locks at low contention and queued
>>>>> locks at high contention?  After all, this would remove the need for
>>>>> the developer to predict which locks will be highly contended.
>>>>>
>>>>> This commit allows ticket locks to automatically switch between pure
>>>>> ticketlock and queued-lock operation as needed.  If too many CPUs are
>>>>> spinning on a given ticket lock, a queue structure will be allocated
>>>>> and the lock will switch to queued-lock operation.  When the lock becomes
>>>>> free, it will switch back into ticketlock operation.  The low-order bit
>>>>> of the head counter is used to indicate that the lock is in queued mode,
>>>>> which forces an unconditional mismatch between the head and tail counters.
>>>>> This approach means that the common-case code path under conditions of
>>>>> low contention is very nearly that of a plain ticket lock.
>>>>>
>>>>> A fixed number of queueing structures is statically allocated in an
>>>>> array.  The ticket-lock address is used to hash into an initial element,
>>>>> but if that element is already in use, it moves to the next element.  If
>>>>> the entire array is already in use, continue to spin in ticket mode.
>>>>>
>>>>> Signed-off-by: Paul E. McKenney 
>>>>> [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). 
>>>>> ]
>>>>> [ paulmck: Address Eric Dumazet review feedback. ]
>>>>> [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
>>>>> [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
>>>>> [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
>>>>> [ paulmck: Reduce queue-switch contention (Waiman Long). ]
>>>>> [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). 
>>>>> ]
>>>>> [ paulmck: Type safety fixes (Steven Rostedt). ]
>>>>> [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
>>>>> [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
>>>>
>>>>
>>>> Hi, Paul,
>>>>
>>>> I simplify the code and remove the search by encoding the index of struct 
>>>> tkt_q_head
>>>> into lock->tickets.head.
>>>>
>>>> A) lock->tickets.head(when queued-lock):
>>>> -
>>>>  index of struct tkt_q_head |0|1|
>>>> -
>>>
>>> Interesting approach!  It might reduce queued-mode overhead a bit in
>>> some cases, though I bet that in the common case the first queue element
>>> examined is the right one.  More on this below.
>>>
>>>> The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
>>>> __ticket_spin_unlock(),
>>>> thus __ticket_spin_unlock() will not change the higher bits of 
>>>> lock->tickets.head.
>>>>
>>>> B) tqhp->head is for the real value of lock->tickets.head.
>>>> if the last bit of tqhp->head is 1, it means it is the head ticket when 
>>>> started queuing.
>>>
>>> But don't you also need the xadd() in __ticket_spin_unlock() to become
>>> a cmpxchg() for this to work?  Or is your patch missing your changes to
>>> arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
>>> the no-contention overhead, which might be counterproductive.  Wouldn't
>>> hurt to get measurements, though.
>>
>> No, don't need to change __ticket_spin_unlock() in my idea.
>> bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
>> __ticket_spin_unlock() only changes the bit1, it will not change
>> the higher bits. tkt_q_do_wake() will restore the tickets.head.
>>
>> This approach avoids cmpxchg in __ticket_spin_unlock().
> 
> Ah, I did miss that.  But doesn't

[PATCH] clk: remove the clk_notifier from clk_notifier_list before free it (was: Re: [BUG] zynq | CCF | SRCU)

2013-06-03 Thread Lai Jiangshan

On 06/01/2013 03:12 AM, Sören Brinkmann wrote:
> Hi,
> 
> we recently encountered some kernel panics when we compiled one of our
> drivers as module and tested inserting/removing the module.
> Trying to debug this issue, I could reproduce it on the mainline kernel
> with a dummy module.
> 
> What happens is, that when on driver remove clk_notifier_unregister() is
> called and no other notifier for that clock is registered, the kernel
> panics.
> I'm not sure what is going wrong here. If there is a bug (and if where)
> or I'm just using the infrastructure the wrong way,... So, any hint is
> appreciated.
> 
> I attach the output from the crashing system. The stacktrace indicates a
> crash in 'srcu_readers_seq_idx()'.
> I also attach the module I used to trigger the issue and a patch on top
> of mainline commit a93cb29acaa8f75618c3f202d1cf43c231984644 which has
> the DT modifications I need to make the module find its clock and boot
> with my initramfs.
> 
> 
>   Thanks,
>   Sören
> 



Hi, Sören Brinkmann

I guess:

modprobe clk_notif_dbg
modprobe clk_notif_dbg -r  # memory corrupt here
modprobe clk_notif_dbg # access corrupted memroy, but no visiable bug
modprobe clk_notif_dbg -r  # access corrupted memroy, BUG


How the first "modprobe clk_notif_dbg -r" corrupt memroy:

=

int clk_notifier_unregister(struct clk *clk, struct notifier_block *nb)
{
struct clk_notifier *cn = NULL;
int ret = -EINVAL;

if (!clk || !nb)
return -EINVAL;

clk_prepare_lock();

list_for_each_entry(cn, &clk_notifier_list, node)
if (cn->clk == clk)
break;

if (cn->clk == clk) {
ret = srcu_notifier_chain_unregister(&cn->notifier_head, nb);

clk->notifier_count--;

/* XXX the notifier code should handle this better */
if (!cn->notifier_head.head) {
srcu_cleanup_notifier_head(&cn->notifier_head);

===> the code forgot to remove @cn from the clk_notifier_list
===> the second "modprobe clk_notif_dbg" will the same @clk and use the 
same corrupt @cn

kfree(cn);
}

} else {
ret = -ENOENT;
}

clk_prepare_unlock();

return ret;
}

===

Could you retry with the following patch?

Thanks,
Lai

>From 5e26b626724139070148df9f6bd0607bc7bc3812 Mon Sep 17 00:00:00 2001
From: Lai Jiangshan 
Date: Mon, 3 Jun 2013 16:59:50 +0800
Subject: [PATCH] clk: remove the clk_notifier from clk_notifier_list before
 free it
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The @cn is stay in @clk_notifier_list after it is freed, it cause
memory corruption.

Example, if @clk is registered(first), unregistered(first),
registered(second), unregistered(second).

The freed @cn will be used when @clk is registered(second),
and the bug will be happened when @clk is unregistered(second):

[  517.04] clk_notif_dbg clk_notif_dbg.1: clk_notifier_unregister()
[  517.04] Unable to handle kernel paging request at virtual address 
00df3008
[  517.05] pgd = ed858000
[  517.05] [00df3008] *pgd=
[  517.06] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[  517.06] Modules linked in: clk_notif_dbg(O-) [last unloaded: 
clk_notif_dbg]
[  517.06] CPU: 1 PID: 499 Comm: modprobe Tainted: G   O 
3.10.0-rc3-00119-ga93cb29-dirty #85
[  517.06] task: ee1e0180 ti: ee3e6000 task.ti: ee3e6000
[  517.06] PC is at srcu_readers_seq_idx+0x48/0x84
[  517.06] LR is at srcu_readers_seq_idx+0x60/0x84
[  517.06] pc : []lr : []psr: 80070013
[  517.06] sp : ee3e7d48  ip :   fp : ee3e7d6c
[  517.06] r10:   r9 : ee3e6000  r8 : 
[  517.06] r7 : ed84fe4c  r6 : c068ec90  r5 : c068e430  r4 : 
[  517.06] r3 : 00df3000  r2 :   r1 : 0002  r0 : 
[  517.06] Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  517.06] Control: 18c5387d  Table: 2d85804a  DAC: 0015
[  517.06] Process modprobe (pid: 499, stack limit = 0xee3e6238)
[  517.06] Stack: (0xee3e7d48 to 0xee3e8000)

[  517.06] [] (srcu_readers_seq_idx+0x48/0x84) from [] 
(try_check_zero+0x34/0xfc)
[  517.06] [] (try_check_zero+0x34/0xfc) from [] 
(srcu_advance_batches+0x58/0x114)
[  517.06] [] (srcu_advance_batches+0x58/0x114) from [] 
(__synchronize_srcu+0x114/0x1ac)
[  517.06] [] (__synchronize_srcu+0x114/0x1ac) from [] 
(synchronize_srcu+0x2c/0x34)
[  517.06] [] (synchronize_srcu+0x2c/0x34) from [] 
(srcu_notifier_chain_unregister+0x68/0x74)
[  517.06] [] (srcu_notifier_chain_unregister+0x68/0x74) from 
[] (clk_notifier_unregister+0x7c/0xc0)
[  517.

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Lai Jiangshan

On 06/11/2013 08:51 AM, Linus Torvalds wrote:
> On Mon, Jun 10, 2013 at 5:44 PM, Steven Rostedt  wrote:
>>
>> OK, I haven't found a issue here yet, but youss are beiing trickssy! We
>> don't like trickssy, and we must find preiouss!!!
> 
> .. and I personally have my usual reservations. I absolutely hate
> papering over scalability issues, and historically whenever people
> have ever thought that we want complex spinlocks, the problem has
> always been that the locking sucks.
> 
> So reinforced by previous events, I really feel that code that needs
> this kind of spinlock is broken and needs to be fixed, rather than
> actually introduce tricky spinlocks.
> 
> So in order to merge something like this, I want (a) numbers for real
> loads and (b) explanations for why the spinlock users cannot be fixed.
> 
> Because "we might hit loads" is just not good enough. I would counter
> with "hiding problems causes more of them".
> 

Hi, all

Off-topic, although I am in this community for several years,
I am not exactly clear with this problem.

1) In general case, which lock is the most competitive in the kernel? what it 
protects for?
2) In which special case, which lock is the most competitive in the kernel? 
what it protects for?
3) In general case, which list is the most hot list?
4) In which special case, which list is the most hot list?

thanks,
Lai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Lai Jiangshan

On Mon, Jun 10, 2013 at 3:36 AM, Paul E. McKenney
 wrote:
> Breaking up locks is better than implementing high-contention locks, but
> if we must have high-contention locks, why not make them automatically
> switch between light-weight ticket locks at low contention and queued
> locks at high contention?
>
> This commit therefore allows ticket locks to automatically switch between
> pure ticketlock and queued-lock operation as needed.  If too many CPUs
> are spinning on a given ticket lock, a queue structure will be allocated
> and the lock will switch to queued-lock operation.  When the lock becomes
> free, it will switch back into ticketlock operation.  The low-order bit
> of the head counter is used to indicate that the lock is in queued mode,
> which forces an unconditional mismatch between the head and tail counters.
> This approach means that the common-case code path under conditions of
> low contention is very nearly that of a plain ticket lock.
>
> A fixed number of queueing structures is statically allocated in an
> array.  The ticket-lock address is used to hash into an initial element,
> but if that element is already in use, it moves to the next element.  If
> the entire array is already in use, continue to spin in ticket mode.
>
> This has been only lightly tested in the kernel, though a userspace
> implementation has survived substantial testing.
>
> Signed-off-by: Paul E. McKenney 
>
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 33692ea..b4a91b0 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -34,6 +34,8 @@
>  # define UNLOCK_LOCK_PREFIX
>  #endif
>
> +#ifndef CONFIG_TICKET_LOCK_QUEUED
> +
>  /*
>   * Ticket locks are conceptually two parts, one indicating the current head 
> of
>   * the queue, and the other indicating the current tail. The lock is acquired
> @@ -62,6 +64,25 @@ static __always_inline void 
> __ticket_spin_lock(arch_spinlock_t *lock)
> barrier();  /* make sure nothing creeps before the lock 
> is taken */
>  }
>
> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
> +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> +
> +static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
> +{
> +   register struct __raw_tickets inc = { .tail = 2 };
> +
> +   inc = xadd(&lock->tickets, inc);
> +   for (;;) {
> +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
> +   break;
> +   inc.head = ACCESS_ONCE(lock->tickets.head);
> +   }
> +   barrier(); /* smp_mb() on Power or ARM. */
> +}
> +
> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
>  static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
>  {
> arch_spinlock_t old, new;
> @@ -70,17 +91,37 @@ static __always_inline int 
> __ticket_spin_trylock(arch_spinlock_t *lock)
> if (old.tickets.head != old.tickets.tail)
> return 0;
>
> +#ifndef CONFIG_TICKET_LOCK_QUEUED
> new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +   new.head_tail = old.head_tail + (2 << TICKET_SHIFT);
> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
>
> /* cmpxchg is a full barrier, so nothing can move before it */
> return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
> old.head_tail;
>  }
>
> +#ifndef CONFIG_TICKET_LOCK_QUEUED
> +
>  static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
>  {
> __add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
>  }
>
> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
> +extern void tkt_q_do_wake(arch_spinlock_t *asp);
> +
> +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> +{
> +   __ticket_t head = 2;
> +
> +   head = xadd(&lock->tickets.head, 2);
> +   if (head & 0x1)
> +   tkt_q_do_wake(lock);
> +}
> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
>  static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
>  {
> struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
> diff --git a/arch/x86/include/asm/spinlock_types.h 
> b/arch/x86/include/asm/spinlock_types.h
> index ad0ad07..cdaefdd 100644
> --- a/arch/x86/include/asm/spinlock_types.h
> +++ b/arch/x86/include/asm/spinlock_types.h
> @@ -7,12 +7,18 @@
>
>  #include 
>
> -#if (CONFIG_NR_CPUS < 256)
> +#if (CONFIG_NR_CPUS < 128)
>  typedef u8  __ticket_t;
>  typedef u16 __ticketpair_t;
> -#else
> +#define TICKET_T_CMP_GE(a, b) (UCHAR_MAX / 2 >= (unsigned char)((a) - (b)))
> +#elif (CONFIG_NR_CPUS < 32768)
>  typedef u16 __ticket_t;
>  typedef u32 __ticketpair_t;
> +#define TICKET_T_CMP_GE(a, b) (USHRT_MAX / 2 >= (unsigned short)((a) - (b)))
> +#else
> +typedef u32 __ticket_t;
> +typedef u64 __ticketpair_t;
> +#define TICKET_T_CMP_GE(a, b) (UINT_MAX / 2 >= (unsigned int)((a) - (b)))
>  #endif
>
>  #define TICKET_SH

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Lai Jiangshan

On Tue, Jun 11, 2013 at 10:48 PM, Lai Jiangshan  wrote:
> On Mon, Jun 10, 2013 at 3:36 AM, Paul E. McKenney
>  wrote:
>> Breaking up locks is better than implementing high-contention locks, but
>> if we must have high-contention locks, why not make them automatically
>> switch between light-weight ticket locks at low contention and queued
>> locks at high contention?
>>
>> This commit therefore allows ticket locks to automatically switch between
>> pure ticketlock and queued-lock operation as needed.  If too many CPUs
>> are spinning on a given ticket lock, a queue structure will be allocated
>> and the lock will switch to queued-lock operation.  When the lock becomes
>> free, it will switch back into ticketlock operation.  The low-order bit
>> of the head counter is used to indicate that the lock is in queued mode,
>> which forces an unconditional mismatch between the head and tail counters.
>> This approach means that the common-case code path under conditions of
>> low contention is very nearly that of a plain ticket lock.
>>
>> A fixed number of queueing structures is statically allocated in an
>> array.  The ticket-lock address is used to hash into an initial element,
>> but if that element is already in use, it moves to the next element.  If
>> the entire array is already in use, continue to spin in ticket mode.
>>
>> This has been only lightly tested in the kernel, though a userspace
>> implementation has survived substantial testing.
>>
>> Signed-off-by: Paul E. McKenney 
>>
>> diff --git a/arch/x86/include/asm/spinlock.h 
>> b/arch/x86/include/asm/spinlock.h
>> index 33692ea..b4a91b0 100644
>> --- a/arch/x86/include/asm/spinlock.h
>> +++ b/arch/x86/include/asm/spinlock.h
>> @@ -34,6 +34,8 @@
>>  # define UNLOCK_LOCK_PREFIX
>>  #endif
>>
>> +#ifndef CONFIG_TICKET_LOCK_QUEUED
>> +
>>  /*
>>   * Ticket locks are conceptually two parts, one indicating the current head 
>> of
>>   * the queue, and the other indicating the current tail. The lock is 
>> acquired
>> @@ -62,6 +64,25 @@ static __always_inline void 
>> __ticket_spin_lock(arch_spinlock_t *lock)
>> barrier();  /* make sure nothing creeps before the lock 
>> is taken */
>>  }
>>
>> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
>> +
>> +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
>> +
>> +static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
>> +{
>> +   register struct __raw_tickets inc = { .tail = 2 };
>> +
>> +   inc = xadd(&lock->tickets, inc);
>> +   for (;;) {
>> +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
>> +   break;
>> +   inc.head = ACCESS_ONCE(lock->tickets.head);
>> +   }
>> +   barrier(); /* smp_mb() on Power or ARM. */
>> +}
>> +
>> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
>> +
>>  static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
>>  {
>> arch_spinlock_t old, new;
>> @@ -70,17 +91,37 @@ static __always_inline int 
>> __ticket_spin_trylock(arch_spinlock_t *lock)
>> if (old.tickets.head != old.tickets.tail)
>> return 0;
>>
>> +#ifndef CONFIG_TICKET_LOCK_QUEUED
>> new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
>> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
>> +   new.head_tail = old.head_tail + (2 << TICKET_SHIFT);
>> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
>>
>> /* cmpxchg is a full barrier, so nothing can move before it */
>> return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
>> old.head_tail;
>>  }
>>
>> +#ifndef CONFIG_TICKET_LOCK_QUEUED
>> +
>>  static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
>>  {
>> __add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
>>  }
>>
>> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
>> +
>> +extern void tkt_q_do_wake(arch_spinlock_t *asp);
>> +
>> +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
>> +{
>> +   __ticket_t head = 2;
>> +
>> +   head = xadd(&lock->tickets.head, 2);
>> +   if (head & 0x1)
>> +   tkt_q_do_wake(lock);
>> +}
>> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
>> +
>>  static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
>>  {

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Lai Jiangshan

t; consecutive slots in the head table. This is both a source of
>> contention and a waste of effort. One possible solution is to add
>> one more field (set to cpuid + 1, for example) to indicate that that
>> setup is being done with asp set to the target lock address
>> immediately. We will need to use cmpxchg128() for 64-bit machine,
>> though. Another solution is to have only that thread with ticket
>> number that is a fixed distance from head (e.g. 16*2) to do the
>> queue setup while the rest wait until the setup is done before
>> spinning on the queue.
>>
>> As my colleague Davidlohr had reported there are more regressions
>> than performance improvement in the AIM7 benchmark. I believe that
>> queue setup contention is likely a source of performance regression.
>
> Please see below for a v3 patch that:
>
> 1.  Fixes cpu_relax().
>
> 2.  Tests before doing cmpxchg().
>
> 3.  Reduces the number of CPUs attempting to set up the queue,
> in the common case, to a single CPU.  (Multiple CPUs can
> still be trying to set up the queue given unfortunate
> sequences of concurrent ticket-lock handoffs.)
>
> Please let me know how it goes!
>
> Thanx, Paul
>
> 
>
> ticketlock: Add queued-ticketlock capability
>
> Breaking up locks is better than implementing high-contention locks, but
> if we must have high-contention locks, why not make them automatically
> switch between light-weight ticket locks at low contention and queued
> locks at high contention?  After all, this would remove the need for
> the developer to predict which locks will be highly contended.
>
> This commit allows ticket locks to automatically switch between pure
> ticketlock and queued-lock operation as needed.  If too many CPUs are
> spinning on a given ticket lock, a queue structure will be allocated
> and the lock will switch to queued-lock operation.  When the lock becomes
> free, it will switch back into ticketlock operation.  The low-order bit
> of the head counter is used to indicate that the lock is in queued mode,
> which forces an unconditional mismatch between the head and tail counters.
> This approach means that the common-case code path under conditions of
> low contention is very nearly that of a plain ticket lock.
>
> A fixed number of queueing structures is statically allocated in an
> array.  The ticket-lock address is used to hash into an initial element,
> but if that element is already in use, it moves to the next element.  If
> the entire array is already in use, continue to spin in ticket mode.
>
> Signed-off-by: Paul E. McKenney 
> [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
> [ paulmck: Address Eric Dumazet review feedback. ]
> [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> [ paulmck: Reduce queue-switch contention (Waiman Long). ]
>
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 33692ea..509c51a 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -34,6 +34,21 @@
>  # define UNLOCK_LOCK_PREFIX
>  #endif
>
> +#ifdef CONFIG_TICKET_LOCK_QUEUED
> +
> +#define __TKT_SPIN_INC 2
> +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> +
> +#else /* #ifdef CONFIG_TICKET_LOCK_QUEUED */
> +
> +#define __TKT_SPIN_INC 1
> +static inline bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets 
> inc)
> +{
> +   return false;
> +}
> +
> +#endif /* #else #ifdef CONFIG_TICKET_LOCK_QUEUED */
> +
>  /*
>   * Ticket locks are conceptually two parts, one indicating the current head 
> of
>   * the queue, and the other indicating the current tail. The lock is acquired
> @@ -49,17 +64,16 @@
>   */
>  static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
>  {
> -   register struct __raw_tickets inc = { .tail = 1 };
> +   register struct __raw_tickets inc = { .tail = __TKT_SPIN_INC };
>
> inc = xadd(&lock->tickets, inc);
> -
> for (;;) {
> -   if (inc.head == inc.tail)
> +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
> break;
> cpu_relax();
> inc.head = ACCESS_ONCE(lock->tickets.head);
> }
> -   barrier();  /* make sure nothing creeps before the lock 
> is taken */
> +   barrier

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Lai Jiangshan

On Wed, Jun 12, 2013 at 9:58 AM, Steven Rostedt  wrote:
> On Wed, 2013-06-12 at 09:19 +0800, Lai Jiangshan wrote:
>
>> > +
>> > +/*
>> > + * Hand the lock off to the first CPU on the queue.
>> > + */
>> > +void tkt_q_do_wake(arch_spinlock_t *lock)
>> > +{
>> > +   struct tkt_q_head *tqhp;
>> > +   struct tkt_q *tqp;
>> > +
>> > +   /* If the queue is still being set up, wait for it. */
>> > +   while ((tqhp = tkt_q_find_head(lock)) == NULL)
>> > +   cpu_relax();
>> > +
>> > +   for (;;) {
>> > +
>> > +   /* Find the first queue element. */
>> > +   tqp = ACCESS_ONCE(tqhp->spin);
>> > +   if (tqp != NULL)
>> > +   break;  /* Element exists, hand off lock. */
>> > +   if (tkt_q_try_unqueue(lock, tqhp))
>> > +   return; /* No element, successfully removed queue. 
>> > */
>> > +   cpu_relax();
>> > +   }
>> > +   if (ACCESS_ONCE(tqhp->head_tkt) != -1)
>> > +   ACCESS_ONCE(tqhp->head_tkt) = -1;
>> > +   smp_mb(); /* Order pointer fetch and assignment against handoff. */
>> > +   ACCESS_ONCE(tqp->cpu) = -1;
>> > +}
>> > +EXPORT_SYMBOL(tkt_q_do_wake);
>> > +
>> > +/*
>> > + * Given a lock that already has a queue associated with it, spin on
>> > + * that queue.  Return false if there was no queue (which means we do not
>> > + * hold the lock) and true otherwise (meaning we -do- hold the lock).
>> > + */
>> > +bool tkt_q_do_spin(arch_spinlock_t *lock, struct __raw_tickets inc)
>> > +{
>> > +   struct tkt_q **oldtail;
>> > +   struct tkt_q tq;
>> > +   struct tkt_q_head *tqhp;
>> > +
>> > +   /*
>> > +* Ensure that accesses to queue header happen after sensing
>> > +* the lock's have-queue bit.
>> > +*/
>> > +   smp_mb();  /* See above block comment. */
>> > +
>> > +   /* If there no longer is a queue, leave. */
>> > +   tqhp = tkt_q_find_head(lock);
>> > +   if (tqhp == NULL)
>> > +   return false;
>> > +
>> > +   /* Initialize our queue element. */
>> > +   tq.cpu = raw_smp_processor_id();
>> > +   tq.tail = inc.tail;
>> > +   tq.next = NULL;
>>
>> I guess a mb() is needed here for between read tqhp->ref and read
>> tqhp->head_tkt.
>> you can move the above mb() to here.
>
> Do we?
>
> The only way to get into here is if you either set up the queue
> yourself, or you saw the LSB set in head.
>
> If you were the one to set it up yourself, then there's nothing to worry
> about because you are also the one that set head_tkt.
>
> If you didn't set up the queue, then someone else set the LSB in head,
> which is done with a cmpxchg() which is also a full mb. This would make
> head_tkt visible as well because it's set before cmpxchg is called.
>
> Thus, to come into this function you must have seen head & 1 set, and
> the smp_mb() above will also make head_tkt visible.
>
> The only thing I can see now is that it might not find tqhp because ref
> may not be set yet. If that's the case, then it will fall out back to
> the main loop. But if it finds ref, then I don't see how it can't see
> head_tkt up to date as well.
>
> Maybe I'm missing something.

No, you are right.

When I lay on the bed in the night, I was thinking about the V1,
I wrongly considered the V2 has the same problem without deeper
thought in this morning.

V2 has not such problem. sorry for the noisy.

Thanks,
Lai

>
> -- Steve
>
>
>>
>> > +
>> > +   /* Check to see if we already hold the lock. */
>> > +   if (ACCESS_ONCE(tqhp->head_tkt) == inc.tail) {
>> > +   /* The last holder left before queue formed, we hold lock. 
>> > */
>> > +   tqhp->head_tkt = -1;
>> > +   return true;
>> > +   }
>> > +
>> > +   /*
>> > +* Add our element to the tail of the queue.  Note that if the
>> > +* queue is empty, the ->spin_tail pointer will reference
>> > +* the queue's head pointer, namely ->spin.
>> > +*/
>> > +   oldtail = xchg(&tqhp->spin_tail, &tq.next);
>> > +   ACCESS_ONCE(*oldtail) =

Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues

2013-03-24 Thread Lai Jiangshan

Hi, TJ

After long time(again) thought.

I think this patchset has this problem:
the work may be running on wrong CPU when
there is online cpu in its wq's cpumask.

example:
node0(cpu0,cpu1),node1(cpu2,cpu3),
wq's cpumask: 1,3
the pwq of this wq on the node1's cpumask: 3
current online cpu: 0-2.
so the cpumask of worker tasks of the pwq on node1 is actually cpu_all_mask.
so the work scheduled from cpu2 can be executed on cpu0 or cpu2.
we expect it is executed on cpu1 only.

It can be fixed by swapping pwqs(node's pwq <-> default pwq)
when cpuhotplug. But could you reschedule this patchset to wq/for-3.11?
the whole patchset is more complicated than my brain.

If you agree, I will rebase my patches again.

Thanks,
Lai

On Thu, Mar 21, 2013 at 2:57 AM, Tejun Heo  wrote:
> On Tue, Mar 19, 2013 at 05:00:19PM -0700, Tejun Heo wrote:
>> and also available in the following git branch.
>>
>>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa
>
> Branch rebased on top of the current wq/for-3.10 with updated patches.
> The new comit ID is 9555fbc12d786a9eae7cf7701a6718a0c029173e.
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: workqueue, pci: INFO: possible recursive locking detected

2013-07-17 Thread Lai Jiangshan

On 07/16/2013 10:41 PM, Srivatsa S. Bhat wrote:
> Hi,
> 
> I have been seeing this warning every time during boot. I haven't
> spent time digging through it though... Please let me know if
> any machine-specific info is needed.
> 
> Regards,
> Srivatsa S. Bhat
> 
> 
> 
> 
> =
> [ INFO: possible recursive locking detected ]
> 3.11.0-rc1-lockdep-fix-a #6 Not tainted
> -
> kworker/0:1/142 is trying to acquire lock:
>  ((&wfc.work)){+.+.+.}, at: [] flush_work+0x0/0xb0
> 
> but task is already holding lock:
>  ((&wfc.work)){+.+.+.}, at: [] process_one_work+0x169/0x610
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>CPU0
>
>   lock((&wfc.work));
>   lock((&wfc.work));


Hi, Srivatsa

This is false negative, the two "wfc"s are different, they are
both on stack. flush_work() can't be deadlock in such case:

void foo(void *)
{
...
if (xxx)
work_on_cpu(..., foo, ...);
...
}

bar()
{
work_on_cpu(..., foo, ...);
}

The complaint is caused by "work_on_cpu() uses a static lock_class_key".
we should fix work_on_cpu().
(but the caller should also be careful, the foo()/local_pci_probe() is 
re-entering)

But I can't find an elegant fix.

long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
{
struct work_for_cpu wfc = { .fn = fn, .arg = arg };

+#ifdef CONFIG_LOCKDEP
+   static struct lock_class_key __key;
+   INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
+   lockdep_init_map(&wfc.work.lockdep_map, &wfc.work, &__key, 0);
+#else
INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
+#endif
schedule_work_on(cpu, &wfc.work);
flush_work(&wfc.work);
return wfc.ret;
}


Any think? Tejun?

thanks,
Lai

> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 3 locks held by kworker/0:1/142:
>  #0:  (events){.+.+.+}, at: [] process_one_work+0x169/0x610
>  #1:  ((&wfc.work)){+.+.+.}, at: [] 
> process_one_work+0x169/0x610
>  #2:  (&__lockdep_no_validate__){..}, at: [] 
> device_attach+0x2a/0xc0
> 
> stack backtrace:
> CPU: 0 PID: 142 Comm: kworker/0:1 Not tainted 3.11.0-rc1-lockdep-fix-a #6
> Hardware name: IBM  -[8737R2A]-/00Y2738, BIOS -[B2E120RUS-1.20]- 11/30/2012
> Workqueue: events work_for_cpu_fn
>  881036fecd88 881036fef678 8161a919 0003
>  881036fec400 881036fef6a8 810c2234 881036fec400
>  881036fecd88 881036fec400  881036fef708
> Call Trace:
>  [] dump_stack+0x59/0x80
>  [] print_deadlock_bug+0xf4/0x100
>  [] validate_chain+0x504/0x750
>  [] __lock_acquire+0x30d/0x580
>  [] lock_acquire+0x97/0x170
>  [] ? start_flush_work+0x220/0x220
>  [] flush_work+0x48/0xb0
>  [] ? start_flush_work+0x220/0x220
>  [] ? mark_held_locks+0x80/0x130
>  [] ? queue_work_on+0x4b/0xa0
>  [] ? trace_hardirqs_on_caller+0x105/0x1d0
>  [] ? trace_hardirqs_on+0xd/0x10
>  [] work_on_cpu+0x80/0x90
>  [] ? wqattrs_hash+0x190/0x190
>  [] ? pci_pm_prepare+0x60/0x60
>  [] ? cpumask_next_and+0x29/0x50
>  [] __pci_device_probe+0x9a/0xe0
>  [] ? _raw_spin_unlock_irq+0x30/0x50
>  [] ? pci_dev_get+0x22/0x30
>  [] pci_device_probe+0x3a/0x60
>  [] ? _raw_spin_unlock_irq+0x30/0x50
>  [] really_probe+0x6c/0x320
>  [] driver_probe_device+0x47/0xa0
>  [] ? __driver_attach+0xb0/0xb0
>  [] __device_attach+0x53/0x60
>  [] bus_for_each_drv+0x74/0xa0
>  [] device_attach+0xa0/0xc0
>  [] pci_bus_add_device+0x39/0x60
>  [] virtfn_add+0x251/0x3e0
>  [] ? trace_hardirqs_on+0xd/0x10
>  [] sriov_enable+0x22f/0x3d0
>  [] pci_enable_sriov+0x4d/0x60
>  [] be_vf_setup+0x175/0x410 [be2net]
>  [] be_setup+0x37a/0x4b0 [be2net]
>  [] be_probe+0x5c0/0x820 [be2net]
>  [] local_pci_probe+0x4e/0x90
>  [] work_for_cpu_fn+0x18/0x30
>  [] process_one_work+0x1da/0x610
>  [] ? process_one_work+0x169/0x610
>  [] worker_thread+0x28c/0x3a0
>  [] ? process_one_work+0x610/0x610
>  [] kthread+0xee/0x100
>  [] ? __init_kthread_worker+0x70/0x70
>  [] ret_from_fork+0x7c/0xb0
>  [] ? __init_kthread_worker+0x70/0x70
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC nohz_full 1/7] nohz_full: Add Kconfig parameter for scalable detection of all-idle state

2013-07-28 Thread Lai Jiangshan

On 07/27/2013 07:19 AM, Paul E. McKenney wrote:
> From: "Paul E. McKenney" 
> 
> At least one CPU must keep the scheduling-clock tick running for
> timekeeping purposes whenever there is a non-idle CPU.  However, with
> the new nohz_full adaptive-idle machinery, it is difficult to distinguish
> between all CPUs really being idle as opposed to all non-idle CPUs being
> in adaptive-ticks mode.  This commit therefore adds a Kconfig parameter
> as a first step towards enabling a scalable detection of full-system
> idle state.
> 
> Signed-off-by: Paul E. McKenney 
> Cc: Frederic Weisbecker 
> Cc: Steven Rostedt 
> ---
>  kernel/time/Kconfig | 23 +++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
> index 70f27e8..a613c2a 100644
> --- a/kernel/time/Kconfig
> +++ b/kernel/time/Kconfig
> @@ -134,6 +134,29 @@ config NO_HZ_FULL_ALL
>Note the boot CPU will still be kept outside the range to
>handle the timekeeping duty.
>  
> +config NO_HZ_FULL_SYSIDLE
> + bool "Detect full-system idle state for full dynticks system"
> + depends on NO_HZ_FULL
> + default n
> + help
> +  At least one CPU must keep the scheduling-clock tick running
> +  for timekeeping purposes whenever there is a non-idle CPU,
> +  where "non-idle" includes CPUs with a single runnable task
> +  in adaptive-idle mode.  Because the underlying adaptive-tick
> +  support cannot distinguish between all CPUs being idle and
> +  all CPUs each running a single task in adaptive-idle mode,
> +  the underlying support simply ensures that there is always
> +  a CPU handling the scheduling-clock tick, whether or not all
> +  CPUs are idle.  This Kconfig option enables scalable detection
> +  of the all-CPUs-idle state, thus allowing the scheduling-clock
> +  tick to be disabled when all CPUs are idle.  Note that scalable
> +  detection of the all-CPUs-idle state means that larger systems
> +  will be slower to declare the all-CPUs-idle state.
> +
> +  Say Y if you would like to help debug all-CPUs-idle detection.

The code is needed only for debug?
I guess not.

> +
> +  Say N if you are unsure.
> +
>  config NO_HZ
>   bool "Old Idle dynticks config"
>   depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 1 2 3 4 5 6 7 8 9 10 >

201 - 300 of 1247 matches

Mail list logo