Re: [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0

2012-10-15 Thread KOSAKI Motohiro
>> I think this only correct when memcg. Even if swappiness==0, global reclaim 
>> swap
>> out anon pages before oom.
>
> Right you are (we really do swap when the file pages are really
> low)! Sorry about the confusion. I kind of became if(global_reclaim)
> block blind...
>
> Then this really needs a memcg specific documentation fix. What about
> the following?
> ---
> From 59a60705abd2faf9e266a4270bbf302001845588 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Mon, 15 Oct 2012 11:43:56 +0200
> Subject: [PATCH] doc: describe memcg swappiness more precisely
>
> since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim
> stopped swapping out anon pages completely when 0 value is used.
> Although this is somehow expected it hasn't been done for a really long
> time this way and so it is probably better to be explicit about the
> effect. Moreover global reclaim swapps out even when swappiness is 0
> to prevent from OOM killer.
>
> Signed-off-by: Michal Hocko 
> ---
>  Documentation/cgroups/memory.txt |4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/Documentation/cgroups/memory.txt 
> b/Documentation/cgroups/memory.txt
> index c07f7b4..71c4da4 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -466,6 +466,10 @@ Note:
>  5.3 swappiness
>
>  Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
> +Please note that unlike the global swappiness, memcg knob set to 0
> +really prevents from any swapping even if there is a swap storage
> +available. This might lead to memcg OOM killer if there are no file
> +pages to reclaim.

Pretty good to me. Thank you!

Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0

2012-10-15 Thread KOSAKI Motohiro
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 078701f..308fd77 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -640,6 +640,9 @@ swappiness
>  This control is used to define how aggressive the kernel will swap
>  memory pages.  Higher values will increase agressiveness, lower values
>  decrease the amount of swap.
> +The value can be used from the [0, 100] range, where 0 means no swapping
> +at all (even if there is a swap storage enabled) while 100 means that
> +anonymous pages are reclaimed in the same rate as file pages.

I think this only correct when memcg. Even if swappiness==0, global reclaim swap
out anon pages before oom.

see below.

get_scan_count()
(snip)
if (global_reclaim(sc)) {
free  = zone_page_state(zone, NR_FREE_PAGES);
/* If we have very few page cache pages,
   force-scan anon pages. */
if (unlikely(file + free <= high_wmark_pages(zone))) {
fraction[0] = 1;
fraction[1] = 0;
denominator = 1;
goto out;
}
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: cleanup register_node()

2012-10-13 Thread KOSAKI Motohiro
On Fri, Oct 12, 2012 at 4:22 AM, Yasuaki Ishimatsu
 wrote:
> register_node() is defined as extern in include/linux/node.h. But the function
> is only called from register_one_node() in driver/base/node.c.
>
> So the patch defines register_node() as static.
>
> CC: David Rientjes 
> CC: Andrew Morton 
> Signed-off-by: Yasuaki Ishimatsu 

Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/10] memory-hotplug : memory-hotplug: check page type in get_page_bootmem

2012-10-12 Thread KOSAKI Motohiro
On Thu, Oct 4, 2012 at 10:32 PM, Yasuaki Ishimatsu
 wrote:
> The function get_page_bootmem() may be called more than one time to the same
> page. There is no need to set page's type, private if the function is not
> the first time called to the page.
>
> Note: the patch is just optimization and does not fix any problem.
>
> CC: David Rientjes 
> CC: Jiang Liu 
> CC: Len Brown 
> CC: Christoph Lameter 
> Cc: Minchan Kim 
> CC: Andrew Morton 
> CC: KOSAKI Motohiro 
> CC: Wen Congyang 
> Signed-off-by: Yasuaki Ishimatsu 
> ---
>  mm/memory_hotplug.c |   15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
>
> Index: linux-3.6/mm/memory_hotplug.c
> ===
> --- linux-3.6.orig/mm/memory_hotplug.c  2012-10-04 18:29:58.284676075 +0900
> +++ linux-3.6/mm/memory_hotplug.c   2012-10-04 18:30:03.454680542 +0900
> @@ -95,10 +95,17 @@ static void release_memory_resource(stru
>  static void get_page_bootmem(unsigned long info,  struct page *page,
>  unsigned long type)
>  {
> -   page->lru.next = (struct list_head *) type;
> -   SetPagePrivate(page);
> -   set_page_private(page, info);
> -   atomic_inc(&page->_count);
> +   unsigned long page_type;
> +
> +   page_type = (unsigned long)page->lru.next;

If I understand correctly, page->lru.next might be uninitialized yet.

Moreover, I have no seen any good effect in this patch. I don't understand
why we need to increase code complexity.



> +   if (page_type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
> +   page_type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE){
> +   page->lru.next = (struct list_head *)type;
> +   SetPagePrivate(page);
> +   set_page_private(page, info);
> +   atomic_inc(&page->_count);
> +   } else
> +   atomic_inc(&page->_count);
>  }
>
>  /* reference to __meminit __free_pages_bootmem is valid
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] acpi,memory-hotplug : add memory offline code to acpi_memory_device_remove()

2012-10-12 Thread KOSAKI Motohiro
>>> -static int acpi_memory_disable_device(struct acpi_memory_device 
>>> *mem_device)
>>> +static int acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
>>>  {
>>> int result;
>>> struct acpi_memory_info *info, *n;
>>>
>>> +   list_for_each_entry_safe(info, n, &mem_device->res_list, list) {
>>
>> Which lock protect this loop?
>
> There is no any lock to protect it now...

When iterate an item removal list, you should use lock for protecting from
memory corruption.




>>> +static int acpi_memory_disable_device(struct acpi_memory_device 
>>> *mem_device)
>>> +{
>>> +   int result;
>>>
>>> /*
>>>  * Ask the VM to offline this memory range.
>>>  * Note: Assume that this function returns zero on success
>>>  */
>>
>> Write function comment instead of this silly comment.
>>
>>> -   list_for_each_entry_safe(info, n, &mem_device->res_list, list) {
>>> -   if (info->enabled) {
>>> -   result = remove_memory(info->start_addr, 
>>> info->length);
>>> -   if (result)
>>> -   return result;
>>> -   }
>>> -   kfree(info);
>>> -   }
>>> +   result = acpi_memory_remove_memory(mem_device);
>>> +   if (result)
>>> +   return result;
>>>
>>> /* Power-off and eject the device */
>>> result = acpi_memory_powerdown_device(mem_device);
>>
>> This patch move acpi_memory_powerdown_device() from ACPI_NOTIFY_EJECT_REQUEST
>> to release callback, but don't explain why.
>
> Hmm, it doesn't move the code. It just reuse the code in 
> acpi_memory_powerdown_device().

Even if reuse or not reuse, you changed the behavior. If any changes
has no good rational, you cannot get an ack.




>>> @@ -473,12 +486,23 @@ static int acpi_memory_device_add(struct
>>>  static int acpi_memory_device_remove(struct acpi_device *device, int type)
>>>  {
>>> struct acpi_memory_device *mem_device = NULL;
>>> -
>>> +   int result;
>>>
>>> if (!device || !acpi_driver_data(device))
>>> return -EINVAL;
>>>
>>> mem_device = acpi_driver_data(device);
>>> +
>>> +   if (type == ACPI_BUS_REMOVAL_EJECT) {
>>> +   /*
>>> +* offline and remove memory only when the memory device is
>>> +* ejected.
>>> +*/
>>
>> This comment explain nothing. A comment should describe _why_ should we do.
>> e.g. Why REMOVAL_NORMAL and REMOVEL_EJECT should be ignored. Why
>> we need remove memory here instead of ACPI_NOTIFY_EJECT_REQUEST.
>
> Hmm, we have 2 ways to remove a memory:
> 1. SCI
> 2. echo 1 >/sys/bus/acpi/devices/PNP0C80:XX/eject
>
> In the 2nd case, there is no ACPI_NOTIFY_EJECT_REQUEST. We should offline
> the memory and remove it from kernel in the release callback. We will poweroff
> the memory device in acpi_bus_hot_remove_device(), so we must offline
> and remove it if the type is ACPI_BUS_REMOVAL_EJECT.
>
> I guess we should not poweroff the memory device when we fail to offline it.
> But device_release_driver() doesn't returns any error...

1) I think /sys/bus/acpi/devices/PNP0C80:XX/eject should emulate acpi
eject. Can't
you make a pseudo acpi eject event and detach device by acpi regular path?

2) Your explanation didn't explain why we should ignore REMOVAL_NORMAL
and REMOVEL_EJECT. As far as reviewers can't track your intention, we
can't maintain
the code and can't ack them.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] acpi,memory-hotplug : store the node id in acpi_memory_device

2012-10-12 Thread KOSAKI Motohiro
On Mon, Oct 8, 2012 at 2:47 AM, Wen Congyang  wrote:
> At 10/06/2012 02:56 AM, KOSAKI Motohiro Wrote:
>> On Wed, Oct 3, 2012 at 6:11 AM, Yasuaki Ishimatsu
>>  wrote:
>>> From: Wen Congyang 
>>>
>>> The memory device has only one node id. Store the node id when
>>> enable the memory device, and we can reuse it when removing the
>>> memory device.
>>
>> You don't explain why we need this. Then nobody can review nor ack.
>>
>
> This patch doesn't fix any problem. Its purpose is: avoid to calculate
> the node id twice.

ditto.

Please separate patches as logical change. You should make problem fix
patch set.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] acpi,memory-hotplug : rename remove_memory() to offline_memory()

2012-10-12 Thread KOSAKI Motohiro
On Mon, Oct 8, 2012 at 2:45 AM, Wen Congyang  wrote:
> At 10/05/2012 05:31 AM, KOSAKI Motohiro Wrote:
>> On Wed, Oct 3, 2012 at 6:02 AM, Yasuaki Ishimatsu
>>  wrote:
>>> From: Yasuaki Ishimatsu 
>>>
>>> add_memory() hot adds a physical memory. But remove_memory does not
>>> hot remove a phsical memory. It only offlines memory. The name
>>> confuse us.
>>>
>>> So the patch renames remove_memory() to offline_memory(). We will
>>> use rename_memory() for hot removing memory.
>>>
>>> CC: David Rientjes 
>>> CC: Jiang Liu 
>>> CC: Len Brown 
>>> CC: Christoph Lameter 
>>> Cc: Minchan Kim 
>>> CC: Andrew Morton 
>>> CC: KOSAKI Motohiro 
>>> Signed-off-by: Yasuaki Ishimatsu 
>>> Signed-off-by: Wen Congyang 
>>> ---
>>>  drivers/acpi/acpi_memhotplug.c |2 +-
>>>  include/linux/memory_hotplug.h |2 +-
>>>  mm/memory_hotplug.c|6 +++---
>>>  3 files changed, 5 insertions(+), 5 deletions(-)
>>
>> Probably, the better way is to just remove remove_memory() and use
>> offline_pages().
>
> we don't notify the userspace that the memory is offlined in offline_pages().
> We reimplement offline_memory(), but ishimatsu doesn't include that patch to
> this series.

Quote from Documentation/SubmittingPatch

> 3) Separate your changes.
> Separate _logical changes_ into a single patch file.

Please send us _logical_ meaningful patch set.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] add some drop_caches documentation and info messsge

2012-10-12 Thread KOSAKI Motohiro
On Fri, Oct 12, 2012 at 8:57 AM, Michal Hocko  wrote:
> Hi,
> I would like to resurrect the following Dave's patch. The last time it
> has been posted was here https://lkml.org/lkml/2010/9/16/250 and there
> didn't seem to be any strong opposition.
> Kosaki was worried about possible excessive logging when somebody drops
> caches too often (but then he claimed he didn't have a strong opinion
> on that) but I would say opposite. If somebody does that then I would
> really like to know that from the log when supporting a system because
> it almost for sure means that there is something fishy going on. It is
> also worth mentioning that only root can write drop caches so this is
> not an flooding attack vector.
> I am bringing that up again because this can be really helpful when
> chasing strange performance issues which (surprise surprise) turn out to
> be related to artificially dropped caches done because the admin thinks
> this would help...
>
> I have just refreshed the original patch on top of the current mm tree
> but I could live with KERN_INFO as well if people think that KERN_NOTICE
> is too hysterical.
> ---
> From 1f4058be9b089bc9d43d71bc63989335d7637d8d Mon Sep 17 00:00:00 2001
> From: Dave Hansen 
> Date: Fri, 12 Oct 2012 14:30:54 +0200
> Subject: [PATCH] add some drop_caches documentation and info messsge
>
> There is plenty of anecdotal evidence and a load of blog posts
> suggesting that using "drop_caches" periodically keeps your system
> running in "tip top shape".  Perhaps adding some kernel
> documentation will increase the amount of accurate data on its use.
>
> If we are not shrinking caches effectively, then we have real bugs.
> Using drop_caches will simply mask the bugs and make them harder
> to find, but certainly does not fix them, nor is it an appropriate
> "workaround" to limit the size of the caches.
>
> It's a great debugging tool, and is really handy for doing things
> like repeatable benchmark runs.  So, add a bit more documentation
> about it, and add a little KERN_NOTICE.  It should help developers
> who are chasing down reclaim-related bugs.
>
> [mho...@suse.cz: refreshed to current -mm tree]
> Signed-off-by: Dave Hansen 
> Reviewed-by: KAMEZAWA Hiroyuki 
> Acked-by: Michal Hocko 

Looks fine.

Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: oom: fix totalpages calculation for memory.swappiness==0

2012-10-11 Thread KOSAKI Motohiro
On Thu, Oct 11, 2012 at 4:57 AM, Michal Hocko  wrote:
> oom_badness takes totalpages argument which says how many pages are
> available and it uses it as a base for the score calculation. The value
> is calculated by mem_cgroup_get_limit which considers both limit and
> total_swap_pages (resp. memsw portion of it).
>
> This is usually correct but since fe35004f (mm: avoid swapping out
> with swappiness==0) we do not swap when swappiness is 0 which means
> that we cannot really use up all the totalpages pages. This in turn
> confuses oom score calculation if the memcg limit is much smaller than
> the available swap because the used memory (capped by the limit) is
> negligible comparing to totalpages so the resulting score is too small
> if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
> A wrong process might be selected as result.
>
> The same issue exists for the global oom killer as well but it is not
> that problematic as the amount of the RAM is usually much bigger than
> the swap space.
>
> The problem can be worked around by checking mem_cgroup_swappiness==0
> and not considering swap at all in such a case.
>
> Signed-off-by: Michal Hocko 
> Acked-by: David Rientjes 
> Cc: stable [3.5+]

Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2]suppress "Device nodeX does not have a release() function" warning

2012-10-11 Thread KOSAKI Motohiro
On Thu, Oct 11, 2012 at 1:26 AM, Yasuaki Ishimatsu
 wrote:
> When calling unregister_node(), the function shows following message at
> device_release().
>
> "Device 'node2' does not have a release() function, it is broken and must
> be fixed."
>
> The reason is node's device struct does not have a release() function.
>
> So the patch registers node_device_release() to the device's release()
> function for suppressing the warning message. Additionally, the patch adds
> memset() to initialize a node struct into register_node(). Because the node
> struct is part of node_devices[] array and it cannot be freed by
> node_device_release(). So if system reuses the node struct, it has a garbage.
>
> CC: David Rientjes 
> CC: Jiang Liu 
> Cc: Minchan Kim 
> CC: Andrew Morton 
> CC: KOSAKI Motohiro 
> Signed-off-by: Yasuaki Ishimatsu 
> Signed-off-by: Wen Congyang 
> ---
>  drivers/base/node.c |   11 +++
>  1 file changed, 11 insertions(+)
>
> Index: linux-3.6/drivers/base/node.c
> ===
> --- linux-3.6.orig/drivers/base/node.c  2012-10-11 10:04:02.149758748 +0900
> +++ linux-3.6/drivers/base/node.c   2012-10-11 10:20:34.111806931 +0900
> @@ -252,6 +252,14 @@ static inline void hugetlb_register_node
>  static inline void hugetlb_unregister_node(struct node *node) {}
>  #endif
>
> +static void node_device_release(struct device *dev)
> +{
> +#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
> +   struct node *node_dev = to_node(dev);
> +
> +   flush_work(&node_dev->node_work);
> +#endif
> +}

The patch description don't explain why this flush_work() is needed.


>  /*
>   * register_node - Setup a sysfs device for a node.
> @@ -263,8 +271,11 @@ int register_node(struct node *node, int
>  {
> int error;
>
> +   memset(node, 0, sizeof(*node));
> +

You should add a comment why we need initialize a node here. A lot
of developers don't have hotplug knowledge.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]suppress "Device memoryX does not have a release() function" warning

2012-10-11 Thread KOSAKI Motohiro
On Thu, Oct 11, 2012 at 1:22 AM, Yasuaki Ishimatsu
 wrote:
> When calling remove_memory_block(), the function shows following message at
> device_release().
>
> "Device 'memory528' does not have a release() function, it is broken and must
> be fixed."
>
> The reason is memory_block's device struct does not have a release() function.
>
> So the patch registers memory_block_release() to the device's release() 
> function
> for suppressing the warning message. Additionally, the patch moves kfree(mem)
> into the release function since the release function is prepared as a means
> to free a memory_block struct.
>
> CC: David Rientjes 
> CC: Jiang Liu 
> Cc: Minchan Kim 
> CC: Andrew Morton 
> CC: KOSAKI Motohiro 
> CC: Wen Congyang 
> Signed-off-by: Yasuaki Ishimatsu 

Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/10] memory-hotplug : remove /sys/firmware/memmap/X sysfs

2012-10-05 Thread KOSAKI Motohiro
On Thu, Oct 4, 2012 at 10:26 PM, Yasuaki Ishimatsu
 wrote:
> When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
> sysfs files are created. But there is no code to remove these files. The patch
> implements the function to remove them.
>
> Note : The code does not free firmware_map_entry since there is no way to free
>memory which is allocated by bootmem.

You have to explain why this is ok. I guess the unfreed
firmware_map_entry is reused
at next online memory and don't make memory leak, right?



>
> CC: David Rientjes 
> CC: Jiang Liu 
> CC: Len Brown 
> CC: Christoph Lameter 
> Cc: Minchan Kim 
> CC: Andrew Morton 
> CC: KOSAKI Motohiro 
> Signed-off-by: Wen Congyang 
> Signed-off-by: Yasuaki Ishimatsu 
>
> ---
>  drivers/firmware/memmap.c|   98 
> ++-
>  include/linux/firmware-map.h |6 ++
>  mm/memory_hotplug.c  |7 ++-
>  3 files changed, 108 insertions(+), 3 deletions(-)
>
> Index: linux-3.6/drivers/firmware/memmap.c
> ===
> --- linux-3.6.orig/drivers/firmware/memmap.c2012-10-04 18:27:05.195500420 
> +0900
> +++ linux-3.6/drivers/firmware/memmap.c 2012-10-04 18:27:18.901514330 +0900
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  /*
>   * Data types 
> --
> @@ -41,6 +42,7 @@ struct firmware_map_entry {
> const char  *type;  /* type of the memory range */
> struct list_headlist;   /* entry for the linked list */
> struct kobject  kobj;   /* kobject for each entry */
> +   unsigned intbootmem:1; /* allocated from bootmem */

Use bool.


>  };
>
>  /*
> @@ -79,7 +81,26 @@ static const struct sysfs_ops memmap_att
> .show = memmap_attr_show,
>  };
>
> +
> +static inline struct firmware_map_entry *
> +to_memmap_entry(struct kobject *kobj)
> +{
> +   return container_of(kobj, struct firmware_map_entry, kobj);
> +}
> +
> +static void release_firmware_map_entry(struct kobject *kobj)
> +{
> +   struct firmware_map_entry *entry = to_memmap_entry(kobj);
> +
> +   if (entry->bootmem)
> +   /* There is no way to free memory allocated from bootmem */
> +   return;
> +
> +   kfree(entry);
> +}
> +
>  static struct kobj_type memmap_ktype = {
> +   .release= release_firmware_map_entry,
> .sysfs_ops  = &memmap_attr_ops,
> .default_attrs  = def_attrs,
>  };
> @@ -94,6 +115,7 @@ static struct kobj_type memmap_ktype = {
>   * in firmware initialisation code in one single thread of execution.
>   */
>  static LIST_HEAD(map_entries);
> +static DEFINE_SPINLOCK(map_entries_lock);
>
>  /**
>   * firmware_map_add_entry() - Does the real work to add a firmware memmap 
> entry.
> @@ -118,11 +140,25 @@ static int firmware_map_add_entry(u64 st
> INIT_LIST_HEAD(&entry->list);
> kobject_init(&entry->kobj, &memmap_ktype);
>
> +   spin_lock(&map_entries_lock);
> list_add_tail(&entry->list, &map_entries);
> +   spin_unlock(&map_entries_lock);
>
> return 0;
>  }
>
> +/**
> + * firmware_map_remove_entry() - Does the real work to remove a firmware
> + * memmap entry.
> + * @entry: removed entry.
> + **/
> +static inline void firmware_map_remove_entry(struct firmware_map_entry 
> *entry)

Don't use inline in *.c file. gcc is wise than you.

> +{
> +   spin_lock(&map_entries_lock);
> +   list_del(&entry->list);
> +   spin_unlock(&map_entries_lock);
> +}
> +
>  /*
>   * Add memmap entry on sysfs
>   */
> @@ -144,6 +180,35 @@ static int add_sysfs_fw_map_entry(struct
> return 0;
>  }
>
> +/*
> + * Remove memmap entry on sysfs
> + */
> +static inline void remove_sysfs_fw_map_entry(struct firmware_map_entry 
> *entry)
> +{
> +   kobject_put(&entry->kobj);
> +}
> +
> +/*
> + * Search memmap entry
> + */
> +
> +static struct firmware_map_entry * __meminit
> +firmware_map_find_entry(u64 start, u64 end, const char *type)
> +{
> +   struct firmware_map_entry *entry;
> +
> +   spin_lock(&map_entries_lock);
> +   list_for_each_entry(entry, &map_entries, list)
> +   if ((entry->start == start) && (entry->end == end) &&
> +   (!strcmp(entry->type, type))) {
> +   spin_unlock(&map_entries_lock);
> +  

Re: [PATCH 1/10] memory-hotplug : check whether memory is offline or not when removing memory

2012-10-05 Thread KOSAKI Motohiro
On Thu, Oct 4, 2012 at 10:25 PM, Yasuaki Ishimatsu
 wrote:
> When calling remove_memory(), the memory should be offline. If the function
> is used to online memory, kernel panic may occur.
>
> So the patch checks whether memory is offline or not.

You don't explain WHY we need the check.


> CC: David Rientjes 
> CC: Jiang Liu 
> CC: Len Brown 
> CC: Christoph Lameter 
> Cc: Minchan Kim 
> CC: Andrew Morton 
> CC: KOSAKI Motohiro 
> Signed-off-by: Wen Congyang 
> Signed-off-by: Yasuaki Ishimatsu 
>
> ---
>  drivers/base/memory.c  |   39 +++
>  include/linux/memory.h |5 +
>  mm/memory_hotplug.c|   17 +++--
>  3 files changed, 59 insertions(+), 2 deletions(-)
>
> Index: linux-3.6/drivers/base/memory.c
> ===
> --- linux-3.6.orig/drivers/base/memory.c2012-10-04 14:22:57.0 
> +0900
> +++ linux-3.6/drivers/base/memory.c 2012-10-04 14:45:46.653585860 +0900
> @@ -70,6 +70,45 @@ void unregister_memory_isolate_notifier(
>  }
>  EXPORT_SYMBOL(unregister_memory_isolate_notifier);
>
> +bool is_memblk_offline(unsigned long start, unsigned long size)

Don't use memblk. Usually memblk mean struct numa_meminfo for x86/numa.
Maybe memory_range_offlined() is better.

And, this function don't take struct memory_block, then this file may be no good
place.

And you need to write down function comment.


> +{
> +   struct memory_block *mem = NULL;
> +   struct mem_section *section;
> +   unsigned long start_pfn, end_pfn;
> +   unsigned long pfn, section_nr;
> +
> +   start_pfn = PFN_DOWN(start);
> +   end_pfn = PFN_UP(start + size);
> +
> +   for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
> +   section_nr = pfn_to_section_nr(pfn);
> +   if (!present_section_nr(section_nr))
> +   continue;
> +
> +   section = __nr_to_section(section_nr);
> +   /* same memblock? */
> +   if (mem)
> +   if ((section_nr >= mem->start_section_nr) &&
> +   (section_nr <= mem->end_section_nr))
> +   continue;
> +
> +   mem = find_memory_block_hinted(section, mem);
> +   if (!mem)
> +   continue;
> +   if (mem->state == MEM_OFFLINE)
> +   continue;
> +
> +   kobject_put(&mem->dev.kobj);
> +   return false;
> +   }
> +
> +   if (mem)
> +   kobject_put(&mem->dev.kobj);
> +
> +   return true;
> +}
> +EXPORT_SYMBOL(is_memblk_offline);
> +
>  /*
>   * register_memory - Setup a sysfs device for a memory block
>   */
> Index: linux-3.6/include/linux/memory.h
> ===
> --- linux-3.6.orig/include/linux/memory.h   2012-10-02 18:00:22.0 
> +0900
> +++ linux-3.6/include/linux/memory.h2012-10-04 14:44:40.902581028 +0900
> @@ -106,6 +106,10 @@ static inline int memory_isolate_notify(
>  {
> return 0;
>  }
> +static inline bool is_memblk_offline(unsigned long start, unsigned long size)
> +{
> +   return false;
> +}
>  #else
>  extern int register_memory_notifier(struct notifier_block *nb);
>  extern void unregister_memory_notifier(struct notifier_block *nb);
> @@ -120,6 +124,7 @@ extern int memory_isolate_notify(unsigne
>  extern struct memory_block *find_memory_block_hinted(struct mem_section *,
> struct memory_block 
> *);
>  extern struct memory_block *find_memory_block(struct mem_section *);
> +extern bool is_memblk_offline(unsigned long start, unsigned long size);
>  #define CONFIG_MEM_BLOCK_SIZE  (PAGES_PER_SECTION<  enum mem_add_context { BOOT, HOTPLUG };
>  #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
> Index: linux-3.6/mm/memory_hotplug.c
> ===
> --- linux-3.6.orig/mm/memory_hotplug.c  2012-10-04 14:31:08.0 +0900
> +++ linux-3.6/mm/memory_hotplug.c   2012-10-04 14:58:22.449687986 +0900
> @@ -1045,8 +1045,21 @@ int offline_memory(u64 start, u64 size)
>
>  int remove_memory(int nid, u64 start, u64 size)
>  {

Your remove_memory() don't remove anything. that's strange.


> -   /* It is not implemented yet*/
> -   return 0;
> +   int ret = 0;
> +   lock_memory_hotplug();
> +   /*
> +* The memory might become online by other task, even if you offine 
> it.
> + 

Re: [PATCH 0/10] memory-hotplug: hot-remove physical memory

2012-10-05 Thread KOSAKI Motohiro
> Known problems:
> 1. memory can't be offlined when CONFIG_MEMCG is selected.
>For example: there is a memory device on node 1. The address range
>is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>and memory11 under the directory /sys/devices/system/memory/.
>If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>when we online pages. When we online memory8, the memory stored page cgroup
>is not provided by this memory device. But when we online memory9, the 
> memory
>stored page cgroup may be provided by memory8. So we can't offline memory8
>now. We should offline the memory in the reversed order.
>When the memory device is hotremoved, we will auto offline memory provided
>by this memory device. But we don't know which memory is onlined first, so
>offlining memory may fail. In such case, you should offline the memory by
>hand before hotremoving the memory device.

Just iterate twice. 1st iterate: offline every non primary memory
block. 2nd iterate:
offline primary (i.e. first added) memory block. It may work.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] acpi,memory-hotplug : store the node id in acpi_memory_device

2012-10-05 Thread KOSAKI Motohiro
On Wed, Oct 3, 2012 at 6:11 AM, Yasuaki Ishimatsu
 wrote:
> From: Wen Congyang 
>
> The memory device has only one node id. Store the node id when
> enable the memory device, and we can reuse it when removing the
> memory device.

You don't explain why we need this. Then nobody can review nor ack.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/6] acpi,memory-hotplug : add physical memory hotplug code to acpi_memhotplug.c

2012-10-05 Thread KOSAKI Motohiro
On Wed, Oct 3, 2012 at 6:09 AM, Yasuaki Ishimatsu
 wrote:
> From: Yasuaki Ishimatsu 
>
> For hot removing physical memory, the patch adds remove_memory() into
> acpi_memory_remove_memory(). But we cannot support physical memory
> hot remove. So remove_memory() do nothinig.

I don't understand this explanation. Why do we need do nothing change?
I guess you need to fold this patch into another meaningful fix.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] memory-hotplug: add node_device_release

2012-10-05 Thread KOSAKI Motohiro
> I have the reason to have to fill the node struct with 0 by memset.
> The node is a part of node struct array (node_devices[]).
> If we add empty release function for suppressing warning,
> some data remains in the node struct after hot removing memory.
> So if we re-hot adds the memory, the node struct is reused by
> register_onde_node(). But the node struct has some data, because
> it was not initialized with 0. As a result, more waning is shown
> by the remained data at hot addinig memory as follows:

Even though you call memset(0) at offline. It doesn't guarantee the memory
keep 0 until online. E.g. physical memory exchange during offline, bit
corruption
by cosmic ray, etc. So, you should fill zero at online phase explicitly if need.

The basic hotplug design is: you should forget everything at offline
and you shouldn't
assume any initialized data at online.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] acpi,memory-hotplug : rename remove_memory() to offline_memory()

2012-10-04 Thread KOSAKI Motohiro
On Wed, Oct 3, 2012 at 6:02 AM, Yasuaki Ishimatsu
 wrote:
> From: Yasuaki Ishimatsu 
>
> add_memory() hot adds a physical memory. But remove_memory does not
> hot remove a phsical memory. It only offlines memory. The name
> confuse us.
>
> So the patch renames remove_memory() to offline_memory(). We will
> use rename_memory() for hot removing memory.
>
> CC: David Rientjes 
> CC: Jiang Liu 
> CC: Len Brown 
> CC: Christoph Lameter 
> Cc: Minchan Kim 
> CC: Andrew Morton 
> CC: KOSAKI Motohiro 
> Signed-off-by: Yasuaki Ishimatsu 
> Signed-off-by: Wen Congyang 
> ---
>  drivers/acpi/acpi_memhotplug.c |2 +-
>  include/linux/memory_hotplug.h |2 +-
>  mm/memory_hotplug.c|6 +++---
>  3 files changed, 5 insertions(+), 5 deletions(-)

Probably, the better way is to just remove remove_memory() and use
offline_pages().

btw, current remove_memory() pfn calculation is just buggy.


> int remove_memory(u64 start, u64 size)
> {
>   unsigned long start_pfn, end_pfn;
>
>   start_pfn = PFN_DOWN(start);
>   end_pfn = start_pfn + PFN_DOWN(size);

It should be:

start_pfn = PFN_DOWN(start);
end_pfn = PFN_UP(start + size)

or

start_pfn = PFN_UP(start);
end_pfn = PFN_DOWN(start + size)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] acpi,memory-hotplug : add memory offline code to acpi_memory_device_remove()

2012-10-04 Thread KOSAKI Motohiro
On Wed, Oct 3, 2012 at 5:58 AM, Yasuaki Ishimatsu
 wrote:
> From: Yasuaki Ishimatsu 
>
> The memory device can be removed by 2 ways:
> 1. send eject request by SCI
> 2. echo 1 >/sys/bus/pci/devices/PNP0C80:XX/eject
>
> In the 1st case, acpi_memory_disable_device() will be called.
> In the 2nd case, acpi_memory_device_remove() will be called.
> acpi_memory_device_remove() will also be called when we unbind the
> memory device from the driver acpi_memhotplug.
>
> acpi_memory_disable_device() has already implemented a code which
> offlines memory and releases acpi_memory_info struct . But
> acpi_memory_device_remove() has not implemented it yet.
>
> So the patch implements acpi_memory_remove_memory() for offlining
> memory and releasing acpi_memory_info struct. And it is used by both
> acpi_memory_device_remove() and acpi_memory_disable_device().
>
> Additionally, if the type is ACPI_BUS_REMOVAL_EJECT in
> acpi_memory_device_remove() , it means that the user wants to eject
> the memory device. In this case, acpi_memory_device_remove() calls
> acpi_memory_remove_memory().
>
> CC: David Rientjes 
> CC: Jiang Liu 
> CC: Len Brown 
> CC: Christoph Lameter 
> Cc: Minchan Kim 
> CC: Andrew Morton 
> CC: KOSAKI Motohiro 
> Signed-off-by: Yasuaki Ishimatsu 
> Signed-off-by: Wen Congyang 
> ---
>  drivers/acpi/acpi_memhotplug.c |   44 
> +++--
>  1 file changed, 34 insertions(+), 10 deletions(-)
>
> Index: linux-3.6/drivers/acpi/acpi_memhotplug.c
> ===
> --- linux-3.6.orig/drivers/acpi/acpi_memhotplug.c   2012-10-03 
> 18:55:33.386378909 +0900
> +++ linux-3.6/drivers/acpi/acpi_memhotplug.c2012-10-03 18:55:58.624380688 
> +0900
> @@ -306,24 +306,37 @@ static int acpi_memory_powerdown_device(
> return 0;
>  }
>
> -static int acpi_memory_disable_device(struct acpi_memory_device *mem_device)
> +static int acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
>  {
> int result;
> struct acpi_memory_info *info, *n;
>
> +   list_for_each_entry_safe(info, n, &mem_device->res_list, list) {

Which lock protect this loop?


> +   if (!info->enabled)
> +   return -EBUSY;
> +
> +   result = remove_memory(info->start_addr, info->length);
> +   if (result)
> +   return result;

I suspect you need to implement rollback code instead of just return.


> +
> +   list_del(&info->list);
> +   kfree(info);
> +   }
> +
> +   return 0;
> +}
> +
> +static int acpi_memory_disable_device(struct acpi_memory_device *mem_device)
> +{
> +   int result;
>
> /*
>  * Ask the VM to offline this memory range.
>  * Note: Assume that this function returns zero on success
>  */

Write function comment instead of this silly comment.

> -   list_for_each_entry_safe(info, n, &mem_device->res_list, list) {
> -   if (info->enabled) {
> -   result = remove_memory(info->start_addr, 
> info->length);
> -   if (result)
> -   return result;
> -   }
> -   kfree(info);
> -   }
> +   result = acpi_memory_remove_memory(mem_device);
> +   if (result)
> +   return result;
>
> /* Power-off and eject the device */
> result = acpi_memory_powerdown_device(mem_device);

This patch move acpi_memory_powerdown_device() from ACPI_NOTIFY_EJECT_REQUEST
to release callback, but don't explain why.





> @@ -473,12 +486,23 @@ static int acpi_memory_device_add(struct
>  static int acpi_memory_device_remove(struct acpi_device *device, int type)
>  {
> struct acpi_memory_device *mem_device = NULL;
> -
> +   int result;
>
> if (!device || !acpi_driver_data(device))
> return -EINVAL;
>
> mem_device = acpi_driver_data(device);
> +
> +   if (type == ACPI_BUS_REMOVAL_EJECT) {
> +   /*
> +* offline and remove memory only when the memory device is
> +* ejected.
> +*/

This comment explain nothing. A comment should describe _why_ should we do.
e.g. Why REMOVAL_NORMAL and REMOVEL_EJECT should be ignored. Why
we need remove memory here instead of ACPI_NOTIFY_EJECT_REQUEST.


> +   result = acpi_memory_remove_memory(mem_device);
> +   if (result)
> +   return result;
> +   }
> +
> kfree(mem_device);
>
>

Re: [PATCH] mm: use %pK for /proc/vmallocinfo

2012-10-02 Thread KOSAKI Motohiro
On Tue, Oct 2, 2012 at 7:49 PM, Kees Cook  wrote:
> In the paranoid case of sysctl kernel.kptr_restrict=2, mask the kernel
> virtual addresses in /proc/vmallocinfo too.
>
> Reported-by: Brad Spengler 
> Signed-off-by: Kees Cook 
> ---
>  mm/vmalloc.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 2bb90b1..9c871db 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2572,7 +2572,7 @@ static int s_show(struct seq_file *m, void *p)
>  {
> struct vm_struct *v = p;
>
> -   seq_printf(m, "0x%p-0x%p %7ld",
> +   seq_printf(m, "0x%pK-0x%pK %7ld",
> v->addr, v->addr + v->size, v->size);

Looks good.
Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v9 PATCH 01/21] memory-hotplug: rename remove_memory() to offline_memory()/offline_pages()

2012-10-02 Thread KOSAKI Motohiro
>> Then, you introduced bisect breakage. It is definitely unacceptable.
>
> What is "bisect breakage" meaning?

Think what's happen when only applying path [1/21].
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v9 PATCH 06/21] memory-hotplug: export the function acpi_bus_remove()

2012-10-02 Thread KOSAKI Motohiro
On Mon, Oct 1, 2012 at 8:34 PM, Ni zhan Chen  wrote:
> On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:
>>
>> From: Wen Congyang 
>>
>> The function acpi_bus_remove() can remove a acpi device from acpi device.
>
> IIUC, s/acpi device/acpi bus

IIUC, acpi_bus_remove() mean "remove the device from a bus".
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] memory-hotplug: add node_device_release

2012-10-01 Thread KOSAKI Motohiro
On Mon, Oct 1, 2012 at 2:54 AM, Yasuaki Ishimatsu
 wrote:
> Hi Kosaki-san,
>
>
> 2012/09/29 7:19, KOSAKI Motohiro wrote:
>>>>>
>>>>> I don't understand it. How can we get rid of the warning?
>>>>
>>>>
>>>> See cpu_device_release() for example.
>>>
>>>
>>> If we implement a function like cpu_device_release(), the warning
>>> disappears. But the comment says in the function "Never copy this
>>> way...".
>>> So I think it is illegal way.
>>
>>
>> What does "illegal" mean?
>
>
> The "illegal" means the code should not be mimicked.
>
>
>> You still haven't explain any benefit of your code. If there is zero
>> benefit, just kill it.
>> I believe everybody think so.
>>
>> Again, Which benefit do you have?
>
>
> The patch has a benefit to delets a warning message.
>
>
>>
>>>>>> Why do we need this node_device_release() implementation?
>>>>>
>>>>>
>>>>> I think that this is a manner of releasing object related kobject.
>>>>
>>>>
>>>> No.  Usually we never call memset() from release callback.
>>>
>>>
>>> What we want to release is a part of array, not a pointer.
>>> Therefore, there is only this way instead of kfree().
>>
>>
>> Why? Before your patch, we don't have memset() and did work it.
>
>
> If we does not apply the patch, a warning message is shown.
> So I think it did not work well.
>
>
>> I can't understand what mean "only way".
>
>
> For deleting a warning message, I created a node_device_release().
> In the manner of releasing kobject, the function frees a object related
> to the kobject. So most functions calls kfree() for releasing it.
> In node_device_release(), we need to free a node struct. If the node
> struct is pointer, I can free it by kfree. But the node struct is a part
> of node_devices[] array. I cannot free it. So I filled the node struct
> with 0.
>
> But you think it is not good. Do you have a good solution?

Do nothing. just add empty release function and kill a warning.
Obviously do nothing can't make any performance drop nor any
side effect.

meaningless memset() is just silly from point of cache pollution view.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] memory-hotplug: add memory_block_release

2012-09-28 Thread KOSAKI Motohiro
> It is not correct. The kobject_put() is prepared against find_memory_block()
> in remove_memory_block() since kobject->kref is incremented in it.
> So release_memory_block() is called by device_unregister() correctly as
> follows:

Ok. Looks good then.
Please rewrite the description more kindly at next spin. Current one
is really really bad.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] slub, hotplug: ignore unrelated node's hot-adding and hot-removing

2012-09-28 Thread KOSAKI Motohiro
On Fri, Sep 28, 2012 at 3:19 AM, Lai Jiangshan  wrote:
> HI, Christoph, KOSAKI
>
> SLAB always allocates kmem_list3 for all nodes(N_HIGH_MEMORY), also node 
> bug/bad things happens.
> SLUB always requires kmem_cache_node on the correct node, so these fix is 
> needed.
>
> SLAB uses for_each_online_node() to travel nodes and do maintain,
> and it tolerates kmem_list3 on alien nodes.
> SLUB uses for_each_node_state(node, N_NORMAL_MEMORY) to travel nodes and do 
> maintain,
> and it does not tolerate kmem_cache_node on alien nodes.
>
> Maybe we need to change SLAB future and let it use
> for_each_node_state(node, N_NORMAL_MEMORY), But I don't want to change SLAB
> until I find something bad in SLAB.

SLAB can't use highmem. then traverse zones which don't have normal
memory is silly IMHO.
If this is not bug, current slub behavior is also not bug. Is there
any difference?

If I understand correctly, current code may waste some additional
memory on corner case. but it doesn't make memory leak both when slab
and slub.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v9 PATCH 01/21] memory-hotplug: rename remove_memory() to offline_memory()/offline_pages()

2012-09-28 Thread KOSAKI Motohiro
On Thu, Sep 27, 2012 at 11:50 PM, Yasuaki Ishimatsu
 wrote:
> Hi Chen,
>
>
> 2012/09/28 11:22, Ni zhan Chen wrote:
>>
>> On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:
>>>
>>> From: Yasuaki Ishimatsu 
>>>
>>> remove_memory() only try to offline pages. It is called in two cases:
>>> 1. hot remove a memory device
>>> 2. echo offline >/sys/devices/system/memory/memoryXX/state
>>>
>>> In the 1st case, we should also change memory block's state, and notify
>>> the userspace that the memory block's state is changed after offlining
>>> pages.
>>>
>>> So rename remove_memory() to offline_memory()/offline_pages(). And in
>>> the 1st case, offline_memory() will be used. The function
>>> offline_memory()
>>> is not implemented. In the 2nd case, offline_pages() will be used.
>>
>>
>> But this time there is not a function associated with add_memory.
>
>
> To associate with add_memory() later, we renamed it.

Then, you introduced bisect breakage. It is definitely unacceptable.

NAK.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] memory-hotplug: add node_device_release

2012-09-28 Thread KOSAKI Motohiro
>>> I don't understand it. How can we get rid of the warning?
>>
>> See cpu_device_release() for example.
>
> If we implement a function like cpu_device_release(), the warning
> disappears. But the comment says in the function "Never copy this way...".
> So I think it is illegal way.

What does "illegal" mean?
You still haven't explain any benefit of your code. If there is zero
benefit, just kill it.
I believe everybody think so.

Again, Which benefit do you have?


 Why do we need this node_device_release() implementation?
>>>
>>> I think that this is a manner of releasing object related kobject.
>>
>> No.  Usually we never call memset() from release callback.
>
> What we want to release is a part of array, not a pointer.
> Therefore, there is only this way instead of kfree().

Why? Before your patch, we don't have memset() and did work it.
I can't understand what mean "only way".
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] memory-hotplug: add node_device_release

2012-09-27 Thread KOSAKI Motohiro
>> Moreover, your explanation is still insufficient. Even if
>> node_device_release() is empty function, we can get rid of the
>> warning.
>
>
> I don't understand it. How can we get rid of the warning?

See cpu_device_release() for example.



>> Why do we need this node_device_release() implementation?
>
> I think that this is a manner of releasing object related kobject.

No.  Usually we never call memset() from release callback.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] memory-hotplug: add memory_block_release

2012-09-27 Thread KOSAKI Motohiro
On Thu, Sep 27, 2012 at 8:24 PM, Yasuaki Ishimatsu
 wrote:
> Hi Chen,
>
>
> 2012/09/27 19:20, Ni zhan Chen wrote:
>>
>> Hi Congyang,
>>
>> 2012/9/27 
>>
>>> From: Yasuaki Ishimatsu 
>>>
>>> When calling remove_memory_block(), the function shows following message
>>> at
>>> device_release().
>>>
>>> Device 'memory528' does not have a release() function, it is broken and
>>> must
>>> be fixed.
>>>
>>
>> What's the difference between the patch and original implemetation?
>
>
> The implementation is for removing a memory_block. So the purpose is
> same as original one. But original code is bad manner. kobject_cleanup()
> is called by remove_memory_block() at last. But release function for
> releasing memory_block is not registered. As a result, the kernel message
> is shown. IMHO, memory_block should be release by the releae function.

but your patch introduced use after free bug, if i understand correctly.
See unregister_memory() function. After your patch, kobject_put() call
release_memory_block() and kfree(). and then device_unregister() will
touch freed memory.

static void
unregister_memory(struct memory_block *memory)
{
BUG_ON(memory->dev.bus != &memory_subsys);

/* drop the ref. we got in remove_memory_block() */
kobject_put(&memory->dev.kobj);
device_unregister(&memory->dev);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] memory-hotplug: add node_device_release

2012-09-27 Thread KOSAKI Motohiro
On Thu, Sep 27, 2012 at 8:07 PM, Yasuaki Ishimatsu
 wrote:
> Hi Kosaki-san,
>
>
> 2012/09/28 5:13, KOSAKI Motohiro wrote:
>>
>> On Thu, Sep 27, 2012 at 1:45 AM,   wrote:
>>>
>>> From: Yasuaki Ishimatsu 
>>>
>>> When calling unregister_node(), the function shows following message at
>>> device_release().
>>
>>
>> This description doesn't have the "following message".
>>
>>
>
>>> Device 'node2' does not have a release() function, it is broken and must
>>> be
>>> fixed.
>
>
> This is the messages. The message is shown by kobject_cleanup(), when
> calling
> unregister_node().

If so, you should quote the message. and don't mix it with your
subject. Moreover
your patch title is too silly. "add node_device_release() function" is
a way. you should
describe the effect of the patch. e.g. suppress "Device 'nodeXX' does
not have a release() function" warning.

Moreover, your explanation is still insufficient. Even if
node_device_release() is empty function, we can get rid of the
warning. Why do we need this node_device_release()
implementation?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] memory_hotplug: Don't modify the zone_start_pfn outside of zone_span_writelock()

2012-09-27 Thread KOSAKI Motohiro
(9/27/12 2:47 AM), Lai Jiangshan wrote:
> The __add_zone() maybe call sleep-able init_currently_empty_zone()
> to init wait_table,

This doesn't explain why sleepable is critical important. I think sleepable
is jsut unrelated. The fact is only: to write zone->zone_start_pfn require
zone_span_writelock, but init_currently_empty_zone() doesn't take it.


> 
> But this function also modifies the zone_start_pfn without any lock.
> It is bugy.

buggy?


> So we move this modification out, and we ensure the modification
> of zone_start_pfn is only done with zone_span_writelock() held or in booting.
> 
> Since zone_start_pfn is not modified by init_currently_empty_zone()
> grow_zone_span() needs to check zone_start_pfn before update it.
> 
> CC: Mel Gorman 
> Signed-off-by: Lai Jiangshan 
> Reported-by: Yasuaki ISIMATU 
> Tested-by: Wen Congyang 
> ---
>  mm/memory_hotplug.c |2 +-
>  mm/page_alloc.c |3 +--
>  2 files changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index b62d429b..790561f 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -205,7 +205,7 @@ static void grow_zone_span(struct zone *zone, unsigned 
> long start_pfn,
>   zone_span_writelock(zone);
>  
>   old_zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> - if (start_pfn < zone->zone_start_pfn)
> + if (!zone->zone_start_pfn || start_pfn < zone->zone_start_pfn)
>   zone->zone_start_pfn = start_pfn;

Wrong. zone->zone_start_pfn==0 may be valid pfn. You shouldn't assume it is 
uninitialized
value.


>  
>   zone->spanned_pages = max(old_zone_end_pfn, end_pfn) -
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c13ea75..2545013 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3997,8 +3997,6 @@ int __meminit init_currently_empty_zone(struct zone 
> *zone,
>   return ret;
>   pgdat->nr_zones = zone_idx(zone) + 1;
>  
> - zone->zone_start_pfn = zone_start_pfn;
> -
>   mminit_dprintk(MMINIT_TRACE, "memmap_init",
>   "Initialising map node %d zone %lu pfns %lu -> %lu\n",
>   pgdat->node_id,
> @@ -4465,6 +4463,7 @@ static void __paginginit free_area_init_core(struct 
> pglist_data *pgdat,
>   ret = init_currently_empty_zone(zone, zone_start_pfn,
>   size, MEMMAP_EARLY);
>   BUG_ON(ret);
> + zone->zone_start_pfn = zone_start_pfn;
>   memmap_init(size, nid, j, zone_start_pfn);
>   zone_start_pfn += size;
>   }
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] slub, hotplug: ignore unrelated node's hot-adding and hot-removing

2012-09-27 Thread KOSAKI Motohiro
(9/27/12 2:47 AM), Lai Jiangshan wrote:
> SLUB only fucus on the nodes which has normal memory, so ignore the other
> node's hot-adding and hot-removing.
> 
> Aka: if some memroy of a node(which has no onlined memory) is online,
> but this new memory onlined is not normal memory(HIGH memory example),
> we should not allocate kmem_cache_node for SLUB.
> 
> And if the last normal memory is offlined, but the node still has memroy,
> we should remove kmem_cache_node for that node.(current code delay it when
> all of the memory is offlined)
> 
> so we only do something when marg->status_change_nid_normal > 0.
> marg->status_change_nid is not suitable here.
> 
> Signed-off-by: Lai Jiangshan 
> ---
>  mm/slub.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 2fdd96f..2d78639 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3577,7 +3577,7 @@ static void slab_mem_offline_callback(void *arg)
>   struct memory_notify *marg = arg;
>   int offline_node;
>  
> - offline_node = marg->status_change_nid;
> + offline_node = marg->status_change_nid_normal;
>  
>   /*
>* If the node still has available memory. we need kmem_cache_node
> @@ -3610,7 +3610,7 @@ static int slab_mem_going_online_callback(void *arg)
>   struct kmem_cache_node *n;
>   struct kmem_cache *s;
>   struct memory_notify *marg = arg;
> - int nid = marg->status_change_nid;
> + int nid = marg->status_change_nid_normal;
>   int ret = 0;

Looks reasonable. I think slab need similar fix too.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] memory_hotplug: fix stale node_states[N_NORMAL_MEMORY]

2012-09-27 Thread KOSAKI Motohiro
(9/27/12 2:47 AM), Lai Jiangshan wrote:
> Currently memory_hotplug only manages the node_states[N_HIGH_MEMORY],
> it forgets to manage node_states[N_NORMAL_MEMORY]. it causes
> node_states[N_NORMAL_MEMORY] becomes stale.

What's mean 'stale'? I guess

: Currently memory_hotplug doesn't turn on/off node_states[N_NORMAL_MEMORY] and
: then it will be invalid if the platform has highmem. Luckily, almost memory 
: hotplug aware platform don't have highmem, but are not all.

right?
I supporse this patch only meaningful on ARM platform practically.



> We add check_nodemasks_changes_online() and check_nodemasks_changes_offline()
> to detect whether node_states[N_HIGH_MEMORY] and node_states[N_NORMAL_MEMORY]
> are changed while hotpluging.


> Also add @status_change_nid_normal to struct memory_notify, thus
> the memory hotplug callbacks know whether the node_states[N_NORMAL_MEMORY]
> are changed.

status_change_nid_normal is very ugly to me. When status_change_nid and 
status_change_nid_normal has positive value, they are always the same.
nid and flags value are more natual to me.



> 
> Signed-off-by: Lai Jiangshan 
> ---
>  Documentation/memory-hotplug.txt |5 ++-
>  include/linux/memory.h   |1 +
>  mm/memory_hotplug.c  |   94 +++--
>  3 files changed, 83 insertions(+), 17 deletions(-)
> 
> diff --git a/Documentation/memory-hotplug.txt 
> b/Documentation/memory-hotplug.txt
> index 6d0c251..6e6cbc7 100644
> --- a/Documentation/memory-hotplug.txt
> +++ b/Documentation/memory-hotplug.txt
> @@ -377,15 +377,18 @@ The third argument is passed by pointer of struct 
> memory_notify.
>  struct memory_notify {
> unsigned long start_pfn;
> unsigned long nr_pages;
> +   int status_change_nid_normal;
> int status_change_nid;
>  }
>  
>  start_pfn is start_pfn of online/offline memory.
>  nr_pages is # of pages of online/offline memory.
> +status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask
> +is (will be) set/clear, if this is -1, then nodemask status is not changed.
>  status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be)
>  set/clear. It means a new(memoryless) node gets new memory by online and a
>  node loses all memory. If this is -1, then nodemask status is not changed.
> -If status_changed_nid >= 0, callback should create/discard structures for the
> +If status_changed_nid* >= 0, callback should create/discard structures for 
> the
>  node if necessary.
>  
>  --
> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index ff9a9f8..a09216d 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -53,6 +53,7 @@ int arch_get_memory_phys_device(unsigned long start_pfn);
>  struct memory_notify {
>   unsigned long start_pfn;
>   unsigned long nr_pages;
> + int status_change_nid_normal;
>   int status_change_nid;
>  };
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 6a5b90d..b62d429b 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -460,6 +460,34 @@ static int online_pages_range(unsigned long start_pfn, 
> unsigned long nr_pages,
>   return 0;
>  }
>  
> +static void check_nodemasks_changes_online(unsigned long nr_pages,
> + struct zone *zone, struct memory_notify *arg)
> +{
> + int nid = zone_to_nid(zone);
> + enum zone_type zone_last = ZONE_NORMAL;
> +
> + if (N_HIGH_MEMORY == N_NORMAL_MEMORY)
> + zone_last = ZONE_MOVABLE;

This is very strange (or ugly) code. ZONE_MOVABLE don't depend on high mem.


> +
> + if (zone_idx(zone) <= zone_last && !node_state(nid, N_NORMAL_MEMORY))
> + arg->status_change_nid_normal = nid;
> + else
> + arg->status_change_nid_normal = -1;

Wrong. The onlined node may only have high mem zone. IOW, think fake numa case 
etc.


> +
> + if (!node_state(nid, N_HIGH_MEMORY))
> + arg->status_change_nid = nid;
> + else
> + arg->status_change_nid = -1;
> +}
> +
> +static void set_nodemasks(int node, struct memory_notify *arg)

Too ugly. just remove this and use node_set_state() directly.

> +{
> + if (arg->status_change_nid_normal >= 0)
> + node_set_state(node, N_NORMAL_MEMORY);
> +
> + node_set_state(node, N_HIGH_MEMORY);
> +}
> +
>  
>  int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
>  {
> @@ -471,13 +499,18 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages)
>   struct memory_notify arg;
>  
>   lock_memory_hotplug();
> + /*
> +  * This doesn't need a lock to do pfn_to_page().
> +  * The section can't be removed here because of the
> +  * memory_block->state_mutex.
> +  */

Please explain the intention of this comment. We think lock_memory_hotplug() 
close
a race against memory offline directly.


> + zone = page_zone(pfn_to_page(pfn));
> +
>   arg.start_pfn = pfn;
>   arg.nr_pages = 

Re: [PATCH 4/4] memory-hotplug: auto offline page_cgroup when onlining memory block failed

2012-09-27 Thread KOSAKI Motohiro
On Thu, Sep 27, 2012 at 1:45 AM,   wrote:
> From: Wen Congyang 
>
> When a memory block is onlined, we will try allocate memory on that node
> to store page_cgroup. If onlining the memory block failed, we don't
> offline the page cgroup, and we have no chance to offline this page cgroup
> unless the memory block is onlined successfully again. It will cause
> that we can't hot-remove the memory device on that node, because some
> memory is used to store page cgroup. If onlining the memory block
> is failed, there is no need to stort page cgroup for this memory. So
> auto offline page_cgroup when onlining memory block failed.
>
> CC: David Rientjes 
> CC: Jiang Liu 
> CC: Len Brown 
> CC: Benjamin Herrenschmidt 
> CC: Paul Mackerras 
> Cc: Minchan Kim 
> CC: Andrew Morton 
> CC: KOSAKI Motohiro 
> CC: Yasuaki Ishimatsu 
> Signed-off-by: Wen Congyang 
> ---
>  mm/page_cgroup.c |3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 5ddad0c..44db00e 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -251,6 +251,9 @@ static int __meminit page_cgroup_callback(struct 
> notifier_block *self,
> mn->nr_pages, mn->status_change_nid);
> break;
> case MEM_CANCEL_ONLINE:
> +   offline_page_cgroup(mn->start_pfn,
> +   mn->nr_pages, mn->status_change_nid);
> +   break;
>     case MEM_GOING_OFFLINE:
> break;
> case MEM_ONLINE:

Looks straight forward and reasonable.
Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] memory-hotplug: clear hwpoisoned flag when onlining pages

2012-09-27 Thread KOSAKI Motohiro
On Thu, Sep 27, 2012 at 1:45 AM,   wrote:
> From: Wen Congyang 
>
> hwpoisoned may set when we offline a page by the sysfs interface
> /sys/devices/system/memory/soft_offline_page or
> /sys/devices/system/memory/hard_offline_page. If we don't clear
> this flag when onlining pages, this page can't be freed, and will
> not in free list. So we can't offline these pages again. So we
> should clear this flag when onlining pages.

This seems wrong fix to me.  After offline, memory may or may not
change with new one. Thus we can't assume any memory status. Thus,
we should just forget hwpoison status at _offline_ event.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] memory-hotplug: add node_device_release

2012-09-27 Thread KOSAKI Motohiro
On Thu, Sep 27, 2012 at 1:45 AM,   wrote:
> From: Yasuaki Ishimatsu 
>
> When calling unregister_node(), the function shows following message at
> device_release().

This description doesn't have the "following message".


> Device 'node2' does not have a release() function, it is broken and must be
> fixed.
>
> So the patch implements node_device_release()
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] memory-hotplug: add memory_block_release

2012-09-27 Thread KOSAKI Motohiro
On Thu, Sep 27, 2012 at 1:45 AM,   wrote:
> From: Yasuaki Ishimatsu 
>
> When calling remove_memory_block(), the function shows following message at
> device_release().
>
> Device 'memory528' does not have a release() function, it is broken and must
> be fixed.
>
> remove_memory_block() calls kfree(mem). I think it shouled be called from
> device_release(). So the patch implements memory_block_release()

Why do you think so? This is terribly bad change log. it has almost
zero information.
We can't review it.




>
> CC: David Rientjes 
> CC: Jiang Liu 
> CC: Len Brown 
> CC: Benjamin Herrenschmidt 
> CC: Paul Mackerras 
> Cc: Minchan Kim 
> CC: Andrew Morton 
> CC: KOSAKI Motohiro 
> CC: Wen Congyang 
> Signed-off-by: Yasuaki Ishimatsu 
> ---
>  drivers/base/memory.c |9 -
>  1 files changed, 8 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 7dda4f7..da457e5 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -70,6 +70,13 @@ void unregister_memory_isolate_notifier(struct 
> notifier_block *nb)
>  }
>  EXPORT_SYMBOL(unregister_memory_isolate_notifier);
>
> +static void release_memory_block(struct device *dev)
> +{
> +   struct memory_block *mem = container_of(dev, struct memory_block, 
> dev);
> +
> +   kfree(mem);
> +}
> +
>  /*
>   * register_memory - Setup a sysfs device for a memory block
>   */
> @@ -80,6 +87,7 @@ int register_memory(struct memory_block *memory)
>
> memory->dev.bus = &memory_subsys;
> memory->dev.id = memory->start_section_nr / sections_per_block;
> +   memory->dev.release = release_memory_block;
>
> error = device_register(&memory->dev);
> return error;
> @@ -630,7 +638,6 @@ int remove_memory_block(unsigned long node_id, struct 
> mem_section *section,
> mem_remove_simple_file(mem, phys_device);
> mem_remove_simple_file(mem, removable);
> unregister_memory(mem);
> -   kfree(mem);
> } else
> kobject_put(&mem->dev.kobj);
>
> --
> 1.7.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] pagemap: fix wrong KPF_THP on slab pages

2012-09-25 Thread KOSAKI Motohiro
On Tue, Sep 25, 2012 at 1:05 PM, Naoya Horiguchi
 wrote:
> On Tue, Sep 25, 2012 at 11:59:51AM -0400, KOSAKI Motohiro wrote:
>> On Tue, Sep 25, 2012 at 9:56 AM, Naoya Horiguchi
>>  wrote:
>> > KPF_THP can be set on non-huge compound pages like slab pages, because
>> > PageTransCompound only sees PG_head and PG_tail. Obviously this is a bug
>> > and breaks user space applications which look for thp via /proc/kpageflags.
>> > Currently thp is constructed only on anonymous pages, so this patch makes
>> > KPF_THP be set when both of PageAnon and PageTransCompound are true.
>>
>> Indeed. Please add some comment too.
>
> Sure. I send revised one.
>
> Thanks,
> Naoya
> ---
> From: Naoya Horiguchi 
> Date: Mon, 24 Sep 2012 16:28:30 -0400
> Subject: [PATCH v2] pagemap: fix wrong KPF_THP on slab pages
>
> KPF_THP can be set on non-huge compound pages like slab pages, because
> PageTransCompound only sees PG_head and PG_tail. Obviously this is a bug
> and breaks user space applications which look for thp via /proc/kpageflags.
> Currently thp is constructed only on anonymous pages, so this patch makes
> KPF_THP be set when both of PageAnon and PageTransCompound are true.
>
> Changelog in v2:
>   - add a comment in code
>
> Signed-off-by: Naoya Horiguchi 
> ---
>  fs/proc/page.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 7fcd0d6..f7cd2f6c 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -115,7 +115,12 @@ u64 stable_page_flags(struct page *page)
> u |= 1 << KPF_COMPOUND_TAIL;
> if (PageHuge(page))
> u |= 1 << KPF_HUGE;
> -   else if (PageTransCompound(page))
> +   /*
> +* Since THP is relevant only for anonymous pages so far, we check it
> +* explicitly with PageAnon. Otherwise thp is confounded with non-huge
> +    * compound pages like slab pages.
> +*/
> +   else if (PageTransCompound(page) && PageAnon(page))
> u |= 1 << KPF_THP;

Looks good to me.

Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] pagemap: fix wrong KPF_THP on slab pages

2012-09-25 Thread KOSAKI Motohiro
On Tue, Sep 25, 2012 at 9:56 AM, Naoya Horiguchi
 wrote:
> KPF_THP can be set on non-huge compound pages like slab pages, because
> PageTransCompound only sees PG_head and PG_tail. Obviously this is a bug
> and breaks user space applications which look for thp via /proc/kpageflags.
> Currently thp is constructed only on anonymous pages, so this patch makes
> KPF_THP be set when both of PageAnon and PageTransCompound are true.

Indeed. Please add some comment too.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix NR_ISOLATED_[ANON|FILE] mismatch

2012-09-20 Thread KOSAKI Motohiro
On Wed, Sep 19, 2012 at 7:51 PM, Minchan Kim  wrote:
> On Wed, Sep 19, 2012 at 02:28:10PM -0400, Johannes Weiner wrote:
>> On Wed, Sep 19, 2012 at 01:04:56PM -0400, KOSAKI Motohiro wrote:
>> > On Wed, Sep 19, 2012 at 3:45 AM, Minchan Kim  wrote:
>> > > When I looked at zone stat mismatch problem, I found
>> > > migrate_to_node doesn't decrease NR_ISOLATED_[ANON|FILE]
>> > > if check_range fails.
>>
>> This is a bit misleading.  It's not that the stats would be
>> inaccurate, it's that the pages would be leaked from the LRU, no?
>>
>> > > It can make system hang out.
>>
>> Did you spot this by code review only or did you actually run into
>> this?  Because...
>>
>> > > Cc: KOSAKI Motohiro 
>> > > Cc: Mel Gorman 
>> > > Cc: Christoph Lameter 
>> > > Signed-off-by: Minchan Kim 
>> > > ---
>> > >  mm/mempolicy.c |   16 
>> > >  1 file changed, 8 insertions(+), 8 deletions(-)
>> > >
>> > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> > > index 3d64b36..6bf0860 100644
>> > > --- a/mm/mempolicy.c
>> > > +++ b/mm/mempolicy.c
>> > > @@ -953,16 +953,16 @@ static int migrate_to_node(struct mm_struct *mm, 
>> > > int source, int dest,
>> > >
>> > > vma = check_range(mm, mm->mmap->vm_start, mm->task_size, &nmask,
>> > > flags | MPOL_MF_DISCONTIG_OK, &pagelist);
>> > > -   if (IS_ERR(vma))
>> > > -   return PTR_ERR(vma);
>> > > -
>> > > -   if (!list_empty(&pagelist)) {
>> > > +   if (IS_ERR(vma)) {
>> > > +   err = PTR_ERR(vma);
>> > > +   goto out;
>> > > +   }
>> > > +   if (!list_empty(&pagelist))
>> > > err = migrate_pages(&pagelist, new_node_page, dest,
>> > > false, 
>> > > MIGRATE_SYNC);
>> > > -   if (err)
>> > > -   putback_lru_pages(&pagelist);
>> > > -   }
>> > > -
>> > > +out:
>> > > +   if (err)
>> > > +   putback_lru_pages(&pagelist);
>> >
>> > Good catch!
>> > This is a regression since following commit. So, I doubt we need
>> > all or nothing semantics. Can we revert it instead? (and probably
>> > we need more kind comment for preventing an accident)
>>
>> I think it makes sense to revert.  Not because of the semantics, but I
>> just don't see how check_range() could even fail for this callsite:
>>
>> 1. we pass mm->mmap->vm_start in there, so we should not fail due to
>>find_vma()
>>
>> 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply
>>and so can not fail
>>
>> 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will
>>continue until addr == end, so we never fail with -EIO
>>
>> > commit 0def08e3acc2c9c934e4671487029aed52202d42
>> > Author: Vasiliy Kulikov 
>> > Date:   Tue Oct 26 14:21:32 2010 -0700
>> >
>> > mm/mempolicy.c: check return code of check_range
>>
>> We don't use this code to "check" the range, we use it to collect
>> migrate pages.  There is no failure case.
>>
>
> Here it goes.
>
> From c2c21b551811e034eb0ede6806e0314b201d7e5b Mon Sep 17 00:00:00 2001
> From: Minchan Kim 
> Date: Thu, 20 Sep 2012 08:39:52 +0900
> Subject: [PATCH] mm: revert 0def08e3, mm/mempolicy.c: check return code of
>  check_range
>
> This patch reverts 0def08e3 because check_range can't fail in
> migrate_to_node with considering current usecases.
>
> Quote from Johannes
> "
> I think it makes sense to revert.  Not because of the semantics, but I
> just don't see how check_range() could even fail for this callsite:
>
> 1. we pass mm->mmap->vm_start in there, so we should not fail due to
>find_vma()
>
> 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply
>and so can not fail
>
> 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will
>continue until addr == end, so we never fail with -EIO
> "
>
> And I add new VM_BUG_ON for checking migrate_to_node's future usecase
> which might pass to MPOL_MF_STRICT.
>
> Cc: KOSAKI Motohiro 
> Cc: Mel Gorman 
> 

Re: [PATCH] memory-hotplug: fix zone stat mismatch

2012-09-19 Thread KOSAKI Motohiro
On Wed, Sep 19, 2012 at 3:29 AM, Minchan Kim  wrote:
> During memory-hotplug stress test, I found NR_ISOLATED_[ANON|FILE]
> are increasing so that kernel are hang out.
>
> The cause is that when we do memory-hotadd after memory-remove,
> __zone_pcp_update clear out zone's ZONE_STAT_ITEMS in setup_pageset
> without draining vm_stat_diff of all CPU.
>
> This patch fixes it.

zone_pcp_update() is called from online pages path. but IMHO,
the statistics should be drained offline path. isn't it?

thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix NR_ISOLATED_[ANON|FILE] mismatch

2012-09-19 Thread KOSAKI Motohiro
On Wed, Sep 19, 2012 at 3:45 AM, Minchan Kim  wrote:
> When I looked at zone stat mismatch problem, I found
> migrate_to_node doesn't decrease NR_ISOLATED_[ANON|FILE]
> if check_range fails.
>
> It can make system hang out.
>
> Cc: KOSAKI Motohiro 
> Cc: Mel Gorman 
> Cc: Christoph Lameter 
> Signed-off-by: Minchan Kim 
> ---
>  mm/mempolicy.c |   16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 3d64b36..6bf0860 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -953,16 +953,16 @@ static int migrate_to_node(struct mm_struct *mm, int 
> source, int dest,
>
> vma = check_range(mm, mm->mmap->vm_start, mm->task_size, &nmask,
> flags | MPOL_MF_DISCONTIG_OK, &pagelist);
> -   if (IS_ERR(vma))
> -   return PTR_ERR(vma);
> -
> -   if (!list_empty(&pagelist)) {
> +   if (IS_ERR(vma)) {
> +   err = PTR_ERR(vma);
> +   goto out;
> +   }
> +   if (!list_empty(&pagelist))
> err = migrate_pages(&pagelist, new_node_page, dest,
> false, MIGRATE_SYNC);
> -   if (err)
> -   putback_lru_pages(&pagelist);
> -   }
> -
> +out:
> +   if (err)
> +   putback_lru_pages(&pagelist);

Good catch!
This is a regression since following commit. So, I doubt we need
all or nothing semantics. Can we revert it instead? (and probably
we need more kind comment for preventing an accident)


commit 0def08e3acc2c9c934e4671487029aed52202d42
Author: Vasiliy Kulikov 
Date:   Tue Oct 26 14:21:32 2010 -0700

mm/mempolicy.c: check return code of check_range
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why blktrace didn't trace requests merge?

2012-09-17 Thread KOSAKI Motohiro
(9/17/12 1:55 PM), Tejun Heo wrote:
> (cc'ing Jens)
> 
> On Mon, Sep 17, 2012 at 09:22:28AM -0400, Steven Rostedt wrote:
>> On Mon, 2012-09-17 at 19:33 +0800, Jianpeng Ma wrote:
>>> Hi all:
>>> I used blktrace to trace some io.But i can't find requests merge. I 
>>> searched the code and did't not find.
>>> Why? 
>>> 
>>
>> No idea. I don't use blktrace much, but I Cc'd those that understand it
>> better than I.

backmerge/fronmerge event?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix mmap overflow checking

2012-09-07 Thread KOSAKI Motohiro
>> I've seen the exactly same patch from another fujitsu guys several
>> month ago. and as I pointed
>> out at that time, this line don't work when 32bit kernel + mmap2 syscall 
>> case.
>>
>> Please don't think do_mmap_pgoff() is for mmap(2) specific and read a
>> past thread before resend
>> a patch.
>
> So, what's your opinion about this bug? How to fix it in your mind?

Fix glibc instead of kernel.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] mempolicy: fix a memory corruption by refcount imbalance in alloc_pages_vma()

2012-09-07 Thread KOSAKI Motohiro
> mempolicy: fix a memory corruption by refcount imbalance in alloc_pages_vma()
> 
> [cc9a6c87: cpuset: mm: reduce large amounts of memory barrier related damage
> v3] introduced a potential memory corruption. shmem_alloc_page() uses a
> pseudo vma and it has one significant unique combination, vma->vm_ops=NULL
> and vma->policy->flags & MPOL_F_SHARED.
> 
> get_vma_policy() does NOT increase a policy ref when vma->vm_ops=NULL and
> mpol_cond_put() DOES decrease a policy ref when a policy has MPOL_F_SHARED.
> Therefore, when a cpuset update race occurs, alloc_pages_vma() falls in 'goto
> retry_cpuset' path, decrements the reference count and frees the policy
> prematurely.
> 
> Signed-off-by: KOSAKI Motohiro 
> Signed-off-by: Mel Gorman 
> ---
>  mm/mempolicy.c |   12 +++-
>  1 files changed, 11 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 45f9825..9842ef5 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1552,8 +1552,18 @@ struct mempolicy *get_vma_policy(struct task_struct 
> *task,
>   addr);
>   if (vpol)
>   pol = vpol;
> - } else if (vma->vm_policy)
> + } else if (vma->vm_policy) {
>   pol = vma->vm_policy;
> +
> + /*
> +  * shmem_alloc_page() passes MPOL_F_SHARED policy with
> +  * a pseudo vma whose vma->vm_ops=NULL. Take a reference
> +  * count on these policies which will be dropped by
> +  * mpol_cond_put() later
> +  */
> +     if (mpol_needs_cond_ref(pol))
> + mpol_get(pol);
> + }

Ok, looks sene change. thank you.


Acked-by: KOSAKI Motohiro 

>   }
>   if (!pol)
>   pol = &default_policy;
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/5] mempolicy: fix a race in shared_policy_replace()

2012-09-07 Thread KOSAKI Motohiro
First off, thank you very much for reworking this for me. I haven't got a chance
to get a test machine for this.

> shared_policy_replace() use of sp_alloc() is unsafe. 1) sp_node cannot
> be dereferenced if sp->lock is not held and 2) another thread can modify
> sp_node between spin_unlock for allocating a new sp node and next spin_lock.
> The bug was introduced before 2.6.12-rc2.
> 
> Kosaki's original patch for this problem was to allocate an sp node and policy
> within shared_policy_replace and initialise it when the lock is reacquired. I
> was not keen on this approach because it partially duplicates sp_alloc(). As
> the paths were sp->lock is taken are not that performance critical this
> patch converts sp->lock to sp->mutex so it can sleep when calling sp_alloc().

Looks make sense.

Acked-by: KOSAKI Motohiro 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix mmap overflow checking

2012-09-07 Thread KOSAKI Motohiro
On Tue, Sep 4, 2012 at 5:23 AM, Wanlong Gao  wrote:
> POSIX said that if the file is a regular file and the value of "off"
> plus "len" exceeds the offset maximum established in the open file
> description associated with fildes, mmap should return EOVERFLOW.
>
> The following test from LTP can reproduce this bug.
>
> char tmpfname[256];
> void *pa = NULL;
> void *addr = NULL;
> size_t len;
> int flag;
> int fd;
> off_t off = 0;
> int prot;
>
> long page_size = sysconf(_SC_PAGE_SIZE);
>
> snprintf(tmpfname, sizeof(tmpfname), "/tmp/mmap_test_%d", getpid());
> unlink(tmpfname);
> fd = open(tmpfname, O_CREAT | O_RDWR | O_EXCL, S_IRUSR | S_IWUSR);
> if (fd == -1) {
> printf(" Error at open(): %s\n", strerror(errno));
> return 1;
> }
> unlink(tmpfname);
>
> flag = MAP_SHARED;
> prot = PROT_READ | PROT_WRITE;
>
> /* len + off > maximum offset */
>
> len = ULONG_MAX;
> if (len % page_size) {
> /* Lower boundary */
> len &= ~(page_size - 1);
> }
>
> off = ULONG_MAX;
> if (off % page_size) {
> /* Lower boundary */
> off &= ~(page_size - 1);
> }
>
> printf("off: %lx, len: %lx\n", (unsigned long)off, (unsigned 
> long)len);
> pa = mmap(addr, len, prot, flag, fd, off);
> if (pa == MAP_FAILED && errno == EOVERFLOW) {
> printf("Test Pass: Error at mmap: %s\n", strerror(errno));
> return 0;
> }
>
> if (pa == MAP_FAILED)
> perror("Test FAIL: expect EOVERFLOW but get other error");
> else
> printf("Test FAIL : Expect EOVERFLOW but got no error\n");
>
> close(fd);
> munmap(pa, len);
> return 1;
>
> Cc: Andrew Morton 
> Cc: KOSAKI Motohiro 
> Cc: linux...@kvack.org (open list:MEMORY MANAGEMENT)
> Signed-off-by: Wanlong Gao 
> ---
>  mm/mmap.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index ae18a48..5380764 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -980,6 +980,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned 
> long addr,
> struct mm_struct * mm = current->mm;
> struct inode *inode;
> vm_flags_t vm_flags;
> +   off_t off = pgoff << PAGE_SHIFT;

I've seen the exactly same patch from another fujitsu guys several
month ago. and as I pointed
out at that time, this line don't work when 32bit kernel + mmap2 syscall case.

Please don't think do_mmap_pgoff() is for mmap(2) specific and read a
past thread before resend
a patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6][resend] mempolicy memory corruption fixlet

2012-08-06 Thread KOSAKI Motohiro
On 7/31/2012 8:33 AM, Josh Boyer wrote:
> On Mon, Jun 11, 2012 at 5:17 AM,   wrote:
>> From: KOSAKI Motohiro 
>>
>> Hi
>>
>> This is trivial fixes of mempolicy meory corruption issues. There
>> are independent patches each ather. and, they don't change userland
>> ABIs.
>>
>> Thanks.
>>
>> changes from v1: fix some typo of changelogs s.
>>
>> ---
>> KOSAKI Motohiro (6):
>>   Revert "mm: mempolicy: Let vma_merge and vma_split handle
>> vma->vm_policy linkages"
>>   mempolicy: Kill all mempolicy sharing
>>   mempolicy: fix a race in shared_policy_replace()
>>   mempolicy: fix refcount leak in mpol_set_shared_policy()
>>   mempolicy: fix a memory corruption by refcount imbalance in
>> alloc_pages_vma()
>>   MAINTAINERS: Added MEMPOLICY entry
>>
>>  MAINTAINERS|7 +++
>>  mm/mempolicy.c |  151 
>> 
>>  mm/shmem.c |9 ++--
>>  3 files changed, 120 insertions(+), 47 deletions(-)
> 
> I don't see these patches queued anywhere.  They aren't in linux-next,
> mmotm, or Linus' tree.  Did these get dropped?  Is the revert still
> needed?

Sorry. my fault. yes, it is needed. currently, Some LTP was fail since
Mel's "mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy 
linkages" patch.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 v5][resend] tmpfs: interleave the starting node of /dev/shmem

2012-07-25 Thread KOSAKI Motohiro
> Please, what's wrong with the patch below, to replace the current
> two or three?  I don't have real NUMA myself: does it work?
> If it doesn't work, can you see why not?

It works. It doesn't match my preference. but I don't want block your way.
this area is maintained you. please go ahead.

at least, inode bias is better than random.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.25-rc2-mm1 - boot hangs on ia64

2008-02-26 Thread KOSAKI Motohiro
Hi

Fujitsu machine can't boot too.
my bisect indicate git-sched.patch cause regression too.

Thanks.

> 25-rc2-mm1 is hanging early in boot on my HP ia64 numa platform.  I saw
> the "Strange hang on ia64 with CONFIG_PRINTK_TIME=y" thread on lkml:  
> 
>   http://marc.info/?t=12028839681&r=1&w=4
> 
> However, my config does not include PRINTK_TIME=y.  In fact, hang occurs
> with ia64 defconfig as well--right after the "Loading...initrd...done"
> message.  2.6.25-rc2 boots OK.
> 
> Bisecting the broken-out series appears to indict 'git-sched.patch'.  I
> went ahead and added Ingo's patch, discussed in the "strange hang"
> thread, even tho' I hadn't enabled printk timestamps.  No effect.
> 
> Anyone else seeing this?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ofa-general] Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-26 Thread KOSAKI Motohiro
> > > > Can you change the spec?
> > >
> > > Not really. It will break all existing codes.
> > 
> > I meant as in eg. submit changes to MPI-3
>
> MPI spec tries to be backward compatible. And MPI-2 spec is 10 years
> old, but MPI-1 is still in a wider use. HPC is moving fast in terms of HW
> technology, but slow in terms of SW. Fortran is still hot there :)

Agreed.
many many people dislike incompatible specification change.

We should accept real world spec.


- kosaki


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH] page reclaim throttle take2

2008-02-25 Thread KOSAKI Motohiro
Hi

this patch is page reclaim improvement.

o previous discussion:
http://marc.info/?l=linux-mm&m=120339997125985&w=2

o test method
  $ ./hackbench 120 process 1000

o test result (average of 5 times measure)

limit   hackbench sys-time major-fault   max-spent-time 
time(s)   (s)in shrink_zone()
 (jiffies)

3   42.06 378.70   5336  6306


o reason why restrict parallel reclaim 3 task per zone

we tested various parameter.
  - restrict 1 is best major fault.
but worst max spent time.
  - restrict 3 is best max spent reclaim time and hackbench result.

I think "restrict 3" cause most good experience.


limit  hackbench sys-time major-fault   max-spent-time 
   time(s)   (s)in shrink_zone()
(jiffies)

1  48.50 283.89   3690  9057
2  44.43 350.94   5245  7159
3  42.06 378.70   5336  6306
4  48.84 401.87   5474  6669
unlimited  282.301248.47  29026  -



Please any comments!



Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]>
CC: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>
CC: Balbir Singh <[EMAIL PROTECTED]>
CC: Rik van Riel <[EMAIL PROTECTED]>
CC: Lee Schermerhorn <[EMAIL PROTECTED]>
CC: Nick Piggin <[EMAIL PROTECTED]>


---
 include/linux/mmzone.h |3 +
 mm/page_alloc.c|4 +
 mm/vmscan.c|  101 -
 3 files changed, 99 insertions(+), 9 deletions(-)

Index: b/include/linux/mmzone.h
===
--- a/include/linux/mmzone.h2008-02-25 21:37:49.0 +0900
+++ b/include/linux/mmzone.h2008-02-26 10:12:12.0 +0900
@@ -335,6 +335,9 @@ struct zone {
unsigned long   spanned_pages;  /* total size, including holes 
*/
unsigned long   present_pages;  /* amount of memory (excluding 
holes) */
 
+
+   atomic_tnr_reclaimers;
+   wait_queue_head_t   reclaim_throttle_waitq;
/*
 * rarely used fields:
 */
Index: b/mm/page_alloc.c
===
--- a/mm/page_alloc.c   2008-02-25 21:37:49.0 +0900
+++ b/mm/page_alloc.c   2008-02-26 10:12:12.0 +0900
@@ -3466,6 +3466,10 @@ static void __meminit free_area_init_cor
zone->nr_scan_inactive = 0;
zap_zone_vm_stats(zone);
zone->flags = 0;
+
+   zone->nr_reclaimers = ATOMIC_INIT(0);
+   init_waitqueue_head(&zone->reclaim_throttle_waitq);
+
if (!size)
continue;
 
Index: b/mm/vmscan.c
===
--- a/mm/vmscan.c   2008-02-25 21:37:49.0 +0900
+++ b/mm/vmscan.c   2008-02-26 10:59:38.0 +0900
@@ -1252,6 +1252,55 @@ static unsigned long shrink_zone(int pri
return nr_reclaimed;
 }
 
+
+#define RECLAIM_LIMIT (3)
+
+static int do_shrink_zone_throttled(int priority, struct zone *zone,
+   struct scan_control *sc,
+   unsigned long *ret_reclaimed)
+{
+   u64 start_time;
+   int ret = 0;
+
+   start_time = jiffies_64;
+
+   wait_event(zone->reclaim_throttle_waitq,
+  atomic_add_unless(&zone->nr_reclaimers, 1, RECLAIM_LIMIT));
+
+   /* more reclaim until needed? */
+   if (scan_global_lru(sc) &&
+   !(current->flags & PF_KSWAPD) &&
+   time_after64(jiffies, start_time + HZ/10)) {
+   if (zone_watermark_ok(zone, sc->order, 4*zone->pages_high,
+ MAX_NR_ZONES-1, 0)) {
+   ret = -EAGAIN;
+   goto out;
+   }
+   }
+
+   *ret_reclaimed += shrink_zone(priority, zone, sc);
+
+out:
+   atomic_dec(&zone->nr_reclaimers);
+   wake_up_all(&zone->reclaim_throttle_waitq);
+
+   return ret;
+}
+
+static unsigned long shrink_zone_throttled(int priority, struct zone *zone,
+  struct scan_control *sc)
+{
+   unsigned long nr_reclaimed = 0;
+   int ret;
+
+   ret = do_shrink_zone_throttled(priority, zone, sc, &nr_reclaimed);
+
+   if (ret == -EAGAIN)
+   nr_reclaimed = 1;
+
+   return nr_reclaimed;
+}
+
 /*
  * This is the direct reclaim path, for page-allocating processes.  We only
  * try to reclaim pages from zon

Re: Tiny cpusets -- cpusets for small systems?

2008-02-24 Thread KOSAKI Motohiro
Hi Pual

> Looking at some IA64 sn2 config builds I have laying about, I see the
> following text sizes for a couple of versions, showing the growth of
> the cpuset/cgroup apparatus over time:
> 
>   25933   2.6.18-rc3-mm1/kernel/cpuset.o (Aug 2006)
> vs.
>   37823   2.6.25-rc2-mm1/kernel/cgroup.o (Feb 2008)
>   19558   2.6.25-rc2-mm1/kernel/cpuset.o
> 
> So the total has grown from 25933 to 57381 text bytes (note that
> this is IA64 arch; most arch's will have proportionately smaller
> text sizes.)

hm, interesting.
but unfortunately the cpuset have more than depend.(i.e. CONFIG_SMP)

To more bad thing, some embedded cpu have poor or no atomic instruction
support.
at that, turn on CONFIG_SMP become large performace regression ;)


I am not already embedded engineer.
thus, I might have made a mistake.
(BTW: I am large server engineer now)

but no thinking dependency is wrong, may be.


Pavel, what do you think it?

- kosaki


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ResCounter: Use read_uint in memory controller

2008-02-23 Thread KOSAKI Motohiro
Hi Andrew,

> yup, I agree.  Even though I don't know what ILP32 and LP64 are ;)

ILP32: integer and long and pointer size is 32bit
LP64:  long and pointer size is 64bit, but int size is 32bit

linux 32bit kernel obey ILP32 model, 64bit kernel obey LP64.

Thanks.


- kosaki

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Document huge memory/cache overhead of memory controller in Kconfig

2008-02-21 Thread KOSAKI Motohiro
Hi

> > I think one reason of many people easy confusion is caused by bad menu
> > hierarchy.
> > I popose mem-cgroup move to child of cgroup and resource counter
> > (= obey denend on).
> 
> > +config CGROUP_MEM_CONT
> > +   bool "Memory controller for cgroups"
> 
> Memory _resource_ controller for cgroups?

Ahhh
my proposal only change menu hierarchy.
I don't know best name and i hope avoid rename discussion ;-)

Thanks.


- kosaki

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Document huge memory/cache overhead of memory controller in Kconfig

2008-02-21 Thread KOSAKI Motohiro
Hi

>  > >> For ordinary desktop people, memory controller is what developers
>  > >> know as MMU or sometimes even some other mysterious piece of silicon
>  > >> inside the heavy box.
>  > >
>  > >Actually I'd guess 'memory controller' == 'DRAM controller' == part of
>  > >northbridge that talks to DRAM.
>  >
>  > Yeah that must have been it when Windows says it found a new controller
>  > after changing the mainboard underneath.
>
>  Just for fun... this option really has to be renamed:

I think one reason of many people easy confusion is caused by bad menu
hierarchy.
I popose mem-cgroup move to child of cgroup and resource counter
(= obey denend on).

if you don't mind, please try to following patch.
may be, looks good than before.

---
 init/Kconfig |   52 ++--
 1 file changed, 26 insertions(+), 26 deletions(-)

Index: b/init/Kconfig
===
--- a/init/Kconfig  2008-02-17 16:44:46.0 +0900
+++ b/init/Kconfig  2008-02-21 23:33:51.0 +0900
@@ -311,6 +311,32 @@ config CPUSETS

  Say N if unsure.

+config PROC_PID_CPUSET
+   bool "Include legacy /proc//cpuset file"
+   depends on CPUSETS
+   default y
+
+config CGROUP_CPUACCT
+   bool "Simple CPU accounting cgroup subsystem"
+   depends on CGROUPS
+   help
+ Provides a simple Resource Controller for monitoring the
+ total CPU consumed by the tasks in a cgroup
+
+config RESOURCE_COUNTERS
+   bool "Resource counters"
+   help
+ This option enables controller independent resource accounting
+  infrastructure that works with cgroups
+   depends on CGROUPS
+
+config CGROUP_MEM_CONT
+   bool "Memory controller for cgroups"
+   depends on CGROUPS && RESOURCE_COUNTERS
+   help
+ Provides a memory controller that manages both page cache and
+ RSS memory.
+
 config GROUP_SCHED
bool "Group CPU scheduler"
default y
@@ -352,20 +378,6 @@ config CGROUP_SCHED

 endchoice

-config CGROUP_CPUACCT
-   bool "Simple CPU accounting cgroup subsystem"
-   depends on CGROUPS
-   help
- Provides a simple Resource Controller for monitoring the
- total CPU consumed by the tasks in a cgroup
-
-config RESOURCE_COUNTERS
-   bool "Resource counters"
-   help
- This option enables controller independent resource accounting
-  infrastructure that works with cgroups
-   depends on CGROUPS
-
 config SYSFS_DEPRECATED
bool "Create deprecated sysfs files"
depends on SYSFS
@@ -387,18 +399,6 @@ config SYSFS_DEPRECATED
  If you are using a distro that was released in 2006 or later,
  it should be safe to say N here.

-config CGROUP_MEM_CONT
-   bool "Memory controller for cgroups"
-   depends on CGROUPS && RESOURCE_COUNTERS
-   help
- Provides a memory controller that manages both page cache and
- RSS memory.
-
-config PROC_PID_CPUSET
-   bool "Include legacy /proc//cpuset file"
-   depends on CPUSETS
-   default y
-
 config RELAY
bool "Kernel->user space relay support (formerly relayfs)"
help
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

2008-02-21 Thread KOSAKI Motohiro
>  >  please repost question with change subject.
>  >  i don't know reason of vanilla kernel behavior, sorry.
>
>  Normally, embedded linux have only one zone(DMA).
>
>  If your patch isn't applied, several processes can reclaim memory in 
> parallel.
>  then, DMA zone's pages_scanned is suddenly increased largely. Because
>  embedded linux have no swap device,  kernel can't stop to scan lru
>  list until meeting page cache page. so if zone->pages_scanned is
>  greater six time than lru list pages, kernel make the zone with
>  unreclaimable state, As a result, OOM will kill it, too.

sorry, my last mail is easy confusious.
if you want discuss vanilla kernel bug, you shold post mail by another thread.
if not, your mail is only readed by few people.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

2008-02-21 Thread KOSAKI Motohiro
Hi balbir-san

>  It's good to keep the main reclaim code and the memory controller reclaim in
>  sync, so this is a nice effort.

thank you.
I will repost next version (fixed nick's opinion) while a few days.

>  > @@ -1456,7 +1501,7 @@ unsigned long try_to_free_mem_cgroup_pag
>  >   int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE);
>  >
>  >   zones = NODE_DATA(numa_node_id())->node_zonelists[target_zone].zones;
>  > - if (do_try_to_free_pages(zones, sc.gfp_mask, &sc))
>  > + if (try_to_free_pages_throttled(zones, 0, sc.gfp_mask, &sc))
>  >   return 1;
>  >   return 0;
>  >  }
>
>  try_to_free_pages_throttled checks for zone_watermark_ok(), that will not 
> work
>  in the case that we are reclaiming from a cgroup which over it's limit. We 
> need
>  a different check, to see if the mem_cgroup is still over it's limit or not.

That makes sense.

unfortunately, I don't know mem-cgroup so much.
What do i use function, instead?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

2008-02-21 Thread KOSAKI Motohiro
Hi Kim-san,

Thank you very much.
btw, what different between  and ?

>  It was a very interesting result.
>  In embedded system, your patch improve performance a little in case
>  without noswap(normal case in embedded system).
>  But, more important thing is OOM occured when I made 240 process
>  without swap device and vanilla kernel.
>  Then, I applied your patch, it worked very well without OOM.

Wow, it is very interesting result!
I am very happy.

>  I think that's why zone's page_scanned was six times greater than
>  number of lru pages.
>  At result, OOM happened.

please repost question with change subject.
i don't know reason of vanilla kernel behavior, sorry.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

2008-02-20 Thread KOSAKI Motohiro
Hi

> > >  * max parallel reclaim tasks:
> > >  *  max consumption time of
> > > try_to_free_pages():
> >
> > sorry, I inserted debug code to my patch at that time.
> 
> Could you send me that debug code ?
> If you will send it to me, I will test it my environment (ARM-920T, Core2Duo).
> And I will report test result.

attached it.
but it is very messy ;-)

usage:
./benchloop.sh

sample output
=
max reclaim 2
Running with 120*40 (== 4800) tasks.
Time: 34.177
14.17user 284.38system 1:43.85elapsed 287%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (3813major+148922minor)pagefaults 0swaps
max prepare time: 4599 0
max reclaim time: 2350 5781
total
8271
max reclaimer
4
max overkill
62131
max saved overkill
9740


max reclaimer represent to max parallel reclaim tasks.
total represetnto max consumption time of try_to_free_pages().

Thanks



reclaim-throttle-3.patch
Description: Binary data


benchloop.sh
Description: Binary data


Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

2008-02-20 Thread KOSAKI Motohiro
Hi Kim-san

Do you adjust hackbench parameter?
my parameter adjust my test machine(8GB mem),
if unchanged, maybe doesn't works it because lack memory.

> I am a many interested in your patch. so I want to test it with exact
> same method as you did.
> I will test it in embedded environment(ARM 920T, 32M ram) and my
> desktop machine.(Core2Duo 2.2G, 2G ram)

Hm
I don't have embedded test machine.
but I can desktop.
I will test it about weekend.
if you don't mind, could you please send me .config file
and tell me your test kernel version?

Thanks, interesting report.


> I guess this patch won't be efficient in embedded environment.
> Since many embedded board just have one processor and don't have any
> swap device.

reclaim conflict rarely happened on UP.
thus, my patch expect no improvement.

but (of course) I will fix regression.

> So, How do I evaluate following field as you did ?
> 
>  * elapse (what do you mean it ??)
>  * major fault

/usr/bin/time command output that.


>  * max parallel reclaim tasks:
>  *  max consumption time of
> try_to_free_pages():

sorry, I inserted debug code to my patch at that time.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8][for -mm] mem_notify v6

2008-02-19 Thread KOSAKI Motohiro
> Did those jobs share nodes -- sometimes two or more jobs using the same
> nodes?  I am sure SGI has such users too, though such job mixes make
> the runtimes of specific jobs less obvious, so customers are more
> tolerant of variations and some inefficiencies, as they get hidden in
> the mix.

Hm
our dedicated ndoe user set memory limit to machine physical memory
size (minus a bit).

I think don't have so much share/dedicate and watch user-defined/swap.
am i misundestand?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8][for -mm] mem_notify v6

2008-02-19 Thread KOSAKI Motohiro
Hi Rik

> > Sounds like a job for memory limits (ulimit?), not for OOM
> > notification, right?
> 
> I suspect one problem could be that an HPC job scheduling program
> does not know exactly how much memory each job can take, so it can
> sometimes end up making a mistake and overcommitting the memory on
> one HPC node.
> 
> In that case the user is better off having that job killed and
> restarted elsewhere, than having all of the jobs on that node
> crawl to a halt due to swapping.
> 
> Paul, is this guess correct? :)

Yes.
Fujitsu HPC middleware watching sum of memory consumption of the job
and, if over-consumption happened, kill process and remove job schedule.

I think that is common hpc requirement.
but we watching to user defined memory limit, not swap.

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


cpuset trivial documentation fix s/N_MEMORY/N_HIGH_MEMORY/

2008-02-19 Thread KOSAKI Motohiro
Hi,

this is easy documentation fix.

current implementation of cpuset track N_HIGH_MEMORY instead N_MEMORY.
(N_MEMORY doesn't exist in current implementation)


Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]>
CC: Paul Jackson <[EMAIL PROTECTED]>
CC: Christoph Lameter <[EMAIL PROTECTED]>
CC: Paul Menage <[EMAIL PROTECTED]>

---
 Documentation/cpusets.txt |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: b/Documentation/cpusets.txt
===
--- a/Documentation/cpusets.txt 2008-02-14 13:42:22.0 +0900
+++ b/Documentation/cpusets.txt 2008-02-19 19:37:14.0 +0900
@@ -209,7 +209,7 @@ and name space for cpusets, with a minim
 The cpus and mems files in the root (top_cpuset) cpuset are
 read-only.  The cpus file automatically tracks the value of
 cpu_online_map using a CPU hotplug notifier, and the mems file
-automatically tracks the value of node_states[N_MEMORY]--i.e.,
+automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e.,
 nodes with memory--using the cpuset_track_online_nodes() hook.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8][for -mm] mem_notify v6

2008-02-18 Thread KOSAKI Motohiro
Hi Paul,

Thank you for wonderful interestings comment.
your comment is really nice.

I was HPC guy with large NUMA box at past. 
I promise i don't ignroe hpc user.
but unfortunately I didn't have experience of use CPUSET
because at that point, it was under development yet.

I hope discuss you that CPUSET usage case and mem_notify requirement.
to be honest, I thought hpc user doesn't use mem_notify, sorry.


> I have what seems, intuitively, a similar problem at the opposite
> end of the world, on big-honkin NUMA boxes (hundreds or thousands of
> CPUs, terabytes of main memory.)  The problem there is often best
> resolved if we can kill the offending task, rather than shrink its
> memory footprint.  The situation is that several compute intensive
> multi-threaded jobs are running, each in their own dedicated cpuset.

agreed.

> So we like to identify such jobs as soon as they begin to swap,
> and kill them very very quickly (before the direct reclaim code
> in mm/vmscan.c can push more than a few pages to the swap device.)

you think kill the process just after swap, right?
but unfortunately, almost user hope receive notification before swap ;-)
because avoid swap.

I think we need discuss this point more.


> For a much earlier, unsuccessful, attempt to accomplish this, see:
> 
>   [Patch] cpusets policy kill no swap
>   http://lkml.org/lkml/2005/3/19/148
> 
> Now, it may well be that we are too far apart to share any part of
> a solution; one seldom uses the same technology to build a Tour de
> France bicycle as one uses to build a Lockheed C-5A Galaxy heavy
> cargo transport.
> 
> One clear difference is the policy of what action we desire to take
> when under memory pressure: do we invite user space to free memory so
> as to avoid the wrath of the oom killer, or do we go to the opposite
> extreme, seeking a nearly instantant killing, faster than the oom
> killer can even begin its search for a victim.

Hmm, sorry
I understand your patch yet, because I don't know CPUSET so much.

I learn CPUSET more, about this week and I'll reply again about next week ;-)


> Another clear difference is the use of cpusets, which are a major and
> vital part of administering the big NUMA boxes, and I presume are not
> even compiled into embedded kernels (correct?).  This difference maybe
> unbridgeable ... these big NUMA systems require per-cpuset mechanisms,
> whereas embedded may require builds without cpusets.

Yes, some embedded distribution(i.e. monta vista) distribute as source.
but embedded people strongly dislike bloat code size.
I think they never turn on CPUSET.

I hope mem_notify works fine without CPUSET.


> 1) You have a little bit of code in the kernel to throttle the
>thundering herd problem.  Perhaps this could be moved to user space
>... one user daemon that is always notified of such memory pressure
>alarms, and in turn notifies interested applications.  This might
>avoid the need to add poll_wait_exclusive() to the kernel.  And it
>moves any fussy details of how to tame the thundering herd out of
>the kernel.

I think you talk about user space oom manager.
it and many user process are obviously different.

I doubt memory manager daemon model doesn't works on desktop and
typical server.
thus, current implementaion optimize to no manager environment.

of course, it doesn't mean i refuse add to code for oom manager.
it is very interesting idea.

i hope discussion it more.


> 2) Another possible mechanism for communicating events from
>the kernel to user space is inotify.  For example, I added
>the line:
> 
>   fsnotify_modify(dentry);   # dentry is current tasks cpuset

Excellent!
that is really good idea.

thaks.


> 3) Perhaps, instead of sending simple events, one could update
>a meter of the rate of recent such events, such as the per-cpuset
>'memory_pressure' mechanism does.  This might lead to addressing
>Andrew Morton's comment:
> 
>   If this feature is useful then I'd expect that some
>   applications would want notification at different times, or at
>   different levels of VM distress.  So this semi-randomly-chosen
>   notification point just won't be strong enough in real-world
>   use.

Hmmm, I don't think so.
I think timing of memmory_pressure_notify(1) is already best.

the page move active list to inactive list indicate swap I/O happen
a bit after.

but memmory_pressure_notify(0) is a bit messy.
I'll try to improve more simplify.


> 4) A place that I found well suited for my purposes (watching for
>swapping from direct reclaim) was just before the lines in the
>pageout() routine in mm/vmscan.c:
> 
>   if (clear_page_dirty_for_io(page)) {
>   ...
>   res = mapping->a_ops->writepage(page, &wbc);
> 
>It seemed that testing "PageAnon(page)" here allowed me to easily
>distinguish between dirty pages going back to the file system, and
>pages going to swap (this detail is

Re: [RFC][PATCH] the proposal of improve page reclaim by throttle

2008-02-18 Thread KOSAKI Motohiro
Hi Nick,

> Yeah this is definitely needed and a nice result.
> 
> I'm worried about a) placing a global limit on parallelism, and b)
> placing a limit on parallelism at all.

sorry, i don't understand yet.
a) and b) have any relation?

> 
> I think it should maybe be a per-zone thing...
> 
> What happens if you make it a per-zone mutex, and allow just a single
> process to reclaim pages from a given zone at a time? I guess that is
> going to slow down throughput a little bit in some cases though...

That makes sense.

OK.
I'll repost after 2-3 days.

Thanks.

- kosaki


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH] the proposal of improve page reclaim by throttle

2008-02-18 Thread KOSAKI Motohiro

background

current VM implementation doesn't has limit of # of parallel reclaim.
when heavy workload, it bring to 2 bad things
  - heavy lock contention
  - unnecessary swap out

abount 2 month ago, KAMEZA Hiroyuki proposed the patch of page 
reclaim throttle and explain it improve reclaim time.
http://marc.info/?l=linux-mm&m=119667465917215&w=2

but unfortunately it works only memcgroup reclaim.
Today, I implement it again for support global reclaim and mesure it.


test machine, method and result
==

CPU:  IA64 x8
MEM:  8GB
SWAP: 2GB


got hackbench from
http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c

$ /usr/bin/time hackbench 120 process 1000

this parameter mean consume all physical memory and 
1GB swap space on my test environment.



before:
hackbench result:   282.30
/usr/bin/time result
user:   14.16
sys:1248.47
elapse: 432.93
major fault:29026
max parallel reclaim tasks: 1298
max consumption time of
 try_to_free_pages():   70394 

after:
hackbench result:   30.36
/usr/bin/time result
user:   14.26
sys:294.44
elapse: 118.01
major fault:3064
max parallel reclaim tasks: 4
max consumption time of
 try_to_free_pages():   12234 


conclusion
=
this patch improve 3 things.
1. reduce unnecessary swap
   (see above major fault. about 90% reduced)
2. improve throughput performance
   (see above hackbench result. about 90% reduced)
3. improve interactive performance.
   (see above max consumption of try_to_free_pages.
about 80% reduced)
4. reduce lock contention.
   (see above sys time. about 80% reduced)


Now, we got about 1000% performance improvement of hackbench :)



foture works
==
 - more discussion with memory controller guys.



Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]>
CC: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>
CC: Balbir Singh <[EMAIL PROTECTED]>
CC: Rik van Riel <[EMAIL PROTECTED]>
CC: Lee Schermerhorn <[EMAIL PROTECTED]>

---
 include/linux/nodemask.h |1 
 mm/vmscan.c  |   49 +--
 2 files changed, 48 insertions(+), 2 deletions(-)

Index: b/include/linux/nodemask.h
===
--- a/include/linux/nodemask.h  2008-02-19 13:58:05.0 +0900
+++ b/include/linux/nodemask.h  2008-02-19 13:58:23.0 +0900
@@ -431,6 +431,7 @@ static inline int num_node_state(enum no
 
 #define num_online_nodes() num_node_state(N_ONLINE)
 #define num_possible_nodes()   num_node_state(N_POSSIBLE)
+#define num_highmem_nodes()num_node_state(N_HIGH_MEMORY)
 #define node_online(node)  node_state((node), N_ONLINE)
 #define node_possible(node)node_state((node), N_POSSIBLE)
 
Index: b/mm/vmscan.c
===
--- a/mm/vmscan.c   2008-02-19 13:58:05.0 +0900
+++ b/mm/vmscan.c   2008-02-19 14:04:06.0 +0900
@@ -127,6 +127,11 @@ long vm_total_pages;   /* The total number
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
+static atomic_t nr_reclaimers = ATOMIC_INIT(0);
+static DECLARE_WAIT_QUEUE_HEAD(reclaim_throttle_waitq);
+#define RECLAIM_LIMIT (2 * num_highmem_nodes())
+
+
 #ifdef CONFIG_CGROUP_MEM_CONT
 #define scan_global_lru(sc)(!(sc)->mem_cgroup)
 #else
@@ -1421,6 +1426,46 @@ out:
return ret;
 }
 
+static unsigned long try_to_free_pages_throttled(struct zone **zones,
+int order,
+gfp_t gfp_mask,
+struct scan_control *sc)
+{
+   unsigned long nr_reclaimed = 0;
+   unsigned long start_time;
+   int i;
+
+   start_time = jiffies;
+
+   wait_event(reclaim_throttle_waitq,
+  atomic_add_unless(&nr_reclaimers, 1, RECLAIM_LIMIT));
+
+   /* more reclaim until needed? */
+   if (unlikely(time_after(jiffies, start_time + HZ))) {
+   for (i = 0; zones[i] != NULL; i++) {
+   struct zone *zone = zones[i];
+   int classzone_idx = zone_idx(zones[0]);
+
+   if (!populated_zone(zone))
+   continue;
+
+   if (zone_watermark_ok(zone, order, 4*zone->pages_high,
+ 

Re: [RFC] bitmap onto and fold operators for mempolicy extensions

2008-02-16 Thread KOSAKI Motohiro
Hi Paul,

> > iff?
> > other portions looks good :)
> 
> "iff" -- "if and only if"
> 
> It is from my days, a long long time ago, as a student of mathematical
> logic and set theory.  I was too lazy to spell that one out.  My bad.

not at all.
I did't know the abbreviation, sorry.

I think your patch has no bug.
Thanks.


- kosaki


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] bitmap onto and fold operators for mempolicy extensions

2008-02-16 Thread KOSAKI Motohiro
Hi Paul,

> The bitmap_fold() operator folds a bitmap into a second that
> has bit m set iff the input bitmap has some bit n set, where
> m == n mod sz, for the specified sz value.

iff?
other portions looks good :)

Reviewed-by: KOSAKI Motohiro <[EMAIL PROTECTED]>

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] bitmap relative operator for mempolicy extensions

2008-02-14 Thread KOSAKI Motohiro
Hi Ray,

> >  > i prefer another name [!relative].
> >
> >  Any suggestions?
> >
> >  I'll give the name some thought myself.
> >  I like good names, and this is the right
> >  time to get this one right.
> 
> 'Relative map' implies a constant offset. What you have there is just
> a map as relmap could be sparse (which, btw, would have been nice to
> have in the example).
> 
> map_bitmap violates your naming convention, bitmap_map isn't all that
> clear, bitmap_remap is taken, and while it is one-to-one and onto, I
> think calling it bitmap_bijection would lose everyone except the
> mathematicians. bitmap_onto? bitmap_map_onto? bitmap_map_bitmap_onto?

Thanks for many idea.
I like bitmap_onto and/or bitmap_map_onto. :)


- kosaki

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] bitmap relative operator for mempolicy extensions

2008-02-14 Thread KOSAKI Motohiro
Hi Paul,

>  The following adds one more bitmap operator, with the usual
>  cpumask and nodemask wrappers.  This operator computes one
>  bitmap relative to another.  If the n-th bit in the origin
>  mask is set, then the m-th bit of the destination mask will
>  be set, where m is the position of the n-th set bit in the
>  relative mask.

this patch is very interesting.

BTW:
Are you think this function name must be "relative" ?
I think "relative" implies ordered set.
but linux bitmap is abstraction of unordered set.
if possible, i prefer another name.

end up, bitmap_relative is map as pecial case, i think.

>  This is initially to be used by the MPOL_F_RELATIVE_NODES
>  facility being considered for mm/mempolicy.c.

agreed with node_relative idea.
I think this is very useful.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8][for -mm] mem_notify v6: memory_pressure_notify() caller

2008-02-13 Thread KOSAKI Motohiro
Hi Andi,

> > to be honest, I don't think at mem-cgroup until now.
> 
> There is not only mem-cgroup BTW, but also NUMA node restrictons from
> NUMA memory policy. So this means a process might not be able to access
> all memory.

you are right.
good point out.

current implementation may cause wake up the no relate process of
memory shortage zone ;-)

but unfortunately, we can't know per zone rss.
(/proc/[pid]/numa_maps is very slow, we can't use it
 at memory shortage emergency)

I think we need develop per zone rss.
it become not only improve mem_notify, but also improve
oom killer of more intelligent process choice.

but it is a bit difficult. (at least for me ;-)
may be, I will implement it a bit later...


Thanks again!
your good opnion may improve my patch.


- kosaki


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: split up feature-removal-schedule.txt

2008-02-12 Thread KOSAKI Motohiro
Hi Greg

> In the big "linux-next" series of emails, David Miller suggested that
> the feature-removal-schedule file be broken up into little pieces, as it
> is causing merge problems for different trees.
> 
> This changeset does just that.  Turns out that this makes things more
> readable, as it's easier to look at a list of filenames for things than
> picking through a 300 line text file.

Excellent!
thank you for good job.


- kosaki


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/8][for -mm] mem_notify v6: memory_pressure_notify() caller

2008-02-12 Thread KOSAKI Motohiro
Hi Andrew

> > and, It is judged out of trouble at the fllowing situations.
> >  o memory pressure decrease and stop moves an anonymous page to the
> > inactive list.
> >  o free pages increase than (pages_high+lowmem_reserve)*2.
> 
> This seems rather arbitrary.  Why choose this stage in the page
> reclaimation process rather than some other stage?
> 
> If this feature is useful then I'd expect that some applications would want
> notification at different times, or at different levels of VM distress.  So
> this semi-randomly-chosen notification point just won't be strong enough in
> real-world use.

Hmmm
actually, This portion become code broat through some bug reports.

Yes, I think it again and implement it more simplefy.
Thanks!


> Does this change work correctly and appropriately for processes which are
> running in a cgroup memory controller?

nice point out.

to be honest, I don't think at mem-cgroup until now.
I will implement it at next post.

> Given the amount of code which these patches add, and the subsequent
> maintenance burden, and the unlikelihood of getting many applications to
> actually _use_ the interface, it is not obvious to me that inclusion in the
> kernel is justifiable, sorry.

OK.
I'll implement it again more simplefy.
Thanks.


> memory_pressure_notify() is far too large to be inlined.

OK.
I will fix it.

> Some of the patches were wordwrapped.

Agghh..
I will don't use gmail at next post.
sorry.


and,
I hope merge only poll_wait_exclusive() and wake_up_locked_nr()
if you don't mind.

this 2 portion anybody noclaim about 2 month.
and I think it is useful function by many people.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 2.6.24][regression fix] Mempolicy: silently restrict nodemask to allowed nodes V3

2008-02-12 Thread KOSAKI Motohiro
Hi

> > please ack.
> 
> As it's now post -rc1 and not a 100% obvious thing, I tend to hang onto
> such patches for a week or so before sending up to Linus

Thanks, really thanks.


> Should this be backported to 2.6.24.x?  If so, the reasons for such a
> relatively stern step should be spelled out in the changelog for the
> -stable maintiners to evaluate.

Oh,
you think below reason is not enough, really?

1. it is regression.
2. it is very easy reprodusable on memoryless node machine.


if so, i back down on my backport reclaim.
I don't hope increase your headache ;-)

thanks.

-kosaki


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH for 2.6.24][regression fix] Mempolicy: silently restrict nodemask to allowed nodes V3

2008-02-11 Thread KOSAKI Motohiro
Hi Andrew

# this is second post of the same patch.

this is backport from -mm to mainline.
original patch is http://marc.info/?l=linux-kernel&m=12025001182&w=2

my change is only line number change and remove extra space.
please ack.



[PATCH] 2.6.24 - mempolicy:  silently restrict nodemask to allowed nodes

V2 -> V3:
+ As suggested by Kosaki Motohito, fold the "contextualization"
  of policy nodemask into mpol_check_policy().  Looks a little
  cleaner. 

V1 -> V2:
+ Communicate whether or not incoming node mask was empty to
  mpol_check_policy() for better error checking.
+ As suggested by David Rientjes, remove the now unused
   cpuset_nodes_subset_current_mems_allowed() from cpuset.h

Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
presence of memoryless nodes.  This patch attempts to fix that problem.

Some background:  

numactl --interleave=all calls set_mempolicy(2) with a fully
populated [out to MAXNUMNODES] nodemask.  set_mempolicy()
[in do_set_mempolicy()] calls contextualize_policy() which
requires that the nodemask be a subset of the current task's
mems_allowed; else EINVAL will be returned.  A task's
mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
i.e., nodes with memory.  So, a fully populated nodemask will
be declared invalid if it includes memoryless nodes.

  NOTE:  the same thing will occur when running in a cpuset
 with restricted mem_allowed--for the same reason:
 node mask contains dis-allowed nodes.

mbind(2), on the other hand, just masks off any nodes in the 
nodemask that are not included in the caller's mems_allowed.

In each case [mbind() and set_mempolicy()], mpol_check_policy()
will complain [again, resulting in EINVAL] if the nodemask contains 
any memoryless nodes.  This is somewhat redundant as mpol_new() 
will remove memoryless nodes for interleave policy, as will 
bind_zonelist()--called by mpol_new() for BIND policy.

Proposed fix:

1) modify contextualize_policy logic to:
   a) remember whether the incoming node mask is empty.
   b) if not, restrict the nodemask to allowed nodes, as is
  currently done in-line for mbind().  This guarantees
  that the resulting mask includes only nodes with memory.

  NOTE:  this is a [benign, IMO] change in behavior for
 set_mempolicy().  Dis-allowed nodes will be
 silently ignored, rather than returning an error.

   c) fold this code into mpol_check_policy(), replace 2 calls to
  contextualize_policy() to call mpol_check_policy() directly
  and remove contextualize_policy().

2) In existing mpol_check_policy() logic, after "contextualization":
   a) MPOL_DEFAULT:  require that in coming mask "was_empty"
   b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
  contains at least one node.
   c) add a case for MPOL_PREFERRED:  if in coming was not empty
  and resulting mask IS empty, user specified invalid nodes.
  Return EINVAL.
   c) remove the now redundant check for memoryless nodes

3) remove the now redundant masking of policy nodes for interleave
   policy from mpol_new().

4) Now that mpol_check_policy() contextualizes the nodemask, remove
   the in-line nodes_and() from sys_mbind().  I believe that this
   restores mbind() to the behavior before the memoryless-nodes
   patch series.  E.g., we'll no longer treat an invalid nodemask
   with MPOL_PREFERRED as local allocation.

Signed-off-by:  Lee Schermerhorn <[EMAIL PROTECTED]>
Signed-off-by:  KOSAKI Motohiro <[EMAIL PROTECTED]>
Tested-by:  KOSAKI Motohiro <[EMAIL PROTECTED]>
Acked-by:   David Rientjes <[EMAIL PROTECTED]>

 include/linux/cpuset.h |3 --
 mm/mempolicy.c |   61 -
 2 files changed, 36 insertions(+), 28 deletions(-)

Index: b/mm/mempolicy.c
===
--- a/mm/mempolicy.c2008-02-10 14:27:58.0 +0900
+++ b/mm/mempolicy.c2008-02-12 09:37:35.0 +0900
@@ -116,22 +116,51 @@ static void mpol_rebind_policy(struct me
 /* Do sanity checking on a policy */
 static int mpol_check_policy(int mode, nodemask_t *nodes)
 {
-   int empty = nodes_empty(*nodes);
+   int was_empty, is_empty;
+
+   if (!nodes)
+   return 0;
+
+   /*
+* "Contextualize" the in-coming nodemast for cpusets:
+* Remember whether in-coming nodemask was empty,  If not,
+* restrict the nodes to the allowed nodes in the cpuset.
+* This is guaranteed to be a subset of nodes with memory.
+*/
+   cpuset_update_task_memory_state();
+   is_empty = was_empty = nodes_empty(*nodes);
+   if (!was_empty) {
+   nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+   is_empty = nodes_empty(*nodes); /* af

Re: [sample] mem_notify v6: usage example

2008-02-11 Thread KOSAKI Motohiro
Hi Andreas,

Thank you very good comment.

> Having such notification handled by glibc to free up unused malloc (or
> any heap allocations) would be very useful, because even if a program
> does "free" there is no guarantee the memory is returned to the kernel.

Yes, no guarantee.
but current glibc-malloc very frequently return memory to kernel.

glibc default behavior

1. over 1M memory: return memory just free(3)  called.
(you can change threshold by MALLOC_MMAP_MAX_ environment)
2. more lower: return memory when exist continuous 128k at heap tail.
(you can change threashold by MALLOC_TRIM_THRESHOLD_ environment)

if you know very memory consumption by already freed memory situation,
please tell me situation detail and consumption memory size.

> I think that having a generic reservation framework is too complex, but
> hiding the details of /dev/mem_notify from applications is desirable.
> A simple wrapper (possibly part of glibc) to return the poll fd, or set
> up the signal is enough.

Agreed.
if large consumption situation exist, I'm behind you.


Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/4] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-11 Thread KOSAKI Motohiro
Hi David,

In general, I like this feature.
and I found no bug in patch [1/4] and [2/4].

> @@ -218,21 +167,27 @@ static struct mempolicy *mpol_new(enum mempolicy_mode 
> mode,
> return ERR_PTR(-ENOMEM);
> flags &= MPOL_MODE_FLAGS;
> atomic_set(&policy->refcnt, 1);
> +   cpuset_update_task_memory_state();
> +   nodes_and(cpuset_context_nmask, *nodes, cpuset_current_mems_allowed);
> switch (mode) {
> case MPOL_INTERLEAVE:
> -   policy->v.nodes = *nodes;
> +   if (nodes_empty(*nodes))
> +   return ERR_PTR(-EINVAL);

need kmem_cache_free(policy_cache, policy) before return?

> +   policy->v.nodes = cpuset_context_nmask;
> if (nodes_weight(policy->v.nodes) == 0) {
> kmem_cache_free(policy_cache, policy);
> return ERR_PTR(-EINVAL);
> }
> break;
> case MPOL_PREFERRED:
> -   policy->v.preferred_node = first_node(*nodes);
> +   policy->v.preferred_node = first_node(cpuset_context_nmask);
> if (policy->v.preferred_node >= MAX_NUMNODES)
> policy->v.preferred_node = -1;
> break;
> case MPOL_BIND:
> -   policy->v.zonelist = bind_zonelist(nodes);
> +   if (nodes_empty(*nodes))
> +   return ERR_PTR(-EINVAL);

ditto

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8][for -mm] mem_notify v6

2008-02-11 Thread KOSAKI Motohiro
> > the Linux Today article is very nice description. (great works by Jake Edge)
> > http://www.linuxworld.com/news/2008/020508-kernel.html
>
> Just for future reference...the above-mentioned article is from LWN,
> syndicated onto LinuxWorld.  It has, so far as I know, never been near
> Linux Today.
>
> Glad you liked it, though :)

Oops, sorry.
I had serious misunderstand ;-)

sorry, again.
and thank you for your helpful message.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] proc: extend /proc//fdinfo/

2008-02-11 Thread KOSAKI Motohiro
Hi Eugene

In general, I think this patch isn't wrong idea.
but it shuld be brush up more, may be.

> kerndev: ~/code/kernel# cat /proc/`pgrep pickup`/fdinfo/6
> mode:   0622

I think this is inode attribute, but not fd attribute.

> dev:253,0
> ino:21463057

may be useful, agreed with you :)

> uid:89
> gid:89
> rdev:   0,0
> pos:0
> flags:  04002 FD_CLOEXEC # if close-on-exec flag is set

agreed with Miklos's opinion.

> path:   /var/spool/postfix/public/pickup

may be, we shold think namespace..

> You can also use this to find out more information about locked open files:

is your requirement only fd -> ino pick up?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.24-mm1] Mempolicy: silently restrict nodemask to allowed nodes V3

2008-02-09 Thread KOSAKI Motohiro
CC'd Greg KH <[EMAIL PROTECTED]>

I tested this patch on fujitsu memoryless node.
(2.6.24 + silently-restrict-nodemask-to-allowed-nodes-V3 insted 2.6.24-mm1)
it seems works good.

Tested-by: KOSAKI Motohiro <[EMAIL PROTECTED]>


Greg, I hope this patch merge to 2.6.24.x stable tree because
this patch is regression fixed patch.
Please tell me what do i doing for it.


[intentional full quote]

> Was "Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't
> works on memoryless node."
> 
> [Aside:  I noticed there were two slightly different distributions for
> this topic.  I've unified the distribution lists w/o dropping anyone, I
> think.  Apologies if you'd rather have been dropped...]
> 
> Here's V3 of the patch, accomodating Kosaki Motohiro's suggestion for
> folding contextualize_policy() into mpol_check_policy() [because my
> "was_empty" argument "was ugly" ;-)].  It does seem to clean up the
> code.
> 
> I'm still deferring David Rientjes' suggestion to fold
> mpol_check_policy() into mpol_new().  We need to sort out whether
> mempolicies specified for tmpfs and hugetlbfs mounts always need the
> same "contextualization" as user/application installed policies.  I
> don't want to hold up this bug fix for that discussion.  This is
> something Paul J will need to address with his cpuset/mempolicy rework,
> so we can sort it out in that context.
> 
> Again, tested with "numactl --interleave=all" and memtoy on ia64 using
> mem= command line argument to simulate memoryless node.
> 
> 
> Lee
> 
> 
> [PATCH] 2.6.24-mm1 - mempolicy:  silently restrict nodemask to allowed nodes
> 
> V2 -> V3:
> + As suggested by Kosaki Motohito, fold the "contextualization"
>   of policy nodemask into mpol_check_policy().  Looks a little
>   cleaner. 
> 
> V1 -> V2:
> + Communicate whether or not incoming node mask was empty to
>   mpol_check_policy() for better error checking.
> + As suggested by David Rientjes, remove the now unused
>cpuset_nodes_subset_current_mems_allowed() from cpuset.h
> 
> Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
> presence of memoryless nodes.  This patch attempts to fix that problem.
> 
> Some background:  
> 
> numactl --interleave=all calls set_mempolicy(2) with a fully
> populated [out to MAXNUMNODES] nodemask.  set_mempolicy()
> [in do_set_mempolicy()] calls contextualize_policy() which
> requires that the nodemask be a subset of the current task's
> mems_allowed; else EINVAL will be returned.  A task's
> mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
> i.e., nodes with memory.  So, a fully populated nodemask will
> be declared invalid if it includes memoryless nodes.
> 
>   NOTE:  the same thing will occur when running in a cpuset
>  with restricted mem_allowed--for the same reason:
>  node mask contains dis-allowed nodes.
> 
> mbind(2), on the other hand, just masks off any nodes in the 
> nodemask that are not included in the caller's mems_allowed.
> 
> In each case [mbind() and set_mempolicy()], mpol_check_policy()
> will complain [again, resulting in EINVAL] if the nodemask contains 
> any memoryless nodes.  This is somewhat redundant as mpol_new() 
> will remove memoryless nodes for interleave policy, as will 
> bind_zonelist()--called by mpol_new() for BIND policy.
> 
> Proposed fix:
> 
> 1) modify contextualize_policy logic to:
>a) remember whether the incoming node mask is empty.
>b) if not, restrict the nodemask to allowed nodes, as is
>   currently done in-line for mbind().  This guarantees
>   that the resulting mask includes only nodes with memory.
> 
>   NOTE:  this is a [benign, IMO] change in behavior for
>  set_mempolicy().  Dis-allowed nodes will be
>  silently ignored, rather than returning an error.
> 
>c) fold this code into mpol_check_policy(), replace 2 calls to
>   contextualize_policy() to call mpol_check_policy() directly
>   and remove contextualize_policy().
> 
> 2) In existing mpol_check_policy() logic, after "contextualization":
>a) MPOL_DEFAULT:  require that in coming mask "was_empty"
>b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
>   contains at least one node.
>c) add a case for MPOL_PREFERRED:  if in coming was not empty
>   and resulting mask IS empty, user specified invalid nodes.
>   Return EINVAL.
>c) remove the now redundant check for memoryless nodes
> 
> 3) remove the now redundant masking of policy n

Re: [PATCH 2.6.24-mm1] Mempolicy: silently restrict nodemask to allowed nodes V3

2008-02-09 Thread KOSAKI Motohiro
Hi Lee-san

looks good for me.
I'll test about the head of week and report it by another mail.

Thanks!

> Was "Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't
> works on memoryless node."
>
> [Aside:  I noticed there were two slightly different distributions for
> this topic.  I've unified the distribution lists w/o dropping anyone, I
> think.  Apologies if you'd rather have been dropped...]
>
> Here's V3 of the patch, accomodating Kosaki Motohiro's suggestion for
> folding contextualize_policy() into mpol_check_policy() [because my
> "was_empty" argument "was ugly" ;-)].  It does seem to clean up the
> code.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8][for -mm] mem_notify v6

2008-02-09 Thread KOSAKI Motohiro
Hi Rik

> More importantly, all gtk+ programs, as well as most databases and other
> system daemons have a poll() loop as their main loop.

not only gtk+, may be all modern GUI program :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [sample] mem_notify v6: usage example

2008-02-09 Thread KOSAKI Motohiro
Hi Jon

> This really needs to be triggered via a generic kernel event in the
> final version - I picture glibc having a reservation API and having
> generic support for freeing such reservations.

to be honest, I doubt idea of generic reservation framework.

end up, we hope drop the application cache, not also dataless memory.
but, automatically drop mechanism only able to drop dataless memory.

and, many application have own memory management subsystem.
I afraid to nobody use too complex framework.

What do you think it?
I hope see your API. please post it.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8][for -mm] mem_notify v6

2008-02-09 Thread KOSAKI Motohiro
Hi

> Interesting patch series (I am being yuppie and reading this thread
> from my iPhone on a treadmill at the gym - so further comments later).
> I think that this is broadly along the lines that I was thinking, but
> this should be an RFC only patch series for now.

sorry, I fixed at next post.


> Some initial questions:

Thank you.
welcome to any discussion.

> Where is the netlink interface? Polling an FD is so last century :)

to be honest, I don't know anyone use netlink and why hope receive
low memory notify by netlink.

poll() is old way, but it works good enough.

and, netlink have a bit weak point.
end up, netlink philosophy is read/write model.

I afraid to many low-mem message queued in netlink buffer
at under heavy pressure.
it cause degrade memory pressure.


> Still, it is good to start with some code - eventually we might just
> have a full reservation API created. Rik and I and others have bounced
> ideas around for a while and I hope we can pitch in. I will play with
> these patches later.

Great.
Welcome to any idea and any discussion.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[sample] mem_notify v6: usage example

2008-02-09 Thread KOSAKI Motohiro
this is usage example of /dev/mem_notify.

Daniel Spang create original version.
kosaki add fasync related code.


Signed-off-by: Daniel Spang <[EMAIL PROTECTED]>
Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]>

---
 Documentation/mem_notify.c |  120 +
 1 file changed, 120 insertions(+)

Index: b/Documentation/mem_notify.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ b/Documentation/mem_notify.c2008-02-10 00:44:00.0 +0900
@@ -0,0 +1,120 @@
+/*
+ * Allocate 10 MB each second. Exit on notification.
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+int count = 0;
+int size = 10;
+
+void *do_alloc()
+{
+for(;;) {
+int *buffer;
+buffer = mmap(NULL,  size*1024*1024,
+  PROT_READ | PROT_WRITE,
+  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+if (buffer == MAP_FAILED) {
+perror("mmap");
+exit(EXIT_FAILURE);
+}
+memset(buffer, 1 , size*1024*1024);
+
+printf("-");
+fflush(stdout);
+
+count++;
+sleep(1);
+}
+}
+
+int wait_for_notification(struct pollfd *pfd)
+{
+int ret;
+read(pfd->fd, 0, 0);
+ret = poll(pfd, 1, -1);  /* wake up when low memory */
+if (ret == -1 && errno != EINTR) {
+perror("poll");
+exit(EXIT_FAILURE);
+}
+return ret;
+}
+
+void do_free()
+{
+   int fd;
+   struct pollfd pfd;
+
+fd = open("/dev/mem_notify", O_RDONLY);
+if (fd == -1) {
+perror("open");
+exit(EXIT_FAILURE);
+}
+
+   pfd.fd = fd;
+pfd.events = POLLIN;
+for(;;)
+if (wait_for_notification(&pfd) > 0) {
+printf("\nGot notification, allocated %d MB\n",
+   size * count);
+exit(EXIT_SUCCESS);
+}
+}
+
+void do_free_signal()
+{
+   int fd;
+   int flags;
+
+fd = open("/dev/mem_notify", O_RDONLY);
+if (fd == -1) {
+perror("open");
+exit(EXIT_FAILURE);
+}
+
+   fcntl(fd, F_SETOWN, getpid());
+   fcntl(fd, F_SETSIG, SIGUSR1);
+
+   flags = fcntl(fd, F_GETFL);
+   fcntl(fd, F_SETFL, flags|FASYNC); /* when low memory, receive SIGUSR1 */
+
+   for(;;)
+   sleep(1);
+}
+
+
+void daniel_exit(int signo)
+{
+   printf("\nGot notification %d, allocated %d MB\n",
+  signo, size * count);
+   exit(EXIT_SUCCESS);
+
+}
+
+int main(int argc, char *argv[])
+{
+pthread_t allocator;
+
+   if(argc == 2 && (strcmp(argv[1], "-sig") == 0)) {
+   printf("run signal mode\n");
+   signal(SIGUSR1, daniel_exit);
+   pthread_create(&allocator, NULL, do_alloc, NULL);
+   do_free_signal();
+   } else {
+   printf("run poll mode\n");
+   pthread_create(&allocator, NULL, do_alloc, NULL);
+   do_free();
+   }
+   return 0;
+}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 8/8][for -mm] mem_notify v6: support fasync feature

2008-02-09 Thread KOSAKI Motohiro
implement FASYNC capability to /dev/mem_notify.


fd = open("/dev/mem_notify", O_RDONLY);

fcntl(fd, F_SETOWN, getpid());
fcntl(fd, F_SETSIG, SIGUSR1);

flags = fcntl(fd, F_GETFL);
fcntl(fd, F_SETFL, flags|FASYNC);  /* when low memory, receive SIGUSR1 
*/



ChangeLog
v5 -> v6:
   o rewrite usage example
   o cleanups number of wakeup tasks calculation.   

v5:  new

Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]>

---
 mm/mem_notify.c |  109 +---
 1 file changed, 104 insertions(+), 5 deletions(-)

Index: b/mm/mem_notify.c
===
--- a/mm/mem_notify.c   2008-02-03 20:37:25.0 +0900
+++ b/mm/mem_notify.c   2008-02-03 20:48:04.0 +0900
@@ -24,18 +24,58 @@
 #define MAX_WAKEUP_TASKS (100)

 struct mem_notify_file_info {
-   unsigned long last_proc_notify;
+   unsigned long last_proc_notify;
+   struct file  *file;
+
+   /* for fasync */
+   struct list_head  fa_list;
+   int   fa_fd;
 };

 static DECLARE_WAIT_QUEUE_HEAD(mem_wait);
 static atomic_long_t nr_under_memory_pressure_zones = ATOMIC_LONG_INIT(0);
 static atomic_t nr_watcher_task = ATOMIC_INIT(0);
+static LIST_HEAD(mem_notify_fasync_list);
+static DEFINE_SPINLOCK(mem_notify_fasync_lock);
+static atomic_t nr_fasync_task = ATOMIC_INIT(0);

 atomic_long_t last_mem_notify = ATOMIC_LONG_INIT(INITIAL_JIFFIES);

+static void mem_notify_kill_fasync_nr(int nr)
+{
+   struct mem_notify_file_info *iter, *saved_iter;
+   LIST_HEAD(l_fired);
+
+   if (!nr)
+   return;
+
+   spin_lock(&mem_notify_fasync_lock);
+
+   list_for_each_entry_safe_reverse(iter, saved_iter,
+&mem_notify_fasync_list,
+fa_list) {
+   struct fown_struct *fown;
+
+   fown = &iter->file->f_owner;
+   send_sigio(fown, iter->fa_fd, POLL_IN);
+
+   list_del(&iter->fa_list);
+   list_add(&iter->fa_list, &l_fired);
+   if (!--nr)
+   break;
+   }
+
+   /* rotate moving for FIFO wakeup */
+   list_splice(&l_fired, &mem_notify_fasync_list);
+
+   spin_unlock(&mem_notify_fasync_lock);
+}
+
 void __memory_pressure_notify(struct zone *zone, int pressure)
 {
int nr_wakeup;
+   int nr_poll_wakeup = 0;
+   int nr_fasync_wakeup = 0;
int flags;

spin_lock_irqsave(&mem_wait.lock, flags);
@@ -48,6 +88,8 @@ void __memory_pressure_notify(struct zon

if (pressure) {
int nr_watcher = atomic_read(&nr_watcher_task);
+   int nr_fasync_wait_tasks = atomic_read(&nr_fasync_task);
+   int nr_poll_wait_tasks = nr_watcher - nr_fasync_wait_tasks;

atomic_long_set(&last_mem_notify, jiffies);
if (!nr_watcher)
@@ -57,10 +99,27 @@ void __memory_pressure_notify(struct zon
if (unlikely(nr_wakeup > MAX_WAKEUP_TASKS))
nr_wakeup = MAX_WAKEUP_TASKS;

-   wake_up_locked_nr(&mem_wait, nr_wakeup);
+   /*   nr_wakeup
+  nr_fasync_wakeup = nr_fasync_wait_taks x 
+nr_watcher
+   */
+   nr_fasync_wakeup = DIV_ROUND_UP(nr_fasync_wait_tasks *
+   nr_wakeup, nr_watcher);
+   if (unlikely(nr_fasync_wakeup > nr_fasync_wait_tasks))
+   nr_fasync_wakeup = nr_fasync_wait_tasks;
+
+   nr_poll_wakeup = DIV_ROUND_UP(nr_poll_wait_tasks *
+ nr_wakeup, nr_watcher);
+   if (unlikely(nr_poll_wakeup > nr_poll_wait_tasks))
+   nr_poll_wakeup = nr_poll_wait_tasks;
+
+   wake_up_locked_nr(&mem_wait, nr_poll_wakeup);
}
 out:
spin_unlock_irqrestore(&mem_wait.lock, flags);
+
+   if (nr_fasync_wakeup)
+   mem_notify_kill_fasync_nr(nr_fasync_wakeup);
 }

 static int mem_notify_open(struct inode *inode, struct file *file)
@@ -75,6 +134,9 @@ static int mem_notify_open(struct inode
}

info->last_proc_notify = INITIAL_JIFFIES;
+   INIT_LIST_HEAD(&info->fa_list);
+   info->file = file;
+   info->fa_fd = -1;
file->private_data = info;
atomic_inc(&nr_watcher_task);
 out:
@@ -83,7 +145,16 @@ out:

 static int mem_notify_release(struct inode *inode, struct file *file)
 {
-   kfree(file->private_data);
+   struct mem_notify_file_info *info = file->private_data;
+
+   spin_lock(&

[PATCH 7/8][for -mm] mem_notify v6: ignore very small zone for prevent incorrect low mem notify

2008-02-09 Thread KOSAKI Motohiro
on X86, ZONE_DMA is very very small.
it cause undesirable low mem notification.
It should ignored.

but on other some architecture, ZONE_DMA have 4GB.
4GB is large as it is not possible to ignored.

therefore, ignore or not is decided by zone size.

ChangeLog:
v5: new


Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]>

---
 include/linux/mem_notify.h |3 +++
 mm/page_alloc.c|6 +-
 2 files changed, 8 insertions(+), 1 deletion(-)

Index: b/include/linux/mem_notify.h
===
--- a/include/linux/mem_notify.h2008-01-23 22:06:04.0 +0900
+++ b/include/linux/mem_notify.h2008-01-23 22:08:02.0 +0900
@@ -22,6 +22,9 @@ static inline void memory_pressure_notif
unsigned long target;
unsigned long pages_high, pages_free, pages_reserve;

+   if (unlikely(zone->mem_notify_status == -1))
+   return;
+
if (pressure) {
target = atomic_long_read(&last_mem_notify) + MEM_NOTIFY_FREQ;
if (likely(time_before(jiffies, target)))
Index: b/mm/page_alloc.c
===
--- a/mm/page_alloc.c   2008-01-23 22:07:57.0 +0900
+++ b/mm/page_alloc.c   2008-01-23 22:08:02.0 +0900
@@ -3470,7 +3470,11 @@ static void __meminit free_area_init_cor
zone->zone_pgdat = pgdat;

zone->prev_priority = DEF_PRIORITY;
-   zone->mem_notify_status = 0;
+
+   if (zone->present_pages < (pgdat->node_present_pages / 10))
+   zone->mem_notify_status = -1;
+   else
+   zone->mem_notify_status = 0;

zone_pcp_init(zone);
INIT_LIST_HEAD(&zone->active_list);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/8][for -mm] mem_notify v6: (optional) fixed incorrect shrink_zone

2008-02-09 Thread KOSAKI Motohiro
on X86, ZONE_DMA is very very small.
It is often no used at all.

Unfortunately,
when NR_ACTIVE==0, NR_INACTIVE==0, shrink_zone() try to reclaim 1 page.
because

zone->nr_scan_active +=
(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
^

it cause unnecessary low memory notify ;-)
I fixed it.

ChangeLog
v5: new


Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]>

---
 mm/vmscan.c |   21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

Index: b/mm/vmscan.c
===
--- a/mm/vmscan.c   2008-02-03 20:27:53.0 +0900
+++ b/mm/vmscan.c   2008-02-03 20:33:13.0 +0900
@@ -947,7 +947,7 @@ static inline void note_zone_scanning_pr

 static inline int zone_is_near_oom(struct zone *zone)
 {
-   return zone->pages_scanned >= (zone_page_state(zone, NR_ACTIVE)
+   return zone->pages_scanned > (zone_page_state(zone, NR_ACTIVE)
+ zone_page_state(zone, NR_INACTIVE))*3;
 }

@@ -1196,18 +1196,29 @@ static unsigned long shrink_zone(int pri
unsigned long nr_inactive;
unsigned long nr_to_scan;
unsigned long nr_reclaimed = 0;
+   unsigned long tmp;
+   unsigned long zone_active;
+   unsigned long zone_inactive;

if (scan_global_lru(sc)) {
/*
 * Add one to nr_to_scan just to make sure that the kernel
 * will slowly sift through the active list.
 */
-   zone->nr_scan_active +=
-   (zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
+   zone_active = zone_page_state(zone, NR_ACTIVE);
+   tmp = (zone_active >> priority) + 1;
+   if (unlikely(tmp > zone_active))
+   tmp = zone_active;
+   zone->nr_scan_active += tmp;
nr_active = zone->nr_scan_active;
-   zone->nr_scan_inactive +=
-   (zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
+
+   zone_inactive = zone_page_state(zone, NR_INACTIVE);
+   tmp = (zone_inactive >> priority) + 1;
+   if (unlikely(tmp > zone_inactive))
+   tmp = zone_inactive;
+   zone->nr_scan_inactive += tmp;
nr_inactive = zone->nr_scan_inactive;
+
if (nr_inactive >= sc->swap_cluster_max)
zone->nr_scan_inactive = 0;
else
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/8][for -mm] mem_notify v6: add new mem_notify field to /proc/zoneinfo

2008-02-09 Thread KOSAKI Motohiro
show new member of zone struct by /proc/zoneinfo.

ChangeLog:
v5: change display order to at last.


Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]>

---
 mm/vmstat.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

Index: b/mm/vmstat.c
===
--- a/mm/vmstat.c   2008-01-23 22:06:05.0 +0900
+++ b/mm/vmstat.c   2008-01-23 22:08:00.0 +0900
@@ -795,10 +795,12 @@ static void zoneinfo_show_print(struct s
seq_printf(m,
   "\n  all_unreclaimable: %u"
   "\n  prev_priority: %i"
-  "\n  start_pfn: %lu",
-  zone_is_all_unreclaimable(zone),
+  "\n  start_pfn: %lu"
+  "\n  mem_notify_status: %i",
+  zone_is_all_unreclaimable(zone),
   zone->prev_priority,
-  zone->zone_start_pfn);
+  zone->zone_start_pfn,
+  zone->mem_notify_status);
seq_putc(m, '\n');
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/8][for -mm] mem_notify v6: memory_pressure_notify() caller

2008-02-09 Thread KOSAKI Motohiro
the notification point to happen whenever the VM moves an
anonymous page to the inactive list - this is a pretty good indication
that there are unused anonymous pages present which will be very likely
swapped out soon.

and, It is judged out of trouble at the fllowing situations.
 o memory pressure decrease and stop moves an anonymous page to the
inactive list.
 o free pages increase than (pages_high+lowmem_reserve)*2.


ChangeLog:
v5: add out of trouble notify to exit of balance_pgdat().


Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]>

---
 mm/page_alloc.c |   12 
 mm/vmscan.c |   26 ++
 2 files changed, 38 insertions(+)

Index: b/mm/vmscan.c
===
--- a/mm/vmscan.c   2008-01-23 22:06:08.0 +0900
+++ b/mm/vmscan.c   2008-01-23 22:07:57.0 +0900
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -1089,10 +1090,14 @@ static void shrink_active_list(unsigned
struct page *page;
struct pagevec pvec;
int reclaim_mapped = 0;
+   bool inactivated_anon = 0;

if (sc->may_swap)
reclaim_mapped = calc_reclaim_mapped(sc, zone, priority);

+   if (!reclaim_mapped)
+   memory_pressure_notify(zone, 0);
+
lru_add_drain();
spin_lock_irq(&zone->lru_lock);
pgmoved = sc->isolate_pages(nr_pages, &l_hold, &pgscanned, sc->order,
@@ -1116,6 +1121,13 @@ static void shrink_active_list(unsigned
if (!reclaim_mapped ||
(total_swap_pages == 0 && PageAnon(page)) ||
page_referenced(page, 0, sc->mem_cgroup)) {
+   /* deal with the case where there is no
+* swap but an anonymous page would be
+* moved to the inactive list.
+*/
+   if (!total_swap_pages && reclaim_mapped &&
+   PageAnon(page))
+   inactivated_anon = 1;
list_add(&page->lru, &l_active);
continue;
}
@@ -1123,8 +1135,12 @@ static void shrink_active_list(unsigned
list_add(&page->lru, &l_active);
continue;
}
+   if (PageAnon(page))
+   inactivated_anon = 1;
list_add(&page->lru, &l_inactive);
}
+   if (inactivated_anon)
+   memory_pressure_notify(zone, 1);

pagevec_init(&pvec, 1);
pgmoved = 0;
@@ -1158,6 +1174,8 @@ static void shrink_active_list(unsigned
pagevec_strip(&pvec);
spin_lock_irq(&zone->lru_lock);
}
+   if (!reclaim_mapped)
+   memory_pressure_notify(zone, 0);

pgmoved = 0;
while (!list_empty(&l_active)) {
@@ -1659,6 +1677,14 @@ out:
goto loop_again;
}

+   for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+   struct zone *zone = pgdat->node_zones + i;
+
+   if (!populated_zone(zone))
+   continue;
+   memory_pressure_notify(zone, 0);
+   }
+
return nr_reclaimed;
 }

Index: b/mm/page_alloc.c
===
--- a/mm/page_alloc.c   2008-01-23 22:06:08.0 +0900
+++ b/mm/page_alloc.c   2008-01-23 23:09:32.0 +0900
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -435,6 +436,8 @@ static inline void __free_one_page(struc
unsigned long page_idx;
int order_size = 1 << order;
int migratetype = get_pageblock_migratetype(page);
+   unsigned long prev_free;
+   unsigned long notify_threshold;

if (unlikely(PageCompound(page)))
destroy_compound_page(page, order);
@@ -444,6 +447,7 @@ static inline void __free_one_page(struc
VM_BUG_ON(page_idx & (order_size - 1));
VM_BUG_ON(bad_range(zone, page));

+   prev_free = zone_page_state(zone, NR_FREE_PAGES);
__mod_zone_page_state(zone, NR_FREE_PAGES, order_size);
while (order < MAX_ORDER-1) {
unsigned long combined_idx;
@@ -465,6 +469,14 @@ static inline void __free_one_page(struc
list_add(&page->lru,
&zone->free_area[order].free_list[migratetype]);
zone->free_area[order].nr_free++;
+
+   notify_threshold = (zone->pages_high +
+   zone->lowmem_reserve[MAX_NR_ZONES-1]) * 2;
+
+   if (unlikely((zone-&

[PATCH 3/8][for -mm] mem_notify v6: introduce /dev/mem_notify new device (the core of this patch series)

2008-02-09 Thread KOSAKI Motohiro
the core of this patch series.
add /dev/mem_notify device for notification low memory to user process.



fd = open("/dev/mem_notify", O_RDONLY);
if (fd < 0) {
exit(1);
}
pollfds.fd = fd;
pollfds.events = POLLIN;
pollfds.revents = 0;
err = poll(&pollfds, 1, -1); // wake up at low memory

...


ChangeLog
 v5 -> v6:
 o improve number of wakeup tasks fomula when task is a few.



Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]>

---
 Documentation/devices.txt  |1
 drivers/char/mem.c |5 +
 include/linux/mem_notify.h |   42 +++
 include/linux/mmzone.h |1
 mm/Makefile|2
 mm/mem_notify.c|  123 +
 mm/page_alloc.c|1
 7 files changed, 174 insertions(+), 1 deletion(-)

Index: b/drivers/char/mem.c
===
--- a/drivers/char/mem.c2008-02-03 20:59:43.0 +0900
+++ b/drivers/char/mem.c2008-02-03 21:00:24.0 +0900
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -869,6 +870,9 @@ static int memory_open(struct inode * in
filp->f_op = &oldmem_fops;
break;
 #endif
+   case 13:
+   filp->f_op = &mem_notify_fops;
+   break;
default:
return -ENXIO;
}
@@ -901,6 +905,7 @@ static const struct {
 #ifdef CONFIG_CRASH_DUMP
{12,"oldmem",S_IRUSR | S_IWUSR | S_IRGRP, &oldmem_fops},
 #endif
+   {13, "mem_notify", S_IRUGO, &mem_notify_fops},
 };

 static struct class *mem_class;
Index: b/include/linux/mem_notify.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ b/include/linux/mem_notify.h2008-02-03 21:01:41.0 +0900
@@ -0,0 +1,42 @@
+/*
+ * Notify applications of memory pressure via /dev/mem_notify
+ *
+ * Copyright (C) 2008 Marcelo Tosatti <[EMAIL PROTECTED]>,
+ *KOSAKI Motohiro <[EMAIL PROTECTED]>
+ *
+ * Released under the GPL, see the file COPYING for details.
+ */
+
+#ifndef _LINUX_MEM_NOTIFY_H
+#define _LINUX_MEM_NOTIFY_H
+
+#define MEM_NOTIFY_FREQ (HZ/5)
+
+extern atomic_long_t last_mem_notify;
+extern struct file_operations mem_notify_fops;
+
+extern void __memory_pressure_notify(struct zone *zone, int pressure);
+
+static inline void memory_pressure_notify(struct zone *zone, int pressure)
+{
+   unsigned long target;
+   unsigned long pages_high, pages_free, pages_reserve;
+
+   if (pressure) {
+   target = atomic_long_read(&last_mem_notify) + MEM_NOTIFY_FREQ;
+   if (likely(time_before(jiffies, target)))
+   return;
+
+   pages_high = zone->pages_high;
+   pages_free = zone_page_state(zone, NR_FREE_PAGES);
+   pages_reserve = zone->lowmem_reserve[MAX_NR_ZONES-1];
+   if (unlikely(pages_free > (pages_high+pages_reserve)*2))
+   return;
+
+   } else if (likely(!zone->mem_notify_status))
+   return;
+
+   __memory_pressure_notify(zone, pressure);
+}
+
+#endif /* _LINUX_MEM_NOTIFY_H */
Index: b/include/linux/mmzone.h
===
--- a/include/linux/mmzone.h2008-02-03 20:59:43.0 +0900
+++ b/include/linux/mmzone.h2008-02-03 20:59:46.0 +0900
@@ -283,6 +283,7 @@ struct zone {
 */
int prev_priority;

+   int mem_notify_status;

ZONE_PADDING(_pad2_)
/* Rarely used or read-mostly fields */
Index: b/mm/Makefile
===
--- a/mm/Makefile   2008-02-03 20:59:43.0 +0900
+++ b/mm/Makefile   2008-02-03 20:59:46.0 +0900
@@ -11,7 +11,7 @@ obj-y := bootmem.o filemap.o mempool.o
   page_alloc.o page-writeback.o pdflush.o \
   readahead.o swap.o truncate.o vmscan.o \
   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
-  page_isolation.o $(mmu-y)
+  page_isolation.o mem_notify.o $(mmu-y)

 obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
 obj-$(CONFIG_BOUNCE)   += bounce.o
Index: b/mm/mem_notify.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ b/mm/mem_notify.c   2008-02-03 21:02:30.0 +0900
@@ -0,0 +1,123 @@
+/*
+ * Notify applications of memory pressure via /dev/mem_notify
+ *
+ * Copyright (C) 2008 Marcelo Tosatti <[EMAIL PR

[PATCH 2/8][for -mm] mem_notify v6: introduce wake_up_locked_nr() new API

2008-02-09 Thread KOSAKI Motohiro
introduce new API wake_up_locked_nr() and wake_up_locked_all().
it it similar as wake_up_nr() and wake_up_all(), but it doesn't lock.

Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]>

---
 include/linux/wait.h |   12 
 kernel/sched.c   |5 +++--
 2 files changed, 11 insertions(+), 6 deletions(-)

Index: b/include/linux/wait.h
===
--- a/include/linux/wait.h  2008-02-03 20:27:54.0 +0900
+++ b/include/linux/wait.h  2008-02-03 20:32:12.0 +0900
@@ -142,7 +142,8 @@ static inline void __remove_wait_queue(w
 }

 void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int
nr, void *key));
-extern void FASTCALL(__wake_up_locked(wait_queue_head_t *q, unsigned
int mode));
+void FASTCALL(__wake_up_locked(wait_queue_head_t *q, unsigned int mode,
+  int nr, void *key));
 extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned
int mode, int nr));
 void FASTCALL(__wake_up_bit(wait_queue_head_t *, void *, int));
 int FASTCALL(__wait_on_bit(wait_queue_head_t *, struct wait_bit_queue
*, int (*)(void *), unsigned));
@@ -155,10 +156,13 @@ wait_queue_head_t *FASTCALL(bit_waitqueu
 #define wake_up(x) __wake_up(x, TASK_NORMAL, 1, NULL)
 #define wake_up_nr(x, nr)  __wake_up(x, TASK_NORMAL, nr, NULL)
 #define wake_up_all(x) __wake_up(x, TASK_NORMAL, 0, NULL)
-#define wake_up_locked(x)  __wake_up_locked((x), TASK_NORMAL)

-#define wake_up_interruptible(x)   __wake_up(x, TASK_INTERRUPTIBLE, 1, 
NULL)
-#define wake_up_interruptible_nr(x, nr)__wake_up(x,
TASK_INTERRUPTIBLE, nr, NULL)
+#define wake_up_locked(x)  __wake_up_locked((x), TASK_NORMAL, 1, 
NULL)
+#define wake_up_locked_nr(x, nr)__wake_up_locked((x),
TASK_NORMAL, nr, NULL)
+#define wake_up_locked_all(x)  __wake_up_locked((x),
TASK_NORMAL, 0, NULL)
+
+#define wake_up_interruptible(x)   __wake_up(x, TASK_INTERRUPTIBLE, 1, 
NULL)
+#define wake_up_interruptible_nr(x, nr) __wake_up(x,
TASK_INTERRUPTIBLE, nr, NULL)
 #define wake_up_interruptible_all(x)   __wake_up(x, TASK_INTERRUPTIBLE, 0, 
NULL)
 #define wake_up_interruptible_sync(x)  __wake_up_sync((x),
TASK_INTERRUPTIBLE, 1)

Index: b/kernel/sched.c
===
--- a/kernel/sched.c2008-02-03 20:27:54.0 +0900
+++ b/kernel/sched.c2008-02-03 20:29:09.0 +0900
@@ -4115,9 +4115,10 @@ EXPORT_SYMBOL(__wake_up);
 /*
  * Same as __wake_up but called with the spinlock in wait_queue_head_t held.
  */
-void __wake_up_locked(wait_queue_head_t *q, unsigned int mode)
+void __wake_up_locked(wait_queue_head_t *q, unsigned int mode,
+ int nr_exclusive, void *key)
 {
-   __wake_up_common(q, mode, 1, 0, NULL);
+   __wake_up_common(q, mode, nr_exclusive, 0, key);
 }

 /**
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/8][for -mm] mem_notify v6: introduce poll_wait_exclusive()

2008-02-09 Thread KOSAKI Motohiro
There are 2 way of adding item to wait_queue,
  1. add_wait_queue()
  2. add_wait_queue_exclusive()
and add_wait_queue_exclusive() is very useful API.

unforunately, poll_wait_exclusive() against poll_wait() doesn't exist.
it means there is no way that wake up only 1 process where polled.
wake_up() is wake up all sleeping process by poll_wait(), not 1 process.

this patch introduce poll_wait_exclusive() new API for allow wake up
only 1 process.


unsigned int kosaki_poll(struct file *file,
 struct poll_table_struct *wait)
{
poll_wait_exclusive(file, &kosaki_wait_queue, wait);
if (data_exist)
return POLLIN | POLLRDNORM;
return 0;
}


Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]>

---
 fs/eventpoll.c   |7 +--
 fs/select.c  |9 ++---
 include/linux/poll.h |   13 +++--
 3 files changed, 22 insertions(+), 7 deletions(-)



Index: b/fs/eventpoll.c
===
--- a/fs/eventpoll.c2008-01-23 19:22:19.0 +0900
+++ b/fs/eventpoll.c2008-01-23 21:11:56.0 +0900
@@ -675,7 +675,7 @@ out_unlock:
  * target file wakeup lists.
  */
 static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
-poll_table *pt)
+poll_table *pt, int exclusive)
 {
struct epitem *epi = ep_item_from_epqueue(pt);
struct eppoll_entry *pwq;
@@ -684,7 +684,10 @@ static void ep_ptable_queue_proc(struct
init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
pwq->whead = whead;
pwq->base = epi;
-   add_wait_queue(whead, &pwq->wait);
+   if (exclusive)
+   add_wait_queue_exclusive(whead, &pwq->wait);
+   else
+   add_wait_queue(whead, &pwq->wait);
list_add_tail(&pwq->llink, &epi->pwqlist);
epi->nwait++;
} else {
Index: b/fs/select.c
===
--- a/fs/select.c   2008-01-23 19:22:22.0 +0900
+++ b/fs/select.c   2008-01-23 21:11:56.0 +0900
@@ -48,7 +48,7 @@ struct poll_table_page {
  * poll table.
  */
 static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
-  poll_table *p);
+  poll_table *p, int exclusive);

 void poll_initwait(struct poll_wqueues *pwq)
 {
@@ -117,7 +117,7 @@ static struct poll_table_entry *poll_get

 /* Add a new entry */
 static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
-   poll_table *p)
+  poll_table *p, int exclusive)
 {
struct poll_table_entry *entry = poll_get_entry(p);
if (!entry)
@@ -126,7 +126,10 @@ static void __pollwait(struct file *filp
entry->filp = filp;
entry->wait_address = wait_address;
init_waitqueue_entry(&entry->wait, current);
-   add_wait_queue(wait_address, &entry->wait);
+   if (exclusive)
+   add_wait_queue_exclusive(wait_address, &entry->wait);
+   else
+   add_wait_queue(wait_address, &entry->wait);
 }

 #define FDS_IN(fds, n) (fds->in + n)
Index: b/include/linux/poll.h
===
--- a/include/linux/poll.h  2008-01-23 19:22:55.0 +0900
+++ b/include/linux/poll.h  2008-02-03 20:28:27.0 +0900
@@ -28,7 +28,8 @@ struct poll_table_struct;
 /*
  * structures and helpers for f_op->poll implementations
  */
-typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *,
struct poll_table_struct *);
+typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *,
+   struct poll_table_struct *, int);

 typedef struct poll_table_struct {
poll_queue_proc qproc;
@@ -37,7 +38,15 @@ typedef struct poll_table_struct {
 static inline void poll_wait(struct file * filp, wait_queue_head_t *
wait_address, poll_table *p)
 {
if (p && wait_address)
-   p->qproc(filp, wait_address, p);
+   p->qproc(filp, wait_address, p, 0);
+}
+
+static inline void poll_wait_exclusive(struct file *filp,
+  wait_queue_head_t *wait_address,
+  poll_table *p)
+{
+   if (p && wait_address)
+   p->qproc(filp, wait_address, p, 1);
 }

 static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/8][for -mm] mem_notify v6

2008-02-09 Thread KOSAKI Motohiro
Hi

The /dev/mem_notify is low memory notification device.
it can avoid swappness and oom by cooperationg with the user process.

the Linux Today article is very nice description. (great works by Jake Edge)
http://www.linuxworld.com/news/2008/020508-kernel.html


When memory gets tight, it is quite possible that applications have memory
allocated—often caches for better performance—that they could free.
After all, it is generally better to lose some performance than to face the
consequences of being chosen by the OOM killer.
But, currently, there is no way for a process to know that the kernel is
feeling memory pressure.
The patch provides a way for interested programs to monitor the /dev/mem_notify
 file to be notified if memory starts to run low.



You need not be annoyed by OOM any longer :)
please any comments!

patch list
   [1/8] introduce poll_wait_exclusive() new API
   [2/8] introduce wake_up_locked_nr() new API
   [3/8] introduce /dev/mem_notify new device (the core of this
patch series)
   [4/8] memory_pressure_notify() caller
   [5/8] add new mem_notify field to /proc/zoneinfo
   [6/8] (optional) fixed incorrect shrink_zone
   [7/8] ignore very small zone for prevent incorrect low mem notify.
   [8/8] support fasync feature


related discussion:
--
 LKML OOM notifications requirement discussion

http://www.gossamer-threads.com/lists/linux/kernel/832802?nohighlight=1#832802
 OOM notifications patch [Marcelo Tosatti]
http://marc.info/?l=linux-kernel&m=119273914027743&w=2
 mem notifications v3 [Marcelo Tosatti]
http://marc.info/?l=linux-mm&m=119852828327044&w=2
 Thrashing notification patch  [Daniel Spang]
http://marc.info/?l=linux-mm&m=119427416315676&w=2
 mem notification v4
http://marc.info/?l=linux-mm&m=120035840523718&w=2
 mem notification v5
http://marc.info/?l=linux-mm&m=120114835421602&w=2

Changelog
-----
 v5 -> v6 (by KOSAKI Motohiro)
   o rebase to 2.6.24-mm1
   o fixed thundering herd guard formula.

 v4 -> v5 (by KOSAKI Motohiro)
   o rebase to 2.6.24-rc8-mm1
   o change display order of /proc/zoneinfo
   o ignore very small zone
   o support fcntl(F_SETFL, FASYNC)
   o fixed some trivial bugs.

 v3 -> v4 (by KOSAKI Motohiro)
   o rebase to 2.6.24-rc6-mm1
   o avoid wake up all.
   o add judgement point to __free_one_page().
   o add zone awareness.

 v2 -> v3 (by Marcelo Tosatti)
   o changes the notification point to happen whenever
 the VM moves an anonymous page to the inactive list.
   o implement notification rate limit.

 v1(oom notify) -> v2 (by Marcelo Tosatti)
   o name change
   o notify timing change from just swap thrashing to
 just before thrashing.
   o also works with swapless device.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: C++ in linux kernel

2008-02-07 Thread KOSAKI Motohiro
Hi

>  I am a kernel newbie.
>  I tried to insmod a C++ module containing classes, inheritance.
>  I am getting 'unresolved symbol' error when I use the 'new' keyword.
>  What could the problem be?

under using gcc, new operator use malloc by default.
but linux doesn't have malloc.

Could you create custom allocater?

>  What kind of runtime support is needed ( arm linux kernel)? Is a
> patch available for it?

may be nothing.

- kosaki


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't works on memoryless node.

2008-02-07 Thread KOSAKI Motohiro
Hi Lee-san

Unfortunately, 2.6.24-mm1 can't boot on fujitsu machine.
(hmm, origin.patch cause regression to pci initialization ;-)

instead, I tested 2.6.24 + your patch.
it seem work good :)

Tested-by: KOSAKI Motohiro <[EMAIL PROTECTED]>

and, I have a bit comment.


>  /* Do sanity checking on a policy */
> -static int mpol_check_policy(int mode, nodemask_t *nodes)
> +static int mpol_check_policy(int mode, nodemask_t *nodes, int was_empty)

was_empty argument is a bit ugly.
Could we unify mpol_check_policy and contextualize_policy?
mpol_check_policy only called from contextualize_policy.

> - return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
> + return 0;

Could we N_POSSIBLE check?

I attached the patch for my idea explain.
on my test environment, your patch and mine works good both.

- kosaki


---
 mm/mempolicy.c |   47 +--
 1 file changed, 21 insertions(+), 26 deletions(-)

Index: b/mm/mempolicy.c
===
--- a/mm/mempolicy.c2008-02-07 17:19:09.0 +0900
+++ b/mm/mempolicy.c2008-02-07 17:24:28.0 +0900
@@ -114,9 +114,25 @@ static void mpol_rebind_policy(struct me
const nodemask_t *newmask);
 
 /* Do sanity checking on a policy */
-static int mpol_check_policy(int mode, nodemask_t *nodes, int was_empty)
+static int mpol_check_policy(int mode, nodemask_t *nodes)
 {
-   int is_empty = nodes_empty(*nodes);
+   int was_empty;
+   int is_empty;
+
+   if (!nodes)
+   return 0;
+
+   /*
+* Remember whether in coming nodemask was empty,  If not,
+* restrict the nodes to the allowed nodes in the cpuset.
+* This is guaranteed to be a subset of nodes with memory.
+*/
+   cpuset_update_task_memory_state();
+   was_empty = nodes_empty(*nodes);
+   if (!was_empty)
+   nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+
+   is_empty = nodes_empty(*nodes);
 
switch (mode) {
case MPOL_DEFAULT:
@@ -144,7 +160,7 @@ static int mpol_check_policy(int mode, n
return -EINVAL;
break;
}
-   return 0;
+   return nodes_subset(*nodes, node_states[N_POSSIBLE]) ? 0 : -EINVAL;
 }
 
 /* Generate a custom zonelist for the BIND policy. */
@@ -432,27 +448,6 @@ static int mbind_range(struct vm_area_st
return err;
 }
 
-static int contextualize_policy(int mode, nodemask_t *nodes)
-{
-   int was_empty;
-
-   if (!nodes)
-   return 0;
-
-   /*
-* Remember whether in coming nodemask was empty,  If not,
-* restrict the nodes to the allowed nodes in the cpuset.
-* This is guaranteed to be a subset of nodes with memory.
-*/
-   cpuset_update_task_memory_state();
-   was_empty = nodes_empty(*nodes);
-   if (!was_empty)
-   nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
-
-   return mpol_check_policy(mode, nodes, was_empty);
-}
-
-
 /*
  * Update task->flags PF_MEMPOLICY bit: set iff non-default
  * mempolicy.  Allows more rapid checking of this (combined perhaps
@@ -488,7 +483,7 @@ static long do_set_mempolicy(int mode, n
 {
struct mempolicy *new;
 
-   if (contextualize_policy(mode, nodes))
+   if (mpol_check_policy(mode, nodes))
return -EINVAL;
new = mpol_new(mode, nodes);
if (IS_ERR(new))
@@ -817,7 +812,7 @@ static long do_mbind(unsigned long start
if (end == start)
return 0;
 
-   if (contextualize_policy(mode, nmask))
+   if (mpol_check_policy(mode, nmask))
return -EINVAL;
 
new = mpol_new(mode, nmask);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 05/19] split LRU lists into anon & file sets

2008-02-06 Thread KOSAKI Motohiro
Hi Rik

Welcome back :)

> > I found number of scan pages calculation bug.
> 
> My latest version of get_scan_ratio() works differently, with the
> percentages always adding up to 100.  However, your patch gave me
> the inspiration to (hopefully) find the bug in my version of the
> code.

OK.


> > 2. wrong fraction omission
> > 
> > nr[l] = zone->nr_scan[l] * percent[file] / 100;
> > 
> > when percent is very small,
> > nr[l] become 0.
> 
> This is probably where the problem is.  Kind of.
> 
> I believe that the problem is that we scale nr[l] by the percentage,
> instead of scaling the amount we add to zone->nr_scan[l] by the
> percentage!

Aahh,
you are right.


> Index: linux-2.6.24-rc6-mm1/mm/vmscan.c
> ===
> --- linux-2.6.24-rc6-mm1.orig/mm/vmscan.c 2008-02-06 19:23:16.0 
> -0500
> +++ linux-2.6.24-rc6-mm1/mm/vmscan.c  2008-02-06 19:22:55.0 -0500
> @@ -1275,13 +1275,17 @@ static unsigned long shrink_zone(int pri
>   for_each_lru(l) {
>   if (scan_global_lru(sc)) {
>   int file = is_file_lru(l);
> + int scan;
>   /*
>* Add one to nr_to_scan just to make sure that the
> -  * kernel will slowly sift through the active list.
> +  * kernel will slowly sift through each list.
>*/
> - zone->nr_scan[l] += (zone_page_state(zone,
> - NR_INACTIVE_ANON + l) >> priority) + 1;
> - nr[l] = zone->nr_scan[l] * percent[file] / 100;
> + scan = zone_page_state(zone, NR_INACTIVE_ANON + l);
> + scan >>= priority;
> + scan = (scan * percent[file]) / 100;
> +
> + zone->nr_scan[l] += scan + 1;
> + nr[l] = zone->nr_scan[l];
>   if (nr[l] >= sc->swap_cluster_max)
>   zone->nr_scan[l] = 0;
>   else

looks good.
thank you clean up code.


- kosaki


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-mm1

2008-02-06 Thread KOSAKI Motohiro
> should be fixed by the commit below. (already upstream)
>
> Ingo

Oops, sorry ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-mm1

2008-02-06 Thread KOSAKI Motohiro
Hi Ingo,

> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24/2.6.24-mm1/
>
>
> - The x86 git tree has been dropped due to runtime failure on one of my test
>   machines

My PC display warning message at boot time.
Could I help your debugging?

--

end_request: I/O error, dev fd0, sector 0
Welcome to  CentOS release 5 (Final)
Press 'I' to enter interactive startup.
[ cut here ]
WARNING: at kernel/lockdep.c:2037 _spin_unlock_irq+0x20/0x23()
Modules linked in: usb_storage ahci ata_piix libata ehci_hcd ohci_hcd uhci_hcd
Pid: 1085, comm: hwclock Not tainted 2.6.24-mm1 #1
 [] warn_on_slowpath+0x40/0x4f
 [] __kernel_text_address+0x18/0x23
 [] dump_trace+0xb3/0xbd
 [] check_usage_forwards+0x16/0x35
 [] mark_lock+0x16d/0x389
 [] _spin_unlock_irq+0x20/0x23
 [] hpet_rtc_interrupt+0xe6/0x2a7
 [] handle_IRQ_event+0x13/0x3d
 [] handle_edge_irq+0xc2/0xff
 [] do_IRQ+0x6e/0x84
 [] common_interrupt+0x2e/0x34
 [] nv_probe+0x1281/0x13bc
 [] do_page_fault+0x15f/0x4ae
 [] do_page_fault+0x0/0x4ae
 [] error_code+0x72/0x78
 ===
---[ end trace 908d580e4399e39b ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   5   >