Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-10-03 Thread Avi Kivity

 On 09/29/2010 02:37 PM, Greg KH wrote:

 Thankfully things like rpm, hald, and other miscellaneous commands scan
 that information.

  Really?  Why?  Why would rpm care about this?  hald is dead now so we
  don't need to worry about that anymore,

  That's not what compatiblity means.  We can't just support
  latest-and-greatest userspace on latest-and-greatest kernels.

Oh, I know that, that's not what I was getting at at all here, sorry if
it came across that way.

I wanted to know so we could go fix programs that are mucking around in
these files, as odds are, the shouldn't be doing that in the first
place.

Like rpm, why would it matter what the memory in the system looks like?



I see, thanks.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-30 Thread Nathan Fontenot
On 09/29/2010 02:28 PM, Robin Holt wrote:
 On Tue, Sep 28, 2010 at 01:17:33PM -0500, Nathan Fontenot wrote:
 On 09/28/2010 07:38 AM, Robin Holt wrote:
 I was tasked with looking at a slowdown in similar sized SGI machines
 booting x86_64.  Jack Steiner had already looked into the memory_dev_init.
 I was looking at link_mem_sections().

 I made a dramatic improvement on a 16TB machine in that function by
 merely caching the most recent memory section and checking to see if
 the next memory section happens to be the subsequent in the linked list
 of kobjects.

 That simple cache reduced the time for link_mem_sections from 1 hour 27
 minutes down to 46 seconds.

 Nice!


 I would like to propose we implement something along those lines also,
 but I am currently swamped.  I can probably get you a patch tomorrow
 afternoon that applies at the end of this set.

 Should this be done as a separate patch?  This patch set concentrates on
 updates to the memory code with the node updates only being done due to the
 memory changes.

 I think its a good idea to do the caching and have no problem adding on to
 this patchset if no one else has any objections.
 
 I am sorry.  I had meant to include you on the Cc: list.  I just posted a
 set of patches (3 small patches) which implement the cache most recent bit
 I aluded to above.  Search for a subject of Speed up link_mem_sections
 during boot and you will find them.  I did add you to the Cc: list for
 the next time I end up sending the set.
 
 My next task is to implement a x86_64 SGI UV specific chunk of code
 to memory_block_size_bytes().  Would you consider adding that to your
 patch set?  I expect to have that either later today or early tomorrow.
 

No problem. I'm putting together a new patch set with updates from all of
the comments now so go ahead and send it to me when you have it ready.

-Nathan
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-30 Thread Robin Holt
On Wed, Sep 29, 2010 at 02:28:30PM -0500, Robin Holt wrote:
 On Tue, Sep 28, 2010 at 01:17:33PM -0500, Nathan Fontenot wrote:
...
 My next task is to implement a x86_64 SGI UV specific chunk of code
 to memory_block_size_bytes().  Would you consider adding that to your
 patch set?  I expect to have that either later today or early tomorrow.

The patch is below.

I left things at a u32, but I would really like it if you changed to an
unsigned long and adjusted my patch for me.

Thanks,
Robin


Subject: [Patch] Implement memory_block_size_bytes for x86_64 when CONFIG_X86_UV


Nathan Fontenot has implemented a patch set for large memory configuration
systems which will combine drivers/base/memory.c memory sections
together into memory blocks with the default behavior being
unchanged from the current behavior.

In his patch set, he implements a memory_block_size_bytes() function
for PPC.  This is the equivalent patch for x86_64 when it has
CONFIG_X86_UV set.

Signed-off-by: Robin Holt h...@sgi.com
Signed-off-by: Jack Steiner stei...@sgi.com
To: Nathan Fontenot nf...@austin.ibm.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Thomas Gleixner t...@linutronix.de
Cc: H. Peter Anvin h...@zytor.com
Cc: lkml linux-ker...@vger.kernel.org

---

 arch/x86/mm/init_64.c |   15 +++
 1 file changed, 15 insertions(+)

Index: memory_block/arch/x86/mm/init_64.c
===
--- memory_block.orig/arch/x86/mm/init_64.c 2010-09-29 14:46:50.711824616 
-0500
+++ memory_block/arch/x86/mm/init_64.c  2010-09-29 14:46:55.683997672 -0500
@@ -50,6 +50,7 @@
 #include asm/numa.h
 #include asm/cacheflush.h
 #include asm/init.h
+#include asm/uv/uv.h
 #include linux/bootmem.h
 
 static unsigned long dma_reserve __initdata;
@@ -928,6 +929,20 @@ const char *arch_vma_name(struct vm_area
return NULL;
 }
 
+#ifdef CONFIG_X86_UV
+#define MIN_MEMORY_BLOCK_SIZE   (1  SECTION_SIZE_BITS)
+
+u32 memory_block_size_bytes(void)
+{
+   if (is_uv_system()) {
+   printk(UV: memory block size 2GB\n);
+   return 2UL * 1024 * 1024 * 1024;
+   }
+   return MIN_MEMORY_BLOCK_SIZE;
+}
+#endif
+
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 /*
  * Initialise the sparsemem vmemmap using huge-pages at the PMD level.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-29 Thread Avi Kivity

 On 09/29/2010 04:50 AM, Greg KH wrote:


  Because the old ABI creates 129,000+ entries inside
  /sys/devices/system/memory with their associated links from
  /sys/devices/system/node/node*/ back to those directory entries.

  Thankfully things like rpm, hald, and other miscellaneous commands scan
  that information.

Really?  Why?  Why would rpm care about this?  hald is dead now so we
don't need to worry about that anymore,


That's not what compatiblity means.  We can't just support 
latest-and-greatest userspace on latest-and-greatest kernels.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-29 Thread Greg KH
On Wed, Sep 29, 2010 at 10:32:34AM +0200, Avi Kivity wrote:
  On 09/29/2010 04:50 AM, Greg KH wrote:
 
   Because the old ABI creates 129,000+ entries inside
   /sys/devices/system/memory with their associated links from
   /sys/devices/system/node/node*/ back to those directory entries.
 
   Thankfully things like rpm, hald, and other miscellaneous commands scan
   that information.

 Really?  Why?  Why would rpm care about this?  hald is dead now so we
 don't need to worry about that anymore,

 That's not what compatiblity means.  We can't just support 
 latest-and-greatest userspace on latest-and-greatest kernels.

Oh, I know that, that's not what I was getting at at all here, sorry if
it came across that way.

I wanted to know so we could go fix programs that are mucking around in
these files, as odds are, the shouldn't be doing that in the first
place.

Like rpm, why would it matter what the memory in the system looks like?

thanks,

greg k-h
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-29 Thread Robin Holt
On Tue, Sep 28, 2010 at 01:17:33PM -0500, Nathan Fontenot wrote:
 On 09/28/2010 07:38 AM, Robin Holt wrote:
  I was tasked with looking at a slowdown in similar sized SGI machines
  booting x86_64.  Jack Steiner had already looked into the memory_dev_init.
  I was looking at link_mem_sections().
  
  I made a dramatic improvement on a 16TB machine in that function by
  merely caching the most recent memory section and checking to see if
  the next memory section happens to be the subsequent in the linked list
  of kobjects.
  
  That simple cache reduced the time for link_mem_sections from 1 hour 27
  minutes down to 46 seconds.
 
 Nice!
 
  
  I would like to propose we implement something along those lines also,
  but I am currently swamped.  I can probably get you a patch tomorrow
  afternoon that applies at the end of this set.
 
 Should this be done as a separate patch?  This patch set concentrates on
 updates to the memory code with the node updates only being done due to the
 memory changes.
 
 I think its a good idea to do the caching and have no problem adding on to
 this patchset if no one else has any objections.

I am sorry.  I had meant to include you on the Cc: list.  I just posted a
set of patches (3 small patches) which implement the cache most recent bit
I aluded to above.  Search for a subject of Speed up link_mem_sections
during boot and you will find them.  I did add you to the Cc: list for
the next time I end up sending the set.

My next task is to implement a x86_64 SGI UV specific chunk of code
to memory_block_size_bytes().  Would you consider adding that to your
patch set?  I expect to have that either later today or early tomorrow.

Robin
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-28 Thread Robin Holt
I was tasked with looking at a slowdown in similar sized SGI machines
booting x86_64.  Jack Steiner had already looked into the memory_dev_init.
I was looking at link_mem_sections().

I made a dramatic improvement on a 16TB machine in that function by
merely caching the most recent memory section and checking to see if
the next memory section happens to be the subsequent in the linked list
of kobjects.

That simple cache reduced the time for link_mem_sections from 1 hour 27
minutes down to 46 seconds.

I would like to propose we implement something along those lines also,
but I am currently swamped.  I can probably get you a patch tomorrow
afternoon that applies at the end of this set.

Thanks,
Robin

On Mon, Sep 27, 2010 at 02:09:31PM -0500, Nathan Fontenot wrote:
 This set of patches decouples the concept that a single memory
 section corresponds to a single directory in 
 /sys/devices/system/memory/.  On systems
 with large amounts of memory (1+ TB) there are perfomance issues
 related to creating the large number of sysfs directories.  For
 a powerpc machine with 1 TB of memory we are creating 63,000+
 directories.  This is resulting in boot times of around 45-50
 minutes for systems with 1 TB of memory and 8 hours for systems
 with 2 TB of memory.  With this patch set applied I am now seeing
 boot times of 5 minutes or less.
 
 The root of this issue is in sysfs directory creation. Every time
 a directory is created a string compare is done against all sibling
 directories to ensure we do not create duplicates.  The list of
 directory nodes in sysfs is kept as an unsorted list which results
 in this being an exponentially longer operation as the number of
 directories are created.
 
 The solution solved by this patch set is to allow a single
 directory in sysfs to span multiple memory sections.  This is
 controlled by an optional architecturally defined function
 memory_block_size_bytes().  The default definition of this
 routine returns a memory block size equal to the memory section
 size. This maintains the current layout of sysfs memory
 directories as it appears to userspace to remain the same as it
 is today.
 
 For architectures that define their own version of this routine,
 as is done for powerpc in this patchset, the view in userspace
 would change such that each memoryXXX directory would span
 multiple memory sections.  The number of sections spanned would
 depend on the value reported by memory_block_size_bytes.
 
 In both cases a new file 'end_phys_index' is created in each
 memoryXXX directory.  This file will contain the physical id
 of the last memory section covered by the sysfs directory.  For
 the default case, the value in 'end_phys_index' will be the same
 as in the existing 'phys_index' file.
 
 This version of the patch set includes an update to to properly
 report block_size_bytes, phys_index, and end_phys_index.  Additionally,
 the patch that adds the end_phys_index sysfs file is now patch 5/8
 instead of being patch 2/8 as in the previous version of the patches.
 
 -Nathan Fontenot
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-28 Thread Avi Kivity

 On 09/27/2010 09:09 PM, Nathan Fontenot wrote:

This set of patches decouples the concept that a single memory
section corresponds to a single directory in
/sys/devices/system/memory/.  On systems
with large amounts of memory (1+ TB) there are perfomance issues
related to creating the large number of sysfs directories.  For
a powerpc machine with 1 TB of memory we are creating 63,000+
directories.  This is resulting in boot times of around 45-50
minutes for systems with 1 TB of memory and 8 hours for systems
with 2 TB of memory.  With this patch set applied I am now seeing
boot times of 5 minutes or less.

The root of this issue is in sysfs directory creation. Every time
a directory is created a string compare is done against all sibling
directories to ensure we do not create duplicates.  The list of
directory nodes in sysfs is kept as an unsorted list which results
in this being an exponentially longer operation as the number of
directories are created.

The solution solved by this patch set is to allow a single
directory in sysfs to span multiple memory sections.  This is
controlled by an optional architecturally defined function
memory_block_size_bytes().  The default definition of this
routine returns a memory block size equal to the memory section
size. This maintains the current layout of sysfs memory
directories as it appears to userspace to remain the same as it
is today.



Why not update sysfs directory creation to be fast, for example by using 
an rbtree instead of a linked list.  This fixes an implementation 
problem in the kernel instead of working around it and creating a new ABI.


New ABIs mean old tools won't work, and new tools need to understand 
both ABIs.


--
error compiling committee.c: too many arguments to function

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-28 Thread Robin Holt
On Tue, Sep 28, 2010 at 02:44:40PM +0200, Avi Kivity wrote:
  On 09/27/2010 09:09 PM, Nathan Fontenot wrote:
 This set of patches decouples the concept that a single memory
 section corresponds to a single directory in
 /sys/devices/system/memory/.  On systems
 with large amounts of memory (1+ TB) there are perfomance issues
 related to creating the large number of sysfs directories.  For
 a powerpc machine with 1 TB of memory we are creating 63,000+
 directories.  This is resulting in boot times of around 45-50
 minutes for systems with 1 TB of memory and 8 hours for systems
 with 2 TB of memory.  With this patch set applied I am now seeing
 boot times of 5 minutes or less.
 
 The root of this issue is in sysfs directory creation. Every time
 a directory is created a string compare is done against all sibling
 directories to ensure we do not create duplicates.  The list of
 directory nodes in sysfs is kept as an unsorted list which results
 in this being an exponentially longer operation as the number of
 directories are created.
 
 The solution solved by this patch set is to allow a single
 directory in sysfs to span multiple memory sections.  This is
 controlled by an optional architecturally defined function
 memory_block_size_bytes().  The default definition of this
 routine returns a memory block size equal to the memory section
 size. This maintains the current layout of sysfs memory
 directories as it appears to userspace to remain the same as it
 is today.
 
 
 Why not update sysfs directory creation to be fast, for example by
 using an rbtree instead of a linked list.  This fixes an
 implementation problem in the kernel instead of working around it
 and creating a new ABI.

Because the old ABI creates 129,000+ entries inside
/sys/devices/system/memory with their associated links from
/sys/devices/system/node/node*/ back to those directory entries.

Thankfully things like rpm, hald, and other miscellaneous commands scan
that information.  On our 8 TB test machine, hald runs continuously
following boot for nearly an hour mostly scanning useless information
from /sys/

Robin

 
 New ABIs mean old tools won't work, and new tools need to understand
 both ABIs.
 
 -- 
 error compiling committee.c: too many arguments to function
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-28 Thread Dave Hansen
On Tue, 2010-09-28 at 14:44 +0200, Avi Kivity wrote:
 Why not update sysfs directory creation to be fast, for example by using 
 an rbtree instead of a linked list.  This fixes an implementation 
 problem in the kernel instead of working around it and creating a new ABI.
 
 New ABIs mean old tools won't work, and new tools need to understand 
 both ABIs.

Just to be clear _these_ patches do not change the existing ABI.

They do add a new ABI: the end_phys_index file.  But, it is completely
redundant at the moment.  It could be taken out of these patches.

That said, fixing the directory creation speed is probably a worthwhile
endeavor too.

-- Dave

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-28 Thread Avi Kivity

 On 09/28/2010 05:12 PM, Robin Holt wrote:

  Why not update sysfs directory creation to be fast, for example by
  using an rbtree instead of a linked list.  This fixes an
  implementation problem in the kernel instead of working around it
  and creating a new ABI.

Because the old ABI creates 129,000+ entries inside
/sys/devices/system/memory with their associated links from
/sys/devices/system/node/node*/ back to those directory entries.

Thankfully things like rpm, hald, and other miscellaneous commands scan
that information.  On our 8 TB test machine, hald runs continuously
following boot for nearly an hour mostly scanning useless information
from /sys/


I see - so the problem wasn't just kernel internal; the ABI itself was 
unsuitable.  Too bad this wasn't considered at the time it was added.


(129k entries / 1 hour = 35 entries/sec; not very impressive)

--
error compiling committee.c: too many arguments to function

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-28 Thread Nathan Fontenot
On 09/28/2010 07:38 AM, Robin Holt wrote:
 I was tasked with looking at a slowdown in similar sized SGI machines
 booting x86_64.  Jack Steiner had already looked into the memory_dev_init.
 I was looking at link_mem_sections().
 
 I made a dramatic improvement on a 16TB machine in that function by
 merely caching the most recent memory section and checking to see if
 the next memory section happens to be the subsequent in the linked list
 of kobjects.
 
 That simple cache reduced the time for link_mem_sections from 1 hour 27
 minutes down to 46 seconds.

Nice!

 
 I would like to propose we implement something along those lines also,
 but I am currently swamped.  I can probably get you a patch tomorrow
 afternoon that applies at the end of this set.

Should this be done as a separate patch?  This patch set concentrates on
updates to the memory code with the node updates only being done due to the
memory changes.

I think its a good idea to do the caching and have no problem adding on to
this patchset if no one else has any objections.

-Nathan

 
 Thanks,
 Robin
 
 On Mon, Sep 27, 2010 at 02:09:31PM -0500, Nathan Fontenot wrote:
 This set of patches decouples the concept that a single memory
 section corresponds to a single directory in 
 /sys/devices/system/memory/.  On systems
 with large amounts of memory (1+ TB) there are perfomance issues
 related to creating the large number of sysfs directories.  For
 a powerpc machine with 1 TB of memory we are creating 63,000+
 directories.  This is resulting in boot times of around 45-50
 minutes for systems with 1 TB of memory and 8 hours for systems
 with 2 TB of memory.  With this patch set applied I am now seeing
 boot times of 5 minutes or less.

 The root of this issue is in sysfs directory creation. Every time
 a directory is created a string compare is done against all sibling
 directories to ensure we do not create duplicates.  The list of
 directory nodes in sysfs is kept as an unsorted list which results
 in this being an exponentially longer operation as the number of
 directories are created.

 The solution solved by this patch set is to allow a single
 directory in sysfs to span multiple memory sections.  This is
 controlled by an optional architecturally defined function
 memory_block_size_bytes().  The default definition of this
 routine returns a memory block size equal to the memory section
 size. This maintains the current layout of sysfs memory
 directories as it appears to userspace to remain the same as it
 is today.

 For architectures that define their own version of this routine,
 as is done for powerpc in this patchset, the view in userspace
 would change such that each memoryXXX directory would span
 multiple memory sections.  The number of sections spanned would
 depend on the value reported by memory_block_size_bytes.

 In both cases a new file 'end_phys_index' is created in each
 memoryXXX directory.  This file will contain the physical id
 of the last memory section covered by the sysfs directory.  For
 the default case, the value in 'end_phys_index' will be the same
 as in the existing 'phys_index' file.

 This version of the patch set includes an update to to properly
 report block_size_bytes, phys_index, and end_phys_index.  Additionally,
 the patch that adds the end_phys_index sysfs file is now patch 5/8
 instead of being patch 2/8 as in the previous version of the patches.

 -Nathan Fontenot
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-28 Thread Greg KH
On Tue, Sep 28, 2010 at 10:12:18AM -0500, Robin Holt wrote:
 On Tue, Sep 28, 2010 at 02:44:40PM +0200, Avi Kivity wrote:
   On 09/27/2010 09:09 PM, Nathan Fontenot wrote:
  This set of patches decouples the concept that a single memory
  section corresponds to a single directory in
  /sys/devices/system/memory/.  On systems
  with large amounts of memory (1+ TB) there are perfomance issues
  related to creating the large number of sysfs directories.  For
  a powerpc machine with 1 TB of memory we are creating 63,000+
  directories.  This is resulting in boot times of around 45-50
  minutes for systems with 1 TB of memory and 8 hours for systems
  with 2 TB of memory.  With this patch set applied I am now seeing
  boot times of 5 minutes or less.
  
  The root of this issue is in sysfs directory creation. Every time
  a directory is created a string compare is done against all sibling
  directories to ensure we do not create duplicates.  The list of
  directory nodes in sysfs is kept as an unsorted list which results
  in this being an exponentially longer operation as the number of
  directories are created.
  
  The solution solved by this patch set is to allow a single
  directory in sysfs to span multiple memory sections.  This is
  controlled by an optional architecturally defined function
  memory_block_size_bytes().  The default definition of this
  routine returns a memory block size equal to the memory section
  size. This maintains the current layout of sysfs memory
  directories as it appears to userspace to remain the same as it
  is today.
  
  
  Why not update sysfs directory creation to be fast, for example by
  using an rbtree instead of a linked list.  This fixes an
  implementation problem in the kernel instead of working around it
  and creating a new ABI.
 
 Because the old ABI creates 129,000+ entries inside
 /sys/devices/system/memory with their associated links from
 /sys/devices/system/node/node*/ back to those directory entries.
 
 Thankfully things like rpm, hald, and other miscellaneous commands scan
 that information.

Really?  Why?  Why would rpm care about this?  hald is dead now so we
don't need to worry about that anymore, but what other commands/programs
read this information?

thanks,

greg k-h
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev