Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-10-03 Thread Avi Kivity

 On 09/29/2010 02:37 PM, Greg KH wrote:

>>  >   Thankfully things like rpm, hald, and other miscellaneous commands scan
>>  >   that information.
>>
>>  Really?  Why?  Why would rpm care about this?  hald is dead now so we
>>  don't need to worry about that anymore,
>
>  That's not what compatiblity means.  We can't just support
>  latest-and-greatest userspace on latest-and-greatest kernels.

Oh, I know that, that's not what I was getting at at all here, sorry if
it came across that way.

I wanted to know so we could go fix programs that are mucking around in
these files, as odds are, the shouldn't be doing that in the first
place.

Like rpm, why would it matter what the memory in the system looks like?



I see, thanks.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-30 Thread Robin Holt
On Wed, Sep 29, 2010 at 02:28:30PM -0500, Robin Holt wrote:
> On Tue, Sep 28, 2010 at 01:17:33PM -0500, Nathan Fontenot wrote:
...
> My next task is to implement a x86_64 SGI UV specific chunk of code
> to memory_block_size_bytes().  Would you consider adding that to your
> patch set?  I expect to have that either later today or early tomorrow.

The patch is below.

I left things at a u32, but I would really like it if you changed to an
unsigned long and adjusted my patch for me.

Thanks,
Robin


Subject: [Patch] Implement memory_block_size_bytes for x86_64 when CONFIG_X86_UV


Nathan Fontenot has implemented a patch set for large memory configuration
systems which will combine drivers/base/memory.c memory sections
together into memory blocks with the default behavior being
unchanged from the current behavior.

In his patch set, he implements a memory_block_size_bytes() function
for PPC.  This is the equivalent patch for x86_64 when it has
CONFIG_X86_UV set.

Signed-off-by: Robin Holt 
Signed-off-by: Jack Steiner 
To: Nathan Fontenot 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: lkml 

---

 arch/x86/mm/init_64.c |   15 +++
 1 file changed, 15 insertions(+)

Index: memory_block/arch/x86/mm/init_64.c
===
--- memory_block.orig/arch/x86/mm/init_64.c 2010-09-29 14:46:50.711824616 
-0500
+++ memory_block/arch/x86/mm/init_64.c  2010-09-29 14:46:55.683997672 -0500
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static unsigned long dma_reserve __initdata;
@@ -928,6 +929,20 @@ const char *arch_vma_name(struct vm_area
return NULL;
 }
 
+#ifdef CONFIG_X86_UV
+#define MIN_MEMORY_BLOCK_SIZE   (1 << SECTION_SIZE_BITS)
+
+u32 memory_block_size_bytes(void)
+{
+   if (is_uv_system()) {
+   printk("UV: memory block size 2GB\n");
+   return 2UL * 1024 * 1024 * 1024;
+   }
+   return MIN_MEMORY_BLOCK_SIZE;
+}
+#endif
+
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 /*
  * Initialise the sparsemem vmemmap using huge-pages at the PMD level.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-30 Thread Nathan Fontenot
On 09/29/2010 02:28 PM, Robin Holt wrote:
> On Tue, Sep 28, 2010 at 01:17:33PM -0500, Nathan Fontenot wrote:
>> On 09/28/2010 07:38 AM, Robin Holt wrote:
>>> I was tasked with looking at a slowdown in similar sized SGI machines
>>> booting x86_64.  Jack Steiner had already looked into the memory_dev_init.
>>> I was looking at link_mem_sections().
>>>
>>> I made a dramatic improvement on a 16TB machine in that function by
>>> merely caching the most recent memory section and checking to see if
>>> the next memory section happens to be the subsequent in the linked list
>>> of kobjects.
>>>
>>> That simple cache reduced the time for link_mem_sections from 1 hour 27
>>> minutes down to 46 seconds.
>>
>> Nice!
>>
>>>
>>> I would like to propose we implement something along those lines also,
>>> but I am currently swamped.  I can probably get you a patch tomorrow
>>> afternoon that applies at the end of this set.
>>
>> Should this be done as a separate patch?  This patch set concentrates on
>> updates to the memory code with the node updates only being done due to the
>> memory changes.
>>
>> I think its a good idea to do the caching and have no problem adding on to
>> this patchset if no one else has any objections.
> 
> I am sorry.  I had meant to include you on the Cc: list.  I just posted a
> set of patches (3 small patches) which implement the cache most recent bit
> I aluded to above.  Search for a subject of "Speed up link_mem_sections
> during boot" and you will find them.  I did add you to the Cc: list for
> the next time I end up sending the set.
> 
> My next task is to implement a x86_64 SGI UV specific chunk of code
> to memory_block_size_bytes().  Would you consider adding that to your
> patch set?  I expect to have that either later today or early tomorrow.
> 

No problem. I'm putting together a new patch set with updates from all of
the comments now so go ahead and send it to me when you have it ready.

-Nathan
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-29 Thread Robin Holt
On Tue, Sep 28, 2010 at 01:17:33PM -0500, Nathan Fontenot wrote:
> On 09/28/2010 07:38 AM, Robin Holt wrote:
> > I was tasked with looking at a slowdown in similar sized SGI machines
> > booting x86_64.  Jack Steiner had already looked into the memory_dev_init.
> > I was looking at link_mem_sections().
> > 
> > I made a dramatic improvement on a 16TB machine in that function by
> > merely caching the most recent memory section and checking to see if
> > the next memory section happens to be the subsequent in the linked list
> > of kobjects.
> > 
> > That simple cache reduced the time for link_mem_sections from 1 hour 27
> > minutes down to 46 seconds.
> 
> Nice!
> 
> > 
> > I would like to propose we implement something along those lines also,
> > but I am currently swamped.  I can probably get you a patch tomorrow
> > afternoon that applies at the end of this set.
> 
> Should this be done as a separate patch?  This patch set concentrates on
> updates to the memory code with the node updates only being done due to the
> memory changes.
> 
> I think its a good idea to do the caching and have no problem adding on to
> this patchset if no one else has any objections.

I am sorry.  I had meant to include you on the Cc: list.  I just posted a
set of patches (3 small patches) which implement the cache most recent bit
I aluded to above.  Search for a subject of "Speed up link_mem_sections
during boot" and you will find them.  I did add you to the Cc: list for
the next time I end up sending the set.

My next task is to implement a x86_64 SGI UV specific chunk of code
to memory_block_size_bytes().  Would you consider adding that to your
patch set?  I expect to have that either later today or early tomorrow.

Robin
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-29 Thread Kay Sievers
On Wed, Sep 29, 2010 at 14:37, Greg KH  wrote:
> On Wed, Sep 29, 2010 at 10:32:34AM +0200, Avi Kivity wrote:
>>  On 09/29/2010 04:50 AM, Greg KH wrote:
>>> >
>>> >  Because the old ABI creates 129,000+ entries inside
>>> >  /sys/devices/system/memory with their associated links from
>>> >  /sys/devices/system/node/node*/ back to those directory entries.
>>> >
>>> >  Thankfully things like rpm, hald, and other miscellaneous commands scan
>>> >  that information.
>>>
>>> Really?  Why?  Why would rpm care about this?  hald is dead now so we
>>> don't need to worry about that anymore,
>>
>> That's not what compatiblity means.  We can't just support
>> latest-and-greatest userspace on latest-and-greatest kernels.
>
> Oh, I know that, that's not what I was getting at at all here, sorry if
> it came across that way.
>
> I wanted to know so we could go fix programs that are mucking around in
> these files, as odds are, the shouldn't be doing that in the first
> place.
>
> Like rpm, why would it matter what the memory in the system looks like?

HAL does many inefficient things, but I don't think it's using
/sys/system/, besides that it may check the cpufreq govenors state
there.

Kay
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-29 Thread Greg KH
On Wed, Sep 29, 2010 at 10:32:34AM +0200, Avi Kivity wrote:
>  On 09/29/2010 04:50 AM, Greg KH wrote:
>> >
>> >  Because the old ABI creates 129,000+ entries inside
>> >  /sys/devices/system/memory with their associated links from
>> >  /sys/devices/system/node/node*/ back to those directory entries.
>> >
>> >  Thankfully things like rpm, hald, and other miscellaneous commands scan
>> >  that information.
>>
>> Really?  Why?  Why would rpm care about this?  hald is dead now so we
>> don't need to worry about that anymore,
>
> That's not what compatiblity means.  We can't just support 
> latest-and-greatest userspace on latest-and-greatest kernels.

Oh, I know that, that's not what I was getting at at all here, sorry if
it came across that way.

I wanted to know so we could go fix programs that are mucking around in
these files, as odds are, the shouldn't be doing that in the first
place.

Like rpm, why would it matter what the memory in the system looks like?

thanks,

greg k-h
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-29 Thread Avi Kivity

 On 09/29/2010 04:50 AM, Greg KH wrote:

>
>  Because the old ABI creates 129,000+ entries inside
>  /sys/devices/system/memory with their associated links from
>  /sys/devices/system/node/node*/ back to those directory entries.
>
>  Thankfully things like rpm, hald, and other miscellaneous commands scan
>  that information.

Really?  Why?  Why would rpm care about this?  hald is dead now so we
don't need to worry about that anymore,


That's not what compatiblity means.  We can't just support 
latest-and-greatest userspace on latest-and-greatest kernels.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-28 Thread Greg KH
On Tue, Sep 28, 2010 at 10:12:18AM -0500, Robin Holt wrote:
> On Tue, Sep 28, 2010 at 02:44:40PM +0200, Avi Kivity wrote:
> >  On 09/27/2010 09:09 PM, Nathan Fontenot wrote:
> > >This set of patches decouples the concept that a single memory
> > >section corresponds to a single directory in
> > >/sys/devices/system/memory/.  On systems
> > >with large amounts of memory (1+ TB) there are perfomance issues
> > >related to creating the large number of sysfs directories.  For
> > >a powerpc machine with 1 TB of memory we are creating 63,000+
> > >directories.  This is resulting in boot times of around 45-50
> > >minutes for systems with 1 TB of memory and 8 hours for systems
> > >with 2 TB of memory.  With this patch set applied I am now seeing
> > >boot times of 5 minutes or less.
> > >
> > >The root of this issue is in sysfs directory creation. Every time
> > >a directory is created a string compare is done against all sibling
> > >directories to ensure we do not create duplicates.  The list of
> > >directory nodes in sysfs is kept as an unsorted list which results
> > >in this being an exponentially longer operation as the number of
> > >directories are created.
> > >
> > >The solution solved by this patch set is to allow a single
> > >directory in sysfs to span multiple memory sections.  This is
> > >controlled by an optional architecturally defined function
> > >memory_block_size_bytes().  The default definition of this
> > >routine returns a memory block size equal to the memory section
> > >size. This maintains the current layout of sysfs memory
> > >directories as it appears to userspace to remain the same as it
> > >is today.
> > >
> > 
> > Why not update sysfs directory creation to be fast, for example by
> > using an rbtree instead of a linked list.  This fixes an
> > implementation problem in the kernel instead of working around it
> > and creating a new ABI.
> 
> Because the old ABI creates 129,000+ entries inside
> /sys/devices/system/memory with their associated links from
> /sys/devices/system/node/node*/ back to those directory entries.
> 
> Thankfully things like rpm, hald, and other miscellaneous commands scan
> that information.

Really?  Why?  Why would rpm care about this?  hald is dead now so we
don't need to worry about that anymore, but what other commands/programs
read this information?

thanks,

greg k-h
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-28 Thread Nathan Fontenot
On 09/28/2010 07:38 AM, Robin Holt wrote:
> I was tasked with looking at a slowdown in similar sized SGI machines
> booting x86_64.  Jack Steiner had already looked into the memory_dev_init.
> I was looking at link_mem_sections().
> 
> I made a dramatic improvement on a 16TB machine in that function by
> merely caching the most recent memory section and checking to see if
> the next memory section happens to be the subsequent in the linked list
> of kobjects.
> 
> That simple cache reduced the time for link_mem_sections from 1 hour 27
> minutes down to 46 seconds.

Nice!

> 
> I would like to propose we implement something along those lines also,
> but I am currently swamped.  I can probably get you a patch tomorrow
> afternoon that applies at the end of this set.

Should this be done as a separate patch?  This patch set concentrates on
updates to the memory code with the node updates only being done due to the
memory changes.

I think its a good idea to do the caching and have no problem adding on to
this patchset if no one else has any objections.

-Nathan

> 
> Thanks,
> Robin
> 
> On Mon, Sep 27, 2010 at 02:09:31PM -0500, Nathan Fontenot wrote:
>> This set of patches decouples the concept that a single memory
>> section corresponds to a single directory in 
>> /sys/devices/system/memory/.  On systems
>> with large amounts of memory (1+ TB) there are perfomance issues
>> related to creating the large number of sysfs directories.  For
>> a powerpc machine with 1 TB of memory we are creating 63,000+
>> directories.  This is resulting in boot times of around 45-50
>> minutes for systems with 1 TB of memory and 8 hours for systems
>> with 2 TB of memory.  With this patch set applied I am now seeing
>> boot times of 5 minutes or less.
>>
>> The root of this issue is in sysfs directory creation. Every time
>> a directory is created a string compare is done against all sibling
>> directories to ensure we do not create duplicates.  The list of
>> directory nodes in sysfs is kept as an unsorted list which results
>> in this being an exponentially longer operation as the number of
>> directories are created.
>>
>> The solution solved by this patch set is to allow a single
>> directory in sysfs to span multiple memory sections.  This is
>> controlled by an optional architecturally defined function
>> memory_block_size_bytes().  The default definition of this
>> routine returns a memory block size equal to the memory section
>> size. This maintains the current layout of sysfs memory
>> directories as it appears to userspace to remain the same as it
>> is today.
>>
>> For architectures that define their own version of this routine,
>> as is done for powerpc in this patchset, the view in userspace
>> would change such that each memoryXXX directory would span
>> multiple memory sections.  The number of sections spanned would
>> depend on the value reported by memory_block_size_bytes.
>>
>> In both cases a new file 'end_phys_index' is created in each
>> memoryXXX directory.  This file will contain the physical id
>> of the last memory section covered by the sysfs directory.  For
>> the default case, the value in 'end_phys_index' will be the same
>> as in the existing 'phys_index' file.
>>
>> This version of the patch set includes an update to to properly
>> report block_size_bytes, phys_index, and end_phys_index.  Additionally,
>> the patch that adds the end_phys_index sysfs file is now patch 5/8
>> instead of being patch 2/8 as in the previous version of the patches.
>>
>> -Nathan Fontenot
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-28 Thread Avi Kivity

 On 09/28/2010 05:12 PM, Robin Holt wrote:

>  Why not update sysfs directory creation to be fast, for example by
>  using an rbtree instead of a linked list.  This fixes an
>  implementation problem in the kernel instead of working around it
>  and creating a new ABI.

Because the old ABI creates 129,000+ entries inside
/sys/devices/system/memory with their associated links from
/sys/devices/system/node/node*/ back to those directory entries.

Thankfully things like rpm, hald, and other miscellaneous commands scan
that information.  On our 8 TB test machine, hald runs continuously
following boot for nearly an hour mostly scanning useless information
from /sys/


I see - so the problem wasn't just kernel internal; the ABI itself was 
unsuitable.  Too bad this wasn't considered at the time it was added.


(129k entries / 1 hour = 35 entries/sec; not very impressive)

--
error compiling committee.c: too many arguments to function

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-28 Thread Dave Hansen
On Tue, 2010-09-28 at 14:44 +0200, Avi Kivity wrote:
> Why not update sysfs directory creation to be fast, for example by using 
> an rbtree instead of a linked list.  This fixes an implementation 
> problem in the kernel instead of working around it and creating a new ABI.
> 
> New ABIs mean old tools won't work, and new tools need to understand 
> both ABIs.

Just to be clear _these_ patches do not change the existing ABI.

They do add a new ABI: the end_phys_index file.  But, it is completely
redundant at the moment.  It could be taken out of these patches.

That said, fixing the directory creation speed is probably a worthwhile
endeavor too.

-- Dave

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-28 Thread Robin Holt
On Tue, Sep 28, 2010 at 02:44:40PM +0200, Avi Kivity wrote:
>  On 09/27/2010 09:09 PM, Nathan Fontenot wrote:
> >This set of patches decouples the concept that a single memory
> >section corresponds to a single directory in
> >/sys/devices/system/memory/.  On systems
> >with large amounts of memory (1+ TB) there are perfomance issues
> >related to creating the large number of sysfs directories.  For
> >a powerpc machine with 1 TB of memory we are creating 63,000+
> >directories.  This is resulting in boot times of around 45-50
> >minutes for systems with 1 TB of memory and 8 hours for systems
> >with 2 TB of memory.  With this patch set applied I am now seeing
> >boot times of 5 minutes or less.
> >
> >The root of this issue is in sysfs directory creation. Every time
> >a directory is created a string compare is done against all sibling
> >directories to ensure we do not create duplicates.  The list of
> >directory nodes in sysfs is kept as an unsorted list which results
> >in this being an exponentially longer operation as the number of
> >directories are created.
> >
> >The solution solved by this patch set is to allow a single
> >directory in sysfs to span multiple memory sections.  This is
> >controlled by an optional architecturally defined function
> >memory_block_size_bytes().  The default definition of this
> >routine returns a memory block size equal to the memory section
> >size. This maintains the current layout of sysfs memory
> >directories as it appears to userspace to remain the same as it
> >is today.
> >
> 
> Why not update sysfs directory creation to be fast, for example by
> using an rbtree instead of a linked list.  This fixes an
> implementation problem in the kernel instead of working around it
> and creating a new ABI.

Because the old ABI creates 129,000+ entries inside
/sys/devices/system/memory with their associated links from
/sys/devices/system/node/node*/ back to those directory entries.

Thankfully things like rpm, hald, and other miscellaneous commands scan
that information.  On our 8 TB test machine, hald runs continuously
following boot for nearly an hour mostly scanning useless information
from /sys/

Robin

> 
> New ABIs mean old tools won't work, and new tools need to understand
> both ABIs.
> 
> -- 
> error compiling committee.c: too many arguments to function
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-28 Thread Avi Kivity

 On 09/27/2010 09:09 PM, Nathan Fontenot wrote:

This set of patches decouples the concept that a single memory
section corresponds to a single directory in
/sys/devices/system/memory/.  On systems
with large amounts of memory (1+ TB) there are perfomance issues
related to creating the large number of sysfs directories.  For
a powerpc machine with 1 TB of memory we are creating 63,000+
directories.  This is resulting in boot times of around 45-50
minutes for systems with 1 TB of memory and 8 hours for systems
with 2 TB of memory.  With this patch set applied I am now seeing
boot times of 5 minutes or less.

The root of this issue is in sysfs directory creation. Every time
a directory is created a string compare is done against all sibling
directories to ensure we do not create duplicates.  The list of
directory nodes in sysfs is kept as an unsorted list which results
in this being an exponentially longer operation as the number of
directories are created.

The solution solved by this patch set is to allow a single
directory in sysfs to span multiple memory sections.  This is
controlled by an optional architecturally defined function
memory_block_size_bytes().  The default definition of this
routine returns a memory block size equal to the memory section
size. This maintains the current layout of sysfs memory
directories as it appears to userspace to remain the same as it
is today.



Why not update sysfs directory creation to be fast, for example by using 
an rbtree instead of a linked list.  This fixes an implementation 
problem in the kernel instead of working around it and creating a new ABI.


New ABIs mean old tools won't work, and new tools need to understand 
both ABIs.


--
error compiling committee.c: too many arguments to function

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-28 Thread Robin Holt
I was tasked with looking at a slowdown in similar sized SGI machines
booting x86_64.  Jack Steiner had already looked into the memory_dev_init.
I was looking at link_mem_sections().

I made a dramatic improvement on a 16TB machine in that function by
merely caching the most recent memory section and checking to see if
the next memory section happens to be the subsequent in the linked list
of kobjects.

That simple cache reduced the time for link_mem_sections from 1 hour 27
minutes down to 46 seconds.

I would like to propose we implement something along those lines also,
but I am currently swamped.  I can probably get you a patch tomorrow
afternoon that applies at the end of this set.

Thanks,
Robin

On Mon, Sep 27, 2010 at 02:09:31PM -0500, Nathan Fontenot wrote:
> This set of patches decouples the concept that a single memory
> section corresponds to a single directory in 
> /sys/devices/system/memory/.  On systems
> with large amounts of memory (1+ TB) there are perfomance issues
> related to creating the large number of sysfs directories.  For
> a powerpc machine with 1 TB of memory we are creating 63,000+
> directories.  This is resulting in boot times of around 45-50
> minutes for systems with 1 TB of memory and 8 hours for systems
> with 2 TB of memory.  With this patch set applied I am now seeing
> boot times of 5 minutes or less.
> 
> The root of this issue is in sysfs directory creation. Every time
> a directory is created a string compare is done against all sibling
> directories to ensure we do not create duplicates.  The list of
> directory nodes in sysfs is kept as an unsorted list which results
> in this being an exponentially longer operation as the number of
> directories are created.
> 
> The solution solved by this patch set is to allow a single
> directory in sysfs to span multiple memory sections.  This is
> controlled by an optional architecturally defined function
> memory_block_size_bytes().  The default definition of this
> routine returns a memory block size equal to the memory section
> size. This maintains the current layout of sysfs memory
> directories as it appears to userspace to remain the same as it
> is today.
> 
> For architectures that define their own version of this routine,
> as is done for powerpc in this patchset, the view in userspace
> would change such that each memoryXXX directory would span
> multiple memory sections.  The number of sections spanned would
> depend on the value reported by memory_block_size_bytes.
> 
> In both cases a new file 'end_phys_index' is created in each
> memoryXXX directory.  This file will contain the physical id
> of the last memory section covered by the sysfs directory.  For
> the default case, the value in 'end_phys_index' will be the same
> as in the existing 'phys_index' file.
> 
> This version of the patch set includes an update to to properly
> report block_size_bytes, phys_index, and end_phys_index.  Additionally,
> the patch that adds the end_phys_index sysfs file is now patch 5/8
> instead of being patch 2/8 as in the previous version of the patches.
> 
> -Nathan Fontenot
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-27 Thread Nathan Fontenot
This set of patches decouples the concept that a single memory
section corresponds to a single directory in 
/sys/devices/system/memory/.  On systems
with large amounts of memory (1+ TB) there are perfomance issues
related to creating the large number of sysfs directories.  For
a powerpc machine with 1 TB of memory we are creating 63,000+
directories.  This is resulting in boot times of around 45-50
minutes for systems with 1 TB of memory and 8 hours for systems
with 2 TB of memory.  With this patch set applied I am now seeing
boot times of 5 minutes or less.

The root of this issue is in sysfs directory creation. Every time
a directory is created a string compare is done against all sibling
directories to ensure we do not create duplicates.  The list of
directory nodes in sysfs is kept as an unsorted list which results
in this being an exponentially longer operation as the number of
directories are created.

The solution solved by this patch set is to allow a single
directory in sysfs to span multiple memory sections.  This is
controlled by an optional architecturally defined function
memory_block_size_bytes().  The default definition of this
routine returns a memory block size equal to the memory section
size. This maintains the current layout of sysfs memory
directories as it appears to userspace to remain the same as it
is today.

For architectures that define their own version of this routine,
as is done for powerpc in this patchset, the view in userspace
would change such that each memoryXXX directory would span
multiple memory sections.  The number of sections spanned would
depend on the value reported by memory_block_size_bytes.

In both cases a new file 'end_phys_index' is created in each
memoryXXX directory.  This file will contain the physical id
of the last memory section covered by the sysfs directory.  For
the default case, the value in 'end_phys_index' will be the same
as in the existing 'phys_index' file.

This version of the patch set includes an update to to properly
report block_size_bytes, phys_index, and end_phys_index.  Additionally,
the patch that adds the end_phys_index sysfs file is now patch 5/8
instead of being patch 2/8 as in the previous version of the patches.

-Nathan Fontenot
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev