subject:"\[PATCH 0\/8\] Create ZONE_MOVABLE to partition memory between movable and non\-movable pages"

On Tue, 30 Jan 2007, Peter Zijlstra wrote:

> I'm guessing this will involve page migration.

Not necessarily. The approach also works without page migration. Depends 
on an intelligent allocation scheme that stays off the areas of interest 
to those restricted to low area allocations as much as possible and then 
is able to reclaim from a section of a zone if necessary. The 
implementation of alloc_pages_range() that I did way back did not reply on 
page migration.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007, Andrew Morton wrote:

> On Mon, 29 Jan 2007 15:37:29 -0800 (PST)
> Christoph Lameter <[EMAIL PROTECTED]> wrote:
> 
> > With a alloc_pages_range() one would be able to specify upper and lower 
> > boundaries.
> 
> Is there a proposal anywhere regarding how this would be implemented?

Yes it was discussed a while back in August. Look for alloc_pages_range. 
Sadly I have not been able to do work on it since there are too many 
other issues.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007, Andrew Morton wrote:

 On Mon, 29 Jan 2007 15:37:29 -0800 (PST)
 Christoph Lameter [EMAIL PROTECTED] wrote:
 
  With a alloc_pages_range() one would be able to specify upper and lower 
  boundaries.
 
 Is there a proposal anywhere regarding how this would be implemented?

Yes it was discussed a while back in August. Look for alloc_pages_range. 
Sadly I have not been able to do work on it since there are too many 
other issues.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Tue, 30 Jan 2007, Peter Zijlstra wrote:

 I'm guessing this will involve page migration.

Not necessarily. The approach also works without page migration. Depends 
on an intelligent allocation scheme that stays off the areas of interest 
to those restricted to low area allocations as much as possible and then 
is able to reclaim from a section of a zone if necessary. The 
implementation of alloc_pages_range() that I did way back did not reply on 
page migration.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

2007-01-30 Thread Peter Zijlstra

On Mon, 2007-01-29 at 16:09 -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 15:37:29 -0800 (PST)
> Christoph Lameter <[EMAIL PROTECTED]> wrote:
> 
> > With a alloc_pages_range() one would be able to specify upper and lower 
> > boundaries.
> 
> Is there a proposal anywhere regarding how this would be implemented?

I'm guessing this will involve page migration.

Still, would we need to place bounds on non movable pages, or will it be
a best effort? It seems the current zone approach is a best effort too,
although it does try to keep allocations away from the lower zones as
much as possible.

But I guess we could make a single zone allocator prefer high addresses
too.

So then we'd end up with a single zone, and each allocation would give a
range. Try and pick a free page with as high an address as possible in
the given range. If no pages available in the given range try and move
some movable pages out of it.

This does of course involve finding free pages in a given range, and
identifying pages as movable.

And a gazillion trivial but tedious things I've forgotten. Christoph, is
this what you were getting at?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

2007-01-30 Thread Peter Zijlstra

On Mon, 2007-01-29 at 16:09 -0800, Andrew Morton wrote:
 On Mon, 29 Jan 2007 15:37:29 -0800 (PST)
 Christoph Lameter [EMAIL PROTECTED] wrote:
 
  With a alloc_pages_range() one would be able to specify upper and lower 
  boundaries.
 
 Is there a proposal anywhere regarding how this would be implemented?

I'm guessing this will involve page migration.

Still, would we need to place bounds on non movable pages, or will it be
a best effort? It seems the current zone approach is a best effort too,
although it does try to keep allocations away from the lower zones as
much as possible.

But I guess we could make a single zone allocator prefer high addresses
too.

So then we'd end up with a single zone, and each allocation would give a
range. Try and pick a free page with as high an address as possible in
the given range. If no pages available in the given range try and move
some movable pages out of it.

This does of course involve finding free pages in a given range, and
identifying pages as movable.

And a gazillion trivial but tedious things I've forgotten. Christoph, is
this what you were getting at?


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007 15:37:29 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> With a alloc_pages_range() one would be able to specify upper and lower 
> boundaries.

Is there a proposal anywhere regarding how this would be implemented?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007, Russell King wrote:

> This sounds like it could help ARM where we have some weird DMA areas.

Some ARM platforms have no need for a ZONE_DMA. The code in mm allows you 
to not compile ZONE_DMA support into these kernels.

> What will help even more is if the block layer can also be persuaded that
> a device dma mask is precisely that - a mask - and not a set of leading
> ones followed by a set of zeros, then we could eliminate the really ugly
> dmabounce code.

With a alloc_pages_range() one would be able to specify upper and lower 
boundaries. The device dma mask can be translated to a fitting boundary. 
Maybe we can then also get rid of the device mask and specify a boundary 
there. There is a lot of ugly code all around that circumvents the 
existing issues with dma masks. That would all go away.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

2007-01-29 Thread Russell King

On Mon, Jan 29, 2007 at 02:45:06PM -0800, Christoph Lameter wrote:
> On Mon, 29 Jan 2007, Andrew Morton wrote:
> 
> > > All 64 bit machine will only have a single zone if we have such a range 
> > > alloc mechanism. The 32bit ones with HIGHMEM wont be able to avoid it, 
> > > true. But all arches that do not need gymnastics to access their memory 
> > > will be able run with a single zone.
> > 
> > What is "such a range alloc mechanism"?
> 
> As I mentioned above: A function that allows an allocation to specify 
> which physical memory ranges are permitted.
> 
> > So please stop telling me what a wonderful world it is to not have multiple
> > zones.  It just isn't going to happen for a long long time.  The
> > multiple-zone kernel is the case we need to care about most by a very large
> > margin indeed.  Single-zone is an infinitesimal corner-case.
> 
> We can still reduce the number of zones for those that require highmem to 
> two which may allows us to avoid ZONE_DMA/DMA32 issues  and allow dma 
> devices to avoid bunce buffers that can do I/O to memory ranges not 
> compatible with the current boundaries of DMA/DMA32. And I am also 
> repeating myself.

This sounds like it could help ARM where we have some weird DMA areas.

What will help even more is if the block layer can also be persuaded that
a device dma mask is precisely that - a mask - and not a set of leading
ones followed by a set of zeros, then we could eliminate the really ugly
dmabounce code.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007, Andrew Morton wrote:

> > All 64 bit machine will only have a single zone if we have such a range 
> > alloc mechanism. The 32bit ones with HIGHMEM wont be able to avoid it, 
> > true. But all arches that do not need gymnastics to access their memory 
> > will be able run with a single zone.
> 
> What is "such a range alloc mechanism"?

As I mentioned above: A function that allows an allocation to specify 
which physical memory ranges are permitted.

> So please stop telling me what a wonderful world it is to not have multiple
> zones.  It just isn't going to happen for a long long time.  The
> multiple-zone kernel is the case we need to care about most by a very large
> margin indeed.  Single-zone is an infinitesimal corner-case.

We can still reduce the number of zones for those that require highmem to 
two which may allows us to avoid ZONE_DMA/DMA32 issues  and allow dma 
devices to avoid bunce buffers that can do I/O to memory ranges not 
compatible with the current boundaries of DMA/DMA32. And I am also 
repeating myself.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007 13:54:38 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Fri, 26 Jan 2007, Andrew Morton wrote:
> 
> > > The main benefit is a significant simplification of the VM, leading to 
> > > robust and reliable operations and a reduction of the maintenance 
> > > headaches coming with the additional zones.
> > > 
> > > If we would introduce the ability of allocating from a range of 
> > > physical addresses then the need for DMA zones would go away allowing 
> > > flexibility for device driver DMA allocations and at the same time we get 
> > > rid of special casing in the VM.
> > 
> > None of this is valid.  The great majority of machines out there will
> > continue to have the same number of zones.  Nothing changes.
> 
> All 64 bit machine will only have a single zone if we have such a range 
> alloc mechanism. The 32bit ones with HIGHMEM wont be able to avoid it, 
> true. But all arches that do not need gymnastics to access their memory 
> will be able run with a single zone.

What is "such a range alloc mechanism"?

> > That's all a real cost, so we need to see *good* benefits to outweigh that
> > cost.  Thus far I don't think we've seen that.
> 
> The real savings is the simplicity of VM design, robustness and 
> efficiency. We loose on all these fronts if we keep or add useless zones. 
> 
> The main reason for the recent problems with dirty handling seem to be due 
> to exactly such a multizone balancing issues involving ZONE_NORMAL and 
> HIGHMEM. Those problems cannot occur on single ZONE arches (this means 
> right now on a series of embedded arches, UML and IA64). 
> 
> Multiple ZONES are a recipie for VM fragility and result in complexity 
> that is difficult to manage.

Why do I have to keep repeating myself?  90% of known FC6-running machines
are x86-32.  90% of vendor-shipped kernels need all three zones.  And the
remaining 10% ship with multiple nodes as well.

So please stop telling me what a wonderful world it is to not have multiple
zones.  It just isn't going to happen for a long long time.  The
multiple-zone kernel is the case we need to care about most by a very large
margin indeed.  Single-zone is an infinitesimal corner-case.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Andrew Morton wrote:

> > The main benefit is a significant simplification of the VM, leading to 
> > robust and reliable operations and a reduction of the maintenance 
> > headaches coming with the additional zones.
> > 
> > If we would introduce the ability of allocating from a range of 
> > physical addresses then the need for DMA zones would go away allowing 
> > flexibility for device driver DMA allocations and at the same time we get 
> > rid of special casing in the VM.
> 
> None of this is valid.  The great majority of machines out there will
> continue to have the same number of zones.  Nothing changes.

All 64 bit machine will only have a single zone if we have such a range 
alloc mechanism. The 32bit ones with HIGHMEM wont be able to avoid it, 
true. But all arches that do not need gymnastics to access their memory 
will be able run with a single zone.

> That's all a real cost, so we need to see *good* benefits to outweigh that
> cost.  Thus far I don't think we've seen that.

The real savings is the simplicity of VM design, robustness and 
efficiency. We loose on all these fronts if we keep or add useless zones. 

The main reason for the recent problems with dirty handling seem to be due 
to exactly such a multizone balancing issues involving ZONE_NORMAL and 
HIGHMEM. Those problems cannot occur on single ZONE arches (this means 
right now on a series of embedded arches, UML and IA64). 

Multiple ZONES are a recipie for VM fragility and result in complexity 
that is difficult to manage.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Andrew Morton wrote:

  The main benefit is a significant simplification of the VM, leading to 
  robust and reliable operations and a reduction of the maintenance 
  headaches coming with the additional zones.
  
  If we would introduce the ability of allocating from a range of 
  physical addresses then the need for DMA zones would go away allowing 
  flexibility for device driver DMA allocations and at the same time we get 
  rid of special casing in the VM.
 
 None of this is valid.  The great majority of machines out there will
 continue to have the same number of zones.  Nothing changes.

All 64 bit machine will only have a single zone if we have such a range 
alloc mechanism. The 32bit ones with HIGHMEM wont be able to avoid it, 
true. But all arches that do not need gymnastics to access their memory 
will be able run with a single zone.
 
 That's all a real cost, so we need to see *good* benefits to outweigh that
 cost.  Thus far I don't think we've seen that.

The real savings is the simplicity of VM design, robustness and 
efficiency. We loose on all these fronts if we keep or add useless zones. 

The main reason for the recent problems with dirty handling seem to be due 
to exactly such a multizone balancing issues involving ZONE_NORMAL and 
HIGHMEM. Those problems cannot occur on single ZONE arches (this means 
right now on a series of embedded arches, UML and IA64). 

Multiple ZONES are a recipie for VM fragility and result in complexity 
that is difficult to manage.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007 13:54:38 -0800 (PST)
Christoph Lameter [EMAIL PROTECTED] wrote:

 On Fri, 26 Jan 2007, Andrew Morton wrote:
 
   The main benefit is a significant simplification of the VM, leading to 
   robust and reliable operations and a reduction of the maintenance 
   headaches coming with the additional zones.
   
   If we would introduce the ability of allocating from a range of 
   physical addresses then the need for DMA zones would go away allowing 
   flexibility for device driver DMA allocations and at the same time we get 
   rid of special casing in the VM.
  
  None of this is valid.  The great majority of machines out there will
  continue to have the same number of zones.  Nothing changes.
 
 All 64 bit machine will only have a single zone if we have such a range 
 alloc mechanism. The 32bit ones with HIGHMEM wont be able to avoid it, 
 true. But all arches that do not need gymnastics to access their memory 
 will be able run with a single zone.

What is such a range alloc mechanism?

  That's all a real cost, so we need to see *good* benefits to outweigh that
  cost.  Thus far I don't think we've seen that.
 
 The real savings is the simplicity of VM design, robustness and 
 efficiency. We loose on all these fronts if we keep or add useless zones. 
 
 The main reason for the recent problems with dirty handling seem to be due 
 to exactly such a multizone balancing issues involving ZONE_NORMAL and 
 HIGHMEM. Those problems cannot occur on single ZONE arches (this means 
 right now on a series of embedded arches, UML and IA64). 
 
 Multiple ZONES are a recipie for VM fragility and result in complexity 
 that is difficult to manage.

Why do I have to keep repeating myself?  90% of known FC6-running machines
are x86-32.  90% of vendor-shipped kernels need all three zones.  And the
remaining 10% ship with multiple nodes as well.

So please stop telling me what a wonderful world it is to not have multiple
zones.  It just isn't going to happen for a long long time.  The
multiple-zone kernel is the case we need to care about most by a very large
margin indeed.  Single-zone is an infinitesimal corner-case.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007, Andrew Morton wrote:

  All 64 bit machine will only have a single zone if we have such a range 
  alloc mechanism. The 32bit ones with HIGHMEM wont be able to avoid it, 
  true. But all arches that do not need gymnastics to access their memory 
  will be able run with a single zone.
 
 What is such a range alloc mechanism?

As I mentioned above: A function that allows an allocation to specify 
which physical memory ranges are permitted.

 So please stop telling me what a wonderful world it is to not have multiple
 zones.  It just isn't going to happen for a long long time.  The
 multiple-zone kernel is the case we need to care about most by a very large
 margin indeed.  Single-zone is an infinitesimal corner-case.

We can still reduce the number of zones for those that require highmem to 
two which may allows us to avoid ZONE_DMA/DMA32 issues  and allow dma 
devices to avoid bunce buffers that can do I/O to memory ranges not 
compatible with the current boundaries of DMA/DMA32. And I am also 
repeating myself.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

2007-01-29 Thread Russell King

On Mon, Jan 29, 2007 at 02:45:06PM -0800, Christoph Lameter wrote:
 On Mon, 29 Jan 2007, Andrew Morton wrote:
 
   All 64 bit machine will only have a single zone if we have such a range 
   alloc mechanism. The 32bit ones with HIGHMEM wont be able to avoid it, 
   true. But all arches that do not need gymnastics to access their memory 
   will be able run with a single zone.
  
  What is such a range alloc mechanism?
 
 As I mentioned above: A function that allows an allocation to specify 
 which physical memory ranges are permitted.
 
  So please stop telling me what a wonderful world it is to not have multiple
  zones.  It just isn't going to happen for a long long time.  The
  multiple-zone kernel is the case we need to care about most by a very large
  margin indeed.  Single-zone is an infinitesimal corner-case.
 
 We can still reduce the number of zones for those that require highmem to 
 two which may allows us to avoid ZONE_DMA/DMA32 issues  and allow dma 
 devices to avoid bunce buffers that can do I/O to memory ranges not 
 compatible with the current boundaries of DMA/DMA32. And I am also 
 repeating myself.

This sounds like it could help ARM where we have some weird DMA areas.

What will help even more is if the block layer can also be persuaded that
a device dma mask is precisely that - a mask - and not a set of leading
ones followed by a set of zeros, then we could eliminate the really ugly
dmabounce code.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007, Russell King wrote:

 This sounds like it could help ARM where we have some weird DMA areas.

Some ARM platforms have no need for a ZONE_DMA. The code in mm allows you 
to not compile ZONE_DMA support into these kernels.

 What will help even more is if the block layer can also be persuaded that
 a device dma mask is precisely that - a mask - and not a set of leading
 ones followed by a set of zeros, then we could eliminate the really ugly
 dmabounce code.

With a alloc_pages_range() one would be able to specify upper and lower 
boundaries. The device dma mask can be translated to a fitting boundary. 
Maybe we can then also get rid of the device mask and specify a boundary 
there. There is a lot of ugly code all around that circumvents the 
existing issues with dma masks. That would all go away.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007 15:37:29 -0800 (PST)
Christoph Lameter [EMAIL PROTECTED] wrote:

 With a alloc_pages_range() one would be able to specify upper and lower 
 boundaries.

Is there a proposal anywhere regarding how this would be implemented?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007 11:58:18 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> > If the only demonstrable benefit is a saving of a few k of text on a small
> > number of machines then things are looking very grim, IMO.
> 
> The main benefit is a significant simplification of the VM, leading to 
> robust and reliable operations and a reduction of the maintenance 
> headaches coming with the additional zones.
> 
> If we would introduce the ability of allocating from a range of 
> physical addresses then the need for DMA zones would go away allowing 
> flexibility for device driver DMA allocations and at the same time we get 
> rid of special casing in the VM.

None of this is valid.  The great majority of machines out there will
continue to have the same number of zones.  Nothing changes.

What will happen is that a small number of machines will have different
runtime behaviour.  So they don't benefit from the majority's testing and
they don't contrinute to it and they potentially have unique-to-them
problems which we need to worry about.

That's all a real cost, so we need to see *good* benefits to outweigh that
cost.  Thus far I don't think we've seen that.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Andrew Morton wrote:

> As Mel points out, distros will ship with CONFIG_ZONE_DMA=y, so the number
> of machines which will actually benefit from this change is really small. 
> And the benefit to those few machines will also, I suspect, be small.
> 
> > > - We kicked around some quite different ways of implementing the same
> > >   things, but nothing came of it.  iirc, one was to remove the hard-coded
> > >   zones altogether and rework all the MM to operate in terms of
> > > 
> > >   for (idx = 0; idx < NUMBER_OF_ZONES; idx++)
> > >   ...
> > 
> > Hmmm.. How would that be simpler?
> 
> Replace a sprinkle of open-coded ifdefs with a regular code sequence which
> everyone uses.  Pretty obvious, I'd thought.

We do use such loops in many places. However, stuff like array 
initialization and special casing cannot use a loop. I am not sure what we 
could change there. The hard coding is necessary because each zone 
currently has these invariant characteristics that we need to consider. 
Reducing the number of zones reduces the amount of special casing in the 
VM that needs to be considered at run time and that is a potential issue
for trouble.

> Plus it becoems straightforward to extend this from the present four zones
> to a complete 12 zones, which gives use the full set of
> ZONE_DMA20,ZONE_DMA21,...,ZONE_DMA32 for those funny devices.

I just hope we can handle the VM complexity of load balancing etc etc that 
this will introduce. Also each zone has management overhead and will cause 
the touching of additional cachelines on many VM operations. Much of that 
management overhead becomes unnecessary if we reduce zones.

> If the only demonstrable benefit is a saving of a few k of text on a small
> number of machines then things are looking very grim, IMO.

The main benefit is a significant simplification of the VM, leading to 
robust and reliable operations and a reduction of the maintenance 
headaches coming with the additional zones.

If we would introduce the ability of allocating from a range of 
physical addresses then the need for DMA zones would go away allowing 
flexibility for device driver DMA allocations and at the same time we get 
rid of special casing in the VM.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007 07:56:09 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Fri, 26 Jan 2007, Andrew Morton wrote:
> 
> > - They add zillions of ifdefs
> 
> They just add a few for ZONE_DMA where we alreaday have similar ifdefs for 
> ZONE_DMA32 and ZONE_HIGHMEM.

I refreshed my memory.  It remains awful.

> > - They make the VM's behaviour diverge between different platforms and
> >   between differen configs on the same platforms, and hence degrade
> >   maintainability and increase complexity.
> 
> They avoid unecessary complexity on platforms. They could be made to work 
> on more platforms with measures to deal with what ZONE_DMA 
> provides in different ways. There are 6 or so platforms that do not need 
> ZONE_DMA at all.

As Mel points out, distros will ship with CONFIG_ZONE_DMA=y, so the number
of machines which will actually benefit from this change is really small. 
And the benefit to those few machines will also, I suspect, be small.

> > - We kicked around some quite different ways of implementing the same
> >   things, but nothing came of it.  iirc, one was to remove the hard-coded
> >   zones altogether and rework all the MM to operate in terms of
> > 
> > for (idx = 0; idx < NUMBER_OF_ZONES; idx++)
> > ...
> 
> Hmmm.. How would that be simpler?

Replace a sprinkle of open-coded ifdefs with a regular code sequence which
everyone uses.  Pretty obvious, I'd thought.

Plus it becoems straightforward to extend this from the present four zones
to a complete 12 zones, which gives use the full set of
ZONE_DMA20,ZONE_DMA21,...,ZONE_DMA32 for those funny devices.

> > - I haven't seen any hard numbers to justify the change.
> 
> I have send you numbers showing significant reductions in code size.

If it isn't in the changelog it doesn't exist.  I guess I didn't copy it
into the changelog.

If the only demonstrable benefit is a saving of a few k of text on a small
number of machines then things are looking very grim, IMO.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages


On Fri, 26 Jan 2007, Christoph Lameter wrote:


On Fri, 26 Jan 2007, Mel Gorman wrote:


For arches that do not have HIGHMEM other zones would be okay too it
seems.

It would, but it'd obscure the code to take advantage of that.


No MOVABLE memory for 64 bit platforms that do not have HIGHMEM right now?



err, no, I misinterpreted what you meant by "other zones would be ok..". I 
though you were suggesting the reuse of zone names for some reason.


The zone used to for ZONE_MOVABLE is the highest populated zone on the 
architecture. On some architectures, that will be ZONE_HIGHMEM. On others, 
it will be ZONE_DMA. See the function find_usable_zone_for_movable()


ZONE_MOVABLE never spans zones. For example, it will not use some 
ZONE_HIGHMEM and some ZONE_NORMAL memory.



The anti-fragmentation code could potentially be used to have subzone groups
that kept movable and unmovable allocations as far apart as possible and at
opposite ends of a zone. That approach has been kicked a few times because of
complexity.


Hmm... But his patch also introduces additional complexity plus its
difficult to handle for the end user.



It's harder for the user to setup all right. But it works within limits 
that are known well in advance and doesn't add additional code to the main 
allocator path. Once it's setup, it acts like any other zone and zone 
behavior is better understood than anti-fragmentations behavior.



There are some NUMA architectures that are not that
symmetric.

I know, it's why find_zone_movable_pfns_for_nodes() is as complex as it is.
The mechanism spreads the unmovable memory evenly throughout all nodes. In the
event some nodes are too small to hold their share, the remaining unmovable
memory is divided between the nodes that are larger.


I would have expected a percentage of a node. If equal amounts of
unmovable memory are assigned to all nodes at first then there will be
large disparities in the amount of movable memories f.e. between a node
with 8G memory compared to a node with 1GB memory.



On the other hand, percentages make it harder for the administrator to 
know in advance how much unmovable memory will be available when the 
system starts even if the machine changes configuration. The absolute 
figure is easier to understand. If there was a requirement, an alternative 
configuration option could be made available that takes a fixed percentage 
of each node with memory.



How do you handle headless nodes? I.e. memory nodes with no processors?


The code only cares about memory, not processors.


Those may be particularly large compared to the rest but these are mainly
used for movable pages since unmovable things like device drivers buffers
have to be kept near the processors that take the interrupt.



Then what I'd do is specify kernelcore to be

(number_of_nodes_with_processors * 
largest_amount_of_memory_on_node_with_processors)

That would have all memory near processors available as unmovable memory 
(that movable allocations will still use so they don't always go remote) 
while keeping a large amount of memory on the headless nodes for movable 
allocations only.


If requirements demanded, a configuration option could be made that allows 
the administrator to specify exactly how much unmovable memory he wants on 
a specific node.


--
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Mel Gorman wrote:

> > For arches that do not have HIGHMEM other zones would be okay too it
> > seems.
> It would, but it'd obscure the code to take advantage of that.

No MOVABLE memory for 64 bit platforms that do not have HIGHMEM right now?

> The anti-fragmentation code could potentially be used to have subzone groups
> that kept movable and unmovable allocations as far apart as possible and at
> opposite ends of a zone. That approach has been kicked a few times because of
> complexity.

Hmm... But his patch also introduces additional complexity plus its 
difficult to handle for the end user.

> > There are some NUMA architectures that are not that
> > symmetric.
> I know, it's why find_zone_movable_pfns_for_nodes() is as complex as it is.
> The mechanism spreads the unmovable memory evenly throughout all nodes. In the
> event some nodes are too small to hold their share, the remaining unmovable
> memory is divided between the nodes that are larger.

I would have expected a percentage of a node. If equal amounts of 
unmovable memory are assigned to all nodes at first then there will be 
large disparities in the amount of movable memories f.e. between a node 
with 8G memory compared to a node with 1GB memory.

How do you handle headless nodes? I.e. memory nodes with no processors? 
Those may be particularly large compared to the rest but these are mainly 
used for movable pages since unmovable things like device drivers buffers
have to be kept near the processors that take the interrupt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages


On Fri, 26 Jan 2007, Christoph Lameter wrote:


On Thu, 25 Jan 2007, Mel Gorman wrote:


The following 8 patches against 2.6.20-rc4-mm1 create a zone called
ZONE_MOVABLE that is only usable by allocations that specify both __GFP_HIGHMEM
and __GFP_MOVABLE. This has the effect of keeping all non-movable pages
within a single memory partition while allowing movable allocations to be
satisified from either partition.


For arches that do not have HIGHMEM other zones would be okay too it
seems.



It would, but it'd obscure the code to take advantage of that.


The size of the zone is determined by a kernelcore= parameter specified at
boot-time. This specifies how much memory is usable by non-movable allocations
and the remainder is used for ZONE_MOVABLE. Any range of pages within
ZONE_MOVABLE can be released by migrating the pages or by reclaiming.


The user has to manually fiddle around with the size of the unmovable
partition until it works?



They have to fiddle with the size of the unmovable partition if their 
workload uses more unmovable kernel allocations than expected. This was 
always going to be the restriction with using zones for partitioning 
memory. Resizing zones on the fly is not really an option because the 
resizing would only work reliably in one direction.


The anti-fragmentation code could potentially be used to have subzone 
groups that kept movable and unmovable allocations as far apart as 
possible and at opposite ends of a zone. That approach has been kicked a 
few times because of complexity.



When selecting a zone to take pages from for ZONE_MOVABLE, there are two
things to consider. First, only memory from the highest populated zone is
used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM
but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second,
the amount of memory usable by the kernel will be spreadly evenly throughout
NUMA nodes where possible. If the nodes are not of equal size, the amount
of memory usable by the kernel on some nodes may be greater than others.


So how is the amount of movable memory on a node calculated?


Subtle difference. The amount of unmovable memory is calculated per node.


Evenly
distributed?


As evenly as possible.


There are some NUMA architectures that are not that
symmetric.



I know, it's why find_zone_movable_pfns_for_nodes() is as complex as it 
is. The mechanism spreads the unmovable memory evenly throughout all 
nodes. In the event some nodes are too small to hold their share, the 
remaining unmovable memory is divided between the nodes that are larger.



By default, the zone is not as useful for hugetlb allocations because they
are pinned and non-migratable (currently at least). A sysctl is provided that
allows huge pages to be allocated from that zone. This means that the huge
page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
the system assuming that pages are not mlocked. Despite huge pages being
non-movable, we do not introduce additional external fragmentation of note
as huge pages are always the largest contiguous block we care about.


The user already has to specify the partitioning of the system at bootup
and could take the huge page sizes into account.



Not in all cases. Some systems will not know how many huge pages they need 
in advance because it is used as a batch system running jobs as requested. 
The zone allows an amount of memory to be set aside that can be 
*optionally* used for hugepages if desired or base pages if not. Between 
jobs, the hugepage pool can be resized up to the size of ZONE_MOVABLE.


The other case is ever supporting memory hot-remove. Any memory within 
ZONE_MOVABLE can potentially be removed by migrating pages and off-lined.



Also huge pages may have variable sizes that can be specified on bootup
for IA64. The assumption that a huge page is always the largest
contiguous block is *not true*.



I didn't say they were the largest supported contiguous block, I said they 
were the largest contiguous block we *care* about. Right now, it is 
assumed that variable pages are not supported at runtime. If they were, 
some smarts would be needed to keep huge pages of the same size together 
to control external fragmentation but that's about it.



The huge page sizes on i386 and x86_64 platforms are contigent on
their page table structure. This can be completely different on other
platforms.



The size doesn't really make much difference to the mechanism.

--
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Thu, 25 Jan 2007, Mel Gorman wrote:

> The following 8 patches against 2.6.20-rc4-mm1 create a zone called
> ZONE_MOVABLE that is only usable by allocations that specify both 
> __GFP_HIGHMEM
> and __GFP_MOVABLE. This has the effect of keeping all non-movable pages
> within a single memory partition while allowing movable allocations to be
> satisified from either partition.

For arches that do not have HIGHMEM other zones would be okay too it 
seems.

> The size of the zone is determined by a kernelcore= parameter specified at
> boot-time. This specifies how much memory is usable by non-movable allocations
> and the remainder is used for ZONE_MOVABLE. Any range of pages within
> ZONE_MOVABLE can be released by migrating the pages or by reclaiming.

The user has to manually fiddle around with the size of the unmovable 
partition until it works?

> When selecting a zone to take pages from for ZONE_MOVABLE, there are two
> things to consider. First, only memory from the highest populated zone is
> used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM
> but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second,
> the amount of memory usable by the kernel will be spreadly evenly throughout
> NUMA nodes where possible. If the nodes are not of equal size, the amount
> of memory usable by the kernel on some nodes may be greater than others.

So how is the amount of movable memory on a node calculated? Evenly 
distributed? There are some NUMA architectures that are not that 
symmetric.

> By default, the zone is not as useful for hugetlb allocations because they
> are pinned and non-migratable (currently at least). A sysctl is provided that
> allows huge pages to be allocated from that zone. This means that the huge
> page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
> the system assuming that pages are not mlocked. Despite huge pages being
> non-movable, we do not introduce additional external fragmentation of note
> as huge pages are always the largest contiguous block we care about.

The user already has to specify the partitioning of the system at bootup 
and could take the huge page sizes into account.

Also huge pages may have variable sizes that can be specified on bootup 
for IA64. The assumption that a huge page is always the largest 
contiguous block is *not true*.

The huge page sizes on i386 and x86_64 platforms are contigent on 
their page table structure. This can be completely different on other 
platforms.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Mel Gorman wrote:

> I haven't thought about it much so I probably am missing something. The major
> difference I see is when only one zone is present. In that case, a number of
> loops presumably get optimised away and the behavior is very different
> (presumably better although you point out no figures exist to prove it). Where
> there are two or more zones, the code paths should be similar whether there
> are 2, 3 or 4 zones present.

The balancing of allocations between zones is becoming unnecessary. Also 
in a NUMA system we then have zone == node which allows for a series of 
simplifications.

> As the common platforms will always have more than one zone, it'll be heavily
> tested and I'm guessing that distros are always going to have to ship kernels
> with ZONE_DMA for the devices that require it. The only platform I see that
> may have problems at the moment is IA64 which looks like the only platform
> that can have one and only one zone. I am guessing that Christoph will catch
> problems here fairly quickly although a non-optional ZONE_MOVABLE would throw
> a spanner into the works somewhat.

There are 6 platforms that have only one zone. These are not major 
platforms. In order for major platforms to go to a single zone in general 
we would have to implement a generic mechanism to do an allocation where 
one can specify the memory boundaries. Many DMA engines have different
limitations from what ZONE_DMA and ZONE_DMA32 can provide. If such a 
scheme would be implemented then those would be able to utilize memory 
better and the amount of bounce buffers would be reduced.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Andrew Morton wrote:

> - They add zillions of ifdefs

They just add a few for ZONE_DMA where we alreaday have similar ifdefs for 
ZONE_DMA32 and ZONE_HIGHMEM.

> - They make the VM's behaviour diverge between different platforms and
>   between differen configs on the same platforms, and hence degrade
>   maintainability and increase complexity.

They avoid unecessary complexity on platforms. They could be made to work 
on more platforms with measures to deal with what ZONE_DMA 
provides in different ways. There are 6 or so platforms that do not need 
ZONE_DMA at all.

> - We kicked around some quite different ways of implementing the same
>   things, but nothing came of it.  iirc, one was to remove the hard-coded
>   zones altogether and rework all the MM to operate in terms of
> 
>   for (idx = 0; idx < NUMBER_OF_ZONES; idx++)
>   ...

Hmmm.. How would that be simpler?

> - I haven't seen any hard numbers to justify the change.

I have send you numbers showing significant reductions in code size.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Andrew Morton wrote:

On Thu, 25 Jan 2007 23:44:58 + (GMT)
Mel Gorman <[EMAIL PROTECTED]> wrote:

The following 8 patches against 2.6.20-rc4-mm1 create a zone called
ZONE_MOVABLE

Argh. These surely get all tangled up with the
make-zones-optional-by-adding-zillions-of-ifdef patches:

There may be some entertainment there all right. I didn't see any obvious
way of avoiding collisions with those patches but for what it's worth,
ZONE_MOVABLE could also be made optional.

In this patchset, I made no assumptions about the number of zones other
than the value of MAX_NR_ZONES. There should be no critical collisions but
I'll look through this patch list and see what I can spot.

deal-with-cases-of-zone_dma-meaning-the-first-zone.patch

This patch looks ok and looks like it stands on it's own.

introduce-config_zone_dma.patch

ok, no collisions here but obviously this patch does not stand on it's
own.

optional-zone_dma-in-the-vm.patch

There are collisions here with the __ZONE_COUNT stuff but it's not
difficult to work around.

optional-zone_dma-in-the-vm-no-gfp_dma-check-in-the-slab-if-no-config_zone_dma-is-set.patch
optional-zone_dma-in-the-vm-no-gfp_dma-check-in-the-slab-if-no-config_zone_dma-is-set-reduce-config_zone_dma-ifdefs.patch

There is no cross-over here with the ZONE_MOVABLE patches. They are
messing around with slab

optional-zone_dma-for-ia64.patch

No collision here

remove-zone_dma-remains-from-parisc.patch
remove-zone_dma-remains-from-sh-sh64.patch

No collisions here either. I see that there were discussions about Power
potentially doing something similar.

set-config_zone_dma-for-arches-with-generic_isa_dma.patch

No collisions

zoneid-fix-up-calculations-for-zoneid_pgshift.patch

Fun, but no collisions.

To my suprise, I only spotted one major conflict point with
optional-zone_dma-in-the-vm.patch and that should be easy enough to
resolve. What I could do is break up one of my patches into
most-of-the-patch and the-part-that-may-conflict-with-optional-dma-zone .
The smaller part would then change depending on whether the optional DMA
zone work is present. Would that be any help?

My objections to those patches:

- They add zillions of ifdefs

- They make the VM's behaviour diverge between different platforms and
between differen configs on the same platforms, and hence degrade
maintainability and increase complexity.

I haven't thought about it much so I probably am missing something. The
major difference I see is when only one zone is present. In that case, a
number of loops presumably get optimised away and the behavior is very
different (presumably better although you point out no figures exist to
prove it). Where there are two or more zones, the code paths should be
similar whether there are 2, 3 or 4 zones present.

As the common platforms will always have more than one zone, it'll be
heavily tested and I'm guessing that distros are always going to have to
ship kernels with ZONE_DMA for the devices that require it. The only
platform I see that may have problems at the moment is IA64 which looks
like the only platform that can have one and only one zone. I am guessing
that Christoph will catch problems here fairly quickly although a
non-optional ZONE_MOVABLE would throw a spanner into the works somewhat.

- We kicked around some quite different ways of implementing the same
things, but nothing came of it. iirc, one was to remove the hard-coded
zones altogether and rework all the MM to operate in terms of

for (idx = 0; idx < NUMBER_OF_ZONES; idx++)
...

hmm. Assuming the aim is to have a situation where all zone-related loops
are optimised away at compile-time, it's hard to see an alternative that
works. Any dynamic way of creating zone at boot time will not have the
compile-time optimizations and any API that is page-range aware will
eventually hit the problems zones were made to solve (i.e. unmovable pages
locked in the lower address ranges).

- I haven't seen any hard numbers to justify the change.

So I want to drop them all.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Thu, 25 Jan 2007 23:44:58 + (GMT)
Mel Gorman <[EMAIL PROTECTED]> wrote:

> The following 8 patches against 2.6.20-rc4-mm1 create a zone called
> ZONE_MOVABLE

Argh.  These surely get all tangled up with the
make-zones-optional-by-adding-zillions-of-ifdef patches:

deal-with-cases-of-zone_dma-meaning-the-first-zone.patch
introduce-config_zone_dma.patch
optional-zone_dma-in-the-vm.patch
optional-zone_dma-in-the-vm-no-gfp_dma-check-in-the-slab-if-no-config_zone_dma-is-set.patch
optional-zone_dma-in-the-vm-no-gfp_dma-check-in-the-slab-if-no-config_zone_dma-is-set-reduce-config_zone_dma-ifdefs.patch
optional-zone_dma-for-ia64.patch
remove-zone_dma-remains-from-parisc.patch
remove-zone_dma-remains-from-sh-sh64.patch
set-config_zone_dma-for-arches-with-generic_isa_dma.patch
zoneid-fix-up-calculations-for-zoneid_pgshift.patch

My objections to those patches:

- They add zillions of ifdefs

- They make the VM's behaviour diverge between different platforms and
  between differen configs on the same platforms, and hence degrade
  maintainability and increase complexity.

- We kicked around some quite different ways of implementing the same
  things, but nothing came of it.  iirc, one was to remove the hard-coded
  zones altogether and rework all the MM to operate in terms of

for (idx = 0; idx < NUMBER_OF_ZONES; idx++)
...

- I haven't seen any hard numbers to justify the change.

So I want to drop them all.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Thu, 25 Jan 2007 23:44:58 + (GMT)
Mel Gorman [EMAIL PROTECTED] wrote:

 The following 8 patches against 2.6.20-rc4-mm1 create a zone called
 ZONE_MOVABLE

Argh.  These surely get all tangled up with the
make-zones-optional-by-adding-zillions-of-ifdef patches:

deal-with-cases-of-zone_dma-meaning-the-first-zone.patch
introduce-config_zone_dma.patch
optional-zone_dma-in-the-vm.patch
optional-zone_dma-in-the-vm-no-gfp_dma-check-in-the-slab-if-no-config_zone_dma-is-set.patch
optional-zone_dma-in-the-vm-no-gfp_dma-check-in-the-slab-if-no-config_zone_dma-is-set-reduce-config_zone_dma-ifdefs.patch
optional-zone_dma-for-ia64.patch
remove-zone_dma-remains-from-parisc.patch
remove-zone_dma-remains-from-sh-sh64.patch
set-config_zone_dma-for-arches-with-generic_isa_dma.patch
zoneid-fix-up-calculations-for-zoneid_pgshift.patch

My objections to those patches:

- They add zillions of ifdefs

- They make the VM's behaviour diverge between different platforms and
  between differen configs on the same platforms, and hence degrade
  maintainability and increase complexity.

- We kicked around some quite different ways of implementing the same
  things, but nothing came of it.  iirc, one was to remove the hard-coded
  zones altogether and rework all the MM to operate in terms of

for (idx = 0; idx  NUMBER_OF_ZONES; idx++)
...

- I haven't seen any hard numbers to justify the change.

So I want to drop them all.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Andrew Morton wrote:

On Thu, 25 Jan 2007 23:44:58 + (GMT)
Mel Gorman [EMAIL PROTECTED] wrote:

The following 8 patches against 2.6.20-rc4-mm1 create a zone called
ZONE_MOVABLE

Argh. These surely get all tangled up with the
make-zones-optional-by-adding-zillions-of-ifdef patches:

There may be some entertainment there all right. I didn't see any obvious
way of avoiding collisions with those patches but for what it's worth,
ZONE_MOVABLE could also be made optional.

deal-with-cases-of-zone_dma-meaning-the-first-zone.patch

This patch looks ok and looks like it stands on it's own.

introduce-config_zone_dma.patch

ok, no collisions here but obviously this patch does not stand on it's
own.

optional-zone_dma-in-the-vm.patch

There are collisions here with the __ZONE_COUNT stuff but it's not
difficult to work around.

There is no cross-over here with the ZONE_MOVABLE patches. They are
messing around with slab

optional-zone_dma-for-ia64.patch

No collision here

remove-zone_dma-remains-from-parisc.patch
remove-zone_dma-remains-from-sh-sh64.patch

No collisions here either. I see that there were discussions about Power
potentially doing something similar.

set-config_zone_dma-for-arches-with-generic_isa_dma.patch

No collisions

zoneid-fix-up-calculations-for-zoneid_pgshift.patch

Fun, but no collisions.

My objections to those patches:

- They add zillions of ifdefs

- They make the VM's behaviour diverge between different platforms and
between differen configs on the same platforms, and hence degrade
maintainability and increase complexity.

for (idx = 0; idx NUMBER_OF_ZONES; idx++)
...

- I haven't seen any hard numbers to justify the change.

So I want to drop them all.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Andrew Morton wrote:

 - They add zillions of ifdefs

They just add a few for ZONE_DMA where we alreaday have similar ifdefs for 
ZONE_DMA32 and ZONE_HIGHMEM.

 - They make the VM's behaviour diverge between different platforms and
   between differen configs on the same platforms, and hence degrade
   maintainability and increase complexity.

They avoid unecessary complexity on platforms. They could be made to work 
on more platforms with measures to deal with what ZONE_DMA 
provides in different ways. There are 6 or so platforms that do not need 
ZONE_DMA at all.

 - We kicked around some quite different ways of implementing the same
   things, but nothing came of it.  iirc, one was to remove the hard-coded
   zones altogether and rework all the MM to operate in terms of
 
   for (idx = 0; idx  NUMBER_OF_ZONES; idx++)
   ...

Hmmm.. How would that be simpler?

 - I haven't seen any hard numbers to justify the change.

I have send you numbers showing significant reductions in code size.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Mel Gorman wrote:

 I haven't thought about it much so I probably am missing something. The major
 difference I see is when only one zone is present. In that case, a number of
 loops presumably get optimised away and the behavior is very different
 (presumably better although you point out no figures exist to prove it). Where
 there are two or more zones, the code paths should be similar whether there
 are 2, 3 or 4 zones present.

The balancing of allocations between zones is becoming unnecessary. Also 
in a NUMA system we then have zone == node which allows for a series of 
simplifications.
 
 As the common platforms will always have more than one zone, it'll be heavily
 tested and I'm guessing that distros are always going to have to ship kernels
 with ZONE_DMA for the devices that require it. The only platform I see that
 may have problems at the moment is IA64 which looks like the only platform
 that can have one and only one zone. I am guessing that Christoph will catch
 problems here fairly quickly although a non-optional ZONE_MOVABLE would throw
 a spanner into the works somewhat.

There are 6 platforms that have only one zone. These are not major 
platforms. In order for major platforms to go to a single zone in general 
we would have to implement a generic mechanism to do an allocation where 
one can specify the memory boundaries. Many DMA engines have different
limitations from what ZONE_DMA and ZONE_DMA32 can provide. If such a 
scheme would be implemented then those would be able to utilize memory 
better and the amount of bounce buffers would be reduced.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Thu, 25 Jan 2007, Mel Gorman wrote:

 The following 8 patches against 2.6.20-rc4-mm1 create a zone called
 ZONE_MOVABLE that is only usable by allocations that specify both 
 __GFP_HIGHMEM
 and __GFP_MOVABLE. This has the effect of keeping all non-movable pages
 within a single memory partition while allowing movable allocations to be
 satisified from either partition.

For arches that do not have HIGHMEM other zones would be okay too it 
seems.

 The size of the zone is determined by a kernelcore= parameter specified at
 boot-time. This specifies how much memory is usable by non-movable allocations
 and the remainder is used for ZONE_MOVABLE. Any range of pages within
 ZONE_MOVABLE can be released by migrating the pages or by reclaiming.

The user has to manually fiddle around with the size of the unmovable 
partition until it works?

 When selecting a zone to take pages from for ZONE_MOVABLE, there are two
 things to consider. First, only memory from the highest populated zone is
 used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM
 but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second,
 the amount of memory usable by the kernel will be spreadly evenly throughout
 NUMA nodes where possible. If the nodes are not of equal size, the amount
 of memory usable by the kernel on some nodes may be greater than others.

So how is the amount of movable memory on a node calculated? Evenly 
distributed? There are some NUMA architectures that are not that 
symmetric.

 By default, the zone is not as useful for hugetlb allocations because they
 are pinned and non-migratable (currently at least). A sysctl is provided that
 allows huge pages to be allocated from that zone. This means that the huge
 page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
 the system assuming that pages are not mlocked. Despite huge pages being
 non-movable, we do not introduce additional external fragmentation of note
 as huge pages are always the largest contiguous block we care about.

The user already has to specify the partitioning of the system at bootup 
and could take the huge page sizes into account.

Also huge pages may have variable sizes that can be specified on bootup 
for IA64. The assumption that a huge page is always the largest 
contiguous block is *not true*.

The huge page sizes on i386 and x86_64 platforms are contigent on 
their page table structure. This can be completely different on other 
platforms.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages


On Fri, 26 Jan 2007, Christoph Lameter wrote:


On Thu, 25 Jan 2007, Mel Gorman wrote:


The following 8 patches against 2.6.20-rc4-mm1 create a zone called
ZONE_MOVABLE that is only usable by allocations that specify both __GFP_HIGHMEM
and __GFP_MOVABLE. This has the effect of keeping all non-movable pages
within a single memory partition while allowing movable allocations to be
satisified from either partition.


For arches that do not have HIGHMEM other zones would be okay too it
seems.



It would, but it'd obscure the code to take advantage of that.


The size of the zone is determined by a kernelcore= parameter specified at
boot-time. This specifies how much memory is usable by non-movable allocations
and the remainder is used for ZONE_MOVABLE. Any range of pages within
ZONE_MOVABLE can be released by migrating the pages or by reclaiming.


The user has to manually fiddle around with the size of the unmovable
partition until it works?



They have to fiddle with the size of the unmovable partition if their 
workload uses more unmovable kernel allocations than expected. This was 
always going to be the restriction with using zones for partitioning 
memory. Resizing zones on the fly is not really an option because the 
resizing would only work reliably in one direction.


The anti-fragmentation code could potentially be used to have subzone 
groups that kept movable and unmovable allocations as far apart as 
possible and at opposite ends of a zone. That approach has been kicked a 
few times because of complexity.



When selecting a zone to take pages from for ZONE_MOVABLE, there are two
things to consider. First, only memory from the highest populated zone is
used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM
but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second,
the amount of memory usable by the kernel will be spreadly evenly throughout
NUMA nodes where possible. If the nodes are not of equal size, the amount
of memory usable by the kernel on some nodes may be greater than others.


So how is the amount of movable memory on a node calculated?


Subtle difference. The amount of unmovable memory is calculated per node.


Evenly
distributed?


As evenly as possible.


There are some NUMA architectures that are not that
symmetric.



I know, it's why find_zone_movable_pfns_for_nodes() is as complex as it 
is. The mechanism spreads the unmovable memory evenly throughout all 
nodes. In the event some nodes are too small to hold their share, the 
remaining unmovable memory is divided between the nodes that are larger.



By default, the zone is not as useful for hugetlb allocations because they
are pinned and non-migratable (currently at least). A sysctl is provided that
allows huge pages to be allocated from that zone. This means that the huge
page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
the system assuming that pages are not mlocked. Despite huge pages being
non-movable, we do not introduce additional external fragmentation of note
as huge pages are always the largest contiguous block we care about.


The user already has to specify the partitioning of the system at bootup
and could take the huge page sizes into account.



Not in all cases. Some systems will not know how many huge pages they need 
in advance because it is used as a batch system running jobs as requested. 
The zone allows an amount of memory to be set aside that can be 
*optionally* used for hugepages if desired or base pages if not. Between 
jobs, the hugepage pool can be resized up to the size of ZONE_MOVABLE.


The other case is ever supporting memory hot-remove. Any memory within 
ZONE_MOVABLE can potentially be removed by migrating pages and off-lined.



Also huge pages may have variable sizes that can be specified on bootup
for IA64. The assumption that a huge page is always the largest
contiguous block is *not true*.



I didn't say they were the largest supported contiguous block, I said they 
were the largest contiguous block we *care* about. Right now, it is 
assumed that variable pages are not supported at runtime. If they were, 
some smarts would be needed to keep huge pages of the same size together 
to control external fragmentation but that's about it.



The huge page sizes on i386 and x86_64 platforms are contigent on
their page table structure. This can be completely different on other
platforms.



The size doesn't really make much difference to the mechanism.

--
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Mel Gorman wrote:

  For arches that do not have HIGHMEM other zones would be okay too it
  seems.
 It would, but it'd obscure the code to take advantage of that.

No MOVABLE memory for 64 bit platforms that do not have HIGHMEM right now?

 The anti-fragmentation code could potentially be used to have subzone groups
 that kept movable and unmovable allocations as far apart as possible and at
 opposite ends of a zone. That approach has been kicked a few times because of
 complexity.

Hmm... But his patch also introduces additional complexity plus its 
difficult to handle for the end user.

  There are some NUMA architectures that are not that
  symmetric.
 I know, it's why find_zone_movable_pfns_for_nodes() is as complex as it is.
 The mechanism spreads the unmovable memory evenly throughout all nodes. In the
 event some nodes are too small to hold their share, the remaining unmovable
 memory is divided between the nodes that are larger.

I would have expected a percentage of a node. If equal amounts of 
unmovable memory are assigned to all nodes at first then there will be 
large disparities in the amount of movable memories f.e. between a node 
with 8G memory compared to a node with 1GB memory.

How do you handle headless nodes? I.e. memory nodes with no processors? 
Those may be particularly large compared to the rest but these are mainly 
used for movable pages since unmovable things like device drivers buffers
have to be kept near the processors that take the interrupt.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages


On Fri, 26 Jan 2007, Christoph Lameter wrote:


On Fri, 26 Jan 2007, Mel Gorman wrote:


For arches that do not have HIGHMEM other zones would be okay too it
seems.

It would, but it'd obscure the code to take advantage of that.


No MOVABLE memory for 64 bit platforms that do not have HIGHMEM right now?



err, no, I misinterpreted what you meant by other zones would be ok... I 
though you were suggesting the reuse of zone names for some reason.


The zone used to for ZONE_MOVABLE is the highest populated zone on the 
architecture. On some architectures, that will be ZONE_HIGHMEM. On others, 
it will be ZONE_DMA. See the function find_usable_zone_for_movable()


ZONE_MOVABLE never spans zones. For example, it will not use some 
ZONE_HIGHMEM and some ZONE_NORMAL memory.



The anti-fragmentation code could potentially be used to have subzone groups
that kept movable and unmovable allocations as far apart as possible and at
opposite ends of a zone. That approach has been kicked a few times because of
complexity.


Hmm... But his patch also introduces additional complexity plus its
difficult to handle for the end user.



It's harder for the user to setup all right. But it works within limits 
that are known well in advance and doesn't add additional code to the main 
allocator path. Once it's setup, it acts like any other zone and zone 
behavior is better understood than anti-fragmentations behavior.



There are some NUMA architectures that are not that
symmetric.

I know, it's why find_zone_movable_pfns_for_nodes() is as complex as it is.
The mechanism spreads the unmovable memory evenly throughout all nodes. In the
event some nodes are too small to hold their share, the remaining unmovable
memory is divided between the nodes that are larger.


I would have expected a percentage of a node. If equal amounts of
unmovable memory are assigned to all nodes at first then there will be
large disparities in the amount of movable memories f.e. between a node
with 8G memory compared to a node with 1GB memory.



On the other hand, percentages make it harder for the administrator to 
know in advance how much unmovable memory will be available when the 
system starts even if the machine changes configuration. The absolute 
figure is easier to understand. If there was a requirement, an alternative 
configuration option could be made available that takes a fixed percentage 
of each node with memory.



How do you handle headless nodes? I.e. memory nodes with no processors?


The code only cares about memory, not processors.


Those may be particularly large compared to the rest but these are mainly
used for movable pages since unmovable things like device drivers buffers
have to be kept near the processors that take the interrupt.



Then what I'd do is specify kernelcore to be

(number_of_nodes_with_processors * 
largest_amount_of_memory_on_node_with_processors)

That would have all memory near processors available as unmovable memory 
(that movable allocations will still use so they don't always go remote) 
while keeping a large amount of memory on the headless nodes for movable 
allocations only.


If requirements demanded, a configuration option could be made that allows 
the administrator to specify exactly how much unmovable memory he wants on 
a specific node.


--
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007 07:56:09 -0800 (PST)
Christoph Lameter [EMAIL PROTECTED] wrote:

 On Fri, 26 Jan 2007, Andrew Morton wrote:
 
  - They add zillions of ifdefs
 
 They just add a few for ZONE_DMA where we alreaday have similar ifdefs for 
 ZONE_DMA32 and ZONE_HIGHMEM.

I refreshed my memory.  It remains awful.

  - They make the VM's behaviour diverge between different platforms and
between differen configs on the same platforms, and hence degrade
maintainability and increase complexity.
 
 They avoid unecessary complexity on platforms. They could be made to work 
 on more platforms with measures to deal with what ZONE_DMA 
 provides in different ways. There are 6 or so platforms that do not need 
 ZONE_DMA at all.

As Mel points out, distros will ship with CONFIG_ZONE_DMA=y, so the number
of machines which will actually benefit from this change is really small. 
And the benefit to those few machines will also, I suspect, be small.

  - We kicked around some quite different ways of implementing the same
things, but nothing came of it.  iirc, one was to remove the hard-coded
zones altogether and rework all the MM to operate in terms of
  
  for (idx = 0; idx  NUMBER_OF_ZONES; idx++)
  ...
 
 Hmmm.. How would that be simpler?

Replace a sprinkle of open-coded ifdefs with a regular code sequence which
everyone uses.  Pretty obvious, I'd thought.

Plus it becoems straightforward to extend this from the present four zones
to a complete 12 zones, which gives use the full set of
ZONE_DMA20,ZONE_DMA21,...,ZONE_DMA32 for those funny devices.

  - I haven't seen any hard numbers to justify the change.
 
 I have send you numbers showing significant reductions in code size.

If it isn't in the changelog it doesn't exist.  I guess I didn't copy it
into the changelog.

If the only demonstrable benefit is a saving of a few k of text on a small
number of machines then things are looking very grim, IMO.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Andrew Morton wrote:

 As Mel points out, distros will ship with CONFIG_ZONE_DMA=y, so the number
 of machines which will actually benefit from this change is really small. 
 And the benefit to those few machines will also, I suspect, be small.
 
   - We kicked around some quite different ways of implementing the same
 things, but nothing came of it.  iirc, one was to remove the hard-coded
 zones altogether and rework all the MM to operate in terms of
   
 for (idx = 0; idx  NUMBER_OF_ZONES; idx++)
 ...
  
  Hmmm.. How would that be simpler?
 
 Replace a sprinkle of open-coded ifdefs with a regular code sequence which
 everyone uses.  Pretty obvious, I'd thought.

We do use such loops in many places. However, stuff like array 
initialization and special casing cannot use a loop. I am not sure what we 
could change there. The hard coding is necessary because each zone 
currently has these invariant characteristics that we need to consider. 
Reducing the number of zones reduces the amount of special casing in the 
VM that needs to be considered at run time and that is a potential issue
for trouble.

 Plus it becoems straightforward to extend this from the present four zones
 to a complete 12 zones, which gives use the full set of
 ZONE_DMA20,ZONE_DMA21,...,ZONE_DMA32 for those funny devices.

I just hope we can handle the VM complexity of load balancing etc etc that 
this will introduce. Also each zone has management overhead and will cause 
the touching of additional cachelines on many VM operations. Much of that 
management overhead becomes unnecessary if we reduce zones.

 If the only demonstrable benefit is a saving of a few k of text on a small
 number of machines then things are looking very grim, IMO.

The main benefit is a significant simplification of the VM, leading to 
robust and reliable operations and a reduction of the maintenance 
headaches coming with the additional zones.

If we would introduce the ability of allocating from a range of 
physical addresses then the need for DMA zones would go away allowing 
flexibility for device driver DMA allocations and at the same time we get 
rid of special casing in the VM.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages