Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-04 Thread Christoph Lameter
On Fri, 4 May 2007, Jesse Barnes wrote:

> You mentioned that if node 0 has a small ZONE_NORMAL and the ZONE_DMA for 
> the system, defaulting to using ZONE_NORMAL on all nodes first would be a 
> bad idea.  Is that really true?  Maybe for ZONE_DMA32 it is since that 
> first node could have a few gigs of memory, but for regular ZONE_DMA it's 
> probably the right thing to do...

If the fallback sequence is f.e. Node 0 NORMAL (500m) Node 1 NORMAL(4G) 
node 2 Normal (4G) ... many more ... Node 0 DMA32 (~4G) Node 0 DMA then 
memory is frequently going to be not optimally placed for allocations from 
processes running on node 0 because node 0 is memory starved. 
Allocations will be made from node 1 which may create a shortage there 
which fall again. Could be a cascade effect because the symmetry in 
memory is no longer there.

The proposal to create an additional node may solve that to some extend by 
placing the 
DMA node nearer to node 0.

Maybe the best approach is to leave things as is and just be careful with 
I/O to 32 bits? I do not think there is an easy solution. A 64 bit NUMA 
platforms should have I/O that is 64 bit capable and not restricted to DMA 
zones.

> > So aside from the comment issues Lee already pointed out, I think 
> Kamezawa-san's patch from 
> http://marc.info/?l=linux-mm=117758484122663=4 seems reasonable.

If we are going to do this then the patch needs to be fine tuned first and 
the impact on core code needs to be minimized. I want to make really sure 
that platforms without DMA zones work right, if zones are empty it should 
work right and weird x86_64 combinations of NORMAL, DMA and DMA32 
distributed over various nodes would need to be covered and tested first.

How will this affect NUMAQ (32 bit NUMA) where we have HIGHMEM on the 
(most) nodes and NORMAL/DMA on node 0?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-04 Thread Jesse Barnes
On Friday, May 04, 2007, Christoph Lameter wrote:
> On Fri, 4 May 2007, Lee Schermerhorn wrote:
> > Hmmm...  "serious hackery", indeed!  ;-)
>
> Maybe on the arch level but minimal changes to core code.
> And it is a step towards avoiding zones in NUMA.

You mentioned that if node 0 has a small ZONE_NORMAL and the ZONE_DMA for 
the system, defaulting to using ZONE_NORMAL on all nodes first would be a 
bad idea.  Is that really true?  Maybe for ZONE_DMA32 it is since that 
first node could have a few gigs of memory, but for regular ZONE_DMA it's 
probably the right thing to do...

So aside from the comment issues Lee already pointed out, I think 
Kamezawa-san's patch from 
http://marc.info/?l=linux-mm=117758484122663=4 seems reasonable.

Jesse
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-04 Thread Christoph Lameter
On Fri, 4 May 2007, Lee Schermerhorn wrote:

> Hmmm...  "serious hackery", indeed!  ;-)

Maybe on the arch level but minimal changes to core code.
And it is a step towards avoiding zones in NUMA.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-04 Thread Lee Schermerhorn
On Fri, 2007-05-04 at 09:18 -0700, Christoph Lameter wrote:
> On Fri, 4 May 2007, Jesse Barnes wrote:
> 
> > I think the idea is to avoid exhausting ZONE_DMA on some NUMA boxes by 
> > ordering the fallback list first by zone, then by node distance (e.g. 
> > ZONE_NORMAL of local node, then ZONE_NORMAL of next nearest node etc., 
> > followed by ZONE_DMA of local node, ZONE_DMA of next nearest node, etc.).
> 
> Maybe it would be cleaner to setup a DMA and DMA32 "node" up and define 
> them at a certain distance to the rest of the nodes that only contain 
> ZONE_NORMAL (or the zone that is replicated on all nodes). Then we would 
> have that effect without reworking zone list generation. Plus in the long 
> run we may then be able to get to 1 zone per node avoiding the 
> difficulties coming zone fallback altogether.
> 
> > Another option would be to make this behavior automatic if both ZONE_DMA 
> > and ZONE_NORMAL had pages.  I initially wrote this stuff with the idea 
> > that machines that really needed it would have all their memory in 
> > ZONE_DMA, but obviously that's not the case, so some more smarts are 
> > needed.
> 
> I think what would work is to first setup nodes that use the highest zone. 
> Then add virtual nodes for the lower zones that may only exist on a single 
> node.
> 
> I.e. a 4 node x86_64 box may have
> 
> Node
> 0 ZONE_NORMAL
> 1 ZONE_NORMAL
> 2 ZONE_NORMAL
> 3 ZONE_NORMAL
> 4 ZONE_DMA32
> 5 [additional ZONE_DMA32 if zone DMA32 is split over multiple nodes]
> 6 ZONE_DMA
> 
> The SLIT information can be used to control how the nodes fallback to the 
> DMA32 nodes on 4 and 5. Node 6 would be given a very high SLIT distance so 
> that it would be used only if an actual __GFP_DMA occurs or the system 
> really runs into memory difficulties.

Hmmm...  "serious hackery", indeed!  ;-)

Lee

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-04 Thread Lee Schermerhorn
On Thu, 2007-05-03 at 22:47 -0700, Andrew Morton wrote:
> On Fri, 27 Apr 2007 14:45:30 +0900 KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> 
> wrote:
> 
> > Hi, this is version 4. including Lee Schermerhon's good rework.
> > and automatic configuration at boot time.
> 
> hm, this adds rather a lot of code.  Have we established that it's worth
> it?

See below.  Something is needed here on some platforms.  The current
zonelist ordering results in some unfortunate behavior on some
platforms.


> 
> And it's complex - how do poor users know what to do with this new control?
> 
Kame's autoconfig seems to be doing the right thing for our platform.
Might not be the case for other platforms, or some workloads on them.  I
suppose the documentation in sysctl.txt could be expanded to describe
when you might want to select a non-default setting, should we decide to
provide that capability.

> 
> This:
> 
> + *   = "[dD]efault | "0" - default, automatic configuration.
> + *   = "[nN]ode"|"1" - order by node locality,
> + * then zone within node.
> + *   = "[zZ]one"|"2" - order by zone, then by locality within zone
> 
> seems a bit excessive.  I think just the 0/1/2 plus documentation would
> suffice?

I agree, but I was considering dropping the "0/1/2" in favor of the more
descriptive [IMO] values ;-).

> 
> 
> I haven't followed this discussion very closely I'm afraid.  If we came up
> with a good reason why Linux needs this feature then could someone please
> (re)describe it?

Kame originally described the need for it in:

http://marc.info/?l=linux-mm=117747120307559=4

I chimed in with support as we have a similar need for our cell-based
ia64 platforms:

http://marc.info/?l=linux-mm=117760331328012=4

I can easily consume all of DMA on our platforms [configured as 100%
"cell local memory" -- always leaves some "cache-line interleaved" at
phys addr zero => ZONE_DMA] by allocating, e.g., a shared memory segment
of size > 1 node's memory + size of ZONE_DMA.  This occurs because the
node containing zone DMA is always 2nd in a zone's ZONE_NORMAL zonelist
[after the zone itself, assuming it has memory].  Then, any driver that
requests memory from ZONE_DMA will be denied, resulting in IO errors,
death of hald [maybe that's a feature? ;-)], ...

I guess I would be happy with Kame's V3 patch that unconditionally
changes the order to be zone first--i.e., ZONE_NORMAL for all nodes
before ZONE_DMA*:

http://marc.info/?l=linux-mm=117758484122663=4

However, this patch apparently crossed in the mail with Christoph's
observation that making the new order [zone order] the default w/o any
option wouldn't be appropriate for some configurations:

http://marc.info/?l=linux-mm=117760245022005=4

Meanwhile, I was factoring out common code in Kame's V1/V2 patch and
adding the "excessive" user interface to the boot parameter/sysctl.
After some additional rework, Kame posted this a V4--the one you're
questioning.

If we decide to proceed with this, I have another "cleanup" patch that
eliminates some redundant "estimating of zone order" [autoconfig] and
reports what order was chosen in the "Build %d zonelists..." message.


Lee

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-04 Thread Christoph Lameter
On Fri, 4 May 2007, Jesse Barnes wrote:

> I think the idea is to avoid exhausting ZONE_DMA on some NUMA boxes by 
> ordering the fallback list first by zone, then by node distance (e.g. 
> ZONE_NORMAL of local node, then ZONE_NORMAL of next nearest node etc., 
> followed by ZONE_DMA of local node, ZONE_DMA of next nearest node, etc.).

Maybe it would be cleaner to setup a DMA and DMA32 "node" up and define 
them at a certain distance to the rest of the nodes that only contain 
ZONE_NORMAL (or the zone that is replicated on all nodes). Then we would 
have that effect without reworking zone list generation. Plus in the long 
run we may then be able to get to 1 zone per node avoiding the 
difficulties coming zone fallback altogether.

> Another option would be to make this behavior automatic if both ZONE_DMA 
> and ZONE_NORMAL had pages.  I initially wrote this stuff with the idea 
> that machines that really needed it would have all their memory in 
> ZONE_DMA, but obviously that's not the case, so some more smarts are 
> needed.

I think what would work is to first setup nodes that use the highest zone. 
Then add virtual nodes for the lower zones that may only exist on a single 
node.

I.e. a 4 node x86_64 box may have

Node
0   ZONE_NORMAL
1   ZONE_NORMAL
2   ZONE_NORMAL
3   ZONE_NORMAL
4   ZONE_DMA32
5   [additional ZONE_DMA32 if zone DMA32 is split over multiple nodes]
6   ZONE_DMA

The SLIT information can be used to control how the nodes fallback to the 
DMA32 nodes on 4 and 5. Node 6 would be given a very high SLIT distance so 
that it would be used only if an actual __GFP_DMA occurs or the system 
really runs into memory difficulties.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-04 Thread Jesse Barnes
On Thursday, May 03, 2007, Andrew Morton wrote:
> On Fri, 27 Apr 2007 14:45:30 +0900 KAMEZAWA Hiroyuki 
<[EMAIL PROTECTED]> wrote:
> > Hi, this is version 4. including Lee Schermerhon's good rework.
> > and automatic configuration at boot time.
>
> hm, this adds rather a lot of code.  Have we established that it's worth
> it?
>
> And it's complex - how do poor users know what to do with this new
> control?
>
>
> This:
>
> + *   = "[dD]efault | "0" - default, automatic configuration.
> + *   = "[nN]ode"|"1" - order by node locality,
> + * then zone within node.
> + *   = "[zZ]one"|"2" - order by zone, then by locality within zone
>
> seems a bit excessive.  I think just the 0/1/2 plus documentation would
> suffice?
>
>
> I haven't followed this discussion very closely I'm afraid.  If we came
> up with a good reason why Linux needs this feature then could someone
> please (re)describe it?

I think the idea is to avoid exhausting ZONE_DMA on some NUMA boxes by 
ordering the fallback list first by zone, then by node distance (e.g. 
ZONE_NORMAL of local node, then ZONE_NORMAL of next nearest node etc., 
followed by ZONE_DMA of local node, ZONE_DMA of next nearest node, etc.).

As for documentation, it would be good if the "default" behavior was 
described as well (it's mostly by node first, then by zone iirc, but has a 
few other tweaks).

Another option would be to make this behavior automatic if both ZONE_DMA 
and ZONE_NORMAL had pages.  I initially wrote this stuff with the idea 
that machines that really needed it would have all their memory in 
ZONE_DMA, but obviously that's not the case, so some more smarts are 
needed.

Jesse

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-04 Thread Jesse Barnes
On Thursday, May 03, 2007, Andrew Morton wrote:
 On Fri, 27 Apr 2007 14:45:30 +0900 KAMEZAWA Hiroyuki 
[EMAIL PROTECTED] wrote:
  Hi, this is version 4. including Lee Schermerhon's good rework.
  and automatic configuration at boot time.

 hm, this adds rather a lot of code.  Have we established that it's worth
 it?

 And it's complex - how do poor users know what to do with this new
 control?


 This:

 + *   = [dD]efault | 0 - default, automatic configuration.
 + *   = [nN]ode|1 - order by node locality,
 + * then zone within node.
 + *   = [zZ]one|2 - order by zone, then by locality within zone

 seems a bit excessive.  I think just the 0/1/2 plus documentation would
 suffice?


 I haven't followed this discussion very closely I'm afraid.  If we came
 up with a good reason why Linux needs this feature then could someone
 please (re)describe it?

I think the idea is to avoid exhausting ZONE_DMA on some NUMA boxes by 
ordering the fallback list first by zone, then by node distance (e.g. 
ZONE_NORMAL of local node, then ZONE_NORMAL of next nearest node etc., 
followed by ZONE_DMA of local node, ZONE_DMA of next nearest node, etc.).

As for documentation, it would be good if the default behavior was 
described as well (it's mostly by node first, then by zone iirc, but has a 
few other tweaks).

Another option would be to make this behavior automatic if both ZONE_DMA 
and ZONE_NORMAL had pages.  I initially wrote this stuff with the idea 
that machines that really needed it would have all their memory in 
ZONE_DMA, but obviously that's not the case, so some more smarts are 
needed.

Jesse

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-04 Thread Christoph Lameter
On Fri, 4 May 2007, Jesse Barnes wrote:

 I think the idea is to avoid exhausting ZONE_DMA on some NUMA boxes by 
 ordering the fallback list first by zone, then by node distance (e.g. 
 ZONE_NORMAL of local node, then ZONE_NORMAL of next nearest node etc., 
 followed by ZONE_DMA of local node, ZONE_DMA of next nearest node, etc.).

Maybe it would be cleaner to setup a DMA and DMA32 node up and define 
them at a certain distance to the rest of the nodes that only contain 
ZONE_NORMAL (or the zone that is replicated on all nodes). Then we would 
have that effect without reworking zone list generation. Plus in the long 
run we may then be able to get to 1 zone per node avoiding the 
difficulties coming zone fallback altogether.

 Another option would be to make this behavior automatic if both ZONE_DMA 
 and ZONE_NORMAL had pages.  I initially wrote this stuff with the idea 
 that machines that really needed it would have all their memory in 
 ZONE_DMA, but obviously that's not the case, so some more smarts are 
 needed.

I think what would work is to first setup nodes that use the highest zone. 
Then add virtual nodes for the lower zones that may only exist on a single 
node.

I.e. a 4 node x86_64 box may have

Node
0   ZONE_NORMAL
1   ZONE_NORMAL
2   ZONE_NORMAL
3   ZONE_NORMAL
4   ZONE_DMA32
5   [additional ZONE_DMA32 if zone DMA32 is split over multiple nodes]
6   ZONE_DMA

The SLIT information can be used to control how the nodes fallback to the 
DMA32 nodes on 4 and 5. Node 6 would be given a very high SLIT distance so 
that it would be used only if an actual __GFP_DMA occurs or the system 
really runs into memory difficulties.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-04 Thread Lee Schermerhorn
On Thu, 2007-05-03 at 22:47 -0700, Andrew Morton wrote:
 On Fri, 27 Apr 2007 14:45:30 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] 
 wrote:
 
  Hi, this is version 4. including Lee Schermerhon's good rework.
  and automatic configuration at boot time.
 
 hm, this adds rather a lot of code.  Have we established that it's worth
 it?

See below.  Something is needed here on some platforms.  The current
zonelist ordering results in some unfortunate behavior on some
platforms.


 
 And it's complex - how do poor users know what to do with this new control?
 
Kame's autoconfig seems to be doing the right thing for our platform.
Might not be the case for other platforms, or some workloads on them.  I
suppose the documentation in sysctl.txt could be expanded to describe
when you might want to select a non-default setting, should we decide to
provide that capability.

 
 This:
 
 + *   = [dD]efault | 0 - default, automatic configuration.
 + *   = [nN]ode|1 - order by node locality,
 + * then zone within node.
 + *   = [zZ]one|2 - order by zone, then by locality within zone
 
 seems a bit excessive.  I think just the 0/1/2 plus documentation would
 suffice?

I agree, but I was considering dropping the 0/1/2 in favor of the more
descriptive [IMO] values ;-).

 
 
 I haven't followed this discussion very closely I'm afraid.  If we came up
 with a good reason why Linux needs this feature then could someone please
 (re)describe it?

Kame originally described the need for it in:

http://marc.info/?l=linux-mmm=117747120307559w=4

I chimed in with support as we have a similar need for our cell-based
ia64 platforms:

http://marc.info/?l=linux-mmm=117760331328012w=4

I can easily consume all of DMA on our platforms [configured as 100%
cell local memory -- always leaves some cache-line interleaved at
phys addr zero = ZONE_DMA] by allocating, e.g., a shared memory segment
of size  1 node's memory + size of ZONE_DMA.  This occurs because the
node containing zone DMA is always 2nd in a zone's ZONE_NORMAL zonelist
[after the zone itself, assuming it has memory].  Then, any driver that
requests memory from ZONE_DMA will be denied, resulting in IO errors,
death of hald [maybe that's a feature? ;-)], ...

I guess I would be happy with Kame's V3 patch that unconditionally
changes the order to be zone first--i.e., ZONE_NORMAL for all nodes
before ZONE_DMA*:

http://marc.info/?l=linux-mmm=117758484122663w=4

However, this patch apparently crossed in the mail with Christoph's
observation that making the new order [zone order] the default w/o any
option wouldn't be appropriate for some configurations:

http://marc.info/?l=linux-mmm=117760245022005w=4

Meanwhile, I was factoring out common code in Kame's V1/V2 patch and
adding the excessive user interface to the boot parameter/sysctl.
After some additional rework, Kame posted this a V4--the one you're
questioning.

If we decide to proceed with this, I have another cleanup patch that
eliminates some redundant estimating of zone order [autoconfig] and
reports what order was chosen in the Build %d zonelists... message.


Lee

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-04 Thread Lee Schermerhorn
On Fri, 2007-05-04 at 09:18 -0700, Christoph Lameter wrote:
 On Fri, 4 May 2007, Jesse Barnes wrote:
 
  I think the idea is to avoid exhausting ZONE_DMA on some NUMA boxes by 
  ordering the fallback list first by zone, then by node distance (e.g. 
  ZONE_NORMAL of local node, then ZONE_NORMAL of next nearest node etc., 
  followed by ZONE_DMA of local node, ZONE_DMA of next nearest node, etc.).
 
 Maybe it would be cleaner to setup a DMA and DMA32 node up and define 
 them at a certain distance to the rest of the nodes that only contain 
 ZONE_NORMAL (or the zone that is replicated on all nodes). Then we would 
 have that effect without reworking zone list generation. Plus in the long 
 run we may then be able to get to 1 zone per node avoiding the 
 difficulties coming zone fallback altogether.
 
  Another option would be to make this behavior automatic if both ZONE_DMA 
  and ZONE_NORMAL had pages.  I initially wrote this stuff with the idea 
  that machines that really needed it would have all their memory in 
  ZONE_DMA, but obviously that's not the case, so some more smarts are 
  needed.
 
 I think what would work is to first setup nodes that use the highest zone. 
 Then add virtual nodes for the lower zones that may only exist on a single 
 node.
 
 I.e. a 4 node x86_64 box may have
 
 Node
 0 ZONE_NORMAL
 1 ZONE_NORMAL
 2 ZONE_NORMAL
 3 ZONE_NORMAL
 4 ZONE_DMA32
 5 [additional ZONE_DMA32 if zone DMA32 is split over multiple nodes]
 6 ZONE_DMA
 
 The SLIT information can be used to control how the nodes fallback to the 
 DMA32 nodes on 4 and 5. Node 6 would be given a very high SLIT distance so 
 that it would be used only if an actual __GFP_DMA occurs or the system 
 really runs into memory difficulties.

Hmmm...  serious hackery, indeed!  ;-)

Lee

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-04 Thread Christoph Lameter
On Fri, 4 May 2007, Lee Schermerhorn wrote:

 Hmmm...  serious hackery, indeed!  ;-)

Maybe on the arch level but minimal changes to core code.
And it is a step towards avoiding zones in NUMA.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-04 Thread Jesse Barnes
On Friday, May 04, 2007, Christoph Lameter wrote:
 On Fri, 4 May 2007, Lee Schermerhorn wrote:
  Hmmm...  serious hackery, indeed!  ;-)

 Maybe on the arch level but minimal changes to core code.
 And it is a step towards avoiding zones in NUMA.

You mentioned that if node 0 has a small ZONE_NORMAL and the ZONE_DMA for 
the system, defaulting to using ZONE_NORMAL on all nodes first would be a 
bad idea.  Is that really true?  Maybe for ZONE_DMA32 it is since that 
first node could have a few gigs of memory, but for regular ZONE_DMA it's 
probably the right thing to do...

So aside from the comment issues Lee already pointed out, I think 
Kamezawa-san's patch from 
http://marc.info/?l=linux-mmm=117758484122663w=4 seems reasonable.

Jesse
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-04 Thread Christoph Lameter
On Fri, 4 May 2007, Jesse Barnes wrote:

 You mentioned that if node 0 has a small ZONE_NORMAL and the ZONE_DMA for 
 the system, defaulting to using ZONE_NORMAL on all nodes first would be a 
 bad idea.  Is that really true?  Maybe for ZONE_DMA32 it is since that 
 first node could have a few gigs of memory, but for regular ZONE_DMA it's 
 probably the right thing to do...

If the fallback sequence is f.e. Node 0 NORMAL (500m) Node 1 NORMAL(4G) 
node 2 Normal (4G) ... many more ... Node 0 DMA32 (~4G) Node 0 DMA then 
memory is frequently going to be not optimally placed for allocations from 
processes running on node 0 because node 0 is memory starved. 
Allocations will be made from node 1 which may create a shortage there 
which fall again. Could be a cascade effect because the symmetry in 
memory is no longer there.

The proposal to create an additional node may solve that to some extend by 
placing the 
DMA node nearer to node 0.

Maybe the best approach is to leave things as is and just be careful with 
I/O to 32 bits? I do not think there is an easy solution. A 64 bit NUMA 
platforms should have I/O that is 64 bit capable and not restricted to DMA 
zones.

  So aside from the comment issues Lee already pointed out, I think 
 Kamezawa-san's patch from 
 http://marc.info/?l=linux-mmm=117758484122663w=4 seems reasonable.

If we are going to do this then the patch needs to be fine tuned first and 
the impact on core code needs to be minimized. I want to make really sure 
that platforms without DMA zones work right, if zones are empty it should 
work right and weird x86_64 combinations of NORMAL, DMA and DMA32 
distributed over various nodes would need to be covered and tested first.

How will this affect NUMAQ (32 bit NUMA) where we have HIGHMEM on the 
(most) nodes and NORMAL/DMA on node 0?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-03 Thread Andrew Morton
On Fri, 27 Apr 2007 14:45:30 +0900 KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> wrote:

> Hi, this is version 4. including Lee Schermerhon's good rework.
> and automatic configuration at boot time.

hm, this adds rather a lot of code.  Have we established that it's worth
it?

And it's complex - how do poor users know what to do with this new control?


This:

+ * = "[dD]efault | "0" - default, automatic configuration.
+ * = "[nN]ode"|"1" - order by node locality,
+ *   then zone within node.
+ * = "[zZ]one"|"2" - order by zone, then by locality within zone

seems a bit excessive.  I think just the 0/1/2 plus documentation would
suffice?


I haven't followed this discussion very closely I'm afraid.  If we came up
with a good reason why Linux needs this feature then could someone please
(re)describe it?

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] change global zonelist order v4 [0/2]

2007-05-03 Thread Andrew Morton
On Fri, 27 Apr 2007 14:45:30 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:

 Hi, this is version 4. including Lee Schermerhon's good rework.
 and automatic configuration at boot time.

hm, this adds rather a lot of code.  Have we established that it's worth
it?

And it's complex - how do poor users know what to do with this new control?


This:

+ * = [dD]efault | 0 - default, automatic configuration.
+ * = [nN]ode|1 - order by node locality,
+ *   then zone within node.
+ * = [zZ]one|2 - order by zone, then by locality within zone

seems a bit excessive.  I think just the 0/1/2 plus documentation would
suffice?


I haven't followed this discussion very closely I'm afraid.  If we came up
with a good reason why Linux needs this feature then could someone please
(re)describe it?

Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/