RE: IO regression after ab8fabd46f on x86 kernels with high memory

2013-06-02 Thread Yuhong Bao
> We kernel guys have been asking the distros to ship 64-bit kernels even
> in their 32-bit distros for many years, but concerns of compat issues
> and the desire to deprecate 32-bit userspace seems to have kept that
> from happening.

And now there is another reason: to call 64-bit EFI runtime services.
In retrospect, I would have stuck with 32-bit EFI with 64-bit kernels calling 
runtime services in compatibility mode, but of course it is too late for that 
now.

Yuhong Bao--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: IO regression after ab8fabd46f on x86 kernels with high memory

2013-06-02 Thread Yuhong Bao
 We kernel guys have been asking the distros to ship 64-bit kernels even
 in their 32-bit distros for many years, but concerns of compat issues
 and the desire to deprecate 32-bit userspace seems to have kept that
 from happening.

And now there is another reason: to call 64-bit EFI runtime services.
In retrospect, I would have stuck with 32-bit EFI with 64-bit kernels calling 
runtime services in compatibility mode, but of course it is too late for that 
now.

Yuhong Bao--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-05-08 Thread H. Peter Anvin
On 04/29/2013 03:03 PM, Linus Torvalds wrote:
> On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
>  wrote:
>>
>> Other than this particular concern, what's the high-level take-away? Is PAE
>> support in the Linux kernel a false promise than distros should not be
>> shipping by default, if at all? Should it be removed from the kernel
>> entirely if these configurations are knowingly broken by commits like this?
> 
> PAE is "make it barely work". The whole concept is fundamentally
> flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
> even understand *how* flawed and stupid that is.
> 

Let's be straight... the problem isn't PAE per se, the problem is
*HIGHMEM*.  PAE just allows HIGHMEM to stretch further into problematic
territory.

Distros install PAE kernels by default because it is required to support
NX.  That is fine.

The problem is that once your memory crosses the HIGHMEM threshold
-- 896 MiB in the normal configuration -- then you are in "this is going
to hurt" territory.  I have seen HIGHMEM devastate performance without
even crossing the 4 GiB threshold where PAE is required.

We kernel guys have been asking the distros to ship 64-bit kernels even
in their 32-bit distros for many years, but concerns of compat issues
and the desire to deprecate 32-bit userspace seems to have kept that
from happening.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-05-08 Thread H. Peter Anvin
On 04/29/2013 03:03 PM, Linus Torvalds wrote:
 On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
 pgriff...@valvesoftware.com wrote:

 Other than this particular concern, what's the high-level take-away? Is PAE
 support in the Linux kernel a false promise than distros should not be
 shipping by default, if at all? Should it be removed from the kernel
 entirely if these configurations are knowingly broken by commits like this?
 
 PAE is make it barely work. The whole concept is fundamentally
 flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
 even understand *how* flawed and stupid that is.
 

Let's be straight... the problem isn't PAE per se, the problem is
*HIGHMEM*.  PAE just allows HIGHMEM to stretch further into problematic
territory.

Distros install PAE kernels by default because it is required to support
NX.  That is fine.

The problem is that once your memory crosses the HIGHMEM threshold
-- 896 MiB in the normal configuration -- then you are in this is going
to hurt territory.  I have seen HIGHMEM devastate performance without
even crossing the 4 GiB threshold where PAE is required.

We kernel guys have been asking the distros to ship 64-bit kernels even
in their 32-bit distros for many years, but concerns of compat issues
and the desire to deprecate 32-bit userspace seems to have kept that
from happening.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-05-01 Thread Sonny Rao
On Mon, Apr 29, 2013 at 3:08 PM, Pierre-Loup A. Griffais
 wrote:
> On 04/29/2013 03:03 PM, Linus Torvalds wrote:
>>
>> On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
>>  wrote:
>>>
>>>
>>> Other than this particular concern, what's the high-level take-away? Is
>>> PAE
>>> support in the Linux kernel a false promise than distros should not be
>>> shipping by default, if at all? Should it be removed from the kernel
>>> entirely if these configurations are knowingly broken by commits like
>>> this?
>>
>>
>> PAE is "make it barely work". The whole concept is fundamentally
>> flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
>> even understand *how* flawed and stupid that is.
>>
>> Don't do it. Upgrade to 64-bit, or live with the fact that IO
>> performance will suck. The fact that it happened to work better under
>> your particular load with one particular IO size is entirely just
>> "random noise".
>>
>> Yeah, the difference between "we can cache it" and "we have to do IO"
>> is huge. With a 32-bit kernel, we do IO much earlier now, just to
>> avoid some really nasty situations. That makes you go from the "can
>> sit in the cache" to the "do lots of IO" situation. Tough.
>>
>> Seriously, you can compile yourself a 64-bit kernel and continue to
>> use your 32-bit user-land. And you can complain to whatever distro you
>> used that it didn't do that in the first place. But we're not going to
>> bother with trying to tune PAE for some particular load. It's just not
>> worth it to anybody.
>
>
> All of this came from me trying to reproduce slowdowns reported by other
> people; I personally run a 64-bit kernel and understand how bad of an idea
> it is to attempt to run 32-bit kernels with PAE enabled on modern machines.
> However, my goal is to avoid ending up with a variety of end-users that
> don't necessarily understand this getting bitten by it and breaking their
> systems by upgrading their kernels. I will indeed bring this up with
> distributors and point out than shipping PAE kernels by default is not a
> good idea given these problems and your stance on the matter.
>

Sorry just saw this (my stupid gmail filters for lkml) The slow-down
we ran into wasn't even on PAE -- it was *just* with highmem on a 2GB
system.  The non-zero amount (90MB? or so) of highmem was enough to
cause major problems due to that particular underflow.

I would say regardless of how much memory you have, if the system can
use a 64-bit kernel, then it almost certainly should.  I've seen some
very minor performance impacts on 64-bit capable Atom systems with
tiny L2 caches, but it's almost in the noise and not worth the pain.

> Thanks,
>  - Pierre-Loup
>
>>
>>  Linus
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-05-01 Thread Steven Rostedt
On Mon, Apr 29, 2013 at 08:48:17PM -0400, Rik van Riel wrote:
> 
> It could also print out a friendly message, to
> inform the user they should upgrade to a 64 bit
> kernel to enjoy the use of all of their memory.

Oh, oh, oh!!! Can we use my message:

  http://lwn.net/Articles/501769/

OK, maybe it's not so friendly ;-)

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-05-01 Thread Steven Rostedt
On Mon, Apr 29, 2013 at 08:48:17PM -0400, Rik van Riel wrote:
 
 It could also print out a friendly message, to
 inform the user they should upgrade to a 64 bit
 kernel to enjoy the use of all of their memory.

Oh, oh, oh!!! Can we use my message:

  http://lwn.net/Articles/501769/

OK, maybe it's not so friendly ;-)

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-05-01 Thread Sonny Rao
On Mon, Apr 29, 2013 at 3:08 PM, Pierre-Loup A. Griffais
pgriff...@valvesoftware.com wrote:
 On 04/29/2013 03:03 PM, Linus Torvalds wrote:

 On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
 pgriff...@valvesoftware.com wrote:


 Other than this particular concern, what's the high-level take-away? Is
 PAE
 support in the Linux kernel a false promise than distros should not be
 shipping by default, if at all? Should it be removed from the kernel
 entirely if these configurations are knowingly broken by commits like
 this?


 PAE is make it barely work. The whole concept is fundamentally
 flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
 even understand *how* flawed and stupid that is.

 Don't do it. Upgrade to 64-bit, or live with the fact that IO
 performance will suck. The fact that it happened to work better under
 your particular load with one particular IO size is entirely just
 random noise.

 Yeah, the difference between we can cache it and we have to do IO
 is huge. With a 32-bit kernel, we do IO much earlier now, just to
 avoid some really nasty situations. That makes you go from the can
 sit in the cache to the do lots of IO situation. Tough.

 Seriously, you can compile yourself a 64-bit kernel and continue to
 use your 32-bit user-land. And you can complain to whatever distro you
 used that it didn't do that in the first place. But we're not going to
 bother with trying to tune PAE for some particular load. It's just not
 worth it to anybody.


 All of this came from me trying to reproduce slowdowns reported by other
 people; I personally run a 64-bit kernel and understand how bad of an idea
 it is to attempt to run 32-bit kernels with PAE enabled on modern machines.
 However, my goal is to avoid ending up with a variety of end-users that
 don't necessarily understand this getting bitten by it and breaking their
 systems by upgrading their kernels. I will indeed bring this up with
 distributors and point out than shipping PAE kernels by default is not a
 good idea given these problems and your stance on the matter.


Sorry just saw this (my stupid gmail filters for lkml) The slow-down
we ran into wasn't even on PAE -- it was *just* with highmem on a 2GB
system.  The non-zero amount (90MB? or so) of highmem was enough to
cause major problems due to that particular underflow.

I would say regardless of how much memory you have, if the system can
use a 64-bit kernel, then it almost certainly should.  I've seen some
very minor performance impacts on 64-bit capable Atom systems with
tiny L2 caches, but it's almost in the noise and not worth the pain.

 Thanks,
  - Pierre-Loup


  Linus


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-04-29 Thread Pierre-Loup A. Griffais

On 04/29/2013 05:48 PM, Rik van Riel wrote:

On 04/29/2013 06:03 PM, Linus Torvalds wrote:


Seriously, you can compile yourself a 64-bit kernel and continue to
use your 32-bit user-land. And you can complain to whatever distro you
used that it didn't do that in the first place. But we're not going to
bother with trying to tune PAE for some particular load. It's just not
worth it to anybody.


I can think of one way to "tune PAE" that will help
avoid the breakage, and at the same time draw the
attention of users.

Limit the memory that a 32 bit PAE kernel uses, to
something small enough where the user will not
encounter random breakage.  Maybe 8 or 12GB?

It could also print out a friendly message, to
inform the user they should upgrade to a 64 bit
kernel to enjoy the use of all of their memory.

It is a bit of a heavy stick, but I suspect that
it would clue in all of the affected users.

If you have no objection to this, I'll whip up a
patch.



That would be pretty useful, especially if I can then convince 
distributors to apply it and roll it out ASAP. I haven't personally 
observed any problems with mem=15G whereas mem=16G exhibits the IO issue 
upfront and more than that exhibits the OOM-killer / low memory 
starvation issue that existed before Johannes change.


Thanks,
 - Pierre-Loup
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-04-29 Thread Rik van Riel

On 04/29/2013 06:03 PM, Linus Torvalds wrote:


Seriously, you can compile yourself a 64-bit kernel and continue to
use your 32-bit user-land. And you can complain to whatever distro you
used that it didn't do that in the first place. But we're not going to
bother with trying to tune PAE for some particular load. It's just not
worth it to anybody.


I can think of one way to "tune PAE" that will help
avoid the breakage, and at the same time draw the
attention of users.

Limit the memory that a 32 bit PAE kernel uses, to
something small enough where the user will not
encounter random breakage.  Maybe 8 or 12GB?

It could also print out a friendly message, to
inform the user they should upgrade to a 64 bit
kernel to enjoy the use of all of their memory.

It is a bit of a heavy stick, but I suspect that
it would clue in all of the affected users.

If you have no objection to this, I'll whip up a
patch.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-04-29 Thread Pierre-Loup A. Griffais

On 04/26/2013 07:42 PM, Johannes Weiner wrote:

On Fri, Apr 26, 2013 at 09:53:56PM -0400, Rik van Riel wrote:

On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote:

I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a
180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it
takes between two and three minutes. It looks like a similar throughput
regression happens on any machine running an i386 PAE kernel with high
amounts of memory; the threshold seems to be 16G; passing mem=15G to the
kernel commandline fixes it.


If you have that much memory in the system, you will
want to run a 64 bit kernel to avoid all kinds of
memory management corner cases.


Agreed.  You can even keep your 32 bit userland, just swap the
kernel...


I bisected it to the following change:

commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
Author: Johannes Weiner 
Date:   Tue Jan 10 15:07:42 2012 -0800

 mm: exclude reserved pages from dirtyable memory

I realize running x86 kernels against high amounts of memory is not
advised for various reasons, but I would assume that such a big
regression in basic functionality to not be part of them. Is that
accurate, or are these configurations expected to become unusable from
3.3 onwards?


Reverting that patch would probably break i686 PAE systems with
lots of memory at a different threshold.


It would also re-introduce the reclaim stalls when zones with very
little page cache due to lowmem reserves end up with a large
percentage of their LRU dirty.  And that affects modern machines too,
because of the lowmem reserves in DMA32 due to relatively bigger
Normal zones.

On such large highmem machines, however, the imbalance between highmem
and lowmem is so enormous that the lowmem reserves basically exclude
all of lowmem from page cache usage.

But because dirty highmem creates lowmem pressure, and the amount of
sanely allowable dirty memory is actually a function of lowmem, not
highmem, highmem is not included in the amount of dirtyable memory.

So because your lowmem is not available for page cache and highmem is
not considered dirtyable out of the box, the amount of dirtyable
memory on your machine is 0.  You can workaround this by setting
vm.highmem_is_dirtyable=1.


I understand the technical concerns; we had some existing issues on 3.2 
with 24/32GB machines where the kernel would start erroneously 
OOM-killing new processes after a while; booting with mem=16G solved 
that. But now this goes a level further, since the machine is unusable 
upfront, right at boot, even with mem=16G. As such this is clearly seems 
like a regression more than a tradeoff.


We're in a situation where popular distros ship 32-bit as the default 
"use this if you're not sure what to get" option, with PAE also enabled 
by default. most modern computers shipping with more than 16G of RAM, 
especially for gaming. Looking at the Steam HW survey data we have 
hundreds of users using this combination; this commit means that 
installing package updates that pull in a new kernel will immediately 
cause their system to become unusable.


Other than this particular concern, what's the high-level take-away? Is 
PAE support in the Linux kernel a false promise than distros should not 
be shipping by default, if at all? Should it be removed from the kernel 
entirely if these configurations are knowingly broken by commits like this?


Thanks,
 - Pierre-Loup


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-04-29 Thread Pierre-Loup A. Griffais

On 04/29/2013 03:03 PM, Linus Torvalds wrote:

On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
 wrote:


Other than this particular concern, what's the high-level take-away? Is PAE
support in the Linux kernel a false promise than distros should not be
shipping by default, if at all? Should it be removed from the kernel
entirely if these configurations are knowingly broken by commits like this?


PAE is "make it barely work". The whole concept is fundamentally
flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
even understand *how* flawed and stupid that is.

Don't do it. Upgrade to 64-bit, or live with the fact that IO
performance will suck. The fact that it happened to work better under
your particular load with one particular IO size is entirely just
"random noise".

Yeah, the difference between "we can cache it" and "we have to do IO"
is huge. With a 32-bit kernel, we do IO much earlier now, just to
avoid some really nasty situations. That makes you go from the "can
sit in the cache" to the "do lots of IO" situation. Tough.

Seriously, you can compile yourself a 64-bit kernel and continue to
use your 32-bit user-land. And you can complain to whatever distro you
used that it didn't do that in the first place. But we're not going to
bother with trying to tune PAE for some particular load. It's just not
worth it to anybody.


All of this came from me trying to reproduce slowdowns reported by other 
people; I personally run a 64-bit kernel and understand how bad of an 
idea it is to attempt to run 32-bit kernels with PAE enabled on modern 
machines. However, my goal is to avoid ending up with a variety of 
end-users that don't necessarily understand this getting bitten by it 
and breaking their systems by upgrading their kernels. I will indeed 
bring this up with distributors and point out than shipping PAE kernels 
by default is not a good idea given these problems and your stance on 
the matter.


Thanks,
 - Pierre-Loup



 Linus



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-04-29 Thread Linus Torvalds
On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
 wrote:
>
> Other than this particular concern, what's the high-level take-away? Is PAE
> support in the Linux kernel a false promise than distros should not be
> shipping by default, if at all? Should it be removed from the kernel
> entirely if these configurations are knowingly broken by commits like this?

PAE is "make it barely work". The whole concept is fundamentally
flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
even understand *how* flawed and stupid that is.

Don't do it. Upgrade to 64-bit, or live with the fact that IO
performance will suck. The fact that it happened to work better under
your particular load with one particular IO size is entirely just
"random noise".

Yeah, the difference between "we can cache it" and "we have to do IO"
is huge. With a 32-bit kernel, we do IO much earlier now, just to
avoid some really nasty situations. That makes you go from the "can
sit in the cache" to the "do lots of IO" situation. Tough.

Seriously, you can compile yourself a 64-bit kernel and continue to
use your 32-bit user-land. And you can complain to whatever distro you
used that it didn't do that in the first place. But we're not going to
bother with trying to tune PAE for some particular load. It's just not
worth it to anybody.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-04-29 Thread Linus Torvalds
On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
pgriff...@valvesoftware.com wrote:

 Other than this particular concern, what's the high-level take-away? Is PAE
 support in the Linux kernel a false promise than distros should not be
 shipping by default, if at all? Should it be removed from the kernel
 entirely if these configurations are knowingly broken by commits like this?

PAE is make it barely work. The whole concept is fundamentally
flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
even understand *how* flawed and stupid that is.

Don't do it. Upgrade to 64-bit, or live with the fact that IO
performance will suck. The fact that it happened to work better under
your particular load with one particular IO size is entirely just
random noise.

Yeah, the difference between we can cache it and we have to do IO
is huge. With a 32-bit kernel, we do IO much earlier now, just to
avoid some really nasty situations. That makes you go from the can
sit in the cache to the do lots of IO situation. Tough.

Seriously, you can compile yourself a 64-bit kernel and continue to
use your 32-bit user-land. And you can complain to whatever distro you
used that it didn't do that in the first place. But we're not going to
bother with trying to tune PAE for some particular load. It's just not
worth it to anybody.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-04-29 Thread Pierre-Loup A. Griffais

On 04/29/2013 03:03 PM, Linus Torvalds wrote:

On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais
pgriff...@valvesoftware.com wrote:


Other than this particular concern, what's the high-level take-away? Is PAE
support in the Linux kernel a false promise than distros should not be
shipping by default, if at all? Should it be removed from the kernel
entirely if these configurations are knowingly broken by commits like this?


PAE is make it barely work. The whole concept is fundamentally
flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't
even understand *how* flawed and stupid that is.

Don't do it. Upgrade to 64-bit, or live with the fact that IO
performance will suck. The fact that it happened to work better under
your particular load with one particular IO size is entirely just
random noise.

Yeah, the difference between we can cache it and we have to do IO
is huge. With a 32-bit kernel, we do IO much earlier now, just to
avoid some really nasty situations. That makes you go from the can
sit in the cache to the do lots of IO situation. Tough.

Seriously, you can compile yourself a 64-bit kernel and continue to
use your 32-bit user-land. And you can complain to whatever distro you
used that it didn't do that in the first place. But we're not going to
bother with trying to tune PAE for some particular load. It's just not
worth it to anybody.


All of this came from me trying to reproduce slowdowns reported by other 
people; I personally run a 64-bit kernel and understand how bad of an 
idea it is to attempt to run 32-bit kernels with PAE enabled on modern 
machines. However, my goal is to avoid ending up with a variety of 
end-users that don't necessarily understand this getting bitten by it 
and breaking their systems by upgrading their kernels. I will indeed 
bring this up with distributors and point out than shipping PAE kernels 
by default is not a good idea given these problems and your stance on 
the matter.


Thanks,
 - Pierre-Loup



 Linus



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-04-29 Thread Pierre-Loup A. Griffais

On 04/26/2013 07:42 PM, Johannes Weiner wrote:

On Fri, Apr 26, 2013 at 09:53:56PM -0400, Rik van Riel wrote:

On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote:

I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a
180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it
takes between two and three minutes. It looks like a similar throughput
regression happens on any machine running an i386 PAE kernel with high
amounts of memory; the threshold seems to be 16G; passing mem=15G to the
kernel commandline fixes it.


If you have that much memory in the system, you will
want to run a 64 bit kernel to avoid all kinds of
memory management corner cases.


Agreed.  You can even keep your 32 bit userland, just swap the
kernel...


I bisected it to the following change:

commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
Author: Johannes Weiner jwei...@redhat.com
Date:   Tue Jan 10 15:07:42 2012 -0800

 mm: exclude reserved pages from dirtyable memory

I realize running x86 kernels against high amounts of memory is not
advised for various reasons, but I would assume that such a big
regression in basic functionality to not be part of them. Is that
accurate, or are these configurations expected to become unusable from
3.3 onwards?


Reverting that patch would probably break i686 PAE systems with
lots of memory at a different threshold.


It would also re-introduce the reclaim stalls when zones with very
little page cache due to lowmem reserves end up with a large
percentage of their LRU dirty.  And that affects modern machines too,
because of the lowmem reserves in DMA32 due to relatively bigger
Normal zones.

On such large highmem machines, however, the imbalance between highmem
and lowmem is so enormous that the lowmem reserves basically exclude
all of lowmem from page cache usage.

But because dirty highmem creates lowmem pressure, and the amount of
sanely allowable dirty memory is actually a function of lowmem, not
highmem, highmem is not included in the amount of dirtyable memory.

So because your lowmem is not available for page cache and highmem is
not considered dirtyable out of the box, the amount of dirtyable
memory on your machine is 0.  You can workaround this by setting
vm.highmem_is_dirtyable=1.


I understand the technical concerns; we had some existing issues on 3.2 
with 24/32GB machines where the kernel would start erroneously 
OOM-killing new processes after a while; booting with mem=16G solved 
that. But now this goes a level further, since the machine is unusable 
upfront, right at boot, even with mem=16G. As such this is clearly seems 
like a regression more than a tradeoff.


We're in a situation where popular distros ship 32-bit as the default 
use this if you're not sure what to get option, with PAE also enabled 
by default. most modern computers shipping with more than 16G of RAM, 
especially for gaming. Looking at the Steam HW survey data we have 
hundreds of users using this combination; this commit means that 
installing package updates that pull in a new kernel will immediately 
cause their system to become unusable.


Other than this particular concern, what's the high-level take-away? Is 
PAE support in the Linux kernel a false promise than distros should not 
be shipping by default, if at all? Should it be removed from the kernel 
entirely if these configurations are knowingly broken by commits like this?


Thanks,
 - Pierre-Loup


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-04-29 Thread Rik van Riel

On 04/29/2013 06:03 PM, Linus Torvalds wrote:


Seriously, you can compile yourself a 64-bit kernel and continue to
use your 32-bit user-land. And you can complain to whatever distro you
used that it didn't do that in the first place. But we're not going to
bother with trying to tune PAE for some particular load. It's just not
worth it to anybody.


I can think of one way to tune PAE that will help
avoid the breakage, and at the same time draw the
attention of users.

Limit the memory that a 32 bit PAE kernel uses, to
something small enough where the user will not
encounter random breakage.  Maybe 8 or 12GB?

It could also print out a friendly message, to
inform the user they should upgrade to a 64 bit
kernel to enjoy the use of all of their memory.

It is a bit of a heavy stick, but I suspect that
it would clue in all of the affected users.

If you have no objection to this, I'll whip up a
patch.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-04-29 Thread Pierre-Loup A. Griffais

On 04/29/2013 05:48 PM, Rik van Riel wrote:

On 04/29/2013 06:03 PM, Linus Torvalds wrote:


Seriously, you can compile yourself a 64-bit kernel and continue to
use your 32-bit user-land. And you can complain to whatever distro you
used that it didn't do that in the first place. But we're not going to
bother with trying to tune PAE for some particular load. It's just not
worth it to anybody.


I can think of one way to tune PAE that will help
avoid the breakage, and at the same time draw the
attention of users.

Limit the memory that a 32 bit PAE kernel uses, to
something small enough where the user will not
encounter random breakage.  Maybe 8 or 12GB?

It could also print out a friendly message, to
inform the user they should upgrade to a 64 bit
kernel to enjoy the use of all of their memory.

It is a bit of a heavy stick, but I suspect that
it would clue in all of the affected users.

If you have no objection to this, I'll whip up a
patch.



That would be pretty useful, especially if I can then convince 
distributors to apply it and roll it out ASAP. I haven't personally 
observed any problems with mem=15G whereas mem=16G exhibits the IO issue 
upfront and more than that exhibits the OOM-killer / low memory 
starvation issue that existed before Johannes change.


Thanks,
 - Pierre-Loup
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-04-26 Thread Johannes Weiner
On Fri, Apr 26, 2013 at 09:53:56PM -0400, Rik van Riel wrote:
> On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote:
> >I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a
> >180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it
> >takes between two and three minutes. It looks like a similar throughput
> >regression happens on any machine running an i386 PAE kernel with high
> >amounts of memory; the threshold seems to be 16G; passing mem=15G to the
> >kernel commandline fixes it.
> 
> If you have that much memory in the system, you will
> want to run a 64 bit kernel to avoid all kinds of
> memory management corner cases.

Agreed.  You can even keep your 32 bit userland, just swap the
kernel...

> >I bisected it to the following change:
> >
> >commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
> >Author: Johannes Weiner 
> >Date:   Tue Jan 10 15:07:42 2012 -0800
> >
> > mm: exclude reserved pages from dirtyable memory
> >
> >I realize running x86 kernels against high amounts of memory is not
> >advised for various reasons, but I would assume that such a big
> >regression in basic functionality to not be part of them. Is that
> >accurate, or are these configurations expected to become unusable from
> >3.3 onwards?
> 
> Reverting that patch would probably break i686 PAE systems with
> lots of memory at a different threshold.

It would also re-introduce the reclaim stalls when zones with very
little page cache due to lowmem reserves end up with a large
percentage of their LRU dirty.  And that affects modern machines too,
because of the lowmem reserves in DMA32 due to relatively bigger
Normal zones.

On such large highmem machines, however, the imbalance between highmem
and lowmem is so enormous that the lowmem reserves basically exclude
all of lowmem from page cache usage.

But because dirty highmem creates lowmem pressure, and the amount of
sanely allowable dirty memory is actually a function of lowmem, not
highmem, highmem is not included in the amount of dirtyable memory.

So because your lowmem is not available for page cache and highmem is
not considered dirtyable out of the box, the amount of dirtyable
memory on your machine is 0.  You can workaround this by setting
vm.highmem_is_dirtyable=1.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-04-26 Thread Rik van Riel

On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote:

I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a
180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it
takes between two and three minutes. It looks like a similar throughput
regression happens on any machine running an i386 PAE kernel with high
amounts of memory; the threshold seems to be 16G; passing mem=15G to the
kernel commandline fixes it.


If you have that much memory in the system, you will
want to run a 64 bit kernel to avoid all kinds of
memory management corner cases.


I bisected it to the following change:

commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
Author: Johannes Weiner 
Date:   Tue Jan 10 15:07:42 2012 -0800

 mm: exclude reserved pages from dirtyable memory

I realize running x86 kernels against high amounts of memory is not
advised for various reasons, but I would assume that such a big
regression in basic functionality to not be part of them. Is that
accurate, or are these configurations expected to become unusable from
3.3 onwards?


Reverting that patch would probably break i686 PAE systems with
lots of memory at a different threshold.

With more than 8-12GB of memory, an i686 kernel is between a
rock and a hard place. Whether you move it closer to the rock,
or closer to the hard place, all you do is change the way in
which it breaks.


Also CCing Sonny since it looks like he tried to fix an overflow issue
related to the same change with commit c8b74c2f66049, but I'm still
experiencing the problem with a kernel built from master.

Thanks,
  - Pierre-Loup



--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-04-26 Thread Rik van Riel

On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote:

I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a
180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it
takes between two and three minutes. It looks like a similar throughput
regression happens on any machine running an i386 PAE kernel with high
amounts of memory; the threshold seems to be 16G; passing mem=15G to the
kernel commandline fixes it.


If you have that much memory in the system, you will
want to run a 64 bit kernel to avoid all kinds of
memory management corner cases.


I bisected it to the following change:

commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
Author: Johannes Weiner jwei...@redhat.com
Date:   Tue Jan 10 15:07:42 2012 -0800

 mm: exclude reserved pages from dirtyable memory

I realize running x86 kernels against high amounts of memory is not
advised for various reasons, but I would assume that such a big
regression in basic functionality to not be part of them. Is that
accurate, or are these configurations expected to become unusable from
3.3 onwards?


Reverting that patch would probably break i686 PAE systems with
lots of memory at a different threshold.

With more than 8-12GB of memory, an i686 kernel is between a
rock and a hard place. Whether you move it closer to the rock,
or closer to the hard place, all you do is change the way in
which it breaks.


Also CCing Sonny since it looks like he tried to fix an overflow issue
related to the same change with commit c8b74c2f66049, but I'm still
experiencing the problem with a kernel built from master.

Thanks,
  - Pierre-Loup



--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IO regression after ab8fabd46f on x86 kernels with high memory

2013-04-26 Thread Johannes Weiner
On Fri, Apr 26, 2013 at 09:53:56PM -0400, Rik van Riel wrote:
 On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote:
 I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a
 180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it
 takes between two and three minutes. It looks like a similar throughput
 regression happens on any machine running an i386 PAE kernel with high
 amounts of memory; the threshold seems to be 16G; passing mem=15G to the
 kernel commandline fixes it.
 
 If you have that much memory in the system, you will
 want to run a 64 bit kernel to avoid all kinds of
 memory management corner cases.

Agreed.  You can even keep your 32 bit userland, just swap the
kernel...

 I bisected it to the following change:
 
 commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
 Author: Johannes Weiner jwei...@redhat.com
 Date:   Tue Jan 10 15:07:42 2012 -0800
 
  mm: exclude reserved pages from dirtyable memory
 
 I realize running x86 kernels against high amounts of memory is not
 advised for various reasons, but I would assume that such a big
 regression in basic functionality to not be part of them. Is that
 accurate, or are these configurations expected to become unusable from
 3.3 onwards?
 
 Reverting that patch would probably break i686 PAE systems with
 lots of memory at a different threshold.

It would also re-introduce the reclaim stalls when zones with very
little page cache due to lowmem reserves end up with a large
percentage of their LRU dirty.  And that affects modern machines too,
because of the lowmem reserves in DMA32 due to relatively bigger
Normal zones.

On such large highmem machines, however, the imbalance between highmem
and lowmem is so enormous that the lowmem reserves basically exclude
all of lowmem from page cache usage.

But because dirty highmem creates lowmem pressure, and the amount of
sanely allowable dirty memory is actually a function of lowmem, not
highmem, highmem is not included in the amount of dirtyable memory.

So because your lowmem is not available for page cache and highmem is
not considered dirtyable out of the box, the amount of dirtyable
memory on your machine is 0.  You can workaround this by setting
vm.highmem_is_dirtyable=1.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/