RE: IO regression after ab8fabd46f on x86 kernels with high memory
> We kernel guys have been asking the distros to ship 64-bit kernels even > in their 32-bit distros for many years, but concerns of compat issues > and the desire to deprecate 32-bit userspace seems to have kept that > from happening. And now there is another reason: to call 64-bit EFI runtime services. In retrospect, I would have stuck with 32-bit EFI with 64-bit kernels calling runtime services in compatibility mode, but of course it is too late for that now. Yuhong Bao-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: IO regression after ab8fabd46f on x86 kernels with high memory
We kernel guys have been asking the distros to ship 64-bit kernels even in their 32-bit distros for many years, but concerns of compat issues and the desire to deprecate 32-bit userspace seems to have kept that from happening. And now there is another reason: to call 64-bit EFI runtime services. In retrospect, I would have stuck with 32-bit EFI with 64-bit kernels calling runtime services in compatibility mode, but of course it is too late for that now. Yuhong Bao-- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On 04/29/2013 03:03 PM, Linus Torvalds wrote: > On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais > wrote: >> >> Other than this particular concern, what's the high-level take-away? Is PAE >> support in the Linux kernel a false promise than distros should not be >> shipping by default, if at all? Should it be removed from the kernel >> entirely if these configurations are knowingly broken by commits like this? > > PAE is "make it barely work". The whole concept is fundamentally > flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't > even understand *how* flawed and stupid that is. > Let's be straight... the problem isn't PAE per se, the problem is *HIGHMEM*. PAE just allows HIGHMEM to stretch further into problematic territory. Distros install PAE kernels by default because it is required to support NX. That is fine. The problem is that once your memory crosses the HIGHMEM threshold -- 896 MiB in the normal configuration -- then you are in "this is going to hurt" territory. I have seen HIGHMEM devastate performance without even crossing the 4 GiB threshold where PAE is required. We kernel guys have been asking the distros to ship 64-bit kernels even in their 32-bit distros for many years, but concerns of compat issues and the desire to deprecate 32-bit userspace seems to have kept that from happening. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On 04/29/2013 03:03 PM, Linus Torvalds wrote: On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais pgriff...@valvesoftware.com wrote: Other than this particular concern, what's the high-level take-away? Is PAE support in the Linux kernel a false promise than distros should not be shipping by default, if at all? Should it be removed from the kernel entirely if these configurations are knowingly broken by commits like this? PAE is make it barely work. The whole concept is fundamentally flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't even understand *how* flawed and stupid that is. Let's be straight... the problem isn't PAE per se, the problem is *HIGHMEM*. PAE just allows HIGHMEM to stretch further into problematic territory. Distros install PAE kernels by default because it is required to support NX. That is fine. The problem is that once your memory crosses the HIGHMEM threshold -- 896 MiB in the normal configuration -- then you are in this is going to hurt territory. I have seen HIGHMEM devastate performance without even crossing the 4 GiB threshold where PAE is required. We kernel guys have been asking the distros to ship 64-bit kernels even in their 32-bit distros for many years, but concerns of compat issues and the desire to deprecate 32-bit userspace seems to have kept that from happening. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On Mon, Apr 29, 2013 at 3:08 PM, Pierre-Loup A. Griffais wrote: > On 04/29/2013 03:03 PM, Linus Torvalds wrote: >> >> On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais >> wrote: >>> >>> >>> Other than this particular concern, what's the high-level take-away? Is >>> PAE >>> support in the Linux kernel a false promise than distros should not be >>> shipping by default, if at all? Should it be removed from the kernel >>> entirely if these configurations are knowingly broken by commits like >>> this? >> >> >> PAE is "make it barely work". The whole concept is fundamentally >> flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't >> even understand *how* flawed and stupid that is. >> >> Don't do it. Upgrade to 64-bit, or live with the fact that IO >> performance will suck. The fact that it happened to work better under >> your particular load with one particular IO size is entirely just >> "random noise". >> >> Yeah, the difference between "we can cache it" and "we have to do IO" >> is huge. With a 32-bit kernel, we do IO much earlier now, just to >> avoid some really nasty situations. That makes you go from the "can >> sit in the cache" to the "do lots of IO" situation. Tough. >> >> Seriously, you can compile yourself a 64-bit kernel and continue to >> use your 32-bit user-land. And you can complain to whatever distro you >> used that it didn't do that in the first place. But we're not going to >> bother with trying to tune PAE for some particular load. It's just not >> worth it to anybody. > > > All of this came from me trying to reproduce slowdowns reported by other > people; I personally run a 64-bit kernel and understand how bad of an idea > it is to attempt to run 32-bit kernels with PAE enabled on modern machines. > However, my goal is to avoid ending up with a variety of end-users that > don't necessarily understand this getting bitten by it and breaking their > systems by upgrading their kernels. I will indeed bring this up with > distributors and point out than shipping PAE kernels by default is not a > good idea given these problems and your stance on the matter. > Sorry just saw this (my stupid gmail filters for lkml) The slow-down we ran into wasn't even on PAE -- it was *just* with highmem on a 2GB system. The non-zero amount (90MB? or so) of highmem was enough to cause major problems due to that particular underflow. I would say regardless of how much memory you have, if the system can use a 64-bit kernel, then it almost certainly should. I've seen some very minor performance impacts on 64-bit capable Atom systems with tiny L2 caches, but it's almost in the noise and not worth the pain. > Thanks, > - Pierre-Loup > >> >> Linus >> > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On Mon, Apr 29, 2013 at 08:48:17PM -0400, Rik van Riel wrote: > > It could also print out a friendly message, to > inform the user they should upgrade to a 64 bit > kernel to enjoy the use of all of their memory. Oh, oh, oh!!! Can we use my message: http://lwn.net/Articles/501769/ OK, maybe it's not so friendly ;-) -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On Mon, Apr 29, 2013 at 08:48:17PM -0400, Rik van Riel wrote: It could also print out a friendly message, to inform the user they should upgrade to a 64 bit kernel to enjoy the use of all of their memory. Oh, oh, oh!!! Can we use my message: http://lwn.net/Articles/501769/ OK, maybe it's not so friendly ;-) -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On Mon, Apr 29, 2013 at 3:08 PM, Pierre-Loup A. Griffais pgriff...@valvesoftware.com wrote: On 04/29/2013 03:03 PM, Linus Torvalds wrote: On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais pgriff...@valvesoftware.com wrote: Other than this particular concern, what's the high-level take-away? Is PAE support in the Linux kernel a false promise than distros should not be shipping by default, if at all? Should it be removed from the kernel entirely if these configurations are knowingly broken by commits like this? PAE is make it barely work. The whole concept is fundamentally flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't even understand *how* flawed and stupid that is. Don't do it. Upgrade to 64-bit, or live with the fact that IO performance will suck. The fact that it happened to work better under your particular load with one particular IO size is entirely just random noise. Yeah, the difference between we can cache it and we have to do IO is huge. With a 32-bit kernel, we do IO much earlier now, just to avoid some really nasty situations. That makes you go from the can sit in the cache to the do lots of IO situation. Tough. Seriously, you can compile yourself a 64-bit kernel and continue to use your 32-bit user-land. And you can complain to whatever distro you used that it didn't do that in the first place. But we're not going to bother with trying to tune PAE for some particular load. It's just not worth it to anybody. All of this came from me trying to reproduce slowdowns reported by other people; I personally run a 64-bit kernel and understand how bad of an idea it is to attempt to run 32-bit kernels with PAE enabled on modern machines. However, my goal is to avoid ending up with a variety of end-users that don't necessarily understand this getting bitten by it and breaking their systems by upgrading their kernels. I will indeed bring this up with distributors and point out than shipping PAE kernels by default is not a good idea given these problems and your stance on the matter. Sorry just saw this (my stupid gmail filters for lkml) The slow-down we ran into wasn't even on PAE -- it was *just* with highmem on a 2GB system. The non-zero amount (90MB? or so) of highmem was enough to cause major problems due to that particular underflow. I would say regardless of how much memory you have, if the system can use a 64-bit kernel, then it almost certainly should. I've seen some very minor performance impacts on 64-bit capable Atom systems with tiny L2 caches, but it's almost in the noise and not worth the pain. Thanks, - Pierre-Loup Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On 04/29/2013 05:48 PM, Rik van Riel wrote: On 04/29/2013 06:03 PM, Linus Torvalds wrote: Seriously, you can compile yourself a 64-bit kernel and continue to use your 32-bit user-land. And you can complain to whatever distro you used that it didn't do that in the first place. But we're not going to bother with trying to tune PAE for some particular load. It's just not worth it to anybody. I can think of one way to "tune PAE" that will help avoid the breakage, and at the same time draw the attention of users. Limit the memory that a 32 bit PAE kernel uses, to something small enough where the user will not encounter random breakage. Maybe 8 or 12GB? It could also print out a friendly message, to inform the user they should upgrade to a 64 bit kernel to enjoy the use of all of their memory. It is a bit of a heavy stick, but I suspect that it would clue in all of the affected users. If you have no objection to this, I'll whip up a patch. That would be pretty useful, especially if I can then convince distributors to apply it and roll it out ASAP. I haven't personally observed any problems with mem=15G whereas mem=16G exhibits the IO issue upfront and more than that exhibits the OOM-killer / low memory starvation issue that existed before Johannes change. Thanks, - Pierre-Loup -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On 04/29/2013 06:03 PM, Linus Torvalds wrote: Seriously, you can compile yourself a 64-bit kernel and continue to use your 32-bit user-land. And you can complain to whatever distro you used that it didn't do that in the first place. But we're not going to bother with trying to tune PAE for some particular load. It's just not worth it to anybody. I can think of one way to "tune PAE" that will help avoid the breakage, and at the same time draw the attention of users. Limit the memory that a 32 bit PAE kernel uses, to something small enough where the user will not encounter random breakage. Maybe 8 or 12GB? It could also print out a friendly message, to inform the user they should upgrade to a 64 bit kernel to enjoy the use of all of their memory. It is a bit of a heavy stick, but I suspect that it would clue in all of the affected users. If you have no objection to this, I'll whip up a patch. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On 04/26/2013 07:42 PM, Johannes Weiner wrote: On Fri, Apr 26, 2013 at 09:53:56PM -0400, Rik van Riel wrote: On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote: I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a 180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it takes between two and three minutes. It looks like a similar throughput regression happens on any machine running an i386 PAE kernel with high amounts of memory; the threshold seems to be 16G; passing mem=15G to the kernel commandline fixes it. If you have that much memory in the system, you will want to run a 64 bit kernel to avoid all kinds of memory management corner cases. Agreed. You can even keep your 32 bit userland, just swap the kernel... I bisected it to the following change: commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d Author: Johannes Weiner Date: Tue Jan 10 15:07:42 2012 -0800 mm: exclude reserved pages from dirtyable memory I realize running x86 kernels against high amounts of memory is not advised for various reasons, but I would assume that such a big regression in basic functionality to not be part of them. Is that accurate, or are these configurations expected to become unusable from 3.3 onwards? Reverting that patch would probably break i686 PAE systems with lots of memory at a different threshold. It would also re-introduce the reclaim stalls when zones with very little page cache due to lowmem reserves end up with a large percentage of their LRU dirty. And that affects modern machines too, because of the lowmem reserves in DMA32 due to relatively bigger Normal zones. On such large highmem machines, however, the imbalance between highmem and lowmem is so enormous that the lowmem reserves basically exclude all of lowmem from page cache usage. But because dirty highmem creates lowmem pressure, and the amount of sanely allowable dirty memory is actually a function of lowmem, not highmem, highmem is not included in the amount of dirtyable memory. So because your lowmem is not available for page cache and highmem is not considered dirtyable out of the box, the amount of dirtyable memory on your machine is 0. You can workaround this by setting vm.highmem_is_dirtyable=1. I understand the technical concerns; we had some existing issues on 3.2 with 24/32GB machines where the kernel would start erroneously OOM-killing new processes after a while; booting with mem=16G solved that. But now this goes a level further, since the machine is unusable upfront, right at boot, even with mem=16G. As such this is clearly seems like a regression more than a tradeoff. We're in a situation where popular distros ship 32-bit as the default "use this if you're not sure what to get" option, with PAE also enabled by default. most modern computers shipping with more than 16G of RAM, especially for gaming. Looking at the Steam HW survey data we have hundreds of users using this combination; this commit means that installing package updates that pull in a new kernel will immediately cause their system to become unusable. Other than this particular concern, what's the high-level take-away? Is PAE support in the Linux kernel a false promise than distros should not be shipping by default, if at all? Should it be removed from the kernel entirely if these configurations are knowingly broken by commits like this? Thanks, - Pierre-Loup -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On 04/29/2013 03:03 PM, Linus Torvalds wrote: On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais wrote: Other than this particular concern, what's the high-level take-away? Is PAE support in the Linux kernel a false promise than distros should not be shipping by default, if at all? Should it be removed from the kernel entirely if these configurations are knowingly broken by commits like this? PAE is "make it barely work". The whole concept is fundamentally flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't even understand *how* flawed and stupid that is. Don't do it. Upgrade to 64-bit, or live with the fact that IO performance will suck. The fact that it happened to work better under your particular load with one particular IO size is entirely just "random noise". Yeah, the difference between "we can cache it" and "we have to do IO" is huge. With a 32-bit kernel, we do IO much earlier now, just to avoid some really nasty situations. That makes you go from the "can sit in the cache" to the "do lots of IO" situation. Tough. Seriously, you can compile yourself a 64-bit kernel and continue to use your 32-bit user-land. And you can complain to whatever distro you used that it didn't do that in the first place. But we're not going to bother with trying to tune PAE for some particular load. It's just not worth it to anybody. All of this came from me trying to reproduce slowdowns reported by other people; I personally run a 64-bit kernel and understand how bad of an idea it is to attempt to run 32-bit kernels with PAE enabled on modern machines. However, my goal is to avoid ending up with a variety of end-users that don't necessarily understand this getting bitten by it and breaking their systems by upgrading their kernels. I will indeed bring this up with distributors and point out than shipping PAE kernels by default is not a good idea given these problems and your stance on the matter. Thanks, - Pierre-Loup Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais wrote: > > Other than this particular concern, what's the high-level take-away? Is PAE > support in the Linux kernel a false promise than distros should not be > shipping by default, if at all? Should it be removed from the kernel > entirely if these configurations are knowingly broken by commits like this? PAE is "make it barely work". The whole concept is fundamentally flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't even understand *how* flawed and stupid that is. Don't do it. Upgrade to 64-bit, or live with the fact that IO performance will suck. The fact that it happened to work better under your particular load with one particular IO size is entirely just "random noise". Yeah, the difference between "we can cache it" and "we have to do IO" is huge. With a 32-bit kernel, we do IO much earlier now, just to avoid some really nasty situations. That makes you go from the "can sit in the cache" to the "do lots of IO" situation. Tough. Seriously, you can compile yourself a 64-bit kernel and continue to use your 32-bit user-land. And you can complain to whatever distro you used that it didn't do that in the first place. But we're not going to bother with trying to tune PAE for some particular load. It's just not worth it to anybody. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais pgriff...@valvesoftware.com wrote: Other than this particular concern, what's the high-level take-away? Is PAE support in the Linux kernel a false promise than distros should not be shipping by default, if at all? Should it be removed from the kernel entirely if these configurations are knowingly broken by commits like this? PAE is make it barely work. The whole concept is fundamentally flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't even understand *how* flawed and stupid that is. Don't do it. Upgrade to 64-bit, or live with the fact that IO performance will suck. The fact that it happened to work better under your particular load with one particular IO size is entirely just random noise. Yeah, the difference between we can cache it and we have to do IO is huge. With a 32-bit kernel, we do IO much earlier now, just to avoid some really nasty situations. That makes you go from the can sit in the cache to the do lots of IO situation. Tough. Seriously, you can compile yourself a 64-bit kernel and continue to use your 32-bit user-land. And you can complain to whatever distro you used that it didn't do that in the first place. But we're not going to bother with trying to tune PAE for some particular load. It's just not worth it to anybody. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On 04/29/2013 03:03 PM, Linus Torvalds wrote: On Mon, Apr 29, 2013 at 2:53 PM, Pierre-Loup A. Griffais pgriff...@valvesoftware.com wrote: Other than this particular concern, what's the high-level take-away? Is PAE support in the Linux kernel a false promise than distros should not be shipping by default, if at all? Should it be removed from the kernel entirely if these configurations are knowingly broken by commits like this? PAE is make it barely work. The whole concept is fundamentally flawed, and anybody who runs a 32-bit kernel with 16GB or RAM doesn't even understand *how* flawed and stupid that is. Don't do it. Upgrade to 64-bit, or live with the fact that IO performance will suck. The fact that it happened to work better under your particular load with one particular IO size is entirely just random noise. Yeah, the difference between we can cache it and we have to do IO is huge. With a 32-bit kernel, we do IO much earlier now, just to avoid some really nasty situations. That makes you go from the can sit in the cache to the do lots of IO situation. Tough. Seriously, you can compile yourself a 64-bit kernel and continue to use your 32-bit user-land. And you can complain to whatever distro you used that it didn't do that in the first place. But we're not going to bother with trying to tune PAE for some particular load. It's just not worth it to anybody. All of this came from me trying to reproduce slowdowns reported by other people; I personally run a 64-bit kernel and understand how bad of an idea it is to attempt to run 32-bit kernels with PAE enabled on modern machines. However, my goal is to avoid ending up with a variety of end-users that don't necessarily understand this getting bitten by it and breaking their systems by upgrading their kernels. I will indeed bring this up with distributors and point out than shipping PAE kernels by default is not a good idea given these problems and your stance on the matter. Thanks, - Pierre-Loup Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On 04/26/2013 07:42 PM, Johannes Weiner wrote: On Fri, Apr 26, 2013 at 09:53:56PM -0400, Rik van Riel wrote: On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote: I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a 180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it takes between two and three minutes. It looks like a similar throughput regression happens on any machine running an i386 PAE kernel with high amounts of memory; the threshold seems to be 16G; passing mem=15G to the kernel commandline fixes it. If you have that much memory in the system, you will want to run a 64 bit kernel to avoid all kinds of memory management corner cases. Agreed. You can even keep your 32 bit userland, just swap the kernel... I bisected it to the following change: commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d Author: Johannes Weiner jwei...@redhat.com Date: Tue Jan 10 15:07:42 2012 -0800 mm: exclude reserved pages from dirtyable memory I realize running x86 kernels against high amounts of memory is not advised for various reasons, but I would assume that such a big regression in basic functionality to not be part of them. Is that accurate, or are these configurations expected to become unusable from 3.3 onwards? Reverting that patch would probably break i686 PAE systems with lots of memory at a different threshold. It would also re-introduce the reclaim stalls when zones with very little page cache due to lowmem reserves end up with a large percentage of their LRU dirty. And that affects modern machines too, because of the lowmem reserves in DMA32 due to relatively bigger Normal zones. On such large highmem machines, however, the imbalance between highmem and lowmem is so enormous that the lowmem reserves basically exclude all of lowmem from page cache usage. But because dirty highmem creates lowmem pressure, and the amount of sanely allowable dirty memory is actually a function of lowmem, not highmem, highmem is not included in the amount of dirtyable memory. So because your lowmem is not available for page cache and highmem is not considered dirtyable out of the box, the amount of dirtyable memory on your machine is 0. You can workaround this by setting vm.highmem_is_dirtyable=1. I understand the technical concerns; we had some existing issues on 3.2 with 24/32GB machines where the kernel would start erroneously OOM-killing new processes after a while; booting with mem=16G solved that. But now this goes a level further, since the machine is unusable upfront, right at boot, even with mem=16G. As such this is clearly seems like a regression more than a tradeoff. We're in a situation where popular distros ship 32-bit as the default use this if you're not sure what to get option, with PAE also enabled by default. most modern computers shipping with more than 16G of RAM, especially for gaming. Looking at the Steam HW survey data we have hundreds of users using this combination; this commit means that installing package updates that pull in a new kernel will immediately cause their system to become unusable. Other than this particular concern, what's the high-level take-away? Is PAE support in the Linux kernel a false promise than distros should not be shipping by default, if at all? Should it be removed from the kernel entirely if these configurations are knowingly broken by commits like this? Thanks, - Pierre-Loup -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On 04/29/2013 06:03 PM, Linus Torvalds wrote: Seriously, you can compile yourself a 64-bit kernel and continue to use your 32-bit user-land. And you can complain to whatever distro you used that it didn't do that in the first place. But we're not going to bother with trying to tune PAE for some particular load. It's just not worth it to anybody. I can think of one way to tune PAE that will help avoid the breakage, and at the same time draw the attention of users. Limit the memory that a 32 bit PAE kernel uses, to something small enough where the user will not encounter random breakage. Maybe 8 or 12GB? It could also print out a friendly message, to inform the user they should upgrade to a 64 bit kernel to enjoy the use of all of their memory. It is a bit of a heavy stick, but I suspect that it would clue in all of the affected users. If you have no objection to this, I'll whip up a patch. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On 04/29/2013 05:48 PM, Rik van Riel wrote: On 04/29/2013 06:03 PM, Linus Torvalds wrote: Seriously, you can compile yourself a 64-bit kernel and continue to use your 32-bit user-land. And you can complain to whatever distro you used that it didn't do that in the first place. But we're not going to bother with trying to tune PAE for some particular load. It's just not worth it to anybody. I can think of one way to tune PAE that will help avoid the breakage, and at the same time draw the attention of users. Limit the memory that a 32 bit PAE kernel uses, to something small enough where the user will not encounter random breakage. Maybe 8 or 12GB? It could also print out a friendly message, to inform the user they should upgrade to a 64 bit kernel to enjoy the use of all of their memory. It is a bit of a heavy stick, but I suspect that it would clue in all of the affected users. If you have no objection to this, I'll whip up a patch. That would be pretty useful, especially if I can then convince distributors to apply it and roll it out ASAP. I haven't personally observed any problems with mem=15G whereas mem=16G exhibits the IO issue upfront and more than that exhibits the OOM-killer / low memory starvation issue that existed before Johannes change. Thanks, - Pierre-Loup -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On Fri, Apr 26, 2013 at 09:53:56PM -0400, Rik van Riel wrote: > On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote: > >I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a > >180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it > >takes between two and three minutes. It looks like a similar throughput > >regression happens on any machine running an i386 PAE kernel with high > >amounts of memory; the threshold seems to be 16G; passing mem=15G to the > >kernel commandline fixes it. > > If you have that much memory in the system, you will > want to run a 64 bit kernel to avoid all kinds of > memory management corner cases. Agreed. You can even keep your 32 bit userland, just swap the kernel... > >I bisected it to the following change: > > > >commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d > >Author: Johannes Weiner > >Date: Tue Jan 10 15:07:42 2012 -0800 > > > > mm: exclude reserved pages from dirtyable memory > > > >I realize running x86 kernels against high amounts of memory is not > >advised for various reasons, but I would assume that such a big > >regression in basic functionality to not be part of them. Is that > >accurate, or are these configurations expected to become unusable from > >3.3 onwards? > > Reverting that patch would probably break i686 PAE systems with > lots of memory at a different threshold. It would also re-introduce the reclaim stalls when zones with very little page cache due to lowmem reserves end up with a large percentage of their LRU dirty. And that affects modern machines too, because of the lowmem reserves in DMA32 due to relatively bigger Normal zones. On such large highmem machines, however, the imbalance between highmem and lowmem is so enormous that the lowmem reserves basically exclude all of lowmem from page cache usage. But because dirty highmem creates lowmem pressure, and the amount of sanely allowable dirty memory is actually a function of lowmem, not highmem, highmem is not included in the amount of dirtyable memory. So because your lowmem is not available for page cache and highmem is not considered dirtyable out of the box, the amount of dirtyable memory on your machine is 0. You can workaround this by setting vm.highmem_is_dirtyable=1. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote: I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a 180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it takes between two and three minutes. It looks like a similar throughput regression happens on any machine running an i386 PAE kernel with high amounts of memory; the threshold seems to be 16G; passing mem=15G to the kernel commandline fixes it. If you have that much memory in the system, you will want to run a 64 bit kernel to avoid all kinds of memory management corner cases. I bisected it to the following change: commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d Author: Johannes Weiner Date: Tue Jan 10 15:07:42 2012 -0800 mm: exclude reserved pages from dirtyable memory I realize running x86 kernels against high amounts of memory is not advised for various reasons, but I would assume that such a big regression in basic functionality to not be part of them. Is that accurate, or are these configurations expected to become unusable from 3.3 onwards? Reverting that patch would probably break i686 PAE systems with lots of memory at a different threshold. With more than 8-12GB of memory, an i686 kernel is between a rock and a hard place. Whether you move it closer to the rock, or closer to the hard place, all you do is change the way in which it breaks. Also CCing Sonny since it looks like he tried to fix an overflow issue related to the same change with commit c8b74c2f66049, but I'm still experiencing the problem with a kernel built from master. Thanks, - Pierre-Loup -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote: I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a 180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it takes between two and three minutes. It looks like a similar throughput regression happens on any machine running an i386 PAE kernel with high amounts of memory; the threshold seems to be 16G; passing mem=15G to the kernel commandline fixes it. If you have that much memory in the system, you will want to run a 64 bit kernel to avoid all kinds of memory management corner cases. I bisected it to the following change: commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d Author: Johannes Weiner jwei...@redhat.com Date: Tue Jan 10 15:07:42 2012 -0800 mm: exclude reserved pages from dirtyable memory I realize running x86 kernels against high amounts of memory is not advised for various reasons, but I would assume that such a big regression in basic functionality to not be part of them. Is that accurate, or are these configurations expected to become unusable from 3.3 onwards? Reverting that patch would probably break i686 PAE systems with lots of memory at a different threshold. With more than 8-12GB of memory, an i686 kernel is between a rock and a hard place. Whether you move it closer to the rock, or closer to the hard place, all you do is change the way in which it breaks. Also CCing Sonny since it looks like he tried to fix an overflow issue related to the same change with commit c8b74c2f66049, but I'm still experiencing the problem with a kernel built from master. Thanks, - Pierre-Loup -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO regression after ab8fabd46f on x86 kernels with high memory
On Fri, Apr 26, 2013 at 09:53:56PM -0400, Rik van Riel wrote: On 04/26/2013 07:44 PM, Pierre-Loup A. Griffais wrote: I initially observed this between kernels 3.2 and 3.5: on 3.2, copying a 180M shared object on the same ext4 filesystem takes 0.6s. On 3.5, it takes between two and three minutes. It looks like a similar throughput regression happens on any machine running an i386 PAE kernel with high amounts of memory; the threshold seems to be 16G; passing mem=15G to the kernel commandline fixes it. If you have that much memory in the system, you will want to run a 64 bit kernel to avoid all kinds of memory management corner cases. Agreed. You can even keep your 32 bit userland, just swap the kernel... I bisected it to the following change: commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d Author: Johannes Weiner jwei...@redhat.com Date: Tue Jan 10 15:07:42 2012 -0800 mm: exclude reserved pages from dirtyable memory I realize running x86 kernels against high amounts of memory is not advised for various reasons, but I would assume that such a big regression in basic functionality to not be part of them. Is that accurate, or are these configurations expected to become unusable from 3.3 onwards? Reverting that patch would probably break i686 PAE systems with lots of memory at a different threshold. It would also re-introduce the reclaim stalls when zones with very little page cache due to lowmem reserves end up with a large percentage of their LRU dirty. And that affects modern machines too, because of the lowmem reserves in DMA32 due to relatively bigger Normal zones. On such large highmem machines, however, the imbalance between highmem and lowmem is so enormous that the lowmem reserves basically exclude all of lowmem from page cache usage. But because dirty highmem creates lowmem pressure, and the amount of sanely allowable dirty memory is actually a function of lowmem, not highmem, highmem is not included in the amount of dirtyable memory. So because your lowmem is not available for page cache and highmem is not considered dirtyable out of the box, the amount of dirtyable memory on your machine is 0. You can workaround this by setting vm.highmem_is_dirtyable=1. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/