Re: [PATCH 00/11] Use global pages with PTI
On 03/30/2018 10:39 PM, Ingo Molnar wrote: > There were a couple of valid review comments which need to be addressed as > well, > but other than that it all looks good to me and I plan to apply the next > iteration. Testing on that non-PCID systems showed an oddity with parts of the kernel image that are modified later in boot (when we set the kernel image read-only). We split a few of the PMD entries and the the old (early boot) values were being used for userspace. I don't think this is a big deal. The most annoying thing is that it makes it harder to quickly validate that all of the things we set to global *should* be global. I'll put some examples of how this looks in the patch when I repost.
Re: [PATCH 00/11] Use global pages with PTI
On 03/30/2018 10:39 PM, Ingo Molnar wrote: > There were a couple of valid review comments which need to be addressed as > well, > but other than that it all looks good to me and I plan to apply the next > iteration. Testing on that non-PCID systems showed an oddity with parts of the kernel image that are modified later in boot (when we set the kernel image read-only). We split a few of the PMD entries and the the old (early boot) values were being used for userspace. I don't think this is a big deal. The most annoying thing is that it makes it harder to quickly validate that all of the things we set to global *should* be global. I'll put some examples of how this looks in the patch when I repost.
Re: [PATCH 00/11] Use global pages with PTI
* Dave Hansenwrote: > On 03/30/2018 01:32 PM, Thomas Gleixner wrote: > > On Fri, 30 Mar 2018, Dave Hansen wrote: > > > >> On 03/30/2018 05:17 AM, Ingo Molnar wrote: > >>> BTW., the expectation on !PCID Intel hardware would be for global pages > >>> to help > >>> even more than the 0.6% and 1.7% you measured on PCID hardware: PCID > >>> already > >>> _reduces_ the cost of TLB flushes - so if there's not even PCID then > >>> global pages > >>> should help even more. > >>> > >>> In theory at least. Would still be nice to measure it. > >> > >> I did the lseek test on a modern, non-PCID system: > >> > >> No Global pages (baseline): 6077741 lseeks/sec > >> 94 Global pages (this set): 8433111 lseeks/sec > >> +2355370 lseeks/sec (+38.8%) > > > > That's all kernel text, right? What's the result for the case where global > > is only set for all user/kernel shared pages? > > Yes, that's all kernel text (94 global entries). Here's the number with > just the entry data/text set global (88 global entries on this system): > > No Global pages (baseline): 6077741 lseeks/sec > 88 Global Pages (kentry ): 7528609 lseeks/sec (+23.9%) > 94 Global pages (this set): 8433111 lseeks/sec (+38.8%) Very impressive! Please incorporate the performance numbers in patches #9 and #11. There were a couple of valid review comments which need to be addressed as well, but other than that it all looks good to me and I plan to apply the next iteration. In fact I think I'll try to put it into the backporting tree: as PGE was really the pre PTI status quo and thus we should expect few quirks/bugs in this area, plus we still want to share as much core PTI logic with the -stable kernels as possible. The performance plus doesn't hurt either ... after so much lost performance. Thanks, Ingo
Re: [PATCH 00/11] Use global pages with PTI
* Dave Hansen wrote: > On 03/30/2018 01:32 PM, Thomas Gleixner wrote: > > On Fri, 30 Mar 2018, Dave Hansen wrote: > > > >> On 03/30/2018 05:17 AM, Ingo Molnar wrote: > >>> BTW., the expectation on !PCID Intel hardware would be for global pages > >>> to help > >>> even more than the 0.6% and 1.7% you measured on PCID hardware: PCID > >>> already > >>> _reduces_ the cost of TLB flushes - so if there's not even PCID then > >>> global pages > >>> should help even more. > >>> > >>> In theory at least. Would still be nice to measure it. > >> > >> I did the lseek test on a modern, non-PCID system: > >> > >> No Global pages (baseline): 6077741 lseeks/sec > >> 94 Global pages (this set): 8433111 lseeks/sec > >> +2355370 lseeks/sec (+38.8%) > > > > That's all kernel text, right? What's the result for the case where global > > is only set for all user/kernel shared pages? > > Yes, that's all kernel text (94 global entries). Here's the number with > just the entry data/text set global (88 global entries on this system): > > No Global pages (baseline): 6077741 lseeks/sec > 88 Global Pages (kentry ): 7528609 lseeks/sec (+23.9%) > 94 Global pages (this set): 8433111 lseeks/sec (+38.8%) Very impressive! Please incorporate the performance numbers in patches #9 and #11. There were a couple of valid review comments which need to be addressed as well, but other than that it all looks good to me and I plan to apply the next iteration. In fact I think I'll try to put it into the backporting tree: as PGE was really the pre PTI status quo and thus we should expect few quirks/bugs in this area, plus we still want to share as much core PTI logic with the -stable kernels as possible. The performance plus doesn't hurt either ... after so much lost performance. Thanks, Ingo
Re: [PATCH 00/11] Use global pages with PTI
On 03/30/2018 01:32 PM, Thomas Gleixner wrote: > On Fri, 30 Mar 2018, Dave Hansen wrote: > >> On 03/30/2018 05:17 AM, Ingo Molnar wrote: >>> BTW., the expectation on !PCID Intel hardware would be for global pages to >>> help >>> even more than the 0.6% and 1.7% you measured on PCID hardware: PCID >>> already >>> _reduces_ the cost of TLB flushes - so if there's not even PCID then global >>> pages >>> should help even more. >>> >>> In theory at least. Would still be nice to measure it. >> >> I did the lseek test on a modern, non-PCID system: >> >> No Global pages (baseline): 6077741 lseeks/sec >> 94 Global pages (this set): 8433111 lseeks/sec >> +2355370 lseeks/sec (+38.8%) > > That's all kernel text, right? What's the result for the case where global > is only set for all user/kernel shared pages? Yes, that's all kernel text (94 global entries). Here's the number with just the entry data/text set global (88 global entries on this system): No Global pages (baseline): 6077741 lseeks/sec 88 Global Pages (kentry ): 7528609 lseeks/sec (+23.9%) 94 Global pages (this set): 8433111 lseeks/sec (+38.8%)
Re: [PATCH 00/11] Use global pages with PTI
On 03/30/2018 01:32 PM, Thomas Gleixner wrote: > On Fri, 30 Mar 2018, Dave Hansen wrote: > >> On 03/30/2018 05:17 AM, Ingo Molnar wrote: >>> BTW., the expectation on !PCID Intel hardware would be for global pages to >>> help >>> even more than the 0.6% and 1.7% you measured on PCID hardware: PCID >>> already >>> _reduces_ the cost of TLB flushes - so if there's not even PCID then global >>> pages >>> should help even more. >>> >>> In theory at least. Would still be nice to measure it. >> >> I did the lseek test on a modern, non-PCID system: >> >> No Global pages (baseline): 6077741 lseeks/sec >> 94 Global pages (this set): 8433111 lseeks/sec >> +2355370 lseeks/sec (+38.8%) > > That's all kernel text, right? What's the result for the case where global > is only set for all user/kernel shared pages? Yes, that's all kernel text (94 global entries). Here's the number with just the entry data/text set global (88 global entries on this system): No Global pages (baseline): 6077741 lseeks/sec 88 Global Pages (kentry ): 7528609 lseeks/sec (+23.9%) 94 Global pages (this set): 8433111 lseeks/sec (+38.8%)
Re: [PATCH 00/11] Use global pages with PTI
On Fri, 30 Mar 2018, Dave Hansen wrote: > On 03/30/2018 05:17 AM, Ingo Molnar wrote: > > BTW., the expectation on !PCID Intel hardware would be for global pages to > > help > > even more than the 0.6% and 1.7% you measured on PCID hardware: PCID > > already > > _reduces_ the cost of TLB flushes - so if there's not even PCID then global > > pages > > should help even more. > > > > In theory at least. Would still be nice to measure it. > > I did the lseek test on a modern, non-PCID system: > > No Global pages (baseline): 6077741 lseeks/sec > 94 Global pages (this set): 8433111 lseeks/sec > +2355370 lseeks/sec (+38.8%) That's all kernel text, right? What's the result for the case where global is only set for all user/kernel shared pages? Thanks, tglx
Re: [PATCH 00/11] Use global pages with PTI
On Fri, 30 Mar 2018, Dave Hansen wrote: > On 03/30/2018 05:17 AM, Ingo Molnar wrote: > > BTW., the expectation on !PCID Intel hardware would be for global pages to > > help > > even more than the 0.6% and 1.7% you measured on PCID hardware: PCID > > already > > _reduces_ the cost of TLB flushes - so if there's not even PCID then global > > pages > > should help even more. > > > > In theory at least. Would still be nice to measure it. > > I did the lseek test on a modern, non-PCID system: > > No Global pages (baseline): 6077741 lseeks/sec > 94 Global pages (this set): 8433111 lseeks/sec > +2355370 lseeks/sec (+38.8%) That's all kernel text, right? What's the result for the case where global is only set for all user/kernel shared pages? Thanks, tglx
Re: [PATCH 00/11] Use global pages with PTI
On 03/30/2018 05:17 AM, Ingo Molnar wrote: > BTW., the expectation on !PCID Intel hardware would be for global pages to > help > even more than the 0.6% and 1.7% you measured on PCID hardware: PCID already > _reduces_ the cost of TLB flushes - so if there's not even PCID then global > pages > should help even more. > > In theory at least. Would still be nice to measure it. I did the lseek test on a modern, non-PCID system: No Global pages (baseline): 6077741 lseeks/sec 94 Global pages (this set): 8433111 lseeks/sec +2355370 lseeks/sec (+38.8%)
Re: [PATCH 00/11] Use global pages with PTI
On 03/30/2018 05:17 AM, Ingo Molnar wrote: > BTW., the expectation on !PCID Intel hardware would be for global pages to > help > even more than the 0.6% and 1.7% you measured on PCID hardware: PCID already > _reduces_ the cost of TLB flushes - so if there's not even PCID then global > pages > should help even more. > > In theory at least. Would still be nice to measure it. I did the lseek test on a modern, non-PCID system: No Global pages (baseline): 6077741 lseeks/sec 94 Global pages (this set): 8433111 lseeks/sec +2355370 lseeks/sec (+38.8%)
Re: [PATCH 00/11] Use global pages with PTI
* Ingo Molnarwrote: > > No Global pages (baseline): 186.951 seconds time elapsed ( +- 0.35% ) > > 28 Global pages (this set): 185.756 seconds time elapsed ( +- 0.09% ) > > -1.195 seconds (-0.64%) > > > > Lower is better here, obviously. > > > > I also re-checked everything using will-it-scale's llseek1 test[2] which > > is basically a microbenchmark of a halfway reasonable syscall. Higher > > here is better. > > > > No Global pages (baseline): 15783951 lseeks/sec > > 28 Global pages (this set): 16054688 lseeks/sec > > +270737 lseeks/sec (+1.71%) > > > > So, both the kernel compile and the microbenchmark got measurably faster. > > Ok, cool, this is much better! > > Mind re-sending the patch-set against latest -tip so it can be merged? > > At this point !PCID Intel hardware is not a primary concern, if something bad > happens on them with global pages we can quirk global pages off on them in > some > way, or so. BTW., the expectation on !PCID Intel hardware would be for global pages to help even more than the 0.6% and 1.7% you measured on PCID hardware: PCID already _reduces_ the cost of TLB flushes - so if there's not even PCID then global pages should help even more. In theory at least. Would still be nice to measure it. Thanks, Ingo
Re: [PATCH 00/11] Use global pages with PTI
* Ingo Molnar wrote: > > No Global pages (baseline): 186.951 seconds time elapsed ( +- 0.35% ) > > 28 Global pages (this set): 185.756 seconds time elapsed ( +- 0.09% ) > > -1.195 seconds (-0.64%) > > > > Lower is better here, obviously. > > > > I also re-checked everything using will-it-scale's llseek1 test[2] which > > is basically a microbenchmark of a halfway reasonable syscall. Higher > > here is better. > > > > No Global pages (baseline): 15783951 lseeks/sec > > 28 Global pages (this set): 16054688 lseeks/sec > > +270737 lseeks/sec (+1.71%) > > > > So, both the kernel compile and the microbenchmark got measurably faster. > > Ok, cool, this is much better! > > Mind re-sending the patch-set against latest -tip so it can be merged? > > At this point !PCID Intel hardware is not a primary concern, if something bad > happens on them with global pages we can quirk global pages off on them in > some > way, or so. BTW., the expectation on !PCID Intel hardware would be for global pages to help even more than the 0.6% and 1.7% you measured on PCID hardware: PCID already _reduces_ the cost of TLB flushes - so if there's not even PCID then global pages should help even more. In theory at least. Would still be nice to measure it. Thanks, Ingo
Re: [PATCH 00/11] Use global pages with PTI
* Dave Hansenwrote: > On 03/27/2018 01:07 PM, Ingo Molnar wrote: > > * Thomas Gleixner wrote: > >>> systems. Atoms are going to be the easiest thing to get my hands on, > >>> but I tend to shy away from them for performance work. > >> What I have in mind is that I wonder whether the whole circus is worth it > >> when there is no performance advantage on PCID systems. > > I was waiting on trying to find a relatively recent Atom system (they > actually come in reasonably sized servers [1]), but I'm hitting a snag > there, so I figured I'd just share a kernel compile using Ingo's > perf-based methodology on a Skylake desktop system with PCIDs. > > Here's the kernel compile: > > No Global pages (baseline): 186.951 seconds time elapsed ( +- 0.35% ) > 28 Global pages (this set): 185.756 seconds time elapsed ( +- 0.09% ) > -1.195 seconds (-0.64%) > > Lower is better here, obviously. > > I also re-checked everything using will-it-scale's llseek1 test[2] which > is basically a microbenchmark of a halfway reasonable syscall. Higher > here is better. > > No Global pages (baseline): 15783951 lseeks/sec > 28 Global pages (this set): 16054688 lseeks/sec >+270737 lseeks/sec (+1.71%) > > So, both the kernel compile and the microbenchmark got measurably faster. Ok, cool, this is much better! Mind re-sending the patch-set against latest -tip so it can be merged? At this point !PCID Intel hardware is not a primary concern, if something bad happens on them with global pages we can quirk global pages off on them in some way, or so. Thanks, Ingo
Re: [PATCH 00/11] Use global pages with PTI
* Dave Hansen wrote: > On 03/27/2018 01:07 PM, Ingo Molnar wrote: > > * Thomas Gleixner wrote: > >>> systems. Atoms are going to be the easiest thing to get my hands on, > >>> but I tend to shy away from them for performance work. > >> What I have in mind is that I wonder whether the whole circus is worth it > >> when there is no performance advantage on PCID systems. > > I was waiting on trying to find a relatively recent Atom system (they > actually come in reasonably sized servers [1]), but I'm hitting a snag > there, so I figured I'd just share a kernel compile using Ingo's > perf-based methodology on a Skylake desktop system with PCIDs. > > Here's the kernel compile: > > No Global pages (baseline): 186.951 seconds time elapsed ( +- 0.35% ) > 28 Global pages (this set): 185.756 seconds time elapsed ( +- 0.09% ) > -1.195 seconds (-0.64%) > > Lower is better here, obviously. > > I also re-checked everything using will-it-scale's llseek1 test[2] which > is basically a microbenchmark of a halfway reasonable syscall. Higher > here is better. > > No Global pages (baseline): 15783951 lseeks/sec > 28 Global pages (this set): 16054688 lseeks/sec >+270737 lseeks/sec (+1.71%) > > So, both the kernel compile and the microbenchmark got measurably faster. Ok, cool, this is much better! Mind re-sending the patch-set against latest -tip so it can be merged? At this point !PCID Intel hardware is not a primary concern, if something bad happens on them with global pages we can quirk global pages off on them in some way, or so. Thanks, Ingo
Re: [PATCH 00/11] Use global pages with PTI
On 03/27/2018 01:07 PM, Ingo Molnar wrote: > * Thomas Gleixnerwrote: >>> systems. Atoms are going to be the easiest thing to get my hands on, >>> but I tend to shy away from them for performance work. >> What I have in mind is that I wonder whether the whole circus is worth it >> when there is no performance advantage on PCID systems. I was waiting on trying to find a relatively recent Atom system (they actually come in reasonably sized servers [1]), but I'm hitting a snag there, so I figured I'd just share a kernel compile using Ingo's perf-based methodology on a Skylake desktop system with PCIDs. Here's the kernel compile: No Global pages (baseline): 186.951 seconds time elapsed ( +- 0.35% ) 28 Global pages (this set): 185.756 seconds time elapsed ( +- 0.09% ) -1.195 seconds (-0.64%) Lower is better here, obviously. I also re-checked everything using will-it-scale's llseek1 test[2] which is basically a microbenchmark of a halfway reasonable syscall. Higher here is better. No Global pages (baseline): 15783951 lseeks/sec 28 Global pages (this set): 16054688 lseeks/sec +270737 lseeks/sec (+1.71%) So, both the kernel compile and the microbenchmark got measurably faster. 1. https://ark.intel.com/products/97933/Intel-Atom-Processor-C3955-16M-Cache-up-to-2_40-GHz 2. https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.c
Re: [PATCH 00/11] Use global pages with PTI
On 03/27/2018 01:07 PM, Ingo Molnar wrote: > * Thomas Gleixner wrote: >>> systems. Atoms are going to be the easiest thing to get my hands on, >>> but I tend to shy away from them for performance work. >> What I have in mind is that I wonder whether the whole circus is worth it >> when there is no performance advantage on PCID systems. I was waiting on trying to find a relatively recent Atom system (they actually come in reasonably sized servers [1]), but I'm hitting a snag there, so I figured I'd just share a kernel compile using Ingo's perf-based methodology on a Skylake desktop system with PCIDs. Here's the kernel compile: No Global pages (baseline): 186.951 seconds time elapsed ( +- 0.35% ) 28 Global pages (this set): 185.756 seconds time elapsed ( +- 0.09% ) -1.195 seconds (-0.64%) Lower is better here, obviously. I also re-checked everything using will-it-scale's llseek1 test[2] which is basically a microbenchmark of a halfway reasonable syscall. Higher here is better. No Global pages (baseline): 15783951 lseeks/sec 28 Global pages (this set): 16054688 lseeks/sec +270737 lseeks/sec (+1.71%) So, both the kernel compile and the microbenchmark got measurably faster. 1. https://ark.intel.com/products/97933/Intel-Atom-Processor-C3955-16M-Cache-up-to-2_40-GHz 2. https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.c
Re: [PATCH 00/11] Use global pages with PTI
On 03/27/2018 01:07 PM, Ingo Molnar wrote: > - To see at minimum stddev numbers, to make sure we are not looking at some > weird >statistical artifact. (I also outlined a more robust measurement method.) > > - If the numbers are right, a CPU engineer should have a look if possible, >because frankly this effect is not expected and is not intuitive. Where > global >pages can be used safely they are almost always an unconditional win. >Maybe we are missing some limitation or some interaction with PCID. > > Since we'll be using PCID even on Meltdown-fixed hardware, maybe the same > negative > performance effect already exists on non-PTI kernels as well, we just never > noticed? Yep, totally agree. I'll do the more robust collection and also explore on "real" !PCID hardware. I also know the right CPU folks to go ask about this, I just want to do the second round of robust data collection before I bug them.
Re: [PATCH 00/11] Use global pages with PTI
On 03/27/2018 01:07 PM, Ingo Molnar wrote: > - To see at minimum stddev numbers, to make sure we are not looking at some > weird >statistical artifact. (I also outlined a more robust measurement method.) > > - If the numbers are right, a CPU engineer should have a look if possible, >because frankly this effect is not expected and is not intuitive. Where > global >pages can be used safely they are almost always an unconditional win. >Maybe we are missing some limitation or some interaction with PCID. > > Since we'll be using PCID even on Meltdown-fixed hardware, maybe the same > negative > performance effect already exists on non-PTI kernels as well, we just never > noticed? Yep, totally agree. I'll do the more robust collection and also explore on "real" !PCID hardware. I also know the right CPU folks to go ask about this, I just want to do the second round of robust data collection before I bug them.
Re: [PATCH 00/11] Use global pages with PTI
* Thomas Gleixnerwrote: > > systems. Atoms are going to be the easiest thing to get my hands on, > > but I tend to shy away from them for performance work. > > What I have in mind is that I wonder whether the whole circus is worth it > when there is no performance advantage on PCID systems. I'd still love to: - To see at minimum stddev numbers, to make sure we are not looking at some weird statistical artifact. (I also outlined a more robust measurement method.) - If the numbers are right, a CPU engineer should have a look if possible, because frankly this effect is not expected and is not intuitive. Where global pages can be used safely they are almost always an unconditional win. Maybe we are missing some limitation or some interaction with PCID. Since we'll be using PCID even on Meltdown-fixed hardware, maybe the same negative performance effect already exists on non-PTI kernels as well, we just never noticed? I.e. there are multiple grounds to get to the bottom of this. Thanks, Ingo
Re: [PATCH 00/11] Use global pages with PTI
* Thomas Gleixner wrote: > > systems. Atoms are going to be the easiest thing to get my hands on, > > but I tend to shy away from them for performance work. > > What I have in mind is that I wonder whether the whole circus is worth it > when there is no performance advantage on PCID systems. I'd still love to: - To see at minimum stddev numbers, to make sure we are not looking at some weird statistical artifact. (I also outlined a more robust measurement method.) - If the numbers are right, a CPU engineer should have a look if possible, because frankly this effect is not expected and is not intuitive. Where global pages can be used safely they are almost always an unconditional win. Maybe we are missing some limitation or some interaction with PCID. Since we'll be using PCID even on Meltdown-fixed hardware, maybe the same negative performance effect already exists on non-PTI kernels as well, we just never noticed? I.e. there are multiple grounds to get to the bottom of this. Thanks, Ingo
Re: [PATCH 00/11] Use global pages with PTI
On Tue, 27 Mar 2018, Dave Hansen wrote: > On 03/27/2018 06:36 AM, Thomas Gleixner wrote: > >> User Time Kernel Time Clock Elapsed > >> Baseline ( 0 GLB PTEs) 803.79 67.77 237.30 > >> w/series (28 GLB PTEs) 807.70 (+0.7%) 68.07 (+0.7%) 238.07 (+0.3%) > >> > >> Without PCIDs, it behaves the way I would expect. > > What's the performance benefit on !PCID systems? And I mean systems which > > actually do not have PCID, not a PCID system with 'nopcid' on the command > > line. > > Do you have something in mind for this? Basically *all* of the servers > that I have access to have PCID because they are newer than ~7 years old. > > That leaves *some* Ivybridge and earlier desktops, Atoms and AMD AMD is not interesting as it's not PTI and uses GLOBAL anyway. > systems. Atoms are going to be the easiest thing to get my hands on, > but I tend to shy away from them for performance work. What I have in mind is that I wonder whether the whole circus is worth it when there is no performance advantage on PCID systems. Thanks, tglx
Re: [PATCH 00/11] Use global pages with PTI
On Tue, 27 Mar 2018, Dave Hansen wrote: > On 03/27/2018 06:36 AM, Thomas Gleixner wrote: > >> User Time Kernel Time Clock Elapsed > >> Baseline ( 0 GLB PTEs) 803.79 67.77 237.30 > >> w/series (28 GLB PTEs) 807.70 (+0.7%) 68.07 (+0.7%) 238.07 (+0.3%) > >> > >> Without PCIDs, it behaves the way I would expect. > > What's the performance benefit on !PCID systems? And I mean systems which > > actually do not have PCID, not a PCID system with 'nopcid' on the command > > line. > > Do you have something in mind for this? Basically *all* of the servers > that I have access to have PCID because they are newer than ~7 years old. > > That leaves *some* Ivybridge and earlier desktops, Atoms and AMD AMD is not interesting as it's not PTI and uses GLOBAL anyway. > systems. Atoms are going to be the easiest thing to get my hands on, > but I tend to shy away from them for performance work. What I have in mind is that I wonder whether the whole circus is worth it when there is no performance advantage on PCID systems. Thanks, tglx
Re: [PATCH 00/11] Use global pages with PTI
On 03/27/2018 06:36 AM, Thomas Gleixner wrote: >> User Time Kernel Time Clock Elapsed >> Baseline ( 0 GLB PTEs) 803.79 67.77 237.30 >> w/series (28 GLB PTEs) 807.70 (+0.7%) 68.07 (+0.7%) 238.07 (+0.3%) >> >> Without PCIDs, it behaves the way I would expect. > What's the performance benefit on !PCID systems? And I mean systems which > actually do not have PCID, not a PCID system with 'nopcid' on the command > line. Do you have something in mind for this? Basically *all* of the servers that I have access to have PCID because they are newer than ~7 years old. That leaves *some* Ivybridge and earlier desktops, Atoms and AMD systems. Atoms are going to be the easiest thing to get my hands on, but I tend to shy away from them for performance work.
Re: [PATCH 00/11] Use global pages with PTI
On 03/27/2018 06:36 AM, Thomas Gleixner wrote: >> User Time Kernel Time Clock Elapsed >> Baseline ( 0 GLB PTEs) 803.79 67.77 237.30 >> w/series (28 GLB PTEs) 807.70 (+0.7%) 68.07 (+0.7%) 238.07 (+0.3%) >> >> Without PCIDs, it behaves the way I would expect. > What's the performance benefit on !PCID systems? And I mean systems which > actually do not have PCID, not a PCID system with 'nopcid' on the command > line. Do you have something in mind for this? Basically *all* of the servers that I have access to have PCID because they are newer than ~7 years old. That leaves *some* Ivybridge and earlier desktops, Atoms and AMD systems. Atoms are going to be the easiest thing to get my hands on, but I tend to shy away from them for performance work.
Re: [PATCH 00/11] Use global pages with PTI
On Fri, 23 Mar 2018, Dave Hansen wrote: > On 03/23/2018 11:26 AM, Linus Torvalds wrote: > > On Fri, Mar 23, 2018 at 10:44 AM, Dave Hansen > >wrote: > >> > >> This adds one major change from the last version of the patch set > >> (present in the last patch). It makes all kernel text global for non- > >> PCID systems. This keeps kernel data protected always, but means that > >> it will be easier to find kernel gadgets via meltdown on old systems > >> without PCIDs. This heuristic is, I think, a reasonable one and it > >> keeps us from having to create any new pti=foo options > > > > Sounds sane. > > > > The patches look reasonable, but I hate seeing a patch series like > > this where the only ostensible reason is performance, and there are no > > performance numbers anywhere.. > > Well, rats. This somehow makes things slower with PCIDs on. I thought > I reversed the numbers, but I actually do a "grep -c GLB > /sys/kernel/debug/page_tables/kernel" and record that in my logs right > next to the output of time(1), so it's awfully hard to screw up. > > This is time doing a modestly-sized kernel compile on a 4-core Skylake > desktop. > > User Time Kernel Time Clock Elapsed > Baseline ( 0 GLB PTEs) 803.79 67.77 237.30 > w/series (28 GLB PTEs) 807.70 (+0.7%) 68.07 (+0.7%) 238.07 (+0.3%) > > Without PCIDs, it behaves the way I would expect. What's the performance benefit on !PCID systems? And I mean systems which actually do not have PCID, not a PCID system with 'nopcid' on the command line. Thanks, tglx
Re: [PATCH 00/11] Use global pages with PTI
On Fri, 23 Mar 2018, Dave Hansen wrote: > On 03/23/2018 11:26 AM, Linus Torvalds wrote: > > On Fri, Mar 23, 2018 at 10:44 AM, Dave Hansen > > wrote: > >> > >> This adds one major change from the last version of the patch set > >> (present in the last patch). It makes all kernel text global for non- > >> PCID systems. This keeps kernel data protected always, but means that > >> it will be easier to find kernel gadgets via meltdown on old systems > >> without PCIDs. This heuristic is, I think, a reasonable one and it > >> keeps us from having to create any new pti=foo options > > > > Sounds sane. > > > > The patches look reasonable, but I hate seeing a patch series like > > this where the only ostensible reason is performance, and there are no > > performance numbers anywhere.. > > Well, rats. This somehow makes things slower with PCIDs on. I thought > I reversed the numbers, but I actually do a "grep -c GLB > /sys/kernel/debug/page_tables/kernel" and record that in my logs right > next to the output of time(1), so it's awfully hard to screw up. > > This is time doing a modestly-sized kernel compile on a 4-core Skylake > desktop. > > User Time Kernel Time Clock Elapsed > Baseline ( 0 GLB PTEs) 803.79 67.77 237.30 > w/series (28 GLB PTEs) 807.70 (+0.7%) 68.07 (+0.7%) 238.07 (+0.3%) > > Without PCIDs, it behaves the way I would expect. What's the performance benefit on !PCID systems? And I mean systems which actually do not have PCID, not a PCID system with 'nopcid' on the command line. Thanks, tglx
Re: [PATCH 00/11] Use global pages with PTI
* Dave Hansenwrote: > This is time doing a modestly-sized kernel compile on a 4-core Skylake > desktop. > > User Time Kernel Time Clock Elapsed > Baseline ( 0 GLB PTEs) 803.79 67.77 237.30 > w/series (28 GLB PTEs) 807.70 (+0.7%) 68.07 (+0.7%) 238.07 (+0.3%) > > Without PCIDs, it behaves the way I would expect. > > I'll ask around, but I'm open to any ideas about what the heck might be > causing this. Hm, so it's a bit weird that while user time and kernel time both increased by about 0.7%, elapsed time only increased by 0.3%? Typically kernel builds are much more parallel for that to be typical, so maybe there's some noise in the measurement? Before spending too much time on the global-TLB patch angle I'd suggest investing a bit of time into making sure that the regression you are seeing is actually real: You haven't described how you have measured kernel build times and "+0.7% regression" might turn out to be the real number, but sub-1% accuracy kernel build times are *awfully* susceptible to: - various sources of noise - systematic statistical errors which doesn't show up as measurement-to-measurement noise but which skews the results: such as the boot-to-boot memory layout of the source code and object files. - cpufreq artifacts Even repeated builds with 'make clean' inbetween can be misleading because the exact layout of key include files and binaries which get accessed the most often during a build are set into stone once they've been read into the page cache for the first time after bootup. Automated reboots between measurements can be misleading as well, if the file layout after bootup is too deterministic. So here's a pretty reliable way to measure kernel build time, which tries to avoid the various pitfalls of caching. First I make sure that cpufreq is set to 'performance': for ((cpu=0; cpu<120; cpu++)); do G=/sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_governor [ -f $G ] && echo performance > $G done [ ... because it can be *really* annoying to discover that an ostensible performance regression was a cpufreq artifact ... again. ;-) ] Then I copy a kernel tree to /tmp (ramfs) as root: cd /tmp rm -rf linux git clone ~/linux linux cd linux make defconfig >/dev/null ... and then we can build the kernel in such a loop (as root again): perf stat --repeat 10 --null --pre'\ cp -a kernel ../kernel.copy.$(date +%s); \ rm -rf *;\ git checkout .; \ echo 1 > /proc/sys/vm/drop_caches; \ find ../kernel* -type f | xargs cat >/dev/null; \ make -j kernel >/dev/null; \ make clean >/dev/null 2>&1; \ sync'\ \ make -j16 >/dev/null ( I have tested these by pasting them into a terminal. Adjust the ~/linux source git tree and the '-j16' to your system. ) Notes: - the 'pre' script portion is not timed by 'perf stat', only the raw build times - we flush all caches via drop_caches and re-establish everything again, but: - we also introduce an intentional memory leak by slowly filling up ramfs with copies of 'kernel/', thus continously changing the layout of free memory, cached data such as compiler binaries and the source code hierarchy. (Note that the leak is about 8MB per iteration, so it isn't massive.) With 10 iterations this is the statistical stability I get this on a big box: Performance counter stats for 'make -j128 kernel' (10 runs): 26.346436425 seconds time elapsed(+- 0.19%) ... which, despite a high iteration count of 10, is still surprisingly noisy, right? A 0.2% stddev is probably not enough to call a 0.7% regression with good confidence, so I had to use *30* iterations to make measurement noise to be about an order of magnitude lower than the effect I'm trying to measure: Performance counter stats for 'make -j128' (30 runs): 26.334767571 seconds time elapsed(+- 0.09% ) i.e. "26.334 +- 0.023" seconds is a number we can have pretty high confidence in, on this system. And just to demonstrate that it's all real, I repeated the whole 30-iteration measurement again: Performance counter stats for 'make -j128' (30 runs): 26.311166142 seconds time elapsed(+- 0.07%) Even if in the end you get a similar result, close to the +0.7% overhead you already measured, we should have more confidence in blaming global TLBs for the performance regression. BYMMV. Thanks, Ingo [*] Note that even this doesn't eliminate certain sources of measurement error: such as the boot-to-boot variance in the layout of
Re: [PATCH 00/11] Use global pages with PTI
* Dave Hansen wrote: > This is time doing a modestly-sized kernel compile on a 4-core Skylake > desktop. > > User Time Kernel Time Clock Elapsed > Baseline ( 0 GLB PTEs) 803.79 67.77 237.30 > w/series (28 GLB PTEs) 807.70 (+0.7%) 68.07 (+0.7%) 238.07 (+0.3%) > > Without PCIDs, it behaves the way I would expect. > > I'll ask around, but I'm open to any ideas about what the heck might be > causing this. Hm, so it's a bit weird that while user time and kernel time both increased by about 0.7%, elapsed time only increased by 0.3%? Typically kernel builds are much more parallel for that to be typical, so maybe there's some noise in the measurement? Before spending too much time on the global-TLB patch angle I'd suggest investing a bit of time into making sure that the regression you are seeing is actually real: You haven't described how you have measured kernel build times and "+0.7% regression" might turn out to be the real number, but sub-1% accuracy kernel build times are *awfully* susceptible to: - various sources of noise - systematic statistical errors which doesn't show up as measurement-to-measurement noise but which skews the results: such as the boot-to-boot memory layout of the source code and object files. - cpufreq artifacts Even repeated builds with 'make clean' inbetween can be misleading because the exact layout of key include files and binaries which get accessed the most often during a build are set into stone once they've been read into the page cache for the first time after bootup. Automated reboots between measurements can be misleading as well, if the file layout after bootup is too deterministic. So here's a pretty reliable way to measure kernel build time, which tries to avoid the various pitfalls of caching. First I make sure that cpufreq is set to 'performance': for ((cpu=0; cpu<120; cpu++)); do G=/sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_governor [ -f $G ] && echo performance > $G done [ ... because it can be *really* annoying to discover that an ostensible performance regression was a cpufreq artifact ... again. ;-) ] Then I copy a kernel tree to /tmp (ramfs) as root: cd /tmp rm -rf linux git clone ~/linux linux cd linux make defconfig >/dev/null ... and then we can build the kernel in such a loop (as root again): perf stat --repeat 10 --null --pre'\ cp -a kernel ../kernel.copy.$(date +%s); \ rm -rf *;\ git checkout .; \ echo 1 > /proc/sys/vm/drop_caches; \ find ../kernel* -type f | xargs cat >/dev/null; \ make -j kernel >/dev/null; \ make clean >/dev/null 2>&1; \ sync'\ \ make -j16 >/dev/null ( I have tested these by pasting them into a terminal. Adjust the ~/linux source git tree and the '-j16' to your system. ) Notes: - the 'pre' script portion is not timed by 'perf stat', only the raw build times - we flush all caches via drop_caches and re-establish everything again, but: - we also introduce an intentional memory leak by slowly filling up ramfs with copies of 'kernel/', thus continously changing the layout of free memory, cached data such as compiler binaries and the source code hierarchy. (Note that the leak is about 8MB per iteration, so it isn't massive.) With 10 iterations this is the statistical stability I get this on a big box: Performance counter stats for 'make -j128 kernel' (10 runs): 26.346436425 seconds time elapsed(+- 0.19%) ... which, despite a high iteration count of 10, is still surprisingly noisy, right? A 0.2% stddev is probably not enough to call a 0.7% regression with good confidence, so I had to use *30* iterations to make measurement noise to be about an order of magnitude lower than the effect I'm trying to measure: Performance counter stats for 'make -j128' (30 runs): 26.334767571 seconds time elapsed(+- 0.09% ) i.e. "26.334 +- 0.023" seconds is a number we can have pretty high confidence in, on this system. And just to demonstrate that it's all real, I repeated the whole 30-iteration measurement again: Performance counter stats for 'make -j128' (30 runs): 26.311166142 seconds time elapsed(+- 0.07%) Even if in the end you get a similar result, close to the +0.7% overhead you already measured, we should have more confidence in blaming global TLBs for the performance regression. BYMMV. Thanks, Ingo [*] Note that even this doesn't eliminate certain sources of measurement error: such as the boot-to-boot variance in the layout of certain key kernel data
Re: [PATCH 00/11] Use global pages with PTI
On Fri, Mar 23, 2018 at 5:46 PM, Linus Torvaldswrote: > > It is, of course, possible that I misunderstood what you actually > benchmarked. But I assume the above benchmark numbers are with the > whole "don't even do global entries if you have PCID". Oh, I went back and read your description, and realized that I _had_ misunderstood what you did. I thought you didn't bother with global pages at all when you had PCID. But that's not what you meant. You always do global for the actual user-mapped kernel pages, but when you don't have PCID you do *all* kernel test as global, whether shared or not. So I entirely misread what the latest change was. Linus
Re: [PATCH 00/11] Use global pages with PTI
On Fri, Mar 23, 2018 at 5:46 PM, Linus Torvalds wrote: > > It is, of course, possible that I misunderstood what you actually > benchmarked. But I assume the above benchmark numbers are with the > whole "don't even do global entries if you have PCID". Oh, I went back and read your description, and realized that I _had_ misunderstood what you did. I thought you didn't bother with global pages at all when you had PCID. But that's not what you meant. You always do global for the actual user-mapped kernel pages, but when you don't have PCID you do *all* kernel test as global, whether shared or not. So I entirely misread what the latest change was. Linus
Re: [PATCH 00/11] Use global pages with PTI
On Fri, Mar 23, 2018 at 5:40 PM, Dave Hansenwrote: > > Well, rats. This somehow makes things slower with PCIDs on. .. what happens when you enable global pages with PCID? You disabled them explicitly because you thought they wouldn't matter.. Even with PCID, a global TLB entry for the shared pages would make sense, because it's now just *one* entry in the TLB rather that "one per PCID and one for the kernel mapping". So even if in theory the lifetime of the TLB entry is the same, when you have capacity misses it most definitely isn't. And for process tear-down and build-up the per-PCID TLB entry does nothing at all. While for a true global entry, it gets shared even across process creation/deletion. So even ignoring TLB capacity issues, with lots of shortlived processes global TLB entries are much better. It is, of course, possible that I misunderstood what you actually benchmarked. But I assume the above benchmark numbers are with the whole "don't even do global entries if you have PCID". Linus
Re: [PATCH 00/11] Use global pages with PTI
On Fri, Mar 23, 2018 at 5:40 PM, Dave Hansen wrote: > > Well, rats. This somehow makes things slower with PCIDs on. .. what happens when you enable global pages with PCID? You disabled them explicitly because you thought they wouldn't matter.. Even with PCID, a global TLB entry for the shared pages would make sense, because it's now just *one* entry in the TLB rather that "one per PCID and one for the kernel mapping". So even if in theory the lifetime of the TLB entry is the same, when you have capacity misses it most definitely isn't. And for process tear-down and build-up the per-PCID TLB entry does nothing at all. While for a true global entry, it gets shared even across process creation/deletion. So even ignoring TLB capacity issues, with lots of shortlived processes global TLB entries are much better. It is, of course, possible that I misunderstood what you actually benchmarked. But I assume the above benchmark numbers are with the whole "don't even do global entries if you have PCID". Linus
Re: [PATCH 00/11] Use global pages with PTI
On 03/23/2018 11:26 AM, Linus Torvalds wrote: > On Fri, Mar 23, 2018 at 10:44 AM, Dave Hansen >wrote: >> >> This adds one major change from the last version of the patch set >> (present in the last patch). It makes all kernel text global for non- >> PCID systems. This keeps kernel data protected always, but means that >> it will be easier to find kernel gadgets via meltdown on old systems >> without PCIDs. This heuristic is, I think, a reasonable one and it >> keeps us from having to create any new pti=foo options > > Sounds sane. > > The patches look reasonable, but I hate seeing a patch series like > this where the only ostensible reason is performance, and there are no > performance numbers anywhere.. Well, rats. This somehow makes things slower with PCIDs on. I thought I reversed the numbers, but I actually do a "grep -c GLB /sys/kernel/debug/page_tables/kernel" and record that in my logs right next to the output of time(1), so it's awfully hard to screw up. This is time doing a modestly-sized kernel compile on a 4-core Skylake desktop. User Time Kernel Time Clock Elapsed Baseline ( 0 GLB PTEs) 803.79 67.77 237.30 w/series (28 GLB PTEs) 807.70 (+0.7%) 68.07 (+0.7%) 238.07 (+0.3%) Without PCIDs, it behaves the way I would expect. I'll ask around, but I'm open to any ideas about what the heck might be causing this.
Re: [PATCH 00/11] Use global pages with PTI
On 03/23/2018 11:26 AM, Linus Torvalds wrote: > On Fri, Mar 23, 2018 at 10:44 AM, Dave Hansen > wrote: >> >> This adds one major change from the last version of the patch set >> (present in the last patch). It makes all kernel text global for non- >> PCID systems. This keeps kernel data protected always, but means that >> it will be easier to find kernel gadgets via meltdown on old systems >> without PCIDs. This heuristic is, I think, a reasonable one and it >> keeps us from having to create any new pti=foo options > > Sounds sane. > > The patches look reasonable, but I hate seeing a patch series like > this where the only ostensible reason is performance, and there are no > performance numbers anywhere.. Well, rats. This somehow makes things slower with PCIDs on. I thought I reversed the numbers, but I actually do a "grep -c GLB /sys/kernel/debug/page_tables/kernel" and record that in my logs right next to the output of time(1), so it's awfully hard to screw up. This is time doing a modestly-sized kernel compile on a 4-core Skylake desktop. User Time Kernel Time Clock Elapsed Baseline ( 0 GLB PTEs) 803.79 67.77 237.30 w/series (28 GLB PTEs) 807.70 (+0.7%) 68.07 (+0.7%) 238.07 (+0.3%) Without PCIDs, it behaves the way I would expect. I'll ask around, but I'm open to any ideas about what the heck might be causing this.
Re: [PATCH 00/11] Use global pages with PTI
On Fri, Mar 23, 2018 at 10:44 AM, Dave Hansenwrote: > > This adds one major change from the last version of the patch set > (present in the last patch). It makes all kernel text global for non- > PCID systems. This keeps kernel data protected always, but means that > it will be easier to find kernel gadgets via meltdown on old systems > without PCIDs. This heuristic is, I think, a reasonable one and it > keeps us from having to create any new pti=foo options Sounds sane. The patches look reasonable, but I hate seeing a patch series like this where the only ostensible reason is performance, and there are no performance numbers anywhere.. Linus
Re: [PATCH 00/11] Use global pages with PTI
On Fri, Mar 23, 2018 at 10:44 AM, Dave Hansen wrote: > > This adds one major change from the last version of the patch set > (present in the last patch). It makes all kernel text global for non- > PCID systems. This keeps kernel data protected always, but means that > it will be easier to find kernel gadgets via meltdown on old systems > without PCIDs. This heuristic is, I think, a reasonable one and it > keeps us from having to create any new pti=foo options Sounds sane. The patches look reasonable, but I hate seeing a patch series like this where the only ostensible reason is performance, and there are no performance numbers anywhere.. Linus
[PATCH 00/11] Use global pages with PTI
The later verions of the KAISER pathces (pre-PTI) allowed the user/kernel shared areas to be GLOBAL. The thought was that this would reduce the TLB overhead of keeping two copies of these mappings. During the switch over to PTI, we seem to have lost our ability to have GLOBAL mappings. This adds them back. This adds one major change from the last version of the patch set (present in the last patch). It makes all kernel text global for non- PCID systems. This keeps kernel data protected always, but means that it will be easier to find kernel gadgets via meltdown on old systems without PCIDs. This heuristic is, I think, a reasonable one and it keeps us from having to create any new pti=foo options Cc: Andrea ArcangeliCc: Andy Lutomirski Cc: Linus Torvalds Cc: Kees Cook Cc: Hugh Dickins Cc: Juergen Gross Cc: x...@kernel.org Cc: Nadav Amit
[PATCH 00/11] Use global pages with PTI
The later verions of the KAISER pathces (pre-PTI) allowed the user/kernel shared areas to be GLOBAL. The thought was that this would reduce the TLB overhead of keeping two copies of these mappings. During the switch over to PTI, we seem to have lost our ability to have GLOBAL mappings. This adds them back. This adds one major change from the last version of the patch set (present in the last patch). It makes all kernel text global for non- PCID systems. This keeps kernel data protected always, but means that it will be easier to find kernel gadgets via meltdown on old systems without PCIDs. This heuristic is, I think, a reasonable one and it keeps us from having to create any new pti=foo options Cc: Andrea Arcangeli Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Kees Cook Cc: Hugh Dickins Cc: Juergen Gross Cc: x...@kernel.org Cc: Nadav Amit