Re: [PATCH 00/11] Use global pages with PTI

2018-03-31 Thread Dave Hansen
On 03/30/2018 10:39 PM, Ingo Molnar wrote:
> There were a couple of valid review comments which need to be addressed as 
> well, 
> but other than that it all looks good to me and I plan to apply the next 
> iteration.

Testing on that non-PCID systems showed an oddity with parts of the
kernel image that are modified later in boot (when we set the kernel
image read-only).  We split a few of the PMD entries and the the old
(early boot) values were being used for userspace.

I don't think this is a big deal.  The most annoying thing is that it
makes it harder to quickly validate that all of the things we set to
global *should* be global.  I'll put some examples of how this looks in
the patch when I repost.


Re: [PATCH 00/11] Use global pages with PTI

2018-03-31 Thread Dave Hansen
On 03/30/2018 10:39 PM, Ingo Molnar wrote:
> There were a couple of valid review comments which need to be addressed as 
> well, 
> but other than that it all looks good to me and I plan to apply the next 
> iteration.

Testing on that non-PCID systems showed an oddity with parts of the
kernel image that are modified later in boot (when we set the kernel
image read-only).  We split a few of the PMD entries and the the old
(early boot) values were being used for userspace.

I don't think this is a big deal.  The most annoying thing is that it
makes it harder to quickly validate that all of the things we set to
global *should* be global.  I'll put some examples of how this looks in
the patch when I repost.


Re: [PATCH 00/11] Use global pages with PTI

2018-03-30 Thread Ingo Molnar

* Dave Hansen  wrote:

> On 03/30/2018 01:32 PM, Thomas Gleixner wrote:
> > On Fri, 30 Mar 2018, Dave Hansen wrote:
> > 
> >> On 03/30/2018 05:17 AM, Ingo Molnar wrote:
> >>> BTW., the expectation on !PCID Intel hardware would be for global pages 
> >>> to help 
> >>> even more than the 0.6% and 1.7% you measured on PCID hardware: PCID 
> >>> already 
> >>> _reduces_ the cost of TLB flushes - so if there's not even PCID then 
> >>> global pages 
> >>> should help even more.
> >>>
> >>> In theory at least. Would still be nice to measure it.
> >>
> >> I did the lseek test on a modern, non-PCID system:
> >>
> >> No Global pages (baseline): 6077741 lseeks/sec
> >> 94 Global pages (this set): 8433111 lseeks/sec
> >>   +2355370 lseeks/sec (+38.8%)
> > 
> > That's all kernel text, right? What's the result for the case where global
> > is only set for all user/kernel shared pages?
> 
> Yes, that's all kernel text (94 global entries).  Here's the number with
> just the entry data/text set global (88 global entries on this system):
> 
> No Global pages (baseline): 6077741 lseeks/sec
> 88 Global Pages (kentry  ): 7528609 lseeks/sec (+23.9%)
> 94 Global pages (this set): 8433111 lseeks/sec (+38.8%)

Very impressive!

Please incorporate the performance numbers in patches #9 and #11.

There were a couple of valid review comments which need to be addressed as 
well, 
but other than that it all looks good to me and I plan to apply the next 
iteration.

In fact I think I'll try to put it into the backporting tree: as PGE was really 
the pre PTI status quo and thus we should expect few quirks/bugs in this area, 
plus we still want to share as much core PTI logic with the -stable kernels as 
possible. The performance plus doesn't hurt either ... after so much lost 
performance.

Thanks,

Ingo


Re: [PATCH 00/11] Use global pages with PTI

2018-03-30 Thread Ingo Molnar

* Dave Hansen  wrote:

> On 03/30/2018 01:32 PM, Thomas Gleixner wrote:
> > On Fri, 30 Mar 2018, Dave Hansen wrote:
> > 
> >> On 03/30/2018 05:17 AM, Ingo Molnar wrote:
> >>> BTW., the expectation on !PCID Intel hardware would be for global pages 
> >>> to help 
> >>> even more than the 0.6% and 1.7% you measured on PCID hardware: PCID 
> >>> already 
> >>> _reduces_ the cost of TLB flushes - so if there's not even PCID then 
> >>> global pages 
> >>> should help even more.
> >>>
> >>> In theory at least. Would still be nice to measure it.
> >>
> >> I did the lseek test on a modern, non-PCID system:
> >>
> >> No Global pages (baseline): 6077741 lseeks/sec
> >> 94 Global pages (this set): 8433111 lseeks/sec
> >>   +2355370 lseeks/sec (+38.8%)
> > 
> > That's all kernel text, right? What's the result for the case where global
> > is only set for all user/kernel shared pages?
> 
> Yes, that's all kernel text (94 global entries).  Here's the number with
> just the entry data/text set global (88 global entries on this system):
> 
> No Global pages (baseline): 6077741 lseeks/sec
> 88 Global Pages (kentry  ): 7528609 lseeks/sec (+23.9%)
> 94 Global pages (this set): 8433111 lseeks/sec (+38.8%)

Very impressive!

Please incorporate the performance numbers in patches #9 and #11.

There were a couple of valid review comments which need to be addressed as 
well, 
but other than that it all looks good to me and I plan to apply the next 
iteration.

In fact I think I'll try to put it into the backporting tree: as PGE was really 
the pre PTI status quo and thus we should expect few quirks/bugs in this area, 
plus we still want to share as much core PTI logic with the -stable kernels as 
possible. The performance plus doesn't hurt either ... after so much lost 
performance.

Thanks,

Ingo


Re: [PATCH 00/11] Use global pages with PTI

2018-03-30 Thread Dave Hansen
On 03/30/2018 01:32 PM, Thomas Gleixner wrote:
> On Fri, 30 Mar 2018, Dave Hansen wrote:
> 
>> On 03/30/2018 05:17 AM, Ingo Molnar wrote:
>>> BTW., the expectation on !PCID Intel hardware would be for global pages to 
>>> help 
>>> even more than the 0.6% and 1.7% you measured on PCID hardware: PCID 
>>> already 
>>> _reduces_ the cost of TLB flushes - so if there's not even PCID then global 
>>> pages 
>>> should help even more.
>>>
>>> In theory at least. Would still be nice to measure it.
>>
>> I did the lseek test on a modern, non-PCID system:
>>
>> No Global pages (baseline): 6077741 lseeks/sec
>> 94 Global pages (this set): 8433111 lseeks/sec
>> +2355370 lseeks/sec (+38.8%)
> 
> That's all kernel text, right? What's the result for the case where global
> is only set for all user/kernel shared pages?

Yes, that's all kernel text (94 global entries).  Here's the number with
just the entry data/text set global (88 global entries on this system):

No Global pages (baseline): 6077741 lseeks/sec
88 Global Pages (kentry  ): 7528609 lseeks/sec (+23.9%)
94 Global pages (this set): 8433111 lseeks/sec (+38.8%)



Re: [PATCH 00/11] Use global pages with PTI

2018-03-30 Thread Dave Hansen
On 03/30/2018 01:32 PM, Thomas Gleixner wrote:
> On Fri, 30 Mar 2018, Dave Hansen wrote:
> 
>> On 03/30/2018 05:17 AM, Ingo Molnar wrote:
>>> BTW., the expectation on !PCID Intel hardware would be for global pages to 
>>> help 
>>> even more than the 0.6% and 1.7% you measured on PCID hardware: PCID 
>>> already 
>>> _reduces_ the cost of TLB flushes - so if there's not even PCID then global 
>>> pages 
>>> should help even more.
>>>
>>> In theory at least. Would still be nice to measure it.
>>
>> I did the lseek test on a modern, non-PCID system:
>>
>> No Global pages (baseline): 6077741 lseeks/sec
>> 94 Global pages (this set): 8433111 lseeks/sec
>> +2355370 lseeks/sec (+38.8%)
> 
> That's all kernel text, right? What's the result for the case where global
> is only set for all user/kernel shared pages?

Yes, that's all kernel text (94 global entries).  Here's the number with
just the entry data/text set global (88 global entries on this system):

No Global pages (baseline): 6077741 lseeks/sec
88 Global Pages (kentry  ): 7528609 lseeks/sec (+23.9%)
94 Global pages (this set): 8433111 lseeks/sec (+38.8%)



Re: [PATCH 00/11] Use global pages with PTI

2018-03-30 Thread Thomas Gleixner
On Fri, 30 Mar 2018, Dave Hansen wrote:

> On 03/30/2018 05:17 AM, Ingo Molnar wrote:
> > BTW., the expectation on !PCID Intel hardware would be for global pages to 
> > help 
> > even more than the 0.6% and 1.7% you measured on PCID hardware: PCID 
> > already 
> > _reduces_ the cost of TLB flushes - so if there's not even PCID then global 
> > pages 
> > should help even more.
> > 
> > In theory at least. Would still be nice to measure it.
> 
> I did the lseek test on a modern, non-PCID system:
> 
> No Global pages (baseline): 6077741 lseeks/sec
> 94 Global pages (this set): 8433111 lseeks/sec
>  +2355370 lseeks/sec (+38.8%)

That's all kernel text, right? What's the result for the case where global
is only set for all user/kernel shared pages?

Thanks,

tglx




Re: [PATCH 00/11] Use global pages with PTI

2018-03-30 Thread Thomas Gleixner
On Fri, 30 Mar 2018, Dave Hansen wrote:

> On 03/30/2018 05:17 AM, Ingo Molnar wrote:
> > BTW., the expectation on !PCID Intel hardware would be for global pages to 
> > help 
> > even more than the 0.6% and 1.7% you measured on PCID hardware: PCID 
> > already 
> > _reduces_ the cost of TLB flushes - so if there's not even PCID then global 
> > pages 
> > should help even more.
> > 
> > In theory at least. Would still be nice to measure it.
> 
> I did the lseek test on a modern, non-PCID system:
> 
> No Global pages (baseline): 6077741 lseeks/sec
> 94 Global pages (this set): 8433111 lseeks/sec
>  +2355370 lseeks/sec (+38.8%)

That's all kernel text, right? What's the result for the case where global
is only set for all user/kernel shared pages?

Thanks,

tglx




Re: [PATCH 00/11] Use global pages with PTI

2018-03-30 Thread Dave Hansen
On 03/30/2018 05:17 AM, Ingo Molnar wrote:
> BTW., the expectation on !PCID Intel hardware would be for global pages to 
> help 
> even more than the 0.6% and 1.7% you measured on PCID hardware: PCID already 
> _reduces_ the cost of TLB flushes - so if there's not even PCID then global 
> pages 
> should help even more.
> 
> In theory at least. Would still be nice to measure it.

I did the lseek test on a modern, non-PCID system:

No Global pages (baseline): 6077741 lseeks/sec
94 Global pages (this set): 8433111 lseeks/sec
   +2355370 lseeks/sec (+38.8%)


Re: [PATCH 00/11] Use global pages with PTI

2018-03-30 Thread Dave Hansen
On 03/30/2018 05:17 AM, Ingo Molnar wrote:
> BTW., the expectation on !PCID Intel hardware would be for global pages to 
> help 
> even more than the 0.6% and 1.7% you measured on PCID hardware: PCID already 
> _reduces_ the cost of TLB flushes - so if there's not even PCID then global 
> pages 
> should help even more.
> 
> In theory at least. Would still be nice to measure it.

I did the lseek test on a modern, non-PCID system:

No Global pages (baseline): 6077741 lseeks/sec
94 Global pages (this set): 8433111 lseeks/sec
   +2355370 lseeks/sec (+38.8%)


Re: [PATCH 00/11] Use global pages with PTI

2018-03-30 Thread Ingo Molnar

* Ingo Molnar  wrote:

> > No Global pages (baseline): 186.951 seconds time elapsed  ( +-  0.35% )
> > 28 Global pages (this set): 185.756 seconds time elapsed  ( +-  0.09% )
> >  -1.195 seconds (-0.64%)
> > 
> > Lower is better here, obviously.
> > 
> > I also re-checked everything using will-it-scale's llseek1 test[2] which
> > is basically a microbenchmark of a halfway reasonable syscall.  Higher
> > here is better.
> > 
> > No Global pages (baseline): 15783951 lseeks/sec
> > 28 Global pages (this set): 16054688 lseeks/sec
> >  +270737 lseeks/sec (+1.71%)
> > 
> > So, both the kernel compile and the microbenchmark got measurably faster.
> 
> Ok, cool, this is much better!
> 
> Mind re-sending the patch-set against latest -tip so it can be merged?
> 
> At this point !PCID Intel hardware is not a primary concern, if something bad 
> happens on them with global pages we can quirk global pages off on them in 
> some 
> way, or so.

BTW., the expectation on !PCID Intel hardware would be for global pages to help 
even more than the 0.6% and 1.7% you measured on PCID hardware: PCID already 
_reduces_ the cost of TLB flushes - so if there's not even PCID then global 
pages 
should help even more.

In theory at least. Would still be nice to measure it.

Thanks,

Ingo


Re: [PATCH 00/11] Use global pages with PTI

2018-03-30 Thread Ingo Molnar

* Ingo Molnar  wrote:

> > No Global pages (baseline): 186.951 seconds time elapsed  ( +-  0.35% )
> > 28 Global pages (this set): 185.756 seconds time elapsed  ( +-  0.09% )
> >  -1.195 seconds (-0.64%)
> > 
> > Lower is better here, obviously.
> > 
> > I also re-checked everything using will-it-scale's llseek1 test[2] which
> > is basically a microbenchmark of a halfway reasonable syscall.  Higher
> > here is better.
> > 
> > No Global pages (baseline): 15783951 lseeks/sec
> > 28 Global pages (this set): 16054688 lseeks/sec
> >  +270737 lseeks/sec (+1.71%)
> > 
> > So, both the kernel compile and the microbenchmark got measurably faster.
> 
> Ok, cool, this is much better!
> 
> Mind re-sending the patch-set against latest -tip so it can be merged?
> 
> At this point !PCID Intel hardware is not a primary concern, if something bad 
> happens on them with global pages we can quirk global pages off on them in 
> some 
> way, or so.

BTW., the expectation on !PCID Intel hardware would be for global pages to help 
even more than the 0.6% and 1.7% you measured on PCID hardware: PCID already 
_reduces_ the cost of TLB flushes - so if there's not even PCID then global 
pages 
should help even more.

In theory at least. Would still be nice to measure it.

Thanks,

Ingo


Re: [PATCH 00/11] Use global pages with PTI

2018-03-30 Thread Ingo Molnar

* Dave Hansen  wrote:

> On 03/27/2018 01:07 PM, Ingo Molnar wrote:
> > * Thomas Gleixner  wrote:
> >>> systems.  Atoms are going to be the easiest thing to get my hands on,
> >>> but I tend to shy away from them for performance work.
> >> What I have in mind is that I wonder whether the whole circus is worth it
> >> when there is no performance advantage on PCID systems.
> 
> I was waiting on trying to find a relatively recent Atom system (they
> actually come in reasonably sized servers [1]), but I'm hitting a snag
> there, so I figured I'd just share a kernel compile using Ingo's
> perf-based methodology on a Skylake desktop system with PCIDs.
>
> Here's the kernel compile:
> 
> No Global pages (baseline): 186.951 seconds time elapsed  ( +-  0.35% )
> 28 Global pages (this set): 185.756 seconds time elapsed  ( +-  0.09% )
>  -1.195 seconds (-0.64%)
> 
> Lower is better here, obviously.
> 
> I also re-checked everything using will-it-scale's llseek1 test[2] which
> is basically a microbenchmark of a halfway reasonable syscall.  Higher
> here is better.
> 
> No Global pages (baseline): 15783951 lseeks/sec
> 28 Global pages (this set): 16054688 lseeks/sec
>+270737 lseeks/sec (+1.71%)
> 
> So, both the kernel compile and the microbenchmark got measurably faster.

Ok, cool, this is much better!

Mind re-sending the patch-set against latest -tip so it can be merged?

At this point !PCID Intel hardware is not a primary concern, if something bad 
happens on them with global pages we can quirk global pages off on them in some 
way, or so.

Thanks,

Ingo


Re: [PATCH 00/11] Use global pages with PTI

2018-03-30 Thread Ingo Molnar

* Dave Hansen  wrote:

> On 03/27/2018 01:07 PM, Ingo Molnar wrote:
> > * Thomas Gleixner  wrote:
> >>> systems.  Atoms are going to be the easiest thing to get my hands on,
> >>> but I tend to shy away from them for performance work.
> >> What I have in mind is that I wonder whether the whole circus is worth it
> >> when there is no performance advantage on PCID systems.
> 
> I was waiting on trying to find a relatively recent Atom system (they
> actually come in reasonably sized servers [1]), but I'm hitting a snag
> there, so I figured I'd just share a kernel compile using Ingo's
> perf-based methodology on a Skylake desktop system with PCIDs.
>
> Here's the kernel compile:
> 
> No Global pages (baseline): 186.951 seconds time elapsed  ( +-  0.35% )
> 28 Global pages (this set): 185.756 seconds time elapsed  ( +-  0.09% )
>  -1.195 seconds (-0.64%)
> 
> Lower is better here, obviously.
> 
> I also re-checked everything using will-it-scale's llseek1 test[2] which
> is basically a microbenchmark of a halfway reasonable syscall.  Higher
> here is better.
> 
> No Global pages (baseline): 15783951 lseeks/sec
> 28 Global pages (this set): 16054688 lseeks/sec
>+270737 lseeks/sec (+1.71%)
> 
> So, both the kernel compile and the microbenchmark got measurably faster.

Ok, cool, this is much better!

Mind re-sending the patch-set against latest -tip so it can be merged?

At this point !PCID Intel hardware is not a primary concern, if something bad 
happens on them with global pages we can quirk global pages off on them in some 
way, or so.

Thanks,

Ingo


Re: [PATCH 00/11] Use global pages with PTI

2018-03-28 Thread Dave Hansen
On 03/27/2018 01:07 PM, Ingo Molnar wrote:
> * Thomas Gleixner  wrote:
>>> systems.  Atoms are going to be the easiest thing to get my hands on,
>>> but I tend to shy away from them for performance work.
>> What I have in mind is that I wonder whether the whole circus is worth it
>> when there is no performance advantage on PCID systems.

I was waiting on trying to find a relatively recent Atom system (they
actually come in reasonably sized servers [1]), but I'm hitting a snag
there, so I figured I'd just share a kernel compile using Ingo's
perf-based methodology on a Skylake desktop system with PCIDs.  Here's
the kernel compile:

No Global pages (baseline): 186.951 seconds time elapsed  ( +-  0.35% )
28 Global pages (this set): 185.756 seconds time elapsed  ( +-  0.09% )
 -1.195 seconds (-0.64%)

Lower is better here, obviously.

I also re-checked everything using will-it-scale's llseek1 test[2] which
is basically a microbenchmark of a halfway reasonable syscall.  Higher
here is better.

No Global pages (baseline): 15783951 lseeks/sec
28 Global pages (this set): 16054688 lseeks/sec
 +270737 lseeks/sec (+1.71%)

So, both the kernel compile and the microbenchmark got measurably faster.

1.
https://ark.intel.com/products/97933/Intel-Atom-Processor-C3955-16M-Cache-up-to-2_40-GHz
2.
https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.c



Re: [PATCH 00/11] Use global pages with PTI

2018-03-28 Thread Dave Hansen
On 03/27/2018 01:07 PM, Ingo Molnar wrote:
> * Thomas Gleixner  wrote:
>>> systems.  Atoms are going to be the easiest thing to get my hands on,
>>> but I tend to shy away from them for performance work.
>> What I have in mind is that I wonder whether the whole circus is worth it
>> when there is no performance advantage on PCID systems.

I was waiting on trying to find a relatively recent Atom system (they
actually come in reasonably sized servers [1]), but I'm hitting a snag
there, so I figured I'd just share a kernel compile using Ingo's
perf-based methodology on a Skylake desktop system with PCIDs.  Here's
the kernel compile:

No Global pages (baseline): 186.951 seconds time elapsed  ( +-  0.35% )
28 Global pages (this set): 185.756 seconds time elapsed  ( +-  0.09% )
 -1.195 seconds (-0.64%)

Lower is better here, obviously.

I also re-checked everything using will-it-scale's llseek1 test[2] which
is basically a microbenchmark of a halfway reasonable syscall.  Higher
here is better.

No Global pages (baseline): 15783951 lseeks/sec
28 Global pages (this set): 16054688 lseeks/sec
 +270737 lseeks/sec (+1.71%)

So, both the kernel compile and the microbenchmark got measurably faster.

1.
https://ark.intel.com/products/97933/Intel-Atom-Processor-C3955-16M-Cache-up-to-2_40-GHz
2.
https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.c



Re: [PATCH 00/11] Use global pages with PTI

2018-03-27 Thread Dave Hansen
On 03/27/2018 01:07 PM, Ingo Molnar wrote:
>  - To see at minimum stddev numbers, to make sure we are not looking at some 
> weird
>statistical artifact. (I also outlined a more robust measurement method.)
> 
>  - If the numbers are right, a CPU engineer should have a look if possible, 
>because frankly this effect is not expected and is not intuitive. Where 
> global 
>pages can be used safely they are almost always an unconditional win.
>Maybe we are missing some limitation or some interaction with PCID.
> 
> Since we'll be using PCID even on Meltdown-fixed hardware, maybe the same 
> negative 
> performance effect already exists on non-PTI kernels as well, we just never 
> noticed?

Yep, totally agree.  I'll do the more robust collection and also explore
on "real" !PCID hardware.  I also know the right CPU folks to go ask
about this, I just want to do the second round of robust data collection
before I bug them.


Re: [PATCH 00/11] Use global pages with PTI

2018-03-27 Thread Dave Hansen
On 03/27/2018 01:07 PM, Ingo Molnar wrote:
>  - To see at minimum stddev numbers, to make sure we are not looking at some 
> weird
>statistical artifact. (I also outlined a more robust measurement method.)
> 
>  - If the numbers are right, a CPU engineer should have a look if possible, 
>because frankly this effect is not expected and is not intuitive. Where 
> global 
>pages can be used safely they are almost always an unconditional win.
>Maybe we are missing some limitation or some interaction with PCID.
> 
> Since we'll be using PCID even on Meltdown-fixed hardware, maybe the same 
> negative 
> performance effect already exists on non-PTI kernels as well, we just never 
> noticed?

Yep, totally agree.  I'll do the more robust collection and also explore
on "real" !PCID hardware.  I also know the right CPU folks to go ask
about this, I just want to do the second round of robust data collection
before I bug them.


Re: [PATCH 00/11] Use global pages with PTI

2018-03-27 Thread Ingo Molnar

* Thomas Gleixner  wrote:

> > systems.  Atoms are going to be the easiest thing to get my hands on,
> > but I tend to shy away from them for performance work.
> 
> What I have in mind is that I wonder whether the whole circus is worth it
> when there is no performance advantage on PCID systems.

I'd still love to:

 - To see at minimum stddev numbers, to make sure we are not looking at some 
weird
   statistical artifact. (I also outlined a more robust measurement method.)

 - If the numbers are right, a CPU engineer should have a look if possible, 
   because frankly this effect is not expected and is not intuitive. Where 
global 
   pages can be used safely they are almost always an unconditional win.
   Maybe we are missing some limitation or some interaction with PCID.

Since we'll be using PCID even on Meltdown-fixed hardware, maybe the same 
negative 
performance effect already exists on non-PTI kernels as well, we just never 
noticed?

I.e. there are multiple grounds to get to the bottom of this.

Thanks,

Ingo


Re: [PATCH 00/11] Use global pages with PTI

2018-03-27 Thread Ingo Molnar

* Thomas Gleixner  wrote:

> > systems.  Atoms are going to be the easiest thing to get my hands on,
> > but I tend to shy away from them for performance work.
> 
> What I have in mind is that I wonder whether the whole circus is worth it
> when there is no performance advantage on PCID systems.

I'd still love to:

 - To see at minimum stddev numbers, to make sure we are not looking at some 
weird
   statistical artifact. (I also outlined a more robust measurement method.)

 - If the numbers are right, a CPU engineer should have a look if possible, 
   because frankly this effect is not expected and is not intuitive. Where 
global 
   pages can be used safely they are almost always an unconditional win.
   Maybe we are missing some limitation or some interaction with PCID.

Since we'll be using PCID even on Meltdown-fixed hardware, maybe the same 
negative 
performance effect already exists on non-PTI kernels as well, we just never 
noticed?

I.e. there are multiple grounds to get to the bottom of this.

Thanks,

Ingo


Re: [PATCH 00/11] Use global pages with PTI

2018-03-27 Thread Thomas Gleixner
On Tue, 27 Mar 2018, Dave Hansen wrote:

> On 03/27/2018 06:36 AM, Thomas Gleixner wrote:
> >> User Time   Kernel Time Clock Elapsed
> >> Baseline ( 0 GLB PTEs)  803.79  67.77   237.30
> >> w/series (28 GLB PTEs)  807.70 (+0.7%)  68.07 (+0.7%)   238.07 (+0.3%)
> >>
> >> Without PCIDs, it behaves the way I would expect.
> > What's the performance benefit on !PCID systems? And I mean systems which
> > actually do not have PCID, not a PCID system with 'nopcid' on the command
> > line.
> 
> Do you have something in mind for this?  Basically *all* of the servers
> that I have access to have PCID because they are newer than ~7 years old.
> 
> That leaves *some* Ivybridge and earlier desktops, Atoms and AMD

AMD is not interesting as it's not PTI and uses GLOBAL anyway.

> systems.  Atoms are going to be the easiest thing to get my hands on,
> but I tend to shy away from them for performance work.

What I have in mind is that I wonder whether the whole circus is worth it
when there is no performance advantage on PCID systems.

Thanks,

tglx



Re: [PATCH 00/11] Use global pages with PTI

2018-03-27 Thread Thomas Gleixner
On Tue, 27 Mar 2018, Dave Hansen wrote:

> On 03/27/2018 06:36 AM, Thomas Gleixner wrote:
> >> User Time   Kernel Time Clock Elapsed
> >> Baseline ( 0 GLB PTEs)  803.79  67.77   237.30
> >> w/series (28 GLB PTEs)  807.70 (+0.7%)  68.07 (+0.7%)   238.07 (+0.3%)
> >>
> >> Without PCIDs, it behaves the way I would expect.
> > What's the performance benefit on !PCID systems? And I mean systems which
> > actually do not have PCID, not a PCID system with 'nopcid' on the command
> > line.
> 
> Do you have something in mind for this?  Basically *all* of the servers
> that I have access to have PCID because they are newer than ~7 years old.
> 
> That leaves *some* Ivybridge and earlier desktops, Atoms and AMD

AMD is not interesting as it's not PTI and uses GLOBAL anyway.

> systems.  Atoms are going to be the easiest thing to get my hands on,
> but I tend to shy away from them for performance work.

What I have in mind is that I wonder whether the whole circus is worth it
when there is no performance advantage on PCID systems.

Thanks,

tglx



Re: [PATCH 00/11] Use global pages with PTI

2018-03-27 Thread Dave Hansen
On 03/27/2018 06:36 AM, Thomas Gleixner wrote:
>> User Time   Kernel Time Clock Elapsed
>> Baseline ( 0 GLB PTEs)  803.79  67.77   237.30
>> w/series (28 GLB PTEs)  807.70 (+0.7%)  68.07 (+0.7%)   238.07 (+0.3%)
>>
>> Without PCIDs, it behaves the way I would expect.
> What's the performance benefit on !PCID systems? And I mean systems which
> actually do not have PCID, not a PCID system with 'nopcid' on the command
> line.

Do you have something in mind for this?  Basically *all* of the servers
that I have access to have PCID because they are newer than ~7 years old.

That leaves *some* Ivybridge and earlier desktops, Atoms and AMD
systems.  Atoms are going to be the easiest thing to get my hands on,
but I tend to shy away from them for performance work.


Re: [PATCH 00/11] Use global pages with PTI

2018-03-27 Thread Dave Hansen
On 03/27/2018 06:36 AM, Thomas Gleixner wrote:
>> User Time   Kernel Time Clock Elapsed
>> Baseline ( 0 GLB PTEs)  803.79  67.77   237.30
>> w/series (28 GLB PTEs)  807.70 (+0.7%)  68.07 (+0.7%)   238.07 (+0.3%)
>>
>> Without PCIDs, it behaves the way I would expect.
> What's the performance benefit on !PCID systems? And I mean systems which
> actually do not have PCID, not a PCID system with 'nopcid' on the command
> line.

Do you have something in mind for this?  Basically *all* of the servers
that I have access to have PCID because they are newer than ~7 years old.

That leaves *some* Ivybridge and earlier desktops, Atoms and AMD
systems.  Atoms are going to be the easiest thing to get my hands on,
but I tend to shy away from them for performance work.


Re: [PATCH 00/11] Use global pages with PTI

2018-03-27 Thread Thomas Gleixner
On Fri, 23 Mar 2018, Dave Hansen wrote:
> On 03/23/2018 11:26 AM, Linus Torvalds wrote:
> > On Fri, Mar 23, 2018 at 10:44 AM, Dave Hansen
> >  wrote:
> >>
> >> This adds one major change from the last version of the patch set
> >> (present in the last patch).  It makes all kernel text global for non-
> >> PCID systems.  This keeps kernel data protected always, but means that
> >> it will be easier to find kernel gadgets via meltdown on old systems
> >> without PCIDs.  This heuristic is, I think, a reasonable one and it
> >> keeps us from having to create any new pti=foo options
> > 
> > Sounds sane.
> > 
> > The patches look reasonable, but I hate seeing a patch series like
> > this where the only ostensible reason is performance, and there are no
> > performance numbers anywhere..
> 
> Well, rats.  This somehow makes things slower with PCIDs on.  I thought
> I reversed the numbers, but I actually do a "grep -c GLB
> /sys/kernel/debug/page_tables/kernel" and record that in my logs right
> next to the output of time(1), so it's awfully hard to screw up.
> 
> This is time doing a modestly-sized kernel compile on a 4-core Skylake
> desktop.
> 
> User Time   Kernel Time Clock Elapsed
> Baseline ( 0 GLB PTEs)  803.79  67.77   237.30
> w/series (28 GLB PTEs)  807.70 (+0.7%)  68.07 (+0.7%)   238.07 (+0.3%)
> 
> Without PCIDs, it behaves the way I would expect.

What's the performance benefit on !PCID systems? And I mean systems which
actually do not have PCID, not a PCID system with 'nopcid' on the command
line.

Thanks,

tglx


Re: [PATCH 00/11] Use global pages with PTI

2018-03-27 Thread Thomas Gleixner
On Fri, 23 Mar 2018, Dave Hansen wrote:
> On 03/23/2018 11:26 AM, Linus Torvalds wrote:
> > On Fri, Mar 23, 2018 at 10:44 AM, Dave Hansen
> >  wrote:
> >>
> >> This adds one major change from the last version of the patch set
> >> (present in the last patch).  It makes all kernel text global for non-
> >> PCID systems.  This keeps kernel data protected always, but means that
> >> it will be easier to find kernel gadgets via meltdown on old systems
> >> without PCIDs.  This heuristic is, I think, a reasonable one and it
> >> keeps us from having to create any new pti=foo options
> > 
> > Sounds sane.
> > 
> > The patches look reasonable, but I hate seeing a patch series like
> > this where the only ostensible reason is performance, and there are no
> > performance numbers anywhere..
> 
> Well, rats.  This somehow makes things slower with PCIDs on.  I thought
> I reversed the numbers, but I actually do a "grep -c GLB
> /sys/kernel/debug/page_tables/kernel" and record that in my logs right
> next to the output of time(1), so it's awfully hard to screw up.
> 
> This is time doing a modestly-sized kernel compile on a 4-core Skylake
> desktop.
> 
> User Time   Kernel Time Clock Elapsed
> Baseline ( 0 GLB PTEs)  803.79  67.77   237.30
> w/series (28 GLB PTEs)  807.70 (+0.7%)  68.07 (+0.7%)   238.07 (+0.3%)
> 
> Without PCIDs, it behaves the way I would expect.

What's the performance benefit on !PCID systems? And I mean systems which
actually do not have PCID, not a PCID system with 'nopcid' on the command
line.

Thanks,

tglx


Re: [PATCH 00/11] Use global pages with PTI

2018-03-24 Thread Ingo Molnar

* Dave Hansen  wrote:

> This is time doing a modestly-sized kernel compile on a 4-core Skylake
> desktop.
> 
> User Time   Kernel Time Clock Elapsed
> Baseline ( 0 GLB PTEs)  803.79  67.77   237.30
> w/series (28 GLB PTEs)  807.70 (+0.7%)  68.07 (+0.7%)   238.07 (+0.3%)
> 
> Without PCIDs, it behaves the way I would expect.
>
> I'll ask around, but I'm open to any ideas about what the heck might be
> causing this.

Hm, so it's a bit weird that while user time and kernel time both increased by 
about 0.7%, elapsed time only increased by 0.3%? Typically kernel builds are 
much 
more parallel for that to be typical, so maybe there's some noise in the 
measurement?

Before spending too much time on the global-TLB patch angle I'd suggest 
investing 
a bit of time into making sure that the regression you are seeing is actually 
real:

You haven't described how you have measured kernel build times and "+0.7% 
regression" might turn out to be the real number, but sub-1% accuracy kernel 
build 
times are *awfully* susceptible to:

 - various sources of noise

 - systematic statistical errors which doesn't show up as 
   measurement-to-measurement noise but which skews the results:
   such as the boot-to-boot memory layout of the source code and
   object files.

 - cpufreq artifacts

Even repeated builds with 'make clean' inbetween can be misleading because the 
exact layout of key include files and binaries which get accessed the most 
often 
during a build are set into stone once they've been read into the page cache 
for 
the first time after bootup. Automated reboots between measurements can be 
misleading as well, if the file layout after bootup is too deterministic.

So here's a pretty reliable way to measure kernel build time, which tries to 
avoid 
the various pitfalls of caching.

First I make sure that cpufreq is set to 'performance':

  for ((cpu=0; cpu<120; cpu++)); do
G=/sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_governor
[ -f $G ] && echo performance > $G
  done

[ ... because it can be *really* annoying to discover that an ostensible 
  performance regression was a cpufreq artifact ... again. ;-) ]

Then I copy a kernel tree to /tmp (ramfs) as root:

cd /tmp
rm -rf linux
git clone ~/linux linux
cd linux
make defconfig >/dev/null

... and then we can build the kernel in such a loop (as root again):

  perf stat --repeat 10 --null --pre'\
cp -a kernel ../kernel.copy.$(date +%s); \
rm -rf *;\
git checkout .;  \
echo 1 > /proc/sys/vm/drop_caches;   \
find ../kernel* -type f | xargs cat >/dev/null;  \
make -j kernel >/dev/null;   \
make clean >/dev/null 2>&1;  \
sync'\
 \
make -j16 >/dev/null

( I have tested these by pasting them into a terminal. Adjust the ~/linux 
source 
  git tree and the '-j16' to your system. )

Notes:

 - the 'pre' script portion is not timed by 'perf stat', only the raw build 
times

 - we flush all caches via drop_caches and re-establish everything again, but:

 - we also introduce an intentional memory leak by slowly filling up ramfs with 
   copies of 'kernel/', thus continously changing the layout of free memory, 
   cached data such as compiler binaries and the source code hierarchy. (Note 
   that the leak is about 8MB per iteration, so it isn't massive.)

With 10 iterations this is the statistical stability I get this on a big box:

 Performance counter stats for 'make -j128 kernel' (10 runs):

  26.346436425 seconds time elapsed(+- 0.19%)

... which, despite a high iteration count of 10, is still surprisingly noisy, 
right?

A 0.2% stddev is probably not enough to call a 0.7% regression with good 
confidence, so I had to use *30* iterations to make measurement noise to be 
about 
an order of magnitude lower than the effect I'm trying to measure:

 Performance counter stats for 'make -j128' (30 runs):

  26.334767571 seconds time elapsed(+- 0.09% )

i.e. "26.334 +- 0.023" seconds is a number we can have pretty high confidence 
in, 
on this system.

And just to demonstrate that it's all real, I repeated the whole 30-iteration 
measurement again:

 Performance counter stats for 'make -j128' (30 runs):

  26.311166142 seconds time elapsed(+- 0.07%)

Even if in the end you get a similar result, close to the +0.7% overhead you 
already measured, we should have more confidence in blaming global TLBs for the 
performance regression.

BYMMV.

Thanks,

Ingo

[*] Note that even this doesn't eliminate certain sources of measurement error: 
such as the boot-to-boot variance in the layout of 

Re: [PATCH 00/11] Use global pages with PTI

2018-03-24 Thread Ingo Molnar

* Dave Hansen  wrote:

> This is time doing a modestly-sized kernel compile on a 4-core Skylake
> desktop.
> 
> User Time   Kernel Time Clock Elapsed
> Baseline ( 0 GLB PTEs)  803.79  67.77   237.30
> w/series (28 GLB PTEs)  807.70 (+0.7%)  68.07 (+0.7%)   238.07 (+0.3%)
> 
> Without PCIDs, it behaves the way I would expect.
>
> I'll ask around, but I'm open to any ideas about what the heck might be
> causing this.

Hm, so it's a bit weird that while user time and kernel time both increased by 
about 0.7%, elapsed time only increased by 0.3%? Typically kernel builds are 
much 
more parallel for that to be typical, so maybe there's some noise in the 
measurement?

Before spending too much time on the global-TLB patch angle I'd suggest 
investing 
a bit of time into making sure that the regression you are seeing is actually 
real:

You haven't described how you have measured kernel build times and "+0.7% 
regression" might turn out to be the real number, but sub-1% accuracy kernel 
build 
times are *awfully* susceptible to:

 - various sources of noise

 - systematic statistical errors which doesn't show up as 
   measurement-to-measurement noise but which skews the results:
   such as the boot-to-boot memory layout of the source code and
   object files.

 - cpufreq artifacts

Even repeated builds with 'make clean' inbetween can be misleading because the 
exact layout of key include files and binaries which get accessed the most 
often 
during a build are set into stone once they've been read into the page cache 
for 
the first time after bootup. Automated reboots between measurements can be 
misleading as well, if the file layout after bootup is too deterministic.

So here's a pretty reliable way to measure kernel build time, which tries to 
avoid 
the various pitfalls of caching.

First I make sure that cpufreq is set to 'performance':

  for ((cpu=0; cpu<120; cpu++)); do
G=/sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_governor
[ -f $G ] && echo performance > $G
  done

[ ... because it can be *really* annoying to discover that an ostensible 
  performance regression was a cpufreq artifact ... again. ;-) ]

Then I copy a kernel tree to /tmp (ramfs) as root:

cd /tmp
rm -rf linux
git clone ~/linux linux
cd linux
make defconfig >/dev/null

... and then we can build the kernel in such a loop (as root again):

  perf stat --repeat 10 --null --pre'\
cp -a kernel ../kernel.copy.$(date +%s); \
rm -rf *;\
git checkout .;  \
echo 1 > /proc/sys/vm/drop_caches;   \
find ../kernel* -type f | xargs cat >/dev/null;  \
make -j kernel >/dev/null;   \
make clean >/dev/null 2>&1;  \
sync'\
 \
make -j16 >/dev/null

( I have tested these by pasting them into a terminal. Adjust the ~/linux 
source 
  git tree and the '-j16' to your system. )

Notes:

 - the 'pre' script portion is not timed by 'perf stat', only the raw build 
times

 - we flush all caches via drop_caches and re-establish everything again, but:

 - we also introduce an intentional memory leak by slowly filling up ramfs with 
   copies of 'kernel/', thus continously changing the layout of free memory, 
   cached data such as compiler binaries and the source code hierarchy. (Note 
   that the leak is about 8MB per iteration, so it isn't massive.)

With 10 iterations this is the statistical stability I get this on a big box:

 Performance counter stats for 'make -j128 kernel' (10 runs):

  26.346436425 seconds time elapsed(+- 0.19%)

... which, despite a high iteration count of 10, is still surprisingly noisy, 
right?

A 0.2% stddev is probably not enough to call a 0.7% regression with good 
confidence, so I had to use *30* iterations to make measurement noise to be 
about 
an order of magnitude lower than the effect I'm trying to measure:

 Performance counter stats for 'make -j128' (30 runs):

  26.334767571 seconds time elapsed(+- 0.09% )

i.e. "26.334 +- 0.023" seconds is a number we can have pretty high confidence 
in, 
on this system.

And just to demonstrate that it's all real, I repeated the whole 30-iteration 
measurement again:

 Performance counter stats for 'make -j128' (30 runs):

  26.311166142 seconds time elapsed(+- 0.07%)

Even if in the end you get a similar result, close to the +0.7% overhead you 
already measured, we should have more confidence in blaming global TLBs for the 
performance regression.

BYMMV.

Thanks,

Ingo

[*] Note that even this doesn't eliminate certain sources of measurement error: 
such as the boot-to-boot variance in the layout of certain key kernel data

Re: [PATCH 00/11] Use global pages with PTI

2018-03-23 Thread Linus Torvalds
On Fri, Mar 23, 2018 at 5:46 PM, Linus Torvalds
 wrote:
>
> It is, of course, possible that I misunderstood what you actually
> benchmarked. But I assume the above benchmark numbers are with the
> whole "don't even do global entries if you have PCID".

Oh, I went back and read your description, and realized that I _had_
misunderstood what you did.

I thought you didn't bother with global pages at all when you had PCID.

But that's not what you meant. You always do global for the actual
user-mapped kernel pages, but when you don't have PCID you do *all*
kernel test as global, whether shared or not.

So I entirely misread what the latest change was.

 Linus


Re: [PATCH 00/11] Use global pages with PTI

2018-03-23 Thread Linus Torvalds
On Fri, Mar 23, 2018 at 5:46 PM, Linus Torvalds
 wrote:
>
> It is, of course, possible that I misunderstood what you actually
> benchmarked. But I assume the above benchmark numbers are with the
> whole "don't even do global entries if you have PCID".

Oh, I went back and read your description, and realized that I _had_
misunderstood what you did.

I thought you didn't bother with global pages at all when you had PCID.

But that's not what you meant. You always do global for the actual
user-mapped kernel pages, but when you don't have PCID you do *all*
kernel test as global, whether shared or not.

So I entirely misread what the latest change was.

 Linus


Re: [PATCH 00/11] Use global pages with PTI

2018-03-23 Thread Linus Torvalds
On Fri, Mar 23, 2018 at 5:40 PM, Dave Hansen
 wrote:
>
> Well, rats.  This somehow makes things slower with PCIDs on.

.. what happens when you enable global pages with PCID? You disabled
them explicitly because you thought they wouldn't matter..

Even with PCID, a global TLB entry for the shared pages would make
sense, because it's now just *one* entry in the TLB rather that "one
per PCID and one for the kernel mapping".

So even if in theory the lifetime of the TLB entry is the same, when
you have capacity misses it most definitely isn't.

And for process tear-down and build-up the per-PCID TLB entry does
nothing at all. While for a true global entry, it gets shared even
across process creation/deletion. So even ignoring TLB capacity
issues, with lots of shortlived processes global TLB entries are much
better.

It is, of course, possible that I misunderstood what you actually
benchmarked. But I assume the above benchmark numbers are with the
whole "don't even do global entries if you have PCID".

   Linus


Re: [PATCH 00/11] Use global pages with PTI

2018-03-23 Thread Linus Torvalds
On Fri, Mar 23, 2018 at 5:40 PM, Dave Hansen
 wrote:
>
> Well, rats.  This somehow makes things slower with PCIDs on.

.. what happens when you enable global pages with PCID? You disabled
them explicitly because you thought they wouldn't matter..

Even with PCID, a global TLB entry for the shared pages would make
sense, because it's now just *one* entry in the TLB rather that "one
per PCID and one for the kernel mapping".

So even if in theory the lifetime of the TLB entry is the same, when
you have capacity misses it most definitely isn't.

And for process tear-down and build-up the per-PCID TLB entry does
nothing at all. While for a true global entry, it gets shared even
across process creation/deletion. So even ignoring TLB capacity
issues, with lots of shortlived processes global TLB entries are much
better.

It is, of course, possible that I misunderstood what you actually
benchmarked. But I assume the above benchmark numbers are with the
whole "don't even do global entries if you have PCID".

   Linus


Re: [PATCH 00/11] Use global pages with PTI

2018-03-23 Thread Dave Hansen
On 03/23/2018 11:26 AM, Linus Torvalds wrote:
> On Fri, Mar 23, 2018 at 10:44 AM, Dave Hansen
>  wrote:
>>
>> This adds one major change from the last version of the patch set
>> (present in the last patch).  It makes all kernel text global for non-
>> PCID systems.  This keeps kernel data protected always, but means that
>> it will be easier to find kernel gadgets via meltdown on old systems
>> without PCIDs.  This heuristic is, I think, a reasonable one and it
>> keeps us from having to create any new pti=foo options
> 
> Sounds sane.
> 
> The patches look reasonable, but I hate seeing a patch series like
> this where the only ostensible reason is performance, and there are no
> performance numbers anywhere..

Well, rats.  This somehow makes things slower with PCIDs on.  I thought
I reversed the numbers, but I actually do a "grep -c GLB
/sys/kernel/debug/page_tables/kernel" and record that in my logs right
next to the output of time(1), so it's awfully hard to screw up.

This is time doing a modestly-sized kernel compile on a 4-core Skylake
desktop.

User Time   Kernel Time Clock Elapsed
Baseline ( 0 GLB PTEs)  803.79  67.77   237.30
w/series (28 GLB PTEs)  807.70 (+0.7%)  68.07 (+0.7%)   238.07 (+0.3%)

Without PCIDs, it behaves the way I would expect.

I'll ask around, but I'm open to any ideas about what the heck might be
causing this.


Re: [PATCH 00/11] Use global pages with PTI

2018-03-23 Thread Dave Hansen
On 03/23/2018 11:26 AM, Linus Torvalds wrote:
> On Fri, Mar 23, 2018 at 10:44 AM, Dave Hansen
>  wrote:
>>
>> This adds one major change from the last version of the patch set
>> (present in the last patch).  It makes all kernel text global for non-
>> PCID systems.  This keeps kernel data protected always, but means that
>> it will be easier to find kernel gadgets via meltdown on old systems
>> without PCIDs.  This heuristic is, I think, a reasonable one and it
>> keeps us from having to create any new pti=foo options
> 
> Sounds sane.
> 
> The patches look reasonable, but I hate seeing a patch series like
> this where the only ostensible reason is performance, and there are no
> performance numbers anywhere..

Well, rats.  This somehow makes things slower with PCIDs on.  I thought
I reversed the numbers, but I actually do a "grep -c GLB
/sys/kernel/debug/page_tables/kernel" and record that in my logs right
next to the output of time(1), so it's awfully hard to screw up.

This is time doing a modestly-sized kernel compile on a 4-core Skylake
desktop.

User Time   Kernel Time Clock Elapsed
Baseline ( 0 GLB PTEs)  803.79  67.77   237.30
w/series (28 GLB PTEs)  807.70 (+0.7%)  68.07 (+0.7%)   238.07 (+0.3%)

Without PCIDs, it behaves the way I would expect.

I'll ask around, but I'm open to any ideas about what the heck might be
causing this.


Re: [PATCH 00/11] Use global pages with PTI

2018-03-23 Thread Linus Torvalds
On Fri, Mar 23, 2018 at 10:44 AM, Dave Hansen
 wrote:
>
> This adds one major change from the last version of the patch set
> (present in the last patch).  It makes all kernel text global for non-
> PCID systems.  This keeps kernel data protected always, but means that
> it will be easier to find kernel gadgets via meltdown on old systems
> without PCIDs.  This heuristic is, I think, a reasonable one and it
> keeps us from having to create any new pti=foo options

Sounds sane.

The patches look reasonable, but I hate seeing a patch series like
this where the only ostensible reason is performance, and there are no
performance numbers anywhere..

 Linus


Re: [PATCH 00/11] Use global pages with PTI

2018-03-23 Thread Linus Torvalds
On Fri, Mar 23, 2018 at 10:44 AM, Dave Hansen
 wrote:
>
> This adds one major change from the last version of the patch set
> (present in the last patch).  It makes all kernel text global for non-
> PCID systems.  This keeps kernel data protected always, but means that
> it will be easier to find kernel gadgets via meltdown on old systems
> without PCIDs.  This heuristic is, I think, a reasonable one and it
> keeps us from having to create any new pti=foo options

Sounds sane.

The patches look reasonable, but I hate seeing a patch series like
this where the only ostensible reason is performance, and there are no
performance numbers anywhere..

 Linus


[PATCH 00/11] Use global pages with PTI

2018-03-23 Thread Dave Hansen
The later verions of the KAISER pathces (pre-PTI) allowed the user/kernel
shared areas to be GLOBAL.  The thought was that this would reduce the
TLB overhead of keeping two copies of these mappings.

During the switch over to PTI, we seem to have lost our ability to have
GLOBAL mappings.  This adds them back.

This adds one major change from the last version of the patch set
(present in the last patch).  It makes all kernel text global for non-
PCID systems.  This keeps kernel data protected always, but means that
it will be easier to find kernel gadgets via meltdown on old systems
without PCIDs.  This heuristic is, I think, a reasonable one and it
keeps us from having to create any new pti=foo options

Cc: Andrea Arcangeli 
Cc: Andy Lutomirski 
Cc: Linus Torvalds 
Cc: Kees Cook 
Cc: Hugh Dickins 
Cc: Juergen Gross 
Cc: x...@kernel.org
Cc: Nadav Amit 


[PATCH 00/11] Use global pages with PTI

2018-03-23 Thread Dave Hansen
The later verions of the KAISER pathces (pre-PTI) allowed the user/kernel
shared areas to be GLOBAL.  The thought was that this would reduce the
TLB overhead of keeping two copies of these mappings.

During the switch over to PTI, we seem to have lost our ability to have
GLOBAL mappings.  This adds them back.

This adds one major change from the last version of the patch set
(present in the last patch).  It makes all kernel text global for non-
PCID systems.  This keeps kernel data protected always, but means that
it will be easier to find kernel gadgets via meltdown on old systems
without PCIDs.  This heuristic is, I think, a reasonable one and it
keeps us from having to create any new pti=foo options

Cc: Andrea Arcangeli 
Cc: Andy Lutomirski 
Cc: Linus Torvalds 
Cc: Kees Cook 
Cc: Hugh Dickins 
Cc: Juergen Gross 
Cc: x...@kernel.org
Cc: Nadav Amit