Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread H. Peter Anvin
No, it is a natural result of an implemention which treats setting the A bit as 
an abnormal flow (e.g. in microcode as opposed to hardware).

On September 29, 2015 7:11:59 PM PDT, ebied...@xmission.com wrote:
>"H. Peter Anvin"  writes:
>
>> On 09/29/2015 06:20 PM, Eric W. Biederman wrote:
>>> Linus Torvalds  writes:
>>> 
 On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski
> wrote:
>
> Does anyone know what happens if you stick a non-accessed segment
>in
> the GDT, map the GDT RO, and access it?

 You should get a #PF, as you guess, but go ahead and test it if you
 want to make sure.
>>> 
>>> I tested this by accident once when workinng on what has become
>known
>>> as coreboot.  Early in boot with your GDT in a EEPROM switching from
>>> real mode to 32bit protected mode causes a write and locks up the
>>> machine when the hardware declines the write to the GDT to set the
>>> accessed bit.  As I recall the write kept being retried and retried
>and
>>> retried...
>>> 
>>> Setting the access bit in the GDT cleared up the problem and I did
>not
>>> look back.
>>> 
>>> Way up in 64bit mode something might be different, but I don't know
>why
>>> cpu designeres would waste the silicon.
>>> 
>>
>> This is totally different from a TLB violation.  In your case, the
>write
>> goes through as far as the CPU is concerned, but when the data is
>> fetched back, it hasn't changed.  A write to a TLB-protected location
>> will #PF.
>
>The key point is that a write is generated when the cpu needs to set
>the
>access bit.  I agree the failure points are different.  A TLB fault vs
>a
>case where the hardware did not accept the write.
>
>The idea of a cpu reading back data (and not trusting it's cache
>coherency controls) to verify the access bit gets set seems mind
>boggling.  That is slow, stupid, racy and incorrect.  Incorrect as the
>cpu should not only set the access bit once per segment register load.
>
>In my case I am pretty certain it was something very weird with the
>hardware not acceppting the write and either not acknowledging the bus
>transaction or cancelling it.  In which case the cpu knew the write had
>not made it to the ``memory'' and was trying to cope.
>
>Eric

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Eric W. Biederman
"H. Peter Anvin"  writes:

> On 09/29/2015 06:20 PM, Eric W. Biederman wrote:
>> Linus Torvalds  writes:
>> 
>>> On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski  
>>> wrote:

 Does anyone know what happens if you stick a non-accessed segment in
 the GDT, map the GDT RO, and access it?
>>>
>>> You should get a #PF, as you guess, but go ahead and test it if you
>>> want to make sure.
>> 
>> I tested this by accident once when workinng on what has become known
>> as coreboot.  Early in boot with your GDT in a EEPROM switching from
>> real mode to 32bit protected mode causes a write and locks up the
>> machine when the hardware declines the write to the GDT to set the
>> accessed bit.  As I recall the write kept being retried and retried and
>> retried...
>> 
>> Setting the access bit in the GDT cleared up the problem and I did not
>> look back.
>> 
>> Way up in 64bit mode something might be different, but I don't know why
>> cpu designeres would waste the silicon.
>> 
>
> This is totally different from a TLB violation.  In your case, the write
> goes through as far as the CPU is concerned, but when the data is
> fetched back, it hasn't changed.  A write to a TLB-protected location
> will #PF.

The key point is that a write is generated when the cpu needs to set the
access bit.  I agree the failure points are different.  A TLB fault vs a
case where the hardware did not accept the write.

The idea of a cpu reading back data (and not trusting it's cache
coherency controls) to verify the access bit gets set seems mind
boggling.  That is slow, stupid, racy and incorrect.  Incorrect as the
cpu should not only set the access bit once per segment register load.

In my case I am pretty certain it was something very weird with the
hardware not acceppting the write and either not acknowledging the bus
transaction or cancelling it.  In which case the cpu knew the write had
not made it to the ``memory'' and was trying to cope.

Eric



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread H. Peter Anvin
On 09/29/2015 06:20 PM, Eric W. Biederman wrote:
> Linus Torvalds  writes:
> 
>> On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski  wrote:
>>>
>>> Does anyone know what happens if you stick a non-accessed segment in
>>> the GDT, map the GDT RO, and access it?
>>
>> You should get a #PF, as you guess, but go ahead and test it if you
>> want to make sure.
> 
> I tested this by accident once when workinng on what has become known
> as coreboot.  Early in boot with your GDT in a EEPROM switching from
> real mode to 32bit protected mode causes a write and locks up the
> machine when the hardware declines the write to the GDT to set the
> accessed bit.  As I recall the write kept being retried and retried and
> retried...
> 
> Setting the access bit in the GDT cleared up the problem and I did not
> look back.
> 
> Way up in 64bit mode something might be different, but I don't know why
> cpu designeres would waste the silicon.
> 

This is totally different from a TLB violation.  In your case, the write
goes through as far as the CPU is concerned, but when the data is
fetched back, it hasn't changed.  A write to a TLB-protected location
will #PF.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski  wrote:
>>
>> Does anyone know what happens if you stick a non-accessed segment in
>> the GDT, map the GDT RO, and access it?
>
> You should get a #PF, as you guess, but go ahead and test it if you
> want to make sure.

I tested this by accident once when workinng on what has become known
as coreboot.  Early in boot with your GDT in a EEPROM switching from
real mode to 32bit protected mode causes a write and locks up the
machine when the hardware declines the write to the GDT to set the
accessed bit.  As I recall the write kept being retried and retried and
retried...

Setting the access bit in the GDT cleared up the problem and I did not
look back.

Way up in 64bit mode something might be different, but I don't know why
cpu designeres would waste the silicon.

Eric
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread H. Peter Anvin
On 09/29/2015 11:02 AM, Andy Lutomirski wrote:
> On Tue, Sep 29, 2015 at 10:50 AM, Linus Torvalds
>  wrote:
>> On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski  wrote:
>>>
>>> Does anyone know what happens if you stick a non-accessed segment in
>>> the GDT, map the GDT RO, and access it?
>>
>> You should get a #PF, as you guess, but go ahead and test it if you
>> want to make sure.
> 
> Then I think that, if we do this, the patch series should first make
> it percpu and fixmapped but RW and then flip it RO as a separate patch
> in case we need to revert the actual RO bit.  I don't want to break
> Wine or The Witcher 2 because of this, and we might need various
> fixups.  I really hope that no one uses get_thread_area to check
> whether TLS has been accessed.
> 

Of course.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread H. Peter Anvin
Ugh.  Didn't realize that.

On September 29, 2015 11:22:04 AM PDT, Andy Lutomirski  
wrote:
>On Tue, Sep 29, 2015 at 11:18 AM, H. Peter Anvin  wrote:
>> SGDT would be easy to use, and it is logical that it is faster since
>it reads an internal register.  SIDT does too but unlike the GDT has a
>secondary limit (it can never be larger than 4096 bytes) and so all
>limits in the range 4095-65535 are exactly equivalent.
>>
>
>Using the IDT limit would have been a great ideal if Intel hadn't
>decided to clobber it on every VM exit.
>
>--Andy

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Andy Lutomirski
On Tue, Sep 29, 2015 at 11:18 AM, H. Peter Anvin  wrote:
> SGDT would be easy to use, and it is logical that it is faster since it reads 
> an internal register.  SIDT does too but unlike the GDT has a secondary limit 
> (it can never be larger than 4096 bytes) and so all limits in the range 
> 4095-65535 are exactly equivalent.
>

Using the IDT limit would have been a great ideal if Intel hadn't
decided to clobber it on every VM exit.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread H. Peter Anvin
SGDT would be easy to use, and it is logical that it is faster since it reads 
an internal register.  SIDT does too but unlike the GDT has a secondary limit 
(it can never be larger than 4096 bytes) and so all limits in the range 
4095-65535 are exactly equivalent.

Anything that causes a write to the GDT will #PF if read-only.  So yes, we need 
to force the accessed bit to set.  This shouldn't be a problem and in fact 
ought to be a performance improvement.

On September 29, 2015 10:35:38 AM PDT, Andy Lutomirski  
wrote:
>On Sep 29, 2015 2:01 AM, "Ingo Molnar"  wrote:
>>
>>
>> * Denys Vlasenko  wrote:
>>
>> > On 09/28/2015 09:58 AM, Ingo Molnar wrote:
>> > >
>> > > * Denys Vlasenko  wrote:
>> > >
>> > >> On 09/26/2015 09:50 PM, H. Peter Anvin wrote:
>> > >>> NAK.  We really should map the GDT read-only on all 64 bit
>systems,
>> > >>> since we can't hide the address from SLDT.  Same with the IDT.
>> > >>
>> > >> Sorry, I don't understand your point.
>> > >
>> > > So the problem is that right now the SGDT instruction (which is
>unprivileged)
>> > > leaks the real address of the kernel image:
>> > >
>> > >  fomalhaut:~> ./sgdt
>> > >  SGDT: 88303fd89000 / 007f
>> > >
>> > > that '88303fd89000' is a kernel address.
>> >
>> > Thank you.
>> > I do know that SGDT and friends are unprivileged on x86
>> > and thus they allow userspace (and guest kernels in paravirt)
>> > learn things they don't need to know.
>> >
>> > I don't see how making GDT page-aligned and page-sized
>> > changes anything in this regard. SGDT will still work,
>> > and still leak GDT address.
>>
>> Well, as I try to explain it in the other part of my mail, doing so
>enables us to
>> remap the GDT to a less security sensitive virtual address that does
>not leak the
>> kernel's randomized address:
>>
>> > > Your observation in the changelog and your patch:
>> > >
>> >  It is page-sized because of paravirt. [...]
>> > >
>> > > ... conflicts with the intention to mark (remap) the primary GDT
>address read-only
>> > > on native kernels as well.
>> > >
>> > > So what we should do instead is to use the page alignment
>properly and remap the
>> > > GDT to a read-only location, and load that one.
>> >
>> > If we'd have a small GDT (i.e. what my patch does), we still can
>remap the
>> > entire page which contains small GDT, and simply don't care that
>some other data
>> > is also visible through that RO page.
>>
>> That's generally considered fragile: suppose an attacker has a
>limited information
>> leak that can read absolute addresses with system privilege but he
>doesn't know
>> the kernel's randomized base offset. With a 'partial page' mapping
>there could be
>> function pointers near the GDT, part of the page the GDT happens to
>be on, that
>> leak this information.
>>
>> (Same goes for crypto keys or other critical information (like canary
>information,
>> salts, etc.) accidentally ending up nearby.)
>>
>> Arguably it's a bit tenuous, but when playing remapping games it's
>generally
>> considered good to be page aligned and page sized, with zero padding.
>>
>> > > This would have a couple of advantages:
>> > >
>> > >  - This would give kernel address randomization more teeth on
>x86.
>> > >
>> > >  - An additional advantage would be that rootkits overwriting the
>GDT would have
>> > >a bit more work to do.
>> > >
>> > >  - A third advantage would be that for NUMA systems we could
>'mirror' the GDT into
>> > >node-local memory and load those. This makes GDT load
>cache-misses a bit less
>> > >expensive.
>> >
>> > GDT is per-cpu. Isn't per-cpu memory already NUMA-local?
>>
>> Indeed it is:
>>
>> fomalhaut:~> for ((cpu=1; cpu<9; cpu++)); do taskset $cpu ./sgdt ;
>done
>> SGDT: 88103fa09000 / 007f
>> SGDT: 88103fa29000 / 007f
>> SGDT: 88103fa29000 / 007f
>> SGDT: 88103fa49000 / 007f
>> SGDT: 88103fa49000 / 007f
>> SGDT: 88103fa49000 / 007f
>> SGDT: 88103fa29000 / 007f
>> SGDT: 88103fa69000 / 007f
>>
>> I confused it with the IDT, which is still global.
>>
>> This also means that the GDT in itself does not leak kernel addresses
>at the
>> moment, except it leaks the layout of the percpu area.
>>
>> So my suggestion would be to:
>>
>>  - make the GDT unconditionally page aligned and sized, then remap it
>to a
>>read-only address unconditionally as well, like we do it for the
>IDT.
>
>Does anyone know what happens if you stick a non-accessed segment in
>the GDT, map the GDT RO, and access it?  The docs are extremely vague
>on the interplay between segmentation and paging on the segmentation
>structures themselves.  My guess is that it causes #PF.  This might
>break set_thread_area users unless we change set_thread_area to force
>the accessed bit on.
>
>There's a possible worse failure mode: if someone pokes an un-accessed
>segment into SS or CS using sigreturn, then it's within the realm of
>possibility that IRET would generate #PF (hey Intel and AMD, please
>document this!).  I don't think th

Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Andy Lutomirski
On Tue, Sep 29, 2015 at 10:50 AM, Linus Torvalds
 wrote:
> On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski  wrote:
>>
>> Does anyone know what happens if you stick a non-accessed segment in
>> the GDT, map the GDT RO, and access it?
>
> You should get a #PF, as you guess, but go ahead and test it if you
> want to make sure.
>

Then I think that, if we do this, the patch series should first make
it percpu and fixmapped but RW and then flip it RO as a separate patch
in case we need to revert the actual RO bit.  I don't want to break
Wine or The Witcher 2 because of this, and we might need various
fixups.  I really hope that no one uses get_thread_area to check
whether TLS has been accessed.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Linus Torvalds
On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski  wrote:
>
> Does anyone know what happens if you stick a non-accessed segment in
> the GDT, map the GDT RO, and access it?

You should get a #PF, as you guess, but go ahead and test it if you
want to make sure.

We do something very similar for the old Pentium F0 0F bug - we mark
the IDT read-only, which causes the (bogus) locked read of the IDT
entry that the F00F bug resulted in to be caught as a page fault
instead.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Andy Lutomirski
On Sep 29, 2015 2:01 AM, "Ingo Molnar"  wrote:
>
>
> * Denys Vlasenko  wrote:
>
> > On 09/28/2015 09:58 AM, Ingo Molnar wrote:
> > >
> > > * Denys Vlasenko  wrote:
> > >
> > >> On 09/26/2015 09:50 PM, H. Peter Anvin wrote:
> > >>> NAK.  We really should map the GDT read-only on all 64 bit systems,
> > >>> since we can't hide the address from SLDT.  Same with the IDT.
> > >>
> > >> Sorry, I don't understand your point.
> > >
> > > So the problem is that right now the SGDT instruction (which is 
> > > unprivileged)
> > > leaks the real address of the kernel image:
> > >
> > >  fomalhaut:~> ./sgdt
> > >  SGDT: 88303fd89000 / 007f
> > >
> > > that '88303fd89000' is a kernel address.
> >
> > Thank you.
> > I do know that SGDT and friends are unprivileged on x86
> > and thus they allow userspace (and guest kernels in paravirt)
> > learn things they don't need to know.
> >
> > I don't see how making GDT page-aligned and page-sized
> > changes anything in this regard. SGDT will still work,
> > and still leak GDT address.
>
> Well, as I try to explain it in the other part of my mail, doing so enables 
> us to
> remap the GDT to a less security sensitive virtual address that does not leak 
> the
> kernel's randomized address:
>
> > > Your observation in the changelog and your patch:
> > >
> >  It is page-sized because of paravirt. [...]
> > >
> > > ... conflicts with the intention to mark (remap) the primary GDT address 
> > > read-only
> > > on native kernels as well.
> > >
> > > So what we should do instead is to use the page alignment properly and 
> > > remap the
> > > GDT to a read-only location, and load that one.
> >
> > If we'd have a small GDT (i.e. what my patch does), we still can remap the
> > entire page which contains small GDT, and simply don't care that some other 
> > data
> > is also visible through that RO page.
>
> That's generally considered fragile: suppose an attacker has a limited 
> information
> leak that can read absolute addresses with system privilege but he doesn't 
> know
> the kernel's randomized base offset. With a 'partial page' mapping there 
> could be
> function pointers near the GDT, part of the page the GDT happens to be on, 
> that
> leak this information.
>
> (Same goes for crypto keys or other critical information (like canary 
> information,
> salts, etc.) accidentally ending up nearby.)
>
> Arguably it's a bit tenuous, but when playing remapping games it's generally
> considered good to be page aligned and page sized, with zero padding.
>
> > > This would have a couple of advantages:
> > >
> > >  - This would give kernel address randomization more teeth on x86.
> > >
> > >  - An additional advantage would be that rootkits overwriting the GDT 
> > > would have
> > >a bit more work to do.
> > >
> > >  - A third advantage would be that for NUMA systems we could 'mirror' the 
> > > GDT into
> > >node-local memory and load those. This makes GDT load cache-misses a 
> > > bit less
> > >expensive.
> >
> > GDT is per-cpu. Isn't per-cpu memory already NUMA-local?
>
> Indeed it is:
>
> fomalhaut:~> for ((cpu=1; cpu<9; cpu++)); do taskset $cpu ./sgdt ; done
> SGDT: 88103fa09000 / 007f
> SGDT: 88103fa29000 / 007f
> SGDT: 88103fa29000 / 007f
> SGDT: 88103fa49000 / 007f
> SGDT: 88103fa49000 / 007f
> SGDT: 88103fa49000 / 007f
> SGDT: 88103fa29000 / 007f
> SGDT: 88103fa69000 / 007f
>
> I confused it with the IDT, which is still global.
>
> This also means that the GDT in itself does not leak kernel addresses at the
> moment, except it leaks the layout of the percpu area.
>
> So my suggestion would be to:
>
>  - make the GDT unconditionally page aligned and sized, then remap it to a
>read-only address unconditionally as well, like we do it for the IDT.

Does anyone know what happens if you stick a non-accessed segment in
the GDT, map the GDT RO, and access it?  The docs are extremely vague
on the interplay between segmentation and paging on the segmentation
structures themselves.  My guess is that it causes #PF.  This might
break set_thread_area users unless we change set_thread_area to force
the accessed bit on.

There's a possible worse failure mode: if someone pokes an un-accessed
segment into SS or CS using sigreturn, then it's within the realm of
possibility that IRET would generate #PF (hey Intel and AMD, please
document this!).  I don't think that would be rootable, but at the
very least we'd want to make sure it doesn't OOPS by either making it
impossible or adding an explicit test to sigreturn.c.

hpa pointed out in another thread that the GDT *must* be writable on
32-bit kernels because we use a task gate for NMI and jumping through
a task gate writes to the GDT.

On another note, SGDT is considerably faster than LSL, at least on
Sandy Bridge.  The vdso might be able to take advantage of that for
getcpu.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.k

Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-29 Thread Ingo Molnar

* Denys Vlasenko  wrote:

> On 09/28/2015 09:58 AM, Ingo Molnar wrote:
> > 
> > * Denys Vlasenko  wrote:
> > 
> >> On 09/26/2015 09:50 PM, H. Peter Anvin wrote:
> >>> NAK.  We really should map the GDT read-only on all 64 bit systems,
> >>> since we can't hide the address from SLDT.  Same with the IDT.
> >>
> >> Sorry, I don't understand your point.
> > 
> > So the problem is that right now the SGDT instruction (which is 
> > unprivileged) 
> > leaks the real address of the kernel image:
> > 
> >  fomalhaut:~> ./sgdt 
> >  SGDT: 88303fd89000 / 007f
> > 
> > that '88303fd89000' is a kernel address.
> 
> Thank you.
> I do know that SGDT and friends are unprivileged on x86
> and thus they allow userspace (and guest kernels in paravirt)
> learn things they don't need to know.
> 
> I don't see how making GDT page-aligned and page-sized
> changes anything in this regard. SGDT will still work,
> and still leak GDT address.

Well, as I try to explain it in the other part of my mail, doing so enables us 
to 
remap the GDT to a less security sensitive virtual address that does not leak 
the 
kernel's randomized address:

> > Your observation in the changelog and your patch:
> > 
>  It is page-sized because of paravirt. [...]
> > 
> > ... conflicts with the intention to mark (remap) the primary GDT address 
> > read-only 
> > on native kernels as well.
> > 
> > So what we should do instead is to use the page alignment properly and 
> > remap the 
> > GDT to a read-only location, and load that one.
> 
> If we'd have a small GDT (i.e. what my patch does), we still can remap the 
> entire page which contains small GDT, and simply don't care that some other 
> data 
> is also visible through that RO page.

That's generally considered fragile: suppose an attacker has a limited 
information 
leak that can read absolute addresses with system privilege but he doesn't know 
the kernel's randomized base offset. With a 'partial page' mapping there could 
be 
function pointers near the GDT, part of the page the GDT happens to be on, that 
leak this information.

(Same goes for crypto keys or other critical information (like canary 
information, 
salts, etc.) accidentally ending up nearby.)

Arguably it's a bit tenuous, but when playing remapping games it's generally 
considered good to be page aligned and page sized, with zero padding.

> > This would have a couple of advantages:
> > 
> >  - This would give kernel address randomization more teeth on x86.
> > 
> >  - An additional advantage would be that rootkits overwriting the GDT would 
> > have 
> >a bit more work to do.
> > 
> >  - A third advantage would be that for NUMA systems we could 'mirror' the 
> > GDT into
> >node-local memory and load those. This makes GDT load cache-misses a bit 
> > less
> >expensive.
> 
> GDT is per-cpu. Isn't per-cpu memory already NUMA-local?

Indeed it is:

fomalhaut:~> for ((cpu=1; cpu<9; cpu++)); do taskset $cpu ./sgdt ; done
SGDT: 88103fa09000 / 007f
SGDT: 88103fa29000 / 007f
SGDT: 88103fa29000 / 007f
SGDT: 88103fa49000 / 007f
SGDT: 88103fa49000 / 007f
SGDT: 88103fa49000 / 007f
SGDT: 88103fa29000 / 007f
SGDT: 88103fa69000 / 007f

I confused it with the IDT, which is still global.

This also means that the GDT in itself does not leak kernel addresses at the 
moment, except it leaks the layout of the percpu area.

So my suggestion would be to:

 - make the GDT unconditionally page aligned and sized, then remap it to a
   read-only address unconditionally as well, like we do it for the IDT.

 - make the IDT per CPU as well, for performance reasons.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-28 Thread Denys Vlasenko
On 09/28/2015 09:58 AM, Ingo Molnar wrote:
> 
> * Denys Vlasenko  wrote:
> 
>> On 09/26/2015 09:50 PM, H. Peter Anvin wrote:
>>> NAK.  We really should map the GDT read-only on all 64 bit systems,
>>> since we can't hide the address from SLDT.  Same with the IDT.
>>
>> Sorry, I don't understand your point.
> 
> So the problem is that right now the SGDT instruction (which is unprivileged) 
> leaks the real address of the kernel image:
> 
>  fomalhaut:~> ./sgdt 
>  SGDT: 88303fd89000 / 007f
> 
> that '88303fd89000' is a kernel address.

Thank you.
I do know that SGDT and friends are unprivileged on x86
and thus they allow userspace (and guest kernels in paravirt)
learn things they don't need to know.

I don't see how making GDT page-aligned and page-sized
changes anything in this regard. SGDT will still work,
and still leak GDT address.

> Your observation in the changelog and your patch:
> 
 It is page-sized because of paravirt. [...]
> 
> ... conflicts with the intention to mark (remap) the primary GDT address 
> read-only 
> on native kernels as well.
> 
> So what we should do instead is to use the page alignment properly and remap 
> the 
> GDT to a read-only location, and load that one.

If we'd have a small GDT (i.e. what my patch does), we still can remap the 
entire page
which contains small GDT, and simply don't care that some other data is also 
visible
through that RO page.

> This would have a couple of advantages:
> 
>  - This would give kernel address randomization more teeth on x86.
> 
>  - An additional advantage would be that rootkits overwriting the GDT would 
> have 
>a bit more work to do.
> 
>  - A third advantage would be that for NUMA systems we could 'mirror' the GDT 
> into
>node-local memory and load those. This makes GDT load cache-misses a bit 
> less
>expensive.

GDT is per-cpu. Isn't per-cpu memory already NUMA-local?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-28 Thread Ingo Molnar

* Denys Vlasenko  wrote:

> On 09/26/2015 09:50 PM, H. Peter Anvin wrote:
> > NAK.  We really should map the GDT read-only on all 64 bit systems,
> > since we can't hide the address from SLDT.  Same with the IDT.
> 
> Sorry, I don't understand your point.

So the problem is that right now the SGDT instruction (which is unprivileged) 
leaks the real address of the kernel image:

 fomalhaut:~> ./sgdt 
 SGDT: 88303fd89000 / 007f

that '88303fd89000' is a kernel address.

 fomalhaut:~> cat sgdt.c 
 #include 
 #include 

 int main(void)
 {
 struct gdt_desc {
 unsigned short  limit;
 unsigned long   addr;
 } __attribute__((packed)) gdt_desc = { -1, -1 };

 asm volatile("sgdt %0": "=m" (gdt_desc));

 printf("SGDT: %016lx / %04x\n", gdt_desc.addr, gdt_desc.limit);

 return 0;
 }

Your observation in the changelog and your patch:

> >> It is page-sized because of paravirt. [...]

... conflicts with the intention to mark (remap) the primary GDT address 
read-only 
on native kernels as well.

So what we should do instead is to use the page alignment properly and remap 
the 
GDT to a read-only location, and load that one.

This would have a couple of advantages:

 - This would give kernel address randomization more teeth on x86.

 - An additional advantage would be that rootkits overwriting the GDT would 
have 
   a bit more work to do.

 - A third advantage would be that for NUMA systems we could 'mirror' the GDT 
into
   node-local memory and load those. This makes GDT load cache-misses a bit less
   expensive.

The IDT is already remapped:

 fomalhaut:~> ./sidt 
 Sidt: ff57b000 / 0fff
 fomalhaut:~> cat sidt.c
 #include 
 #include 

 int main(void)
 {
 struct idt_desc {
 unsigned short  limit;
 unsigned long   addr;
 } __attribute__((packed)) idt_desc = { -1, -1 };

 asm volatile("sidt %0": "=m" (idt_desc));

 printf("Sidt: %016lx / %04x\n", idt_desc.addr, idt_desc.limit);

 return 0;
 }

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-26 Thread Denys Vlasenko
On 09/26/2015 09:50 PM, H. Peter Anvin wrote:
> NAK.  We really should map the GDT read-only on all 64 bit systems,
> since we can't hide the address from SLDT.  Same with the IDT.

Sorry, I don't understand your point.


> On September 26, 2015 11:00:40 AM PDT, Denys Vlasenko  
> wrote:
>> We have our GDT in a page-sized per-cpu structure, gdt_page.
>>
>> On x86_64 kernel, GDT is 128 bytes - only ~3% of that page is used.
>>
>> It is page-sized because of paravirt. Hypervisors need to know when
>> GDT is changed, so they remap it read-only and handle write faults.
>> If it's not in its own page, other writes nearby will cause
>> those faults too.
>>
>> In other words, we need GDT to live in a separate page
>> only if CONFIG_HYPERVISOR_GUEST=y.
>>
>> This patch reduces GDT alignment to cacheline-aligned
>> if CONFIG_HYPERVISOR_GUEST is not set.
>>
>> Patch also renames gdt_page to cpu_gdt (mimicking naming of existing
>> cpu_tss per-cpu variable), since now it is not always a full page.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled

2015-09-26 Thread H. Peter Anvin
NAK.  We really should map the GDT read-only on all 64 bit systems, since we 
can't hide the address from SLDT.  Same with the IDT.

On September 26, 2015 11:00:40 AM PDT, Denys Vlasenko  
wrote:
>We have our GDT in a page-sized per-cpu structure, gdt_page.
>
>On x86_64 kernel, GDT is 128 bytes - only ~3% of that page is used.
>
>It is page-sized because of paravirt. Hypervisors need to know when
>GDT is changed, so they remap it read-only and handle write faults.
>If it's not in its own page, other writes nearby will cause
>those faults too.
>
>In other words, we need GDT to live in a separate page
>only if CONFIG_HYPERVISOR_GUEST=y.
>
>This patch reduces GDT alignment to cacheline-aligned
>if CONFIG_HYPERVISOR_GUEST is not set.
>
>Patch also renames gdt_page to cpu_gdt (mimicking naming of existing
>cpu_tss per-cpu variable), since now it is not always a full page.
>
>$ objdump -x vmlinux | grep .data..percpu | sort
>Before:
>(offset)(size)
>  wO .data..percpu  4000
>irq_stack_union
>4000  wO .data..percpu  5000
>exception_stacks
>9000  wO .data..percpu  1000 gdt_page  <<<
>HERE
>  a000  wO .data..percpu  0008 espfix_waddr
>  a008  wO .data..percpu  0008 espfix_stack
>...
> 00019398 g   .data..percpu   __per_cpu_end
>After:
>  wO .data..percpu  4000
>irq_stack_union
>4000  wO .data..percpu  5000
>exception_stacks
>  9000  wO .data..percpu  0008 espfix_waddr
>  9008  wO .data..percpu  0008 espfix_stack
>...
>00013c80  wO .data..percpu  0040 cyc2ns
>00013cc0  wO .data..percpu  22c0 cpu_tss
>00015f80  wO .data..percpu  0080 cpu_gdt  <<<
>HERE
>  00016000  wO .data..percpu  0018 cpu_tlbstate
>...
> 00018418 g   .data..percpu   __per_cpu_end
>
>Run-tested on a 144 CPU machine.
>
>Signed-off-by: Denys Vlasenko 
>CC: Ingo Molnar 
>CC: H. Peter Anvin 
>CC: Konrad Rzeszutek Wilk 
>CC: Boris Ostrovsky 
>CC: David Vrabel 
>CC: Joerg Roedel 
>CC: Gleb Natapov 
>CC: Paolo Bonzini 
>CC: kvm@vger.kernel.org
>CC: x...@kernel.org
>CC: linux-ker...@vger.kernel.org
>---
> arch/x86/entry/entry_32.S|  2 +-
> arch/x86/include/asm/desc.h  | 16 +++-
> arch/x86/kernel/cpu/common.c | 10 --
> arch/x86/kernel/cpu/perf_event.c |  2 +-
> arch/x86/kernel/head_32.S|  4 ++--
> arch/x86/kernel/head_64.S|  2 +-
> arch/x86/kernel/vmlinux.lds.S|  2 +-
> arch/x86/tools/relocs.c  |  2 +-
> arch/x86/xen/enlighten.c |  4 ++--
> 9 files changed, 28 insertions(+), 16 deletions(-)
>
>diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
>index b2909bf..bc6ae1c 100644
>--- a/arch/x86/entry/entry_32.S
>+++ b/arch/x86/entry/entry_32.S
>@@ -429,7 +429,7 @@ ldt_ss:
>  * compensating for the offset by changing to the ESPFIX segment with
>  * a base address that matches for the difference.
>  */
>-#define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + (GDT_ENTRY_ESPFIX_SS *
>8)
>+#define GDT_ESPFIX_SS PER_CPU_VAR(cpu_gdt) + (GDT_ENTRY_ESPFIX_SS * 8)
>   mov %esp, %edx  /* load kernel esp */
>   mov PT_OLDESP(%esp), %eax   /* load userspace esp */
>   mov %dx, %ax/* eax: new kernel esp */
>diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
>index 4e10d73..76de300 100644
>--- a/arch/x86/include/asm/desc.h
>+++ b/arch/x86/include/asm/desc.h
>@@ -39,15 +39,21 @@ extern gate_desc idt_table[];
> extern struct desc_ptr debug_idt_descr;
> extern gate_desc debug_idt_table[];
> 
>-struct gdt_page {
>+struct cpu_gdt {
>   struct desc_struct gdt[GDT_ENTRIES];
>-} __attribute__((aligned(PAGE_SIZE)));
>-
>-DECLARE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page);
>+}
>+#ifdef CONFIG_HYPERVISOR_GUEST
>+/* Xen et al want GDT to have its own page. They remap it read-only */
>+__attribute__((aligned(PAGE_SIZE)));
>+DECLARE_PER_CPU_PAGE_ALIGNED(struct cpu_gdt, cpu_gdt);
>+#else
>+cacheline_aligned;
>+DECLARE_PER_CPU_ALIGNED(struct cpu_gdt, cpu_gdt);
>+#endif
> 
> static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu)
> {
>-  return per_cpu(gdt_page, cpu).gdt;
>+  return per_cpu(cpu_gdt, cpu).gdt;
> }
> 
> #ifdef CONFIG_X86_64
>diff --git a/arch/x86/kernel/cpu/common.c
>b/arch/x86/kernel/cpu/common.c
>index de22ea7..6b90785 100644
>--- a/arch/x86/kernel/cpu/common.c
>+++ b/arch/x86/kernel/cpu/common.c
>@@ -92,7 +92,13 @@ static const struct cpu_dev default_cpu = {
> 
> static const struct cpu_dev *this_cpu = &default_cpu;
> 
>-DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
>+#ifdef CONFIG_HYP