Re: [PATCH] maintainers: drop Chris Wright from pvops

2017-10-26 Thread Rusty Russell
Chris CC'd: He wasn't that hard to find.

(linkedin says he's CTO of RedHat now.  I feel like an underachiever!)

Cheers,
Rusty.

Juergen Gross  writes:

> Mails to chr...@sous-sol.org are not deliverable since several months.
> Drop him as PARAVIRT_OPS maintainer.
>
> Signed-off-by: Juergen Gross 
> ---
>  MAINTAINERS | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d85c08956875..af0cb69f6a3e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -10179,7 +10179,6 @@ F:Documentation/parport*.txt
>  
>  PARAVIRT_OPS INTERFACE
>  M:   Juergen Gross 
> -M:   Chris Wright 
>  M:   Alok Kataria 
>  M:   Rusty Russell 
>  L:   virtualization@lists.linux-foundation.org
> -- 
> 2.12.3
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-iommu version 0.5

2017-10-26 Thread Linu Cherian
Hi Jean,

On Wed Oct 25, 2017 at 10:07:53AM +0100, Jean-Philippe Brucker wrote:
> On 25/10/17 08:07, Linu Cherian wrote:
> > Hi Jean,
> > 
> > On Tue Oct 24, 2017 at 10:28:59PM +0530, Linu Cherian wrote:
> >> Hi Jean,
> >> Thanks for your reply.
> >>
> >> On Tue Oct 24, 2017 at 09:37:12AM +0100, Jean-Philippe Brucker wrote:
> >>> Hi Linu,
> >>>
> >>> On 24/10/17 07:27, Linu Cherian wrote:
>  Hi Jean,
> 
>  On Mon Oct 23, 2017 at 10:32:41AM +0100, Jean-Philippe Brucker wrote:
> > This is version 0.5 of the virtio-iommu specification, the 
> > paravirtualized
> > IOMMU. This version addresses feedback from v0.4 and adds an event 
> > virtqueue.
> > Please find the specification, LaTeX sources and pdf, at:
> > git://linux-arm.org/virtio-iommu.git viommu/v0.5
> > http://linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.5/virtio-iommu-v0.5.pdf
> >
> > A detailed changelog since v0.4 follows. You can find the pdf diff at:
> > http://linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/diffs/virtio-iommu-pdf-diff-v0.4-v0.5.pdf
> >
> > * Add an event virtqueue for the device to report translation faults to
> >   the driver. For the moment only unrecoverable faults are available but
> >   future versions will extend it.
> > * Simplify PROBE request by removing the ack part, and flattening RESV
> >   properties.
> > * Rename "address space" to "domain". The change might seem futile but
> >   allows to introduce PASIDs and other features cleanly in the next
> >   versions. In the same vein, the few remaining "device" occurrences 
> > were
> >   replaced by "endpoint", to avoid any confusion with "the device"
> >   referring to the virtio device across the document.
> > * Add implementation notes for RESV_MEM properties.
> > * Update ACPI table definition.
> > * Fix typos and clarify a few things.
> >
> > I will publish the Linux driver for v0.5 shortly. Then for next versions
> > I'll focus on optimizations and adding support for hardware 
> > acceleration.
> >
> > Existing implementations are simple and can certainly be optimized, even
> > without architectural changes. But the architecture itself can also be
> > improved in a number of ways. Currently it is designed to work well with
> > VFIO. However, having explicit MAP requests is less efficient* than page
> > tables for emulated and PV endpoints, and the current architecture 
> > doesn't
> > address this. Binding page tables is an obvious way to improve 
> > throughput
> > in that case, but we can explore cleverer (and possibly simpler) ways to
> > do it.
> >
> > So first we'll work on getting the base device and driver merged, then
> > we'll analyze and compare several ideas for improving performance.
> >
> > Thanks,
> > Jean
> >
> > * I have yet to study this behaviour, and would be interested in any
> > prior art on the subject of analyzing devices DMA patterns (virtio and
> > others)
> 
> 
>  From the spec,
>  Under future extensions.
> 
>  "Page Table Handover, to allow guests to manage their own page tables 
>  and share them with the MMU"
> 
>  Had few questions on this.
> 
>  1. Did you mean SVM support for vfio-pci devices attached to guest 
>  processes here.
> >>>
> >>> Yes, using the VFIO BIND and INVALIDATE ioctls that Intel is working on,
> >>> and adding requests in pretty much the same format to virtio-iommu.
> >>>
>  2. Can you give some hints on how this is going to work , since 
>  virtio-iommu guest kernel 
> driver need to create stage 1 page table as required by hardware 
>  which is not the case now. 
> CMIIW. 
> >>>
> >>> The virtio-iommu device advertises which PASID/page table format is
> >>> supported by the host (obtained via sysfs and communicated in the PROBE
> >>> request), then the guest binds page tables or PASID tables to a domain and
> >>> populates it. Binding page tables alone is easy because we already have
> >>> the required drivers in the guest (io-pgtable or arch/* for SVM) and code
> >>> in the host to manage PASID tables. But since the PASID table pointer is
> >>> translated by stage-2, it would requires a little more work in the host
> >>> for obtaining GPA buffers from the guest on demand.
> >>   Is this for resolving PCI PRI requests ?. 
> >>   IIUC, PCI PRI requests for devices owned by guest need to be resolved
> >>   by guest itself.
> 
> Supporting PCI PRI is a separate problem, that will be implemented by
> extending the event queue proposed in v0.5. Once the guest bound the PASID
> table and created the page tables, it will start some DMA job in the
> device. If a page isn't mapped, the pIOMMU sends a PRI Request (a page
> fault) to its driver, which is relayed to userspace by VFIO, then to the
> guest via virtio-iommu. The 

Re: [RFC] virtio-iommu version 0.5

2017-10-26 Thread Linu Cherian
Hi Jean,

On Wed Oct 25, 2017 at 10:07:53AM +0100, Jean-Philippe Brucker wrote:
> On 25/10/17 08:07, Linu Cherian wrote:
> > Hi Jean,
> > 
> > On Tue Oct 24, 2017 at 10:28:59PM +0530, Linu Cherian wrote:
> >> Hi Jean,
> >> Thanks for your reply.
> >>
> >> On Tue Oct 24, 2017 at 09:37:12AM +0100, Jean-Philippe Brucker wrote:
> >>> Hi Linu,
> >>>
> >>> On 24/10/17 07:27, Linu Cherian wrote:
>  Hi Jean,
> 
>  On Mon Oct 23, 2017 at 10:32:41AM +0100, Jean-Philippe Brucker wrote:
> > This is version 0.5 of the virtio-iommu specification, the 
> > paravirtualized
> > IOMMU. This version addresses feedback from v0.4 and adds an event 
> > virtqueue.
> > Please find the specification, LaTeX sources and pdf, at:
> > git://linux-arm.org/virtio-iommu.git viommu/v0.5
> > http://linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.5/virtio-iommu-v0.5.pdf
> >
> > A detailed changelog since v0.4 follows. You can find the pdf diff at:
> > http://linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/diffs/virtio-iommu-pdf-diff-v0.4-v0.5.pdf
> >
> > * Add an event virtqueue for the device to report translation faults to
> >   the driver. For the moment only unrecoverable faults are available but
> >   future versions will extend it.
> > * Simplify PROBE request by removing the ack part, and flattening RESV
> >   properties.
> > * Rename "address space" to "domain". The change might seem futile but
> >   allows to introduce PASIDs and other features cleanly in the next
> >   versions. In the same vein, the few remaining "device" occurrences 
> > were
> >   replaced by "endpoint", to avoid any confusion with "the device"
> >   referring to the virtio device across the document.
> > * Add implementation notes for RESV_MEM properties.
> > * Update ACPI table definition.
> > * Fix typos and clarify a few things.
> >
> > I will publish the Linux driver for v0.5 shortly. Then for next versions
> > I'll focus on optimizations and adding support for hardware 
> > acceleration.
> >
> > Existing implementations are simple and can certainly be optimized, even
> > without architectural changes. But the architecture itself can also be
> > improved in a number of ways. Currently it is designed to work well with
> > VFIO. However, having explicit MAP requests is less efficient* than page
> > tables for emulated and PV endpoints, and the current architecture 
> > doesn't
> > address this. Binding page tables is an obvious way to improve 
> > throughput
> > in that case, but we can explore cleverer (and possibly simpler) ways to
> > do it.
> >
> > So first we'll work on getting the base device and driver merged, then
> > we'll analyze and compare several ideas for improving performance.
> >
> > Thanks,
> > Jean
> >
> > * I have yet to study this behaviour, and would be interested in any
> > prior art on the subject of analyzing devices DMA patterns (virtio and
> > others)
> 
> 
>  From the spec,
>  Under future extensions.
> 
>  "Page Table Handover, to allow guests to manage their own page tables 
>  and share them with the MMU"
> 
>  Had few questions on this.
> 
>  1. Did you mean SVM support for vfio-pci devices attached to guest 
>  processes here.
> >>>
> >>> Yes, using the VFIO BIND and INVALIDATE ioctls that Intel is working on,
> >>> and adding requests in pretty much the same format to virtio-iommu.
> >>>
>  2. Can you give some hints on how this is going to work , since 
>  virtio-iommu guest kernel 
> driver need to create stage 1 page table as required by hardware 
>  which is not the case now. 
> CMIIW. 
> >>>
> >>> The virtio-iommu device advertises which PASID/page table format is
> >>> supported by the host (obtained via sysfs and communicated in the PROBE
> >>> request), then the guest binds page tables or PASID tables to a domain and
> >>> populates it. Binding page tables alone is easy because we already have
> >>> the required drivers in the guest (io-pgtable or arch/* for SVM) and code
> >>> in the host to manage PASID tables. But since the PASID table pointer is
> >>> translated by stage-2, it would requires a little more work in the host
> >>> for obtaining GPA buffers from the guest on demand.
> >>   Is this for resolving PCI PRI requests ?. 
> >>   IIUC, PCI PRI requests for devices owned by guest need to be resolved
> >>   by guest itself.
> 
> Supporting PCI PRI is a separate problem, that will be implemented by
> extending the event queue proposed in v0.5. Once the guest bound the PASID
> table and created the page tables, it will start some DMA job in the
> device. If a page isn't mapped, the pIOMMU sends a PRI Request (a page
> fault) to its driver, which is relayed to userspace by VFIO, then to the
> guest via virtio-iommu. The 

Re: [RFC] virtio-iommu version 0.5

2017-10-26 Thread Linu Cherian
Hi Jean,
Thanks for your reply.

On Tue Oct 24, 2017 at 09:37:12AM +0100, Jean-Philippe Brucker wrote:
> Hi Linu,
> 
> On 24/10/17 07:27, Linu Cherian wrote:
> > Hi Jean,
> > 
> > On Mon Oct 23, 2017 at 10:32:41AM +0100, Jean-Philippe Brucker wrote:
> >> This is version 0.5 of the virtio-iommu specification, the paravirtualized
> >> IOMMU. This version addresses feedback from v0.4 and adds an event 
> >> virtqueue.
> >> Please find the specification, LaTeX sources and pdf, at:
> >> git://linux-arm.org/virtio-iommu.git viommu/v0.5
> >> http://linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.5/virtio-iommu-v0.5.pdf
> >>
> >> A detailed changelog since v0.4 follows. You can find the pdf diff at:
> >> http://linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/diffs/virtio-iommu-pdf-diff-v0.4-v0.5.pdf
> >>
> >> * Add an event virtqueue for the device to report translation faults to
> >>   the driver. For the moment only unrecoverable faults are available but
> >>   future versions will extend it.
> >> * Simplify PROBE request by removing the ack part, and flattening RESV
> >>   properties.
> >> * Rename "address space" to "domain". The change might seem futile but
> >>   allows to introduce PASIDs and other features cleanly in the next
> >>   versions. In the same vein, the few remaining "device" occurrences were
> >>   replaced by "endpoint", to avoid any confusion with "the device"
> >>   referring to the virtio device across the document.
> >> * Add implementation notes for RESV_MEM properties.
> >> * Update ACPI table definition.
> >> * Fix typos and clarify a few things.
> >>
> >> I will publish the Linux driver for v0.5 shortly. Then for next versions
> >> I'll focus on optimizations and adding support for hardware acceleration.
> >>
> >> Existing implementations are simple and can certainly be optimized, even
> >> without architectural changes. But the architecture itself can also be
> >> improved in a number of ways. Currently it is designed to work well with
> >> VFIO. However, having explicit MAP requests is less efficient* than page
> >> tables for emulated and PV endpoints, and the current architecture doesn't
> >> address this. Binding page tables is an obvious way to improve throughput
> >> in that case, but we can explore cleverer (and possibly simpler) ways to
> >> do it.
> >>
> >> So first we'll work on getting the base device and driver merged, then
> >> we'll analyze and compare several ideas for improving performance.
> >>
> >> Thanks,
> >> Jean
> >>
> >> * I have yet to study this behaviour, and would be interested in any
> >> prior art on the subject of analyzing devices DMA patterns (virtio and
> >> others)
> > 
> > 
> > From the spec,
> > Under future extensions.
> > 
> > "Page Table Handover, to allow guests to manage their own page tables and 
> > share them with the MMU"
> > 
> > Had few questions on this.
> > 
> > 1. Did you mean SVM support for vfio-pci devices attached to guest 
> > processes here.
> 
> Yes, using the VFIO BIND and INVALIDATE ioctls that Intel is working on,
> and adding requests in pretty much the same format to virtio-iommu.
> 
> > 2. Can you give some hints on how this is going to work , since 
> > virtio-iommu guest kernel 
> >driver need to create stage 1 page table as required by hardware which 
> > is not the case now. 
> >CMIIW. 
> 
> The virtio-iommu device advertises which PASID/page table format is
> supported by the host (obtained via sysfs and communicated in the PROBE
> request), then the guest binds page tables or PASID tables to a domain and
> populates it. Binding page tables alone is easy because we already have
> the required drivers in the guest (io-pgtable or arch/* for SVM) and code
> in the host to manage PASID tables. But since the PASID table pointer is
> translated by stage-2, it would requires a little more work in the host
> for obtaining GPA buffers from the guest on demand.
  Is this for resolving PCI PRI requests ?. 
  IIUC, PCI PRI requests for devices owned by guest need to be resolved
  by guest itself.


 In addition the BIND
> ioctl is different from the one used by VT-d, so this solution didn't get
> much appreciation.

Could you please share the links on this ?

> 
> The alternative is to bind PASID tables. 

Sorry, i didnt get the difference here.

It requires to factor the guest
> PASID handling code into a library, which is difficult for SMMU. Luckily
> I'm still working on adding PASID code for SMMUv3, so extracting it out of
> the driver isn't a big overhead. The good thing about this solution is
> that it reuses any specification work done for VFIO (and vice versa) and
> any host driver change made for vSMMU/VT-d emulations.
> 
> Thanks,
> Jean

-- 
Linu cherian
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] maintainers: drop Chris Wright from pvops

2017-10-26 Thread Chris Wright
(resend w/out html damage that triggers lkml reject)

On Thu, Oct 26, 2017 at 3:17 PM, Rusty Russell  wrote:
> Chris CC'd: He wasn't that hard to find.
>
> (linkedin says he's CTO of RedHat now.  I feel like an underachiever!)
>
> Cheers,
> Rusty.
>
> Juergen Gross  writes:
>
>> Mails to chr...@sous-sol.org are not deliverable since several months.
>> Drop him as PARAVIRT_OPS maintainer.
>>
>> Signed-off-by: Juergen Gross 

Acked-by: Chris Wright 

;)

thanks,
-chris

>> ---
>>  MAINTAINERS | 1 -
>>  1 file changed, 1 deletion(-)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index d85c08956875..af0cb69f6a3e 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -10179,7 +10179,6 @@ F:Documentation/parport*.txt
>>
>>  PARAVIRT_OPS INTERFACE
>>  M:   Juergen Gross 
>> -M:   Chris Wright 
>>  M:   Alok Kataria 
>>  M:   Rusty Russell 
>>  L:   virtualization@lists.linux-foundation.org
>> --
>> 2.12.3
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-iommu version 0.5

2017-10-26 Thread Linu Cherian
Hi Jean,

On Tue Oct 24, 2017 at 10:28:59PM +0530, Linu Cherian wrote:
> Hi Jean,
> Thanks for your reply.
> 
> On Tue Oct 24, 2017 at 09:37:12AM +0100, Jean-Philippe Brucker wrote:
> > Hi Linu,
> > 
> > On 24/10/17 07:27, Linu Cherian wrote:
> > > Hi Jean,
> > > 
> > > On Mon Oct 23, 2017 at 10:32:41AM +0100, Jean-Philippe Brucker wrote:
> > >> This is version 0.5 of the virtio-iommu specification, the 
> > >> paravirtualized
> > >> IOMMU. This version addresses feedback from v0.4 and adds an event 
> > >> virtqueue.
> > >> Please find the specification, LaTeX sources and pdf, at:
> > >> git://linux-arm.org/virtio-iommu.git viommu/v0.5
> > >> http://linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.5/virtio-iommu-v0.5.pdf
> > >>
> > >> A detailed changelog since v0.4 follows. You can find the pdf diff at:
> > >> http://linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/diffs/virtio-iommu-pdf-diff-v0.4-v0.5.pdf
> > >>
> > >> * Add an event virtqueue for the device to report translation faults to
> > >>   the driver. For the moment only unrecoverable faults are available but
> > >>   future versions will extend it.
> > >> * Simplify PROBE request by removing the ack part, and flattening RESV
> > >>   properties.
> > >> * Rename "address space" to "domain". The change might seem futile but
> > >>   allows to introduce PASIDs and other features cleanly in the next
> > >>   versions. In the same vein, the few remaining "device" occurrences were
> > >>   replaced by "endpoint", to avoid any confusion with "the device"
> > >>   referring to the virtio device across the document.
> > >> * Add implementation notes for RESV_MEM properties.
> > >> * Update ACPI table definition.
> > >> * Fix typos and clarify a few things.
> > >>
> > >> I will publish the Linux driver for v0.5 shortly. Then for next versions
> > >> I'll focus on optimizations and adding support for hardware acceleration.
> > >>
> > >> Existing implementations are simple and can certainly be optimized, even
> > >> without architectural changes. But the architecture itself can also be
> > >> improved in a number of ways. Currently it is designed to work well with
> > >> VFIO. However, having explicit MAP requests is less efficient* than page
> > >> tables for emulated and PV endpoints, and the current architecture 
> > >> doesn't
> > >> address this. Binding page tables is an obvious way to improve throughput
> > >> in that case, but we can explore cleverer (and possibly simpler) ways to
> > >> do it.
> > >>
> > >> So first we'll work on getting the base device and driver merged, then
> > >> we'll analyze and compare several ideas for improving performance.
> > >>
> > >> Thanks,
> > >> Jean
> > >>
> > >> * I have yet to study this behaviour, and would be interested in any
> > >> prior art on the subject of analyzing devices DMA patterns (virtio and
> > >> others)
> > > 
> > > 
> > > From the spec,
> > > Under future extensions.
> > > 
> > > "Page Table Handover, to allow guests to manage their own page tables and 
> > > share them with the MMU"
> > > 
> > > Had few questions on this.
> > > 
> > > 1. Did you mean SVM support for vfio-pci devices attached to guest 
> > > processes here.
> > 
> > Yes, using the VFIO BIND and INVALIDATE ioctls that Intel is working on,
> > and adding requests in pretty much the same format to virtio-iommu.
> > 
> > > 2. Can you give some hints on how this is going to work , since 
> > > virtio-iommu guest kernel 
> > >driver need to create stage 1 page table as required by hardware which 
> > > is not the case now. 
> > >CMIIW. 
> > 
> > The virtio-iommu device advertises which PASID/page table format is
> > supported by the host (obtained via sysfs and communicated in the PROBE
> > request), then the guest binds page tables or PASID tables to a domain and
> > populates it. Binding page tables alone is easy because we already have
> > the required drivers in the guest (io-pgtable or arch/* for SVM) and code
> > in the host to manage PASID tables. But since the PASID table pointer is
> > translated by stage-2, it would requires a little more work in the host
> > for obtaining GPA buffers from the guest on demand.
>   Is this for resolving PCI PRI requests ?. 
>   IIUC, PCI PRI requests for devices owned by guest need to be resolved
>   by guest itself.
> 
> 
>  In addition the BIND
> > ioctl is different from the one used by VT-d, so this solution didn't get
> > much appreciation.
> 
> Could you please share the links on this ?
> 
> > 
> > The alternative is to bind PASID tables. 
> 
> Sorry, i didnt get the difference here.
>

Also does this solution intend to cover the page table sharing of non SVM 
cases. For example, if we need to share the IOMMU page table for 
a device used in guest kernel, so that map/unmap gets directly handled by the 
guest
and only TLB invalidates happens through a virtio-iommu channel.
 
> It requires to factor the guest
> > PASID handling code into a library, which is difficult for 

Re: [PATCH] virtio/ringtest: fix up need_event math

2017-10-26 Thread Cornelia Huck
On Thu, 26 Oct 2017 04:48:01 +0300
"Michael S. Tsirkin"  wrote:

> last kicked event index must be updated unconditionally:
> even if we don't need to kick, we do not want to re-check
> the same entry for events.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  tools/virtio/ringtest/ring.c | 24 +++-
>  1 file changed, 15 insertions(+), 9 deletions(-)

Acked-by: Cornelia Huck 

I think virtio_ring_0_9 has the same issue?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 15/27] compiler: Option to default to hidden symbols

2017-10-26 Thread Luis R. Rodriguez
On Wed, Oct 18, 2017 at 04:15:10PM -0700, Thomas Garnier wrote:
> On Thu, Oct 12, 2017 at 1:02 PM, Luis R. Rodriguez  wrote:
> > On Wed, Oct 11, 2017 at 01:30:15PM -0700, Thomas Garnier wrote:
> >> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> >> index e95a2631e545..6997716f73bf 100644
> >> --- a/include/linux/compiler.h
> >> +++ b/include/linux/compiler.h
> >> @@ -78,6 +78,14 @@ extern void __chk_io_ptr(const volatile void __iomem *);
> >>  #include 
> >>  #endif
> >>
> >> +/* Useful for Position Independent Code to reduce global references */
> >> +#ifdef CONFIG_DEFAULT_HIDDEN
> >> +#pragma GCC visibility push(hidden)
> >> +#define __default_visibility  __attribute__((visibility ("default")))
> >
> > Does this still work with CONFIG_LD_DEAD_CODE_DATA_ELIMINATION ?
> 
> I cannot make it work with or without this change. How is it supposed
> to be used?

Sadly I don't think much documentation was really added as part of the Nick's
commits about feature, even though commit b67067f1176 ("kbuild: allow archs to
select link dead code/data elimination") *does* say this was documented.

Side rant: the whole CONFIG_LTO removal was merged in the same commit without
this having gone in as a separate atomic patch.

Nick can you provide a bit more guidance about how to get this feature going or
tested on an architecture? Or are you just sticking to assuming folks using the
linker / compiler flags will know what to do? *Some* guidance could help.

> For me with, it crashes with a bad consdev at:
> http://elixir.free-electrons.com/linux/latest/source/drivers/tty/tty_io.c#L3194

>From my reading of the commit log he only had tested it with with powerpc64le,
each other architecture would have to do work to get as far as even booting.

It would require someone then testing Nick's patches against a working
powerpc setup to ensure we don't regress there.

> >> diff --git a/init/Kconfig b/init/Kconfig
> >> index ccb1d8daf241..b640201fcff7 100644
> >> --- a/init/Kconfig
> >> +++ b/init/Kconfig
> >> @@ -1649,6 +1649,13 @@ config PROFILING
> >>  config TRACEPOINTS
> >>   bool
> >>
> >> +#
> >> +# Default to hidden visibility for all symbols.
> >> +# Useful for Position Independent Code to reduce global references.
> >> +#
> >> +config DEFAULT_HIDDEN
> >> + bool
> >
> > Note it is default.
> >
> > Has 0-day ran through this git tree? It should be easy to get it added for
> > testing. Also, even though most changes are x86 based there are some generic
> > changes and I'd love a warm fuzzy this won't break odd / random builds.
> > Although 0-day does cover a lot of test cases, it only has limited run time
> > tests. There are some other test beds which also cover some more obscure
> > architectures. Having a test pass on Guenter's test bed would be nice to
> > see. For that please coordinate with Guenter if he's willing to run this
> > a test for you.
> 
> Not yet, plan to give a v1.5 to Kees Cook to keep in one of his tree
> for couple weeks. I expect it will identify interesting issues.

I bet :)

  Luis
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 06/27] x86/entry/64: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
On Fri, Oct 20, 2017 at 1:26 AM, Ingo Molnar  wrote:
>
> * Thomas Garnier  wrote:
>
>> Change the assembly code to use only relative references of symbols for the
>> kernel to be PIE compatible.
>>
>> Position Independent Executable (PIE) support will allow to extended the
>> KASLR randomization range below the -2G memory limit.
>>
>> Signed-off-by: Thomas Garnier 
>> ---
>>  arch/x86/entry/entry_64.S | 22 +++---
>>  1 file changed, 15 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index 49167258d587..15bd5530d2ae 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -194,12 +194,15 @@ entry_SYSCALL_64_fastpath:
>>   ja  1f  /* return -ENOSYS (already in 
>> pt_regs->ax) */
>>   movq%r10, %rcx
>>
>> + /* Ensures the call is position independent */
>> + leaqsys_call_table(%rip), %r11
>> +
>>   /*
>>* This call instruction is handled specially in stub_ptregs_64.
>>* It might end up jumping to the slow path.  If it jumps, RAX
>>* and all argument registers are clobbered.
>>*/
>> - call*sys_call_table(, %rax, 8)
>> + call*(%r11, %rax, 8)
>>  .Lentry_SYSCALL_64_after_fastpath_call:
>>
>>   movq%rax, RAX(%rsp)
>> @@ -334,7 +337,8 @@ ENTRY(stub_ptregs_64)
>>* RAX stores a pointer to the C function implementing the syscall.
>>* IRQs are on.
>>*/
>> - cmpq$.Lentry_SYSCALL_64_after_fastpath_call, (%rsp)
>> + leaq.Lentry_SYSCALL_64_after_fastpath_call(%rip), %r11
>> + cmpq%r11, (%rsp)
>>   jne 1f
>>
>>   /*
>> @@ -1172,7 +1176,8 @@ ENTRY(error_entry)
>>   movl%ecx, %eax  /* zero extend */
>>   cmpq%rax, RIP+8(%rsp)
>>   je  .Lbstep_iret
>> - cmpq$.Lgs_change, RIP+8(%rsp)
>> + leaq.Lgs_change(%rip), %rcx
>> + cmpq%rcx, RIP+8(%rsp)
>>   jne .Lerror_entry_done
>>
>>   /*
>> @@ -1383,10 +1388,10 @@ ENTRY(nmi)
>>* resume the outer NMI.
>>*/
>>
>> - movq$repeat_nmi, %rdx
>> + leaqrepeat_nmi(%rip), %rdx
>>   cmpq8(%rsp), %rdx
>>   ja  1f
>> - movq$end_repeat_nmi, %rdx
>> + leaqend_repeat_nmi(%rip), %rdx
>>   cmpq8(%rsp), %rdx
>>   ja  nested_nmi_out
>>  1:
>> @@ -1440,7 +1445,8 @@ nested_nmi:
>>   pushq   %rdx
>>   pushfq
>>   pushq   $__KERNEL_CS
>> - pushq   $repeat_nmi
>> + leaqrepeat_nmi(%rip), %rdx
>> + pushq   %rdx
>>
>>   /* Put stack back */
>>   addq$(6*8), %rsp
>> @@ -1479,7 +1485,9 @@ first_nmi:
>>   addq$8, (%rsp)  /* Fix up RSP */
>>   pushfq  /* RFLAGS */
>>   pushq   $__KERNEL_CS/* CS */
>> - pushq   $1f /* RIP */
>> + pushq   %rax/* Support Position Independent Code */
>> + leaq1f(%rip), %rax  /* RIP */
>> + xchgq   %rax, (%rsp)/* Restore RAX, put 1f */
>>   INTERRUPT_RETURN/* continues at repeat_nmi below */
>>   UNWIND_HINT_IRET_REGS
>
> This patch seems to add extra overhead to the syscall fast-path even when PIE 
> is
> disabled, right?

It does add extra instructions when one is not possible, I preferred
that over ifdefing but I can change it.

>
> Thanks,
>
> Ingo



-- 
Thomas
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-iommu version 0.5

2017-10-26 Thread Linu Cherian
Hi Jean,

On Mon Oct 23, 2017 at 10:32:41AM +0100, Jean-Philippe Brucker wrote:
> This is version 0.5 of the virtio-iommu specification, the paravirtualized
> IOMMU. This version addresses feedback from v0.4 and adds an event virtqueue.
> Please find the specification, LaTeX sources and pdf, at:
> git://linux-arm.org/virtio-iommu.git viommu/v0.5
> http://linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.5/virtio-iommu-v0.5.pdf
> 
> A detailed changelog since v0.4 follows. You can find the pdf diff at:
> http://linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/diffs/virtio-iommu-pdf-diff-v0.4-v0.5.pdf
> 
> * Add an event virtqueue for the device to report translation faults to
>   the driver. For the moment only unrecoverable faults are available but
>   future versions will extend it.
> * Simplify PROBE request by removing the ack part, and flattening RESV
>   properties.
> * Rename "address space" to "domain". The change might seem futile but
>   allows to introduce PASIDs and other features cleanly in the next
>   versions. In the same vein, the few remaining "device" occurrences were
>   replaced by "endpoint", to avoid any confusion with "the device"
>   referring to the virtio device across the document.
> * Add implementation notes for RESV_MEM properties.
> * Update ACPI table definition.
> * Fix typos and clarify a few things.
> 
> I will publish the Linux driver for v0.5 shortly. Then for next versions
> I'll focus on optimizations and adding support for hardware acceleration.
> 
> Existing implementations are simple and can certainly be optimized, even
> without architectural changes. But the architecture itself can also be
> improved in a number of ways. Currently it is designed to work well with
> VFIO. However, having explicit MAP requests is less efficient* than page
> tables for emulated and PV endpoints, and the current architecture doesn't
> address this. Binding page tables is an obvious way to improve throughput
> in that case, but we can explore cleverer (and possibly simpler) ways to
> do it.
> 
> So first we'll work on getting the base device and driver merged, then
> we'll analyze and compare several ideas for improving performance.
> 
> Thanks,
> Jean
> 
> * I have yet to study this behaviour, and would be interested in any
> prior art on the subject of analyzing devices DMA patterns (virtio and
> others)


>From the spec,
Under future extensions.

"Page Table Handover, to allow guests to manage their own page tables and share 
them with the MMU"

Had few questions on this.

1. Did you mean SVM support for vfio-pci devices attached to guest processes 
here.

2. Can you give some hints on how this is going to work , since virtio-iommu 
guest kernel 
   driver need to create stage 1 page table as required by hardware which is 
not the case now. 
   CMIIW. 


-- 
Linu cherian
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 01/27] x86/crypto: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
On Fri, Oct 20, 2017 at 1:28 AM, Ard Biesheuvel
 wrote:
> On 20 October 2017 at 09:24, Ingo Molnar  wrote:
>>
>> * Thomas Garnier  wrote:
>>
>>> Change the assembly code to use only relative references of symbols for the
>>> kernel to be PIE compatible.
>>>
>>> Position Independent Executable (PIE) support will allow to extended the
>>> KASLR randomization range below the -2G memory limit.
>>
>>> diff --git a/arch/x86/crypto/aes-x86_64-asm_64.S 
>>> b/arch/x86/crypto/aes-x86_64-asm_64.S
>>> index 8739cf7795de..86fa068e5e81 100644
>>> --- a/arch/x86/crypto/aes-x86_64-asm_64.S
>>> +++ b/arch/x86/crypto/aes-x86_64-asm_64.S
>>> @@ -48,8 +48,12 @@
>>>  #define R10  %r10
>>>  #define R11  %r11
>>>
>>> +/* Hold global for PIE suport */
>>> +#define RBASE%r12
>>> +
>>>  #define prologue(FUNC,KEY,B128,B192,r1,r2,r5,r6,r7,r8,r9,r10,r11) \
>>>   ENTRY(FUNC);\
>>> + pushq   RBASE;  \
>>>   movqr1,r2;  \
>>>   leaqKEY+48(r8),r9;  \
>>>   movqr10,r11;\
>>> @@ -74,54 +78,63 @@
>>>   movlr6 ## E,4(r9);  \
>>>   movlr7 ## E,8(r9);  \
>>>   movlr8 ## E,12(r9); \
>>> + popqRBASE;  \
>>>   ret;\
>>>   ENDPROC(FUNC);
>>>
>>> +#define round_mov(tab_off, reg_i, reg_o) \
>>> + leaqtab_off(%rip), RBASE; \
>>> + movl(RBASE,reg_i,4), reg_o;
>>> +
>>> +#define round_xor(tab_off, reg_i, reg_o) \
>>> + leaqtab_off(%rip), RBASE; \
>>> + xorl(RBASE,reg_i,4), reg_o;
>>> +
>>>  #define round(TAB,OFFSET,r1,r2,r3,r4,r5,r6,r7,r8,ra,rb,rc,rd) \
>>>   movzbl  r2 ## H,r5 ## E;\
>>>   movzbl  r2 ## L,r6 ## E;\
>>> - movlTAB+1024(,r5,4),r5 ## E;\
>>> + round_mov(TAB+1024, r5, r5 ## E)\
>>>   movwr4 ## X,r2 ## X;\
>>> - movlTAB(,r6,4),r6 ## E; \
>>> + round_mov(TAB, r6, r6 ## E) \
>>>   roll$16,r2 ## E;\
>>>   shrl$16,r4 ## E;\
>>>   movzbl  r4 ## L,r7 ## E;\
>>>   movzbl  r4 ## H,r4 ## E;\
>>>   xorlOFFSET(r8),ra ## E; \
>>>   xorlOFFSET+4(r8),rb ## E;   \
>>> - xorlTAB+3072(,r4,4),r5 ## E;\
>>> - xorlTAB+2048(,r7,4),r6 ## E;\
>>> + round_xor(TAB+3072, r4, r5 ## E)\
>>> + round_xor(TAB+2048, r7, r6 ## E)\
>>>   movzbl  r1 ## L,r7 ## E;\
>>>   movzbl  r1 ## H,r4 ## E;\
>>> - movlTAB+1024(,r4,4),r4 ## E;\
>>> + round_mov(TAB+1024, r4, r4 ## E)\
>>>   movwr3 ## X,r1 ## X;\
>>>   roll$16,r1 ## E;\
>>>   shrl$16,r3 ## E;\
>>> - xorlTAB(,r7,4),r5 ## E; \
>>> + round_xor(TAB, r7, r5 ## E) \
>>>   movzbl  r3 ## L,r7 ## E;\
>>>   movzbl  r3 ## H,r3 ## E;\
>>> - xorlTAB+3072(,r3,4),r4 ## E;\
>>> - xorlTAB+2048(,r7,4),r5 ## E;\
>>> + round_xor(TAB+3072, r3, r4 ## E)\
>>> + round_xor(TAB+2048, r7, r5 ## E)\
>>>   movzbl  r1 ## L,r7 ## E;\
>>>   movzbl  r1 ## H,r3 ## E;\
>>>   shrl$16,r1 ## E;\
>>> - xorlTAB+3072(,r3,4),r6 ## E;\
>>> - movlTAB+2048(,r7,4),r3 ## E;\
>>> + round_xor(TAB+3072, r3, r6 ## E)\
>>> + round_mov(TAB+2048, r7, r3 ## E)\
>>>   movzbl  r1 ## L,r7 ## E;\
>>>   movzbl  r1 ## H,r1 ## E;\
>>> - xorlTAB+1024(,r1,4),r6 ## E;\
>>> - xorlTAB(,r7,4),r3 ## E; \
>>> + round_xor(TAB+1024, r1, r6 ## E)\
>>> + round_xor(TAB, r7, r3 ## E) \
>>>   movzbl  r2 ## H,r1 ## E;\
>>>   movzbl  r2 ## L,r7 ## E;\
>>>   shrl$16,r2 ## E;\
>>> - xorlTAB+3072(,r1,4),r3 ## E;\
>>> - xorlTAB+2048(,r7,4),r4 ## E;\
>>> + round_xor(TAB+3072, r1, r3 ## E)\
>>> + round_xor(TAB+2048, r7, r4 ## E)\
>>>   movzbl  r2 ## H,r1 ## E;\
>>>   movzbl  r2 ## L,r2 ## E;\
>>>   xorlOFFSET+8(r8),rc ## E;   \
>>>   xorlOFFSET+12(r8),rd ## E;  \
>>> - xorlTAB+1024(,r1,4),r3 ## E;\
>>> - xorlTAB(,r2,4),r4 ## E;
>>> + round_xor(TAB+1024, r1, r3 ## E)\
>>> + round_xor(TAB, r2, r4 ## E)
>>
>> This appears to be adding unconditional overhead to a function that was 
>> moved to
>> assembly to improve its performance.
>>

It adds couple extra instructions, how much overhead it creates is
hard for me to tell. It would increase the code complexity if
everything is ifdef.

>
> I did some benchmarking on this code a while ago and, interestingly,
> it was slower than the generic C implementation (on a Pentium E2200),
> so we may want to consider whether we still need this driver in the
> first place.

Interesting.

-- 
Thomas
___
Virtualization mailing list

Re: [PATCH v2 1/1] virtio_balloon: include buffers and cached memory statistics

2017-10-26 Thread Tomáš Golembiovský
On Thu, 19 Oct 2017 16:12:20 +0300
"Michael S. Tsirkin"  wrote:

> On Thu, Sep 21, 2017 at 02:55:41PM +0200, Tomáš Golembiovský wrote:
> > Add a new fields, VIRTIO_BALLOON_S_BUFFERS and VIRTIO_BALLOON_S_CACHED,
> > to virtio_balloon memory statistics protocol. The values correspond to
> > 'Buffers' and 'Cached' in /proc/meminfo.
> > 
> > To be able to compute the value of 'Cached' memory it is necessary to
> > export total_swapcache_pages() to modules.
> > 
> > Signed-off-by: Tomáš Golembiovský 
> 
> Does 'Buffers' actually make sense? It's a temporary storage -
> wouldn't it be significantly out of date by the time
> host receives it?

That would be best answered by somebody from kernel. But my personal
opinion is that it would not be out of date. The amount of memory
dedicated to Buffers does not seem to fluctuate too much.

Tomas


> > ---
> >  drivers/virtio/virtio_balloon.c | 11 +++
> >  include/uapi/linux/virtio_balloon.h |  4 +++-
> >  mm/swap_state.c |  1 +
> >  3 files changed, 15 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/virtio/virtio_balloon.c 
> > b/drivers/virtio/virtio_balloon.c
> > index f0b3a0b9d42f..c2558ec47a62 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -244,12 +244,19 @@ static unsigned int update_balloon_stats(struct 
> > virtio_balloon *vb)
> > struct sysinfo i;
> > unsigned int idx = 0;
> > long available;
> > +   long cached;
> >  
> > all_vm_events(events);
> > si_meminfo();
> >  
> > available = si_mem_available();
> >  
> > +   cached = global_node_page_state(NR_FILE_PAGES) -
> > +   total_swapcache_pages() - i.bufferram;
> > +   if (cached < 0)
> > +   cached = 0;
> > +
> > +
> >  #ifdef CONFIG_VM_EVENT_COUNTERS
> > update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN,
> > pages_to_bytes(events[PSWPIN]));
> > @@ -264,6 +271,10 @@ static unsigned int update_balloon_stats(struct 
> > virtio_balloon *vb)
> > pages_to_bytes(i.totalram));
> > update_stat(vb, idx++, VIRTIO_BALLOON_S_AVAIL,
> > pages_to_bytes(available));
> > +   update_stat(vb, idx++, VIRTIO_BALLOON_S_BUFFERS,
> > +   pages_to_bytes(i.bufferram));
> > +   update_stat(vb, idx++, VIRTIO_BALLOON_S_CACHED,
> > +   pages_to_bytes(cached));
> >  
> > return idx;
> >  }
> > diff --git a/include/uapi/linux/virtio_balloon.h 
> > b/include/uapi/linux/virtio_balloon.h
> > index 343d7ddefe04..d5dc8a56a497 100644
> > --- a/include/uapi/linux/virtio_balloon.h
> > +++ b/include/uapi/linux/virtio_balloon.h
> > @@ -52,7 +52,9 @@ struct virtio_balloon_config {
> >  #define VIRTIO_BALLOON_S_MEMFREE  4   /* Total amount of free memory */
> >  #define VIRTIO_BALLOON_S_MEMTOT   5   /* Total amount of memory */
> >  #define VIRTIO_BALLOON_S_AVAIL6   /* Available memory as in /proc */
> > -#define VIRTIO_BALLOON_S_NR   7
> > +#define VIRTIO_BALLOON_S_BUFFERS  7   /* Buffers memory as in /proc */
> > +#define VIRTIO_BALLOON_S_CACHED   8   /* Cached memory as in /proc */
> > +#define VIRTIO_BALLOON_S_NR   9
> >  
> >  /*
> >   * Memory statistics structure.
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 71ce2d1ccbf7..f3a4ff7d6c52 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -95,6 +95,7 @@ unsigned long total_swapcache_pages(void)
> > rcu_read_unlock();
> > return ret;
> >  }
> > +EXPORT_SYMBOL_GPL(total_swapcache_pages);
> >  
> >  static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
> 
> Need an ack from MM crowd on that.
> 
> > -- 
> > 2.14.1


-- 
Tomáš Golembiovský 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 0/1] linux: Buffers/caches in VirtIO Balloon driver stats

2017-10-26 Thread Tomáš Golembiovský
On Thu, 5 Oct 2017 15:51:18 +0200
Tomáš Golembiovský  wrote:

> On Thu, 21 Sep 2017 14:55:40 +0200
> Tomáš Golembiovský  wrote:
> 
> > Linux driver part
> > 
> > v2:
> > - fixed typos
> > 
> > Tomáš Golembiovský (1):
> >   virtio_balloon: include buffers and cached memory statistics
> > 
> >  drivers/virtio/virtio_balloon.c | 11 +++
> >  include/uapi/linux/virtio_balloon.h |  4 +++-
> >  mm/swap_state.c |  1 +
> >  3 files changed, 15 insertions(+), 1 deletion(-)
> > 
> > -- 
> > 2.14.1
> > 
> 
> ping

ping

-- 
Tomáš Golembiovský 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 15/27] compiler: Option to default to hidden symbols

2017-10-26 Thread Thomas Garnier via Virtualization
On Thu, Oct 12, 2017 at 1:02 PM, Luis R. Rodriguez  wrote:
> On Wed, Oct 11, 2017 at 01:30:15PM -0700, Thomas Garnier wrote:
>> Provide an option to default visibility to hidden except for key
>> symbols. This option is disabled by default and will be used by x86_64
>> PIE support to remove errors between compilation units.
>>
>> The default visibility is also enabled for external symbols that are
>> compared as they maybe equals (start/end of sections). In this case,
>> older versions of GCC will remove the comparison if the symbols are
>> hidden. This issue exists at least on gcc 4.9 and before.
>>
>> Signed-off-by: Thomas Garnier 
>
> <-- snip -->
>
>> diff --git a/arch/x86/kernel/cpu/microcode/core.c 
>> b/arch/x86/kernel/cpu/microcode/core.c
>> index 86e8f0b2537b..8f021783a929 100644
>> --- a/arch/x86/kernel/cpu/microcode/core.c
>> +++ b/arch/x86/kernel/cpu/microcode/core.c
>> @@ -144,8 +144,8 @@ static bool __init check_loader_disabled_bsp(void)
>>   return *res;
>>  }
>>
>> -extern struct builtin_fw __start_builtin_fw[];
>> -extern struct builtin_fw __end_builtin_fw[];
>> +extern struct builtin_fw __start_builtin_fw[] __default_visibility;
>> +extern struct builtin_fw __end_builtin_fw[] __default_visibility;
>>
>>  bool get_builtin_firmware(struct cpio_data *cd, const char *name)
>>  {
>
> <-- snip -->
>
>> diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
>> index e5da44eddd2f..1aa5d6dac9e1 100644
>> --- a/include/asm-generic/sections.h
>> +++ b/include/asm-generic/sections.h
>> @@ -30,6 +30,9 @@
>>   *   __irqentry_text_start, __irqentry_text_end
>>   *   __softirqentry_text_start, __softirqentry_text_end
>>   */
>> +#ifdef CONFIG_DEFAULT_HIDDEN
>> +#pragma GCC visibility push(default)
>> +#endif
>>  extern char _text[], _stext[], _etext[];
>>  extern char _data[], _sdata[], _edata[];
>>  extern char __bss_start[], __bss_stop[];
>> @@ -46,6 +49,9 @@ extern char __softirqentry_text_start[], 
>> __softirqentry_text_end[];
>>
>>  /* Start and end of .ctors section - used for constructor calls. */
>>  extern char __ctors_start[], __ctors_end[];
>> +#ifdef CONFIG_DEFAULT_HIDDEN
>> +#pragma GCC visibility pop
>> +#endif
>>
>>  extern __visible const void __nosave_begin, __nosave_end;
>>
>> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
>> index e95a2631e545..6997716f73bf 100644
>> --- a/include/linux/compiler.h
>> +++ b/include/linux/compiler.h
>> @@ -78,6 +78,14 @@ extern void __chk_io_ptr(const volatile void __iomem *);
>>  #include 
>>  #endif
>>
>> +/* Useful for Position Independent Code to reduce global references */
>> +#ifdef CONFIG_DEFAULT_HIDDEN
>> +#pragma GCC visibility push(hidden)
>> +#define __default_visibility  __attribute__((visibility ("default")))
>
> Does this still work with CONFIG_LD_DEAD_CODE_DATA_ELIMINATION ?

I cannot make it work with or without this change. How is it supposed
to be used?

For me with, it crashes with a bad consdev at:
http://elixir.free-electrons.com/linux/latest/source/drivers/tty/tty_io.c#L3194

>
>> +#else
>> +#define __default_visibility
>> +#endif
>> +
>>  /*
>>   * Generic compiler-dependent macros required for kernel
>>   * build go below this comment. Actual compiler/compiler version
>> diff --git a/init/Kconfig b/init/Kconfig
>> index ccb1d8daf241..b640201fcff7 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1649,6 +1649,13 @@ config PROFILING
>>  config TRACEPOINTS
>>   bool
>>
>> +#
>> +# Default to hidden visibility for all symbols.
>> +# Useful for Position Independent Code to reduce global references.
>> +#
>> +config DEFAULT_HIDDEN
>> + bool
>
> Note it is default.
>
> Has 0-day ran through this git tree? It should be easy to get it added for
> testing. Also, even though most changes are x86 based there are some generic
> changes and I'd love a warm fuzzy this won't break odd / random builds.
> Although 0-day does cover a lot of test cases, it only has limited run time
> tests. There are some other test beds which also cover some more obscure
> architectures. Having a test pass on Guenter's test bed would be nice to
> see. For that please coordinate with Guenter if he's willing to run this
> a test for you.

Not yet, plan to give a v1.5 to Kees Cook to keep in one of his tree
for couple weeks. I expect it will identify interesting issues.

>
>   Luis



-- 
Thomas
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-26 Thread Josh Poimboeuf
On Tue, Oct 17, 2017 at 04:59:41PM -0400, Boris Ostrovsky wrote:
> On 10/17/2017 04:50 PM, Josh Poimboeuf wrote:
> > On Tue, Oct 17, 2017 at 04:36:00PM -0400, Boris Ostrovsky wrote:
> >> On 10/17/2017 04:17 PM, Josh Poimboeuf wrote:
> >>> On Tue, Oct 17, 2017 at 11:36:57AM -0400, Boris Ostrovsky wrote:
>  On 10/17/2017 10:36 AM, Josh Poimboeuf wrote:
> > Maybe we can add a new field to the alternatives entry struct which
> > specifies the offset to the CALL instruction, so apply_alternatives()
> > can find it.
>  We'd also have to assume that the restore part of an alternative entry
>  is the same size as the save part. Which is true now.
> >>> Why?
> >>>
> >> Don't you need to know the size of the instruction without save and
> >> restore part?
> >>
> >> + if (a->replacementlen == 6 && *insnbuf == 0xff && *(insnbuf+1) == 0x15)
> >>
> >> Otherwise you'd need another field for the actual instruction length.
> > If we know where the CALL instruction starts, and can verify that it
> > starts with "ff 15", then we know the instruction length: 6 bytes.
> > Right?
> >
> 
> Oh, OK. Then you shouldn't need a->replacementlen test(s?) in
> apply_alternatives()?

Right.  Though in the above code it was needed for a different reason,
to make sure it wasn't reading past the end of the buffer.

-- 
Josh
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization

2017-10-26 Thread Thomas Garnier via Virtualization
On Thu, Oct 12, 2017 at 9:28 AM, Tom Lendacky  wrote:
> On 10/12/2017 10:34 AM, Thomas Garnier wrote:
>>
>> On Wed, Oct 11, 2017 at 2:34 PM, Tom Lendacky 
>> wrote:
>>>
>>> On 10/11/2017 3:30 PM, Thomas Garnier wrote:

 Changes:
- patch v1:
  - Simplify ftrace implementation.
  - Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
- rfc v3:
  - Use --emit-relocs instead of -pie to reduce dynamic relocation
 space on
mapped memory. It also simplifies the relocation process.
  - Move the start the module section next to the kernel. Remove the
 need for
-mcmodel=large on modules. Extends module space from 1 to 2G
 maximum.
  - Support for XEN PVH as 32-bit relocations can be ignored with
--emit-relocs.
  - Support for GOT relocations previously done automatically with
 -pie.
  - Remove need for dynamic PLT in modules.
  - Support dymamic GOT for modules.
- rfc v2:
  - Add support for global stack cookie while compiler default to fs
 without
mcmodel=kernel
  - Change patch 7 to correctly jump out of the identity mapping on
 kexec load
preserve.

 These patches make the changes necessary to build the kernel as Position
 Independent Executable (PIE) on x86_64. A PIE kernel can be relocated
 below
 the top 2G of the virtual address space. It allows to optionally extend
 the
 KASLR randomization range from 1G to 3G.
>>>
>>>
>>> Hi Thomas,
>>>
>>> I've applied your patches so that I can verify that SME works with PIE.
>>> Unfortunately, I'm running into build warnings and errors when I enable
>>> PIE.
>>>
>>> With CONFIG_STACK_VALIDATION=y I receive lots of messages like this:
>>>
>>>drivers/scsi/libfc/fc_exch.o: warning: objtool:
>>> fc_destroy_exch_mgr()+0x0: call without frame pointer save/setup
>>>
>>> Disabling CONFIG_STACK_VALIDATION suppresses those.
>>
>>
>> I ran into that, I plan to fix it in the next iteration.
>>
>>>
>>> But near the end of the build, I receive errors like this:
>>>
>>>arch/x86/kernel/setup.o: In function `dump_kernel_offset':
>>>.../arch/x86/kernel/setup.c:801:(.text+0x32): relocation truncated to
>>> fit: R_X86_64_32S against symbol `_text' defined in .text section in
>>> .tmp_vmlinux1
>>>.
>>>. about 10 more of the above type messages
>>>.
>>>make: *** [vmlinux] Error 1
>>>Error building kernel, exiting
>>>
>>> Are there any config options that should or should not be enabled when
>>> building with PIE enabled?  Is there a compiler requirement for PIE (I'm
>>> using gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5))?
>>
>>
>> I never ran into these ones and I tested compilers older and newer.
>> What was your exact configuration?
>
>
> I'll send you the config in a separate email.
>
> Thanks,
> Tom

Thanks for your feedback (Tom and Markus). The issue was linked to
using a modern gcc with a modern linker, I managed to repro and fix it
on my current version.

I will create a v1.5 for Kees Cook to keep on one of his branch for
few weeks so I can collect as much feedback from 0day. After that I
will send v2.

>
>
>>
>>>
>>> Thanks,
>>> Tom
>>>

 Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
 changes, PIE support and KASLR in general. Thanks to Roland McGrath on
 his
 feedback for using -pie versus --emit-relocs and details on compiler
 code
 generation.

 The patches:
- 1-3, 5-1#, 17-18: Change in assembly code to be PIE compliant.
- 4: Add a new _ASM_GET_PTR macro to fetch a symbol address
 generically.
- 14: Adapt percpu design to work correctly when PIE is enabled.
- 15: Provide an option to default visibility to hidden except for
 key symbols.
  It removes errors between compilation units.
- 16: Adapt relocation tool to handle PIE binary correctly.
- 19: Add support for global cookie.
- 20: Support ftrace with PIE (used on Ubuntu config).
- 21: Fix incorrect address marker on dump_pagetables.
- 22: Add option to move the module section just after the kernel.
- 23: Adapt module loading to support PIE with dynamic GOT.
- 24: Make the GOT read-only.
- 25: Add the CONFIG_X86_PIE option (off by default).
- 26: Adapt relocation tool to generate a 64-bit relocation table.
- 27: Add the CONFIG_RANDOMIZE_BASE_LARGE option to increase
 relocation range
  from 1G to 3G (off by default).

 Performance/Size impact:

 Size of vmlinux (Default configuration):
File size:
- PIE disabled: +0.31%
- PIE enabled: -3.210% (less relocations)
.text section:
- PIE disabled: +0.000644%
- PIE enabled: +0.837%


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-26 Thread Josh Poimboeuf
On Tue, Oct 17, 2017 at 04:36:00PM -0400, Boris Ostrovsky wrote:
> On 10/17/2017 04:17 PM, Josh Poimboeuf wrote:
> > On Tue, Oct 17, 2017 at 11:36:57AM -0400, Boris Ostrovsky wrote:
> >> On 10/17/2017 10:36 AM, Josh Poimboeuf wrote:
> >>> Maybe we can add a new field to the alternatives entry struct which
> >>> specifies the offset to the CALL instruction, so apply_alternatives()
> >>> can find it.
> >> We'd also have to assume that the restore part of an alternative entry
> >> is the same size as the save part. Which is true now.
> > Why?
> >
> 
> Don't you need to know the size of the instruction without save and
> restore part?
> 
> + if (a->replacementlen == 6 && *insnbuf == 0xff && *(insnbuf+1) == 0x15)
> 
> Otherwise you'd need another field for the actual instruction length.

If we know where the CALL instruction starts, and can verify that it
starts with "ff 15", then we know the instruction length: 6 bytes.
Right?

-- 
Josh
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-26 Thread Josh Poimboeuf
On Tue, Oct 17, 2017 at 11:36:57AM -0400, Boris Ostrovsky wrote:
> On 10/17/2017 10:36 AM, Josh Poimboeuf wrote:
> >
> > Maybe we can add a new field to the alternatives entry struct which
> > specifies the offset to the CALL instruction, so apply_alternatives()
> > can find it.
> 
> We'd also have to assume that the restore part of an alternative entry
> is the same size as the save part. Which is true now.

Why?

-- 
Josh
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 15/27] compiler: Option to default to hidden symbols

2017-10-26 Thread Luis R. Rodriguez
On Wed, Oct 11, 2017 at 01:30:15PM -0700, Thomas Garnier wrote:
> Provide an option to default visibility to hidden except for key
> symbols. This option is disabled by default and will be used by x86_64
> PIE support to remove errors between compilation units.
> 
> The default visibility is also enabled for external symbols that are
> compared as they maybe equals (start/end of sections). In this case,
> older versions of GCC will remove the comparison if the symbols are
> hidden. This issue exists at least on gcc 4.9 and before.
> 
> Signed-off-by: Thomas Garnier 

<-- snip -->

> diff --git a/arch/x86/kernel/cpu/microcode/core.c 
> b/arch/x86/kernel/cpu/microcode/core.c
> index 86e8f0b2537b..8f021783a929 100644
> --- a/arch/x86/kernel/cpu/microcode/core.c
> +++ b/arch/x86/kernel/cpu/microcode/core.c
> @@ -144,8 +144,8 @@ static bool __init check_loader_disabled_bsp(void)
>   return *res;
>  }
>  
> -extern struct builtin_fw __start_builtin_fw[];
> -extern struct builtin_fw __end_builtin_fw[];
> +extern struct builtin_fw __start_builtin_fw[] __default_visibility;
> +extern struct builtin_fw __end_builtin_fw[] __default_visibility;
>  
>  bool get_builtin_firmware(struct cpio_data *cd, const char *name)
>  {

<-- snip -->

> diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
> index e5da44eddd2f..1aa5d6dac9e1 100644
> --- a/include/asm-generic/sections.h
> +++ b/include/asm-generic/sections.h
> @@ -30,6 +30,9 @@
>   *   __irqentry_text_start, __irqentry_text_end
>   *   __softirqentry_text_start, __softirqentry_text_end
>   */
> +#ifdef CONFIG_DEFAULT_HIDDEN
> +#pragma GCC visibility push(default)
> +#endif
>  extern char _text[], _stext[], _etext[];
>  extern char _data[], _sdata[], _edata[];
>  extern char __bss_start[], __bss_stop[];
> @@ -46,6 +49,9 @@ extern char __softirqentry_text_start[], 
> __softirqentry_text_end[];
>  
>  /* Start and end of .ctors section - used for constructor calls. */
>  extern char __ctors_start[], __ctors_end[];
> +#ifdef CONFIG_DEFAULT_HIDDEN
> +#pragma GCC visibility pop
> +#endif
>  
>  extern __visible const void __nosave_begin, __nosave_end;
>  
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index e95a2631e545..6997716f73bf 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -78,6 +78,14 @@ extern void __chk_io_ptr(const volatile void __iomem *);
>  #include 
>  #endif
>  
> +/* Useful for Position Independent Code to reduce global references */
> +#ifdef CONFIG_DEFAULT_HIDDEN
> +#pragma GCC visibility push(hidden)
> +#define __default_visibility  __attribute__((visibility ("default")))

Does this still work with CONFIG_LD_DEAD_CODE_DATA_ELIMINATION ?

> +#else
> +#define __default_visibility
> +#endif
> +
>  /*
>   * Generic compiler-dependent macros required for kernel
>   * build go below this comment. Actual compiler/compiler version
> diff --git a/init/Kconfig b/init/Kconfig
> index ccb1d8daf241..b640201fcff7 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1649,6 +1649,13 @@ config PROFILING
>  config TRACEPOINTS
>   bool
>  
> +#
> +# Default to hidden visibility for all symbols.
> +# Useful for Position Independent Code to reduce global references.
> +#
> +config DEFAULT_HIDDEN
> + bool

Note it is default.

Has 0-day ran through this git tree? It should be easy to get it added for
testing. Also, even though most changes are x86 based there are some generic
changes and I'd love a warm fuzzy this won't break odd / random builds.
Although 0-day does cover a lot of test cases, it only has limited run time
tests. There are some other test beds which also cover some more obscure
architectures. Having a test pass on Guenter's test bed would be nice to
see. For that please coordinate with Guenter if he's willing to run this
a test for you.

  Luis
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-26 Thread Josh Poimboeuf
On Tue, Oct 17, 2017 at 09:58:59AM -0400, Boris Ostrovsky wrote:
> On 10/17/2017 01:24 AM, Josh Poimboeuf wrote:
> > On Mon, Oct 16, 2017 at 02:18:48PM -0400, Boris Ostrovsky wrote:
> >> On 10/12/2017 03:53 PM, Boris Ostrovsky wrote:
> >>> On 10/12/2017 03:27 PM, Andrew Cooper wrote:
>  On 12/10/17 20:11, Boris Ostrovsky wrote:
> > There is also another problem:
> >
> > [1.312425] general protection fault:  [#1] SMP
> > [1.312901] Modules linked in:
> > [1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6
> > [1.313878] task: 88003e2c task.stack: c938c000
> > [1.314360] RIP: 1e030:entry_SYSCALL_64_fastpath+0x1/0xa5
> > [1.314854] RSP: e02b:c938ff50 EFLAGS: 00010046
> > [1.315336] RAX: 000c RBX: 55f550168040 RCX:
> > 7fcfc959f59a
> > [1.315827] RDX:  RSI:  RDI:
> > 
> > [1.316315] RBP: 000a R08: 037f R09:
> > 0064
> > [1.316805] R10: 1f89cbf5 R11: 88003e2c R12:
> > 7fcfc958ad60
> > [1.317300] R13:  R14: 55f550185954 R15:
> > 1000
> > [1.317801] FS:  () GS:88003f80()
> > knlGS:
> > [1.318267] CS:  e033 DS:  ES:  CR0: 80050033
> > [1.318750] CR2: 7fcfc97ab218 CR3: 3c88e000 CR4:
> > 00042660
> > [1.319235] Call Trace:
> > [1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48
> > 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00
> > 00 50  15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5
> > [1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: 
> > c938ff50
> > [1.344255] ---[ end trace d7cb8cd6cd7c294c ]---
> > [1.345009] Kernel panic - not syncing: Attempted to kill init!
> > exitcode=0x000b
> >
> >
> > All code
> > 
> >0:51   push   %rcx
> >1:50   push   %rax
> >2:57   push   %rdi
> >3:56   push   %rsi
> >4:52   push   %rdx
> >5:51   push   %rcx
> >6:6a dapushq  $0xffda
> >8:41 50push   %r8
> >a:41 51push   %r9
> >c:41 52push   %r10
> >e:41 53push   %r11
> >   10:48 83 ec 30  sub$0x30,%rsp
> >   14:65 4c 8b 1c 25 c0 d2 mov%gs:0xd2c0,%r11
> >   1b:00 00
> >   1d:41 f7 03 df 39 08 90 testl  $0x900839df,(%r11)
> >   24:0f 85 a5 00 00 00jne0xcf
> >   2a:50   push   %rax
> >   2b:*ff 15 9c 95 d0 ffcallq  *-0x2f6a64(%rip)#
> > 0xffd095cd<-- trapping instruction
> >   31:58   pop%rax
> >   32:48 3d 4c 01 00 00cmp$0x14c,%rax
> >   38:77 0fja 0x49
> >   3a:4c 89 d1 mov%r10,%rcx
> >   3d:ff   .byte 0xff
> >   3e:14 c5adc$0xc5,%al
> >
> >
> > so the original 'cli' was replaced with the pv call but to me the offset
> > looks a bit off, no? Shouldn't it always be positive?
>  callq takes a 32bit signed displacement, so jumping back by up to 2G is
>  perfectly legitimate.
> >>> Yes, but
> >>>
> >>> ostr@workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath
> >>> 817365dd t entry_SYSCALL_64_fastpath
> >>> ostr@workbase> nm vmlinux | grep " pv_irq_ops"
> >>> 81c2dbc0 D pv_irq_ops
> >>> ostr@workbase>
> >>>
> >>> so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I
> >>> didn't mean that x86 instruction set doesn't allow negative
> >>> displacement, I was trying to say that pv_irq_ops always live further 
> >>> down)
> >> I believe the problem is this:
> >>
> >> #define PV_INDIRECT(addr)   *addr(%rip)
> >>
> >> The displacement that the linker computes will be relative to the where
> >> this instruction is placed at the time of linking, which is in
> >> .pv_altinstructions (and not .text). So when we copy it into .text the
> >> displacement becomes bogus.
> > apply_alternatives() is supposed to adjust that displacement based on
> > the new IP, though it could be messing that up somehow.  (See patch
> > 10/13.)
> >
> 
> That patch doesn't take into account the fact that replacement
> instructions may have to save/restore registers. So, for example,
> 
> 
> -if (a->replacementlen && is_jmp(replacement[0]))
> +} 

Re: [PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization

2017-10-26 Thread Tom Lendacky

On 10/12/2017 10:34 AM, Thomas Garnier wrote:

On Wed, Oct 11, 2017 at 2:34 PM, Tom Lendacky  wrote:

On 10/11/2017 3:30 PM, Thomas Garnier wrote:

Changes:
   - patch v1:
 - Simplify ftrace implementation.
 - Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
   - rfc v3:
 - Use --emit-relocs instead of -pie to reduce dynamic relocation space on
   mapped memory. It also simplifies the relocation process.
 - Move the start the module section next to the kernel. Remove the need for
   -mcmodel=large on modules. Extends module space from 1 to 2G maximum.
 - Support for XEN PVH as 32-bit relocations can be ignored with
   --emit-relocs.
 - Support for GOT relocations previously done automatically with -pie.
 - Remove need for dynamic PLT in modules.
 - Support dymamic GOT for modules.
   - rfc v2:
 - Add support for global stack cookie while compiler default to fs without
   mcmodel=kernel
 - Change patch 7 to correctly jump out of the identity mapping on kexec 
load
   preserve.

These patches make the changes necessary to build the kernel as Position
Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
the top 2G of the virtual address space. It allows to optionally extend the
KASLR randomization range from 1G to 3G.


Hi Thomas,

I've applied your patches so that I can verify that SME works with PIE.
Unfortunately, I'm running into build warnings and errors when I enable
PIE.

With CONFIG_STACK_VALIDATION=y I receive lots of messages like this:

   drivers/scsi/libfc/fc_exch.o: warning: objtool: fc_destroy_exch_mgr()+0x0: 
call without frame pointer save/setup

Disabling CONFIG_STACK_VALIDATION suppresses those.


I ran into that, I plan to fix it in the next iteration.



But near the end of the build, I receive errors like this:

   arch/x86/kernel/setup.o: In function `dump_kernel_offset':
   .../arch/x86/kernel/setup.c:801:(.text+0x32): relocation truncated to fit: 
R_X86_64_32S against symbol `_text' defined in .text section in .tmp_vmlinux1
   .
   . about 10 more of the above type messages
   .
   make: *** [vmlinux] Error 1
   Error building kernel, exiting

Are there any config options that should or should not be enabled when
building with PIE enabled?  Is there a compiler requirement for PIE (I'm
using gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5))?


I never ran into these ones and I tested compilers older and newer.
What was your exact configuration?


I'll send you the config in a separate email.

Thanks,
Tom





Thanks,
Tom



Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
changes, PIE support and KASLR in general. Thanks to Roland McGrath on his
feedback for using -pie versus --emit-relocs and details on compiler code
generation.

The patches:
   - 1-3, 5-1#, 17-18: Change in assembly code to be PIE compliant.
   - 4: Add a new _ASM_GET_PTR macro to fetch a symbol address generically.
   - 14: Adapt percpu design to work correctly when PIE is enabled.
   - 15: Provide an option to default visibility to hidden except for key 
symbols.
 It removes errors between compilation units.
   - 16: Adapt relocation tool to handle PIE binary correctly.
   - 19: Add support for global cookie.
   - 20: Support ftrace with PIE (used on Ubuntu config).
   - 21: Fix incorrect address marker on dump_pagetables.
   - 22: Add option to move the module section just after the kernel.
   - 23: Adapt module loading to support PIE with dynamic GOT.
   - 24: Make the GOT read-only.
   - 25: Add the CONFIG_X86_PIE option (off by default).
   - 26: Adapt relocation tool to generate a 64-bit relocation table.
   - 27: Add the CONFIG_RANDOMIZE_BASE_LARGE option to increase relocation range
 from 1G to 3G (off by default).

Performance/Size impact:

Size of vmlinux (Default configuration):
   File size:
   - PIE disabled: +0.31%
   - PIE enabled: -3.210% (less relocations)
   .text section:
   - PIE disabled: +0.000644%
   - PIE enabled: +0.837%

Size of vmlinux (Ubuntu configuration):
   File size:
   - PIE disabled: -0.201%
   - PIE enabled: -0.082%
   .text section:
   - PIE disabled: same
   - PIE enabled: +1.319%

Size of vmlinux (Default configuration + ORC):
   File size:
   - PIE enabled: -3.167%
   .text section:
   - PIE enabled: +0.814%

Size of vmlinux (Ubuntu configuration + ORC):
   File size:
   - PIE enabled: -3.167%
   .text section:
   - PIE enabled: +1.26%

The size increase is mainly due to not having access to the 32-bit signed
relocation that can be used with mcmodel=kernel. A small part is due to reduced
optimization for PIE code. This bug [1] was opened with gcc to provide a better
code generation for kernel PIE.

Hackbench (50% and 1600% on thread/process for pipe/sockets):
   - PIE disabled: no significant change (avg +0.1% on latest test).
   - PIE enabled: between -0.50% to +0.86% in average (default and Ubuntu 

Re: [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-26 Thread Josh Poimboeuf
On Mon, Oct 16, 2017 at 02:18:48PM -0400, Boris Ostrovsky wrote:
> On 10/12/2017 03:53 PM, Boris Ostrovsky wrote:
> > On 10/12/2017 03:27 PM, Andrew Cooper wrote:
> >> On 12/10/17 20:11, Boris Ostrovsky wrote:
> >>> There is also another problem:
> >>>
> >>> [1.312425] general protection fault:  [#1] SMP
> >>> [1.312901] Modules linked in:
> >>> [1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6
> >>> [1.313878] task: 88003e2c task.stack: c938c000
> >>> [1.314360] RIP: 1e030:entry_SYSCALL_64_fastpath+0x1/0xa5
> >>> [1.314854] RSP: e02b:c938ff50 EFLAGS: 00010046
> >>> [1.315336] RAX: 000c RBX: 55f550168040 RCX:
> >>> 7fcfc959f59a
> >>> [1.315827] RDX:  RSI:  RDI:
> >>> 
> >>> [1.316315] RBP: 000a R08: 037f R09:
> >>> 0064
> >>> [1.316805] R10: 1f89cbf5 R11: 88003e2c R12:
> >>> 7fcfc958ad60
> >>> [1.317300] R13:  R14: 55f550185954 R15:
> >>> 1000
> >>> [1.317801] FS:  () GS:88003f80()
> >>> knlGS:
> >>> [1.318267] CS:  e033 DS:  ES:  CR0: 80050033
> >>> [1.318750] CR2: 7fcfc97ab218 CR3: 3c88e000 CR4:
> >>> 00042660
> >>> [1.319235] Call Trace:
> >>> [1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48
> >>> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00
> >>> 00 50  15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5
> >>> [1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: 
> >>> c938ff50
> >>> [1.344255] ---[ end trace d7cb8cd6cd7c294c ]---
> >>> [1.345009] Kernel panic - not syncing: Attempted to kill init!
> >>> exitcode=0x000b
> >>>
> >>>
> >>> All code
> >>> 
> >>>0:51   push   %rcx
> >>>1:50   push   %rax
> >>>2:57   push   %rdi
> >>>3:56   push   %rsi
> >>>4:52   push   %rdx
> >>>5:51   push   %rcx
> >>>6:6a dapushq  $0xffda
> >>>8:41 50push   %r8
> >>>a:41 51push   %r9
> >>>c:41 52push   %r10
> >>>e:41 53push   %r11
> >>>   10:48 83 ec 30  sub$0x30,%rsp
> >>>   14:65 4c 8b 1c 25 c0 d2 mov%gs:0xd2c0,%r11
> >>>   1b:00 00
> >>>   1d:41 f7 03 df 39 08 90 testl  $0x900839df,(%r11)
> >>>   24:0f 85 a5 00 00 00jne0xcf
> >>>   2a:50   push   %rax
> >>>   2b:*ff 15 9c 95 d0 ffcallq  *-0x2f6a64(%rip)#
> >>> 0xffd095cd<-- trapping instruction
> >>>   31:58   pop%rax
> >>>   32:48 3d 4c 01 00 00cmp$0x14c,%rax
> >>>   38:77 0fja 0x49
> >>>   3a:4c 89 d1 mov%r10,%rcx
> >>>   3d:ff   .byte 0xff
> >>>   3e:14 c5adc$0xc5,%al
> >>>
> >>>
> >>> so the original 'cli' was replaced with the pv call but to me the offset
> >>> looks a bit off, no? Shouldn't it always be positive?
> >> callq takes a 32bit signed displacement, so jumping back by up to 2G is
> >> perfectly legitimate.
> > Yes, but
> >
> > ostr@workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath
> > 817365dd t entry_SYSCALL_64_fastpath
> > ostr@workbase> nm vmlinux | grep " pv_irq_ops"
> > 81c2dbc0 D pv_irq_ops
> > ostr@workbase>
> >
> > so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I
> > didn't mean that x86 instruction set doesn't allow negative
> > displacement, I was trying to say that pv_irq_ops always live further down)
> 
> I believe the problem is this:
> 
> #define PV_INDIRECT(addr)   *addr(%rip)
> 
> The displacement that the linker computes will be relative to the where
> this instruction is placed at the time of linking, which is in
> .pv_altinstructions (and not .text). So when we copy it into .text the
> displacement becomes bogus.

apply_alternatives() is supposed to adjust that displacement based on
the new IP, though it could be messing that up somehow.  (See patch
10/13.)

-- 
Josh
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization

2017-10-26 Thread Markus Trippelsdorf
On 2017.10.12 at 08:34 -0700, Thomas Garnier wrote:
> On Wed, Oct 11, 2017 at 2:34 PM, Tom Lendacky  wrote:
> > On 10/11/2017 3:30 PM, Thomas Garnier wrote:
> >> Changes:
> >>   - patch v1:
> >> - Simplify ftrace implementation.
> >> - Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
> >>   - rfc v3:
> >> - Use --emit-relocs instead of -pie to reduce dynamic relocation space 
> >> on
> >>   mapped memory. It also simplifies the relocation process.
> >> - Move the start the module section next to the kernel. Remove the 
> >> need for
> >>   -mcmodel=large on modules. Extends module space from 1 to 2G maximum.
> >> - Support for XEN PVH as 32-bit relocations can be ignored with
> >>   --emit-relocs.
> >> - Support for GOT relocations previously done automatically with -pie.
> >> - Remove need for dynamic PLT in modules.
> >> - Support dymamic GOT for modules.
> >>   - rfc v2:
> >> - Add support for global stack cookie while compiler default to fs 
> >> without
> >>   mcmodel=kernel
> >> - Change patch 7 to correctly jump out of the identity mapping on 
> >> kexec load
> >>   preserve.
> >>
> >> These patches make the changes necessary to build the kernel as Position
> >> Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
> >> the top 2G of the virtual address space. It allows to optionally extend the
> >> KASLR randomization range from 1G to 3G.
> >
> > Hi Thomas,
> >
> > I've applied your patches so that I can verify that SME works with PIE.
> > Unfortunately, I'm running into build warnings and errors when I enable
> > PIE.
> >
> > With CONFIG_STACK_VALIDATION=y I receive lots of messages like this:
> >
> >   drivers/scsi/libfc/fc_exch.o: warning: objtool: 
> > fc_destroy_exch_mgr()+0x0: call without frame pointer save/setup
> >
> > Disabling CONFIG_STACK_VALIDATION suppresses those.
> 
> I ran into that, I plan to fix it in the next iteration.
> 
> >
> > But near the end of the build, I receive errors like this:
> >
> >   arch/x86/kernel/setup.o: In function `dump_kernel_offset':
> >   .../arch/x86/kernel/setup.c:801:(.text+0x32): relocation truncated to 
> > fit: R_X86_64_32S against symbol `_text' defined in .text section in 
> > .tmp_vmlinux1
> >   .
> >   . about 10 more of the above type messages
> >   .
> >   make: *** [vmlinux] Error 1
> >   Error building kernel, exiting
> >
> > Are there any config options that should or should not be enabled when
> > building with PIE enabled?  Is there a compiler requirement for PIE (I'm
> > using gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5))?
> 
> I never ran into these ones and I tested compilers older and newer.
> What was your exact configuration?

I get with gcc trunk and CONFIG_RANDOMIZE_BASE_LARGE=y:

...
  MODPOST vmlinux.o 
  ld: failed to convert GOTPCREL relocation; relink with --no-relax

and after adding --no-relax to vmlinux_link() in scripts/link-vmlinux.sh:

  MODPOST vmlinux.o
virt/kvm/vfio.o: In function `kvm_vfio_update_coherency.isra.4':
vfio.c:(.text+0x63): relocation truncated to fit: R_X86_64_PLT32 against 
undefined symbol `vfio_external_check_extension'
virt/kvm/vfio.o: In function `kvm_vfio_destroy':
vfio.c:(.text+0xf7): relocation truncated to fit: R_X86_64_PLT32 against 
undefined symbol `vfio_group_set_kvm'
vfio.c:(.text+0x10a): relocation truncated to fit: R_X86_64_PLT32 against 
undefined symbol `vfio_group_put_external_user'
virt/kvm/vfio.o: In function `kvm_vfio_set_attr':
vfio.c:(.text+0x2bc): relocation truncated to fit: R_X86_64_PLT32 against 
undefined symbol `vfio_external_group_match_file'
vfio.c:(.text+0x307): relocation truncated to fit: R_X86_64_PLT32 against 
undefined symbol `vfio_group_set_kvm'
vfio.c:(.text+0x31a): relocation truncated to fit: R_X86_64_PLT32 against 
undefined symbol `vfio_group_put_external_user'
vfio.c:(.text+0x3b9): relocation truncated to fit: R_X86_64_PLT32 against 
undefined symbol `vfio_group_get_external_user'
vfio.c:(.text+0x462): relocation truncated to fit: R_X86_64_PLT32 against 
undefined symbol `vfio_group_set_kvm'
vfio.c:(.text+0x4bd): relocation truncated to fit: R_X86_64_PLT32 against 
undefined symbol `vfio_group_put_external_user'
make: *** [Makefile:1000: vmlinux] Error 1

Works fine with CONFIG_RANDOMIZE_BASE_LARGE unset.

-- 
Markus


config.gz
Description: application/gunzip
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization

2017-10-26 Thread Tom Lendacky
On 10/11/2017 3:30 PM, Thomas Garnier wrote:
> Changes:
>   - patch v1:
> - Simplify ftrace implementation.
> - Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
>   - rfc v3:
> - Use --emit-relocs instead of -pie to reduce dynamic relocation space on
>   mapped memory. It also simplifies the relocation process.
> - Move the start the module section next to the kernel. Remove the need 
> for
>   -mcmodel=large on modules. Extends module space from 1 to 2G maximum.
> - Support for XEN PVH as 32-bit relocations can be ignored with
>   --emit-relocs.
> - Support for GOT relocations previously done automatically with -pie.
> - Remove need for dynamic PLT in modules.
> - Support dymamic GOT for modules.
>   - rfc v2:
> - Add support for global stack cookie while compiler default to fs without
>   mcmodel=kernel
> - Change patch 7 to correctly jump out of the identity mapping on kexec 
> load
>   preserve.
> 
> These patches make the changes necessary to build the kernel as Position
> Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
> the top 2G of the virtual address space. It allows to optionally extend the
> KASLR randomization range from 1G to 3G.

Hi Thomas,

I've applied your patches so that I can verify that SME works with PIE.
Unfortunately, I'm running into build warnings and errors when I enable
PIE.

With CONFIG_STACK_VALIDATION=y I receive lots of messages like this:

  drivers/scsi/libfc/fc_exch.o: warning: objtool: fc_destroy_exch_mgr()+0x0: 
call without frame pointer save/setup

Disabling CONFIG_STACK_VALIDATION suppresses those.

But near the end of the build, I receive errors like this:

  arch/x86/kernel/setup.o: In function `dump_kernel_offset':
  .../arch/x86/kernel/setup.c:801:(.text+0x32): relocation truncated to fit: 
R_X86_64_32S against symbol `_text' defined in .text section in .tmp_vmlinux1
  .
  . about 10 more of the above type messages
  .
  make: *** [vmlinux] Error 1
  Error building kernel, exiting

Are there any config options that should or should not be enabled when
building with PIE enabled?  Is there a compiler requirement for PIE (I'm
using gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5))?

Thanks,
Tom

> 
> Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
> changes, PIE support and KASLR in general. Thanks to Roland McGrath on his
> feedback for using -pie versus --emit-relocs and details on compiler code
> generation.
> 
> The patches:
>   - 1-3, 5-1#, 17-18: Change in assembly code to be PIE compliant.
>   - 4: Add a new _ASM_GET_PTR macro to fetch a symbol address generically.
>   - 14: Adapt percpu design to work correctly when PIE is enabled.
>   - 15: Provide an option to default visibility to hidden except for key 
> symbols.
> It removes errors between compilation units.
>   - 16: Adapt relocation tool to handle PIE binary correctly.
>   - 19: Add support for global cookie.
>   - 20: Support ftrace with PIE (used on Ubuntu config).
>   - 21: Fix incorrect address marker on dump_pagetables.
>   - 22: Add option to move the module section just after the kernel.
>   - 23: Adapt module loading to support PIE with dynamic GOT.
>   - 24: Make the GOT read-only.
>   - 25: Add the CONFIG_X86_PIE option (off by default).
>   - 26: Adapt relocation tool to generate a 64-bit relocation table.
>   - 27: Add the CONFIG_RANDOMIZE_BASE_LARGE option to increase relocation 
> range
> from 1G to 3G (off by default).
> 
> Performance/Size impact:
> 
> Size of vmlinux (Default configuration):
>   File size:
>   - PIE disabled: +0.31%
>   - PIE enabled: -3.210% (less relocations)
>   .text section:
>   - PIE disabled: +0.000644%
>   - PIE enabled: +0.837%
> 
> Size of vmlinux (Ubuntu configuration):
>   File size:
>   - PIE disabled: -0.201%
>   - PIE enabled: -0.082%
>   .text section:
>   - PIE disabled: same
>   - PIE enabled: +1.319%
> 
> Size of vmlinux (Default configuration + ORC):
>   File size:
>   - PIE enabled: -3.167%
>   .text section:
>   - PIE enabled: +0.814%
> 
> Size of vmlinux (Ubuntu configuration + ORC):
>   File size:
>   - PIE enabled: -3.167%
>   .text section:
>   - PIE enabled: +1.26%
> 
> The size increase is mainly due to not having access to the 32-bit signed
> relocation that can be used with mcmodel=kernel. A small part is due to 
> reduced
> optimization for PIE code. This bug [1] was opened with gcc to provide a 
> better
> code generation for kernel PIE.
> 
> Hackbench (50% and 1600% on thread/process for pipe/sockets):
>   - PIE disabled: no significant change (avg +0.1% on latest test).
>   - PIE enabled: between -0.50% to +0.86% in average (default and Ubuntu 
> config).
> 
> slab_test (average of 10 runs):
>   - PIE disabled: no significant change (-2% on latest run, likely noise).
>   - PIE enabled: between -1% and +0.8% on latest runs.
> 
> Kernbench (average of 10 Half and Optimal runs):
>   

[PATCH v1 27/27] x86/kaslr: Add option to extend KASLR range from 1GB to 3GB

2017-10-26 Thread Thomas Garnier via Virtualization
Add a new CONFIG_RANDOMIZE_BASE_LARGE option to benefit from PIE
support. It increases the KASLR range from 1GB to 3GB. The new range
stars at 0x just above the EFI memory region. This
option is off by default.

The boot code is adapted to create the appropriate page table spanning
three PUD pages.

The relocation table uses 64-bit integers generated with the updated
relocation tool with the large-reloc option.

Signed-off-by: Thomas Garnier 
---
 arch/x86/Kconfig | 21 +
 arch/x86/boot/compressed/Makefile|  5 +
 arch/x86/boot/compressed/misc.c  | 10 +-
 arch/x86/include/asm/page_64_types.h |  9 +
 arch/x86/kernel/head64.c | 15 ---
 arch/x86/kernel/head_64.S| 11 ++-
 6 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bbd28a46ab55..54627dd06348 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2155,6 +2155,27 @@ config X86_PIE
select DYNAMIC_MODULE_BASE
select MODULE_REL_CRCS if MODVERSIONS
 
+config RANDOMIZE_BASE_LARGE
+   bool "Increase the randomization range of the kernel image"
+   depends on X86_64 && RANDOMIZE_BASE
+   select X86_PIE
+   select X86_MODULE_PLTS if MODULES
+   default n
+   ---help---
+ Build the kernel as a Position Independent Executable (PIE) and
+ increase the available randomization range from 1GB to 3GB.
+
+ This option impacts performance on kernel CPU intensive workloads up
+ to 10% due to PIE generated code. Impact on user-mode processes and
+ typical usage would be significantly less (0.50% when you build the
+ kernel).
+
+ The kernel and modules will generate slightly more assembly (1 to 2%
+ increase on the .text sections). The vmlinux binary will be
+ significantly smaller due to less relocations.
+
+ If unsure say N
+
 config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs"
depends on SMP
diff --git a/arch/x86/boot/compressed/Makefile 
b/arch/x86/boot/compressed/Makefile
index 8a958274b54c..94dfee5a7cd2 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -112,7 +112,12 @@ $(obj)/vmlinux.bin: vmlinux FORCE
 
 targets += $(patsubst $(obj)/%,%,$(vmlinux-objs-y)) vmlinux.bin.all 
vmlinux.relocs
 
+# Large randomization require bigger relocation table
+ifeq ($(CONFIG_RANDOMIZE_BASE_LARGE),y)
+CMD_RELOCS = arch/x86/tools/relocs --large-reloc
+else
 CMD_RELOCS = arch/x86/tools/relocs
+endif
 quiet_cmd_relocs = RELOCS  $@
   cmd_relocs = $(CMD_RELOCS) $< > $@;$(CMD_RELOCS) --abs-relocs $<
 $(obj)/vmlinux.relocs: vmlinux FORCE
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index c14217cd0155..c1ac9f2e283d 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -169,10 +169,18 @@ void __puthex(unsigned long value)
 }
 
 #if CONFIG_X86_NEED_RELOCS
+
+/* Large randomization go lower than -2G and use large relocation table */
+#ifdef CONFIG_RANDOMIZE_BASE_LARGE
+typedef long rel_t;
+#else
+typedef int rel_t;
+#endif
+
 static void handle_relocations(void *output, unsigned long output_len,
   unsigned long virt_addr)
 {
-   int *reloc;
+   rel_t *reloc;
unsigned long delta, map, ptr;
unsigned long min_addr = (unsigned long)output;
unsigned long max_addr = min_addr + (VO___bss_start - VO__text);
diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 3f5f08b010d0..6b65f846dd64 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -48,7 +48,11 @@
 #define __PAGE_OFFSET   __PAGE_OFFSET_BASE
 #endif /* CONFIG_RANDOMIZE_MEMORY */
 
+#ifdef CONFIG_RANDOMIZE_BASE_LARGE
+#define __START_KERNEL_map _AC(0x, UL)
+#else
 #define __START_KERNEL_map _AC(0x8000, UL)
+#endif /* CONFIG_RANDOMIZE_BASE_LARGE */
 
 /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
 #ifdef CONFIG_X86_5LEVEL
@@ -65,9 +69,14 @@
  * 512MiB by default, leaving 1.5GiB for modules once the page tables
  * are fully set up. If kernel ASLR is configured, it can extend the
  * kernel page table mapping, reducing the size of the modules area.
+ * On PIE, we relocate the binary 2G lower so add this extra space.
  */
 #if defined(CONFIG_RANDOMIZE_BASE)
+#ifdef CONFIG_RANDOMIZE_BASE_LARGE
+#define KERNEL_IMAGE_SIZE  (_AC(3, UL) * 1024 * 1024 * 1024)
+#else
 #define KERNEL_IMAGE_SIZE  (1024 * 1024 * 1024)
+#endif
 #else
 #define KERNEL_IMAGE_SIZE  (512 * 1024 * 1024)
 #endif
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index b6363f0d11a7..d603d0f5a40a 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -39,6 +39,7 @@ static 

[PATCH v1 25/27] x86/pie: Add option to build the kernel as PIE

2017-10-26 Thread Thomas Garnier via Virtualization
Add the CONFIG_X86_PIE option which builds the kernel as a Position
Independent Executable (PIE). The kernel is currently build with the
mcmodel=kernel option which forces it to stay on the top 2G of the
virtual address space. With PIE, the kernel will be able to move below
the current limit.

The --emit-relocs linker option was kept instead of using -pie to limit
the impact on mapped sections. Any incompatible relocation will be
catch by the arch/x86/tools/relocs binary at compile time.

If segment based stack cookies are enabled, try to use the compiler
option to select the segment register. If not available, automatically
enabled global stack cookie in auto mode. Otherwise, recommend
compiler update or global stack cookie option.

Performance/Size impact:
Size of vmlinux (Default configuration):
 File size:
 - PIE disabled: +0.31%
 - PIE enabled: -3.210% (less relocations)
 .text section:
 - PIE disabled: +0.000644%
 - PIE enabled: +0.837%

Size of vmlinux (Ubuntu configuration):
 File size:
 - PIE disabled: -0.201%
 - PIE enabled: -0.082%
 .text section:
 - PIE disabled: same
 - PIE enabled: +1.319%

Size of vmlinux (Default configuration + ORC):
 File size:
 - PIE enabled: -3.167%
 .text section:
 - PIE enabled: +0.814%

Size of vmlinux (Ubuntu configuration + ORC):
 File size:
 - PIE enabled: -3.167%
 .text section:
 - PIE enabled: +1.26%

The size increase is mainly due to not having access to the 32-bit signed
relocation that can be used with mcmodel=kernel. A small part is due to reduced
optimization for PIE code. This bug [1] was opened with gcc to provide a better
code generation for kernel PIE.

Hackbench (50% and 1600% on thread/process for pipe/sockets):
 - PIE disabled: no significant change (avg +0.1% on latest test).
 - PIE enabled: between -0.50% to +0.86% in average (default and Ubuntu config).

slab_test (average of 10 runs):
 - PIE disabled: no significant change (-2% on latest run, likely noise).
 - PIE enabled: between -1% and +0.8% on latest runs.

Kernbench (average of 10 Half and Optimal runs):
 Elapsed Time:
 - PIE disabled: no significant change (avg -0.239%)
 - PIE enabled: average +0.07%
 System Time:
 - PIE disabled: no significant change (avg -0.277%)
 - PIE enabled: average +0.7%

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82303

Signed-off-by: Thomas Garnier 

merge PIE
---
 arch/x86/Kconfig  |  7 +++
 arch/x86/Makefile | 27 +++
 2 files changed, 34 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1f2b731c9ec3..bbd28a46ab55 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2148,6 +2148,13 @@ config X86_GLOBAL_STACKPROTECTOR
 
   If unsure, say N
 
+config X86_PIE
+   bool
+   depends on X86_64
+   select DEFAULT_HIDDEN
+   select DYNAMIC_MODULE_BASE
+   select MODULE_REL_CRCS if MODVERSIONS
+
 config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs"
depends on SMP
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index b592d57c531b..beae9504c3f4 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -135,7 +135,34 @@ else
 
 KBUILD_CFLAGS += -mno-red-zone
 ifdef CONFIG_X86_PIE
+KBUILD_CFLAGS += -fPIC
 KBUILD_LDFLAGS_MODULE += -T $(srctree)/arch/x86/kernel/module.lds
+
+ifndef CONFIG_CC_STACKPROTECTOR_NONE
+ifndef CONFIG_X86_GLOBAL_STACKPROTECTOR
+stackseg-flag := -mstack-protector-guard-reg=%gs
+ifeq ($(call cc-option-yn,$(stackseg-flag)),n)
+ifdef CONFIG_CC_STACKPROTECTOR_AUTO
+$(warning Cannot use CONFIG_CC_STACKPROTECTOR_* while \
+building a position independent kernel. \
+Default to global stack protector \
+(CONFIG_X86_GLOBAL_STACKPROTECTOR).)
+CONFIG_X86_GLOBAL_STACKPROTECTOR := y
+KBUILD_CFLAGS += -DCONFIG_X86_GLOBAL_STACKPROTECTOR
+KBUILD_AFLAGS += -DCONFIG_X86_GLOBAL_STACKPROTECTOR
+else
+$(error echo Cannot use \
+CONFIG_CC_STACKPROTECTOR_(REGULAR|STRONG) \
+while building a position independent binary \
+Update your compiler or use \
+CONFIG_X86_GLOBAL_STACKPROTECTOR)
+endif
+else
+KBUILD_CFLAGS += $(stackseg-flag)
+endif
+endif
+endif
+
 else
 KBUILD_CFLAGS += -mcmodel=kernel
 endif
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization

2017-10-26 Thread Thomas Garnier via Virtualization
On Wed, Oct 11, 2017 at 2:34 PM, Tom Lendacky  wrote:
> On 10/11/2017 3:30 PM, Thomas Garnier wrote:
>> Changes:
>>   - patch v1:
>> - Simplify ftrace implementation.
>> - Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
>>   - rfc v3:
>> - Use --emit-relocs instead of -pie to reduce dynamic relocation space on
>>   mapped memory. It also simplifies the relocation process.
>> - Move the start the module section next to the kernel. Remove the need 
>> for
>>   -mcmodel=large on modules. Extends module space from 1 to 2G maximum.
>> - Support for XEN PVH as 32-bit relocations can be ignored with
>>   --emit-relocs.
>> - Support for GOT relocations previously done automatically with -pie.
>> - Remove need for dynamic PLT in modules.
>> - Support dymamic GOT for modules.
>>   - rfc v2:
>> - Add support for global stack cookie while compiler default to fs 
>> without
>>   mcmodel=kernel
>> - Change patch 7 to correctly jump out of the identity mapping on kexec 
>> load
>>   preserve.
>>
>> These patches make the changes necessary to build the kernel as Position
>> Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
>> the top 2G of the virtual address space. It allows to optionally extend the
>> KASLR randomization range from 1G to 3G.
>
> Hi Thomas,
>
> I've applied your patches so that I can verify that SME works with PIE.
> Unfortunately, I'm running into build warnings and errors when I enable
> PIE.
>
> With CONFIG_STACK_VALIDATION=y I receive lots of messages like this:
>
>   drivers/scsi/libfc/fc_exch.o: warning: objtool: fc_destroy_exch_mgr()+0x0: 
> call without frame pointer save/setup
>
> Disabling CONFIG_STACK_VALIDATION suppresses those.

I ran into that, I plan to fix it in the next iteration.

>
> But near the end of the build, I receive errors like this:
>
>   arch/x86/kernel/setup.o: In function `dump_kernel_offset':
>   .../arch/x86/kernel/setup.c:801:(.text+0x32): relocation truncated to fit: 
> R_X86_64_32S against symbol `_text' defined in .text section in .tmp_vmlinux1
>   .
>   . about 10 more of the above type messages
>   .
>   make: *** [vmlinux] Error 1
>   Error building kernel, exiting
>
> Are there any config options that should or should not be enabled when
> building with PIE enabled?  Is there a compiler requirement for PIE (I'm
> using gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5))?

I never ran into these ones and I tested compilers older and newer.
What was your exact configuration?

>
> Thanks,
> Tom
>
>>
>> Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
>> changes, PIE support and KASLR in general. Thanks to Roland McGrath on his
>> feedback for using -pie versus --emit-relocs and details on compiler code
>> generation.
>>
>> The patches:
>>   - 1-3, 5-1#, 17-18: Change in assembly code to be PIE compliant.
>>   - 4: Add a new _ASM_GET_PTR macro to fetch a symbol address generically.
>>   - 14: Adapt percpu design to work correctly when PIE is enabled.
>>   - 15: Provide an option to default visibility to hidden except for key 
>> symbols.
>> It removes errors between compilation units.
>>   - 16: Adapt relocation tool to handle PIE binary correctly.
>>   - 19: Add support for global cookie.
>>   - 20: Support ftrace with PIE (used on Ubuntu config).
>>   - 21: Fix incorrect address marker on dump_pagetables.
>>   - 22: Add option to move the module section just after the kernel.
>>   - 23: Adapt module loading to support PIE with dynamic GOT.
>>   - 24: Make the GOT read-only.
>>   - 25: Add the CONFIG_X86_PIE option (off by default).
>>   - 26: Adapt relocation tool to generate a 64-bit relocation table.
>>   - 27: Add the CONFIG_RANDOMIZE_BASE_LARGE option to increase relocation 
>> range
>> from 1G to 3G (off by default).
>>
>> Performance/Size impact:
>>
>> Size of vmlinux (Default configuration):
>>   File size:
>>   - PIE disabled: +0.31%
>>   - PIE enabled: -3.210% (less relocations)
>>   .text section:
>>   - PIE disabled: +0.000644%
>>   - PIE enabled: +0.837%
>>
>> Size of vmlinux (Ubuntu configuration):
>>   File size:
>>   - PIE disabled: -0.201%
>>   - PIE enabled: -0.082%
>>   .text section:
>>   - PIE disabled: same
>>   - PIE enabled: +1.319%
>>
>> Size of vmlinux (Default configuration + ORC):
>>   File size:
>>   - PIE enabled: -3.167%
>>   .text section:
>>   - PIE enabled: +0.814%
>>
>> Size of vmlinux (Ubuntu configuration + ORC):
>>   File size:
>>   - PIE enabled: -3.167%
>>   .text section:
>>   - PIE enabled: +1.26%
>>
>> The size increase is mainly due to not having access to the 32-bit signed
>> relocation that can be used with mcmodel=kernel. A small part is due to 
>> reduced
>> optimization for PIE code. This bug [1] was opened with gcc to provide a 
>> better
>> code generation for kernel PIE.
>>
>> Hackbench (50% and 1600% on thread/process for pipe/sockets):
>> 

[PATCH v1 26/27] x86/relocs: Add option to generate 64-bit relocations

2017-10-26 Thread Thomas Garnier via Virtualization
The x86 relocation tool generates a list of 32-bit signed integers. There
was no need to use 64-bit integers because all addresses where above the 2G
top of the memory.

This change add a large-reloc option to generate 64-bit unsigned integers.
It can be used when the kernel plan to go below the top 2G and 32-bit
integers are not enough.

Signed-off-by: Thomas Garnier 
---
 arch/x86/tools/relocs.c| 60 +-
 arch/x86/tools/relocs.h|  4 +--
 arch/x86/tools/relocs_common.c | 15 +++
 3 files changed, 60 insertions(+), 19 deletions(-)

diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index bc032ad88b22..e7497ea1fe76 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -12,8 +12,14 @@
 
 static Elf_Ehdr ehdr;
 
+#if ELF_BITS == 64
+typedef uint64_t rel_off_t;
+#else
+typedef uint32_t rel_off_t;
+#endif
+
 struct relocs {
-   uint32_t*offset;
+   rel_off_t   *offset;
unsigned long   count;
unsigned long   size;
 };
@@ -684,7 +690,7 @@ static void print_absolute_relocs(void)
printf("\n");
 }
 
-static void add_reloc(struct relocs *r, uint32_t offset)
+static void add_reloc(struct relocs *r, rel_off_t offset)
 {
if (r->count == r->size) {
unsigned long newsize = r->size + 5;
@@ -1058,26 +1064,48 @@ static void sort_relocs(struct relocs *r)
qsort(r->offset, r->count, sizeof(r->offset[0]), cmp_relocs);
 }
 
-static int write32(uint32_t v, FILE *f)
+static int write32(rel_off_t rel, FILE *f)
 {
-   unsigned char buf[4];
+   unsigned char buf[sizeof(uint32_t)];
+   uint32_t v = (uint32_t)rel;
 
put_unaligned_le32(v, buf);
-   return fwrite(buf, 1, 4, f) == 4 ? 0 : -1;
+   return fwrite(buf, 1, sizeof(buf), f) == sizeof(buf) ? 0 : -1;
 }
 
-static int write32_as_text(uint32_t v, FILE *f)
+static int write32_as_text(rel_off_t rel, FILE *f)
 {
+   uint32_t v = (uint32_t)rel;
return fprintf(f, "\t.long 0x%08"PRIx32"\n", v) > 0 ? 0 : -1;
 }
 
-static void emit_relocs(int as_text, int use_real_mode)
+static int write64(rel_off_t rel, FILE *f)
+{
+   unsigned char buf[sizeof(uint64_t)];
+   uint64_t v = (uint64_t)rel;
+
+   put_unaligned_le64(v, buf);
+   return fwrite(buf, 1, sizeof(buf), f) == sizeof(buf) ? 0 : -1;
+}
+
+static int write64_as_text(rel_off_t rel, FILE *f)
+{
+   uint64_t v = (uint64_t)rel;
+   return fprintf(f, "\t.quad 0x%016"PRIx64"\n", v) > 0 ? 0 : -1;
+}
+
+static void emit_relocs(int as_text, int use_real_mode, int use_large_reloc)
 {
int i;
-   int (*write_reloc)(uint32_t, FILE *) = write32;
+   int (*write_reloc)(rel_off_t, FILE *);
int (*do_reloc)(struct section *sec, Elf_Rel *rel, Elf_Sym *sym,
const char *symname);
 
+   if (use_large_reloc)
+   write_reloc = write64;
+   else
+   write_reloc = write32;
+
 #if ELF_BITS == 64
if (!use_real_mode)
do_reloc = do_reloc64;
@@ -1088,6 +1116,9 @@ static void emit_relocs(int as_text, int use_real_mode)
do_reloc = do_reloc32;
else
do_reloc = do_reloc_real;
+
+   /* Large relocations only for 64-bit */
+   use_large_reloc = 0;
 #endif
 
/* Collect up the relocations */
@@ -,8 +1142,13 @@ static void emit_relocs(int as_text, int use_real_mode)
 * gas will like.
 */
printf(".section \".data.reloc\",\"a\"\n");
-   printf(".balign 4\n");
-   write_reloc = write32_as_text;
+   if (use_large_reloc) {
+   printf(".balign 8\n");
+   write_reloc = write64_as_text;
+   } else {
+   printf(".balign 4\n");
+   write_reloc = write32_as_text;
+   }
}
 
if (use_real_mode) {
@@ -1180,7 +1216,7 @@ static void print_reloc_info(void)
 
 void process(FILE *fp, int use_real_mode, int as_text,
 int show_absolute_syms, int show_absolute_relocs,
-int show_reloc_info)
+int show_reloc_info, int use_large_reloc)
 {
regex_init(use_real_mode);
read_ehdr(fp);
@@ -1203,5 +1239,5 @@ void process(FILE *fp, int use_real_mode, int as_text,
print_reloc_info();
return;
}
-   emit_relocs(as_text, use_real_mode);
+   emit_relocs(as_text, use_real_mode, use_large_reloc);
 }
diff --git a/arch/x86/tools/relocs.h b/arch/x86/tools/relocs.h
index 1d23bf953a4a..cb771cc4412d 100644
--- a/arch/x86/tools/relocs.h
+++ b/arch/x86/tools/relocs.h
@@ -30,8 +30,8 @@ enum symtype {
 
 void process_32(FILE *fp, int use_real_mode, int as_text,
int show_absolute_syms, int show_absolute_relocs,
-   int show_reloc_info);
+   int show_reloc_info, int 

[PATCH v1 24/27] x86/mm: Make the x86 GOT read-only

2017-10-26 Thread Thomas Garnier via Virtualization
The GOT is changed during early boot when relocations are applied. Make
it read-only directly. This table exists only for PIE binary.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 include/asm-generic/vmlinux.lds.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index e549bff87c5b..a2301c292e26 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -279,6 +279,17 @@
VMLINUX_SYMBOL(__end_ro_after_init) = .;
 #endif
 
+#ifdef CONFIG_X86_PIE
+#define RO_GOT_X86 \
+   .got: AT(ADDR(.got) - LOAD_OFFSET) {\
+   VMLINUX_SYMBOL(__start_got) = .;\
+   *(.got);\
+   VMLINUX_SYMBOL(__end_got) = .;  \
+   }
+#else
+#define RO_GOT_X86
+#endif
+
 /*
  * Read only Data
  */
@@ -335,6 +346,7 @@
VMLINUX_SYMBOL(__end_builtin_fw) = .;   \
}   \
\
+   RO_GOT_X86  \
TRACEDATA   \
\
/* Kernel symbol table: Normal symbols */   \
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 22/27] x86/modules: Add option to start module section after kernel

2017-10-26 Thread Thomas Garnier via Virtualization
Add an option so the module section is just after the mapped kernel. It
will ensure position independent modules are always at the right
distance from the kernel and do not require mcmodule=large. It also
optimize the available size for modules by getting rid of the empty
space on kernel randomization range.

Signed-off-by: Thomas Garnier 
---
 Documentation/x86/x86_64/mm.txt | 3 +++
 arch/x86/Kconfig| 4 
 arch/x86/include/asm/pgtable_64_types.h | 6 +-
 arch/x86/kernel/head64.c| 5 -
 arch/x86/mm/dump_pagetables.c   | 4 ++--
 5 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index b0798e281aa6..b51d66386e32 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -73,4 +73,7 @@ Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct 
mapping of all
 physical memory, vmalloc/ioremap space and virtual memory map are randomized.
 Their order is preserved but their base will be offset early at boot time.
 
+If CONFIG_DYNAMIC_MODULE_BASE is enabled, the module section follows the end of
+the mapped kernel.
+
 -Andi Kleen, Jul 2004
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 772ff3e0f623..1f2b731c9ec3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2133,6 +2133,10 @@ config RANDOMIZE_MEMORY_PHYSICAL_PADDING
 
   If unsure, leave at the default value.
 
+# Module section starts just after the end of the kernel module
+config DYNAMIC_MODULE_BASE
+   bool
+
 config X86_GLOBAL_STACKPROTECTOR
bool "Stack cookie using a global variable"
select CC_STACKPROTECTOR
diff --git a/arch/x86/include/asm/pgtable_64_types.h 
b/arch/x86/include/asm/pgtable_64_types.h
index 06470da156ba..e00fc429b898 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -6,6 +6,7 @@
 #ifndef __ASSEMBLY__
 #include 
 #include 
+#include 
 
 /*
  * These are used to make use of C type-checking..
@@ -18,7 +19,6 @@ typedef unsigned long pgdval_t;
 typedef unsigned long  pgprotval_t;
 
 typedef struct { pteval_t pte; } pte_t;
-
 #endif /* !__ASSEMBLY__ */
 
 #define SHARED_KERNEL_PMD  0
@@ -93,7 +93,11 @@ typedef struct { pteval_t pte; } pte_t;
 #define VMEMMAP_START  __VMEMMAP_BASE
 #endif /* CONFIG_RANDOMIZE_MEMORY */
 #define VMALLOC_END(VMALLOC_START + _AC((VMALLOC_SIZE_TB << 40) - 1, UL))
+#ifdef CONFIG_DYNAMIC_MODULE_BASE
+#define MODULES_VADDR   ALIGN(((unsigned long)_end + PAGE_SIZE), PMD_SIZE)
+#else
 #define MODULES_VADDR(__START_KERNEL_map + KERNEL_IMAGE_SIZE)
+#endif
 /* The module sections ends with the start of the fixmap */
 #define MODULES_END   __fix_to_virt(__end_of_fixed_addresses + 1)
 #define MODULES_LEN   (MODULES_END - MODULES_VADDR)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 675f1dba3b21..b6363f0d11a7 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -321,12 +321,15 @@ asmlinkage __visible void __init x86_64_start_kernel(char 
* real_mode_data)
 * Build-time sanity checks on the kernel image and module
 * area mappings. (these are purely build-time and produce no code)
 */
+#ifndef CONFIG_DYNAMIC_MODULE_BASE
BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map);
BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE);
-   BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
+   BUILD_BUG_ON(!IS_ENABLED(CONFIG_RANDOMIZE_BASE_LARGE) &&
+MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
+#endif
BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) ==
(__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 8691a57da63e..8565b2b45848 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -95,7 +95,7 @@ static struct addr_marker address_markers[] = {
{ EFI_VA_END,   "EFI Runtime Services" },
 # endif
{ __START_KERNEL_map,   "High Kernel Mapping" },
-   { MODULES_VADDR,"Modules" },
+   { 0/* MODULES_VADDR */, "Modules" },
{ MODULES_END,  "End Modules" },
 #else
{ PAGE_OFFSET,  "Kernel Mapping" },
@@ -529,7 +529,7 @@ static int __init pt_dump_init(void)
 # endif
address_markers[FIXADDR_START_NR].start_address = FIXADDR_START;
 #endif
-
+   address_markers[MODULES_VADDR_NR].start_address = MODULES_VADDR;
return 0;
 }
 __initcall(pt_dump_init);
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing 

[PATCH v1 23/27] x86/modules: Adapt module loading for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Adapt module loading to support PIE relocations. Generate dynamic GOT if
a symbol requires it but no entry exist in the kernel GOT.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/Makefile   |   4 +
 arch/x86/include/asm/module.h   |  11 +++
 arch/x86/include/asm/sections.h |   4 +
 arch/x86/kernel/module.c| 182 ++--
 arch/x86/kernel/module.lds  |   3 +
 5 files changed, 199 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/kernel/module.lds

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index de228200ef2a..b592d57c531b 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -134,7 +134,11 @@ else
 KBUILD_CFLAGS += $(cflags-y)
 
 KBUILD_CFLAGS += -mno-red-zone
+ifdef CONFIG_X86_PIE
+KBUILD_LDFLAGS_MODULE += -T $(srctree)/arch/x86/kernel/module.lds
+else
 KBUILD_CFLAGS += -mcmodel=kernel
+endif
 
 # -funit-at-a-time shrinks the kernel .text considerably
 # unfortunately it makes reading oopses harder.
diff --git a/arch/x86/include/asm/module.h b/arch/x86/include/asm/module.h
index 9eb7c718aaf8..21e0e02c0343 100644
--- a/arch/x86/include/asm/module.h
+++ b/arch/x86/include/asm/module.h
@@ -4,12 +4,23 @@
 #include 
 #include 
 
+#ifdef CONFIG_X86_PIE
+struct mod_got_sec {
+   struct elf64_shdr   *got;
+   int got_num_entries;
+   int got_max_entries;
+};
+#endif
+
 struct mod_arch_specific {
 #ifdef CONFIG_ORC_UNWINDER
unsigned int num_orcs;
int *orc_unwind_ip;
struct orc_entry *orc_unwind;
 #endif
+#ifdef CONFIG_X86_PIE
+   struct mod_got_sec  core;
+#endif
 };
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 6b2d496cf1aa..92d796109da1 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -15,4 +15,8 @@ extern char __end_rodata_hpage_align[];
 extern char __start_got[], __end_got[];
 #endif
 
+#if defined(CONFIG_X86_PIE)
+extern char __start_got[], __end_got[];
+#endif
+
 #endif /* _ASM_X86_SECTIONS_H */
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 62e7d70aadd5..aed24dfac1d3 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -77,6 +78,173 @@ static unsigned long int get_module_load_offset(void)
 }
 #endif
 
+#ifdef CONFIG_X86_PIE
+static u64 find_got_kernel_entry(Elf64_Sym *sym, const Elf64_Rela *rela)
+{
+   u64 *pos;
+
+   for (pos = (u64*)__start_got; pos < (u64*)__end_got; pos++) {
+   if (*pos == sym->st_value)
+   return (u64)pos + rela->r_addend;
+   }
+
+   return 0;
+}
+
+static u64 module_emit_got_entry(struct module *mod, void *loc,
+const Elf64_Rela *rela, Elf64_Sym *sym)
+{
+   struct mod_got_sec *gotsec = >arch.core;
+   u64 *got = (u64*)gotsec->got->sh_addr;
+   int i = gotsec->got_num_entries;
+   u64 ret;
+
+   /* Check if we can use the kernel GOT */
+   ret = find_got_kernel_entry(sym, rela);
+   if (ret)
+   return ret;
+
+   got[i] = sym->st_value;
+
+   /*
+* Check if the entry we just created is a duplicate. Given that the
+* relocations are sorted, this will be the last entry we allocated.
+* (if one exists).
+*/
+   if (i > 0 && got[i] == got[i - 2]) {
+   ret = (u64)[i - 1];
+   } else {
+   gotsec->got_num_entries++;
+   BUG_ON(gotsec->got_num_entries > gotsec->got_max_entries);
+   ret = (u64)[i];
+   }
+
+   return ret + rela->r_addend;
+}
+
+#define cmp_3way(a,b)  ((a) < (b) ? -1 : (a) > (b))
+
+static int cmp_rela(const void *a, const void *b)
+{
+   const Elf64_Rela *x = a, *y = b;
+   int i;
+
+   /* sort by type, symbol index and addend */
+   i = cmp_3way(ELF64_R_TYPE(x->r_info), ELF64_R_TYPE(y->r_info));
+   if (i == 0)
+   i = cmp_3way(ELF64_R_SYM(x->r_info), ELF64_R_SYM(y->r_info));
+   if (i == 0)
+   i = cmp_3way(x->r_addend, y->r_addend);
+   return i;
+}
+
+static bool duplicate_rel(const Elf64_Rela *rela, int num)
+{
+   /*
+* Entries are sorted by type, symbol index and addend. That means
+* that, if a duplicate entry exists, it must be in the preceding
+* slot.
+*/
+   return num > 0 && cmp_rela(rela + num, rela + num - 1) == 0;
+}
+
+static unsigned int count_gots(Elf64_Sym *syms, Elf64_Rela *rela, int num)
+{
+   unsigned int ret = 0;
+   Elf64_Sym *s;
+   int i;
+
+   for (i = 0; i < num; i++) {
+   switch (ELF64_R_TYPE(rela[i].r_info)) {
+   

[PATCH v1 21/27] x86/mm/dump_pagetables: Fix address markers index on x86_64

2017-10-26 Thread Thomas Garnier via Virtualization
The address_markers_idx enum is not aligned with the table when EFI is
enabled. Add an EFI_VA_END_NR entry in this case.

Signed-off-by: Thomas Garnier 
---
 arch/x86/mm/dump_pagetables.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 5e3ac6fe6c9e..8691a57da63e 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -52,12 +52,15 @@ enum address_markers_idx {
LOW_KERNEL_NR,
VMALLOC_START_NR,
VMEMMAP_START_NR,
-#ifdef CONFIG_KASAN
+# ifdef CONFIG_KASAN
KASAN_SHADOW_START_NR,
KASAN_SHADOW_END_NR,
-#endif
+# endif
 # ifdef CONFIG_X86_ESPFIX64
ESPFIX_START_NR,
+# endif
+# ifdef CONFIG_EFI
+   EFI_VA_END_NR,
 # endif
HIGH_KERNEL_NR,
MODULES_VADDR_NR,
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 18/27] kvm: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible. The new __ASM_GET_PTR_PRE macro is used to
get the address of a symbol on both 32 and 64-bit with PIE support.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/kvm_host.h | 6 --
 arch/x86/kernel/kvm.c   | 6 --
 arch/x86/kvm/svm.c  | 4 ++--
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9d7d856b2d89..14073fda75fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1342,9 +1342,11 @@ asmlinkage void kvm_spurious_fault(void);
".pushsection .fixup, \"ax\" \n" \
"667: \n\t" \
cleanup_insn "\n\t"   \
-   "cmpb $0, kvm_rebooting \n\t" \
+   "cmpb $0, kvm_rebooting" __ASM_SEL(,(%%rip)) " \n\t" \
"jne 668b \n\t"   \
-   __ASM_SIZE(push) " $666b \n\t"\
+   __ASM_SIZE(push) "%%" _ASM_AX " \n\t"   \
+   __ASM_GET_PTR_PRE(666b) "%%" _ASM_AX "\n\t" \
+   "xchg %%" _ASM_AX ", (%%" _ASM_SP ") \n\t"  \
"call kvm_spurious_fault \n\t"\
".popsection \n\t" \
_ASM_EXTABLE(666b, 667b)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8bb9594d0761..4464c3667831 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -627,8 +627,10 @@ asm(
 ".global __raw_callee_save___kvm_vcpu_is_preempted;"
 ".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
 "__raw_callee_save___kvm_vcpu_is_preempted:"
-"movq  __per_cpu_offset(,%rdi,8), %rax;"
-"cmpb  $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);"
+"leaq  __per_cpu_offset(%rip), %rax;"
+"movq  (%rax,%rdi,8), %rax;"
+"addq  " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rip), %rax;"
+"cmpb  $0, (%rax);"
 "setne %al;"
 "ret;"
 ".popsection");
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 0e68f0b3cbf7..364536080438 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -568,12 +568,12 @@ static u32 svm_msrpm_offset(u32 msr)
 
 static inline void clgi(void)
 {
-   asm volatile (__ex(SVM_CLGI));
+   asm volatile (__ex(SVM_CLGI) : :);
 }
 
 static inline void stgi(void)
 {
-   asm volatile (__ex(SVM_STGI));
+   asm volatile (__ex(SVM_STGI) : :);
 }
 
 static inline void invlpga(unsigned long addr, u32 asid)
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 17/27] xen: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use the new _ASM_GET_PTR macro which get a
symbol reference while being PIE compatible. Adapt the relocation tool
to ignore 32-bit Xen code.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/tools/relocs.c | 16 +++-
 arch/x86/xen/xen-head.S |  9 +
 arch/x86/xen/xen-pvh.S  | 13 +
 3 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index 5d3eb2760198..bc032ad88b22 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -831,6 +831,16 @@ static int is_percpu_sym(ElfW(Sym) *sym, const char 
*symname)
strncmp(symname, "init_per_cpu_", 13);
 }
 
+/*
+ * Check if the 32-bit relocation is within the xenpvh 32-bit code.
+ * If so, ignores it.
+ */
+static int is_in_xenpvh_assembly(ElfW(Addr) offset)
+{
+   ElfW(Sym) *sym = sym_lookup("pvh_start_xen");
+   return sym && (offset >= sym->st_value) &&
+   (offset < (sym->st_value + sym->st_size));
+}
 
 static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
  const char *symname)
@@ -892,8 +902,12 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, 
ElfW(Sym) *sym,
 * the relocations are processed.
 * Make sure that the offset will fit.
 */
-   if (r_type != R_X86_64_64 && (int32_t)offset != (int64_t)offset)
+   if (r_type != R_X86_64_64 &&
+   (int32_t)offset != (int64_t)offset) {
+   if (is_in_xenpvh_assembly(offset))
+   break;
die("Relocation offset doesn't fit in 32 bits\n");
+   }
 
if (r_type == R_X86_64_64)
add_reloc(, offset);
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 124941d09b2b..e5b7b9566191 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -25,14 +25,15 @@ ENTRY(startup_xen)
 
/* Clear .bss */
xor %eax,%eax
-   mov $__bss_start, %_ASM_DI
-   mov $__bss_stop, %_ASM_CX
+   _ASM_GET_PTR(__bss_start, %_ASM_DI)
+   _ASM_GET_PTR(__bss_stop, %_ASM_CX)
sub %_ASM_DI, %_ASM_CX
shr $__ASM_SEL(2, 3), %_ASM_CX
rep __ASM_SIZE(stos)
 
-   mov %_ASM_SI, xen_start_info
-   mov $init_thread_union+THREAD_SIZE, %_ASM_SP
+   _ASM_GET_PTR(xen_start_info, %_ASM_AX)
+   mov %_ASM_SI, (%_ASM_AX)
+   _ASM_GET_PTR(init_thread_union+THREAD_SIZE, %_ASM_SP)
 
jmp xen_start_kernel
 END(startup_xen)
diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
index e1a5fbeae08d..43e234c7c2de 100644
--- a/arch/x86/xen/xen-pvh.S
+++ b/arch/x86/xen/xen-pvh.S
@@ -101,8 +101,8 @@ ENTRY(pvh_start_xen)
call xen_prepare_pvh
 
/* startup_64 expects boot_params in %rsi. */
-   mov $_pa(pvh_bootparams), %rsi
-   mov $_pa(startup_64), %rax
+   movabs $_pa(pvh_bootparams), %rsi
+   movabs $_pa(startup_64), %rax
jmp *%rax
 
 #else /* CONFIG_X86_64 */
@@ -137,10 +137,15 @@ END(pvh_start_xen)
 
.section ".init.data","aw"
.balign 8
+   /*
+* Use a quad for _pa(gdt_start) because PIE does not understand a
+* long is enough. The resulting value will still be in the lower long
+* part.
+*/
 gdt:
.word gdt_end - gdt_start
-   .long _pa(gdt_start)
-   .word 0
+   .quad _pa(gdt_start)
+   .balign 8
 gdt_start:
.quad 0x/* NULL descriptor */
.quad 0x/* reserved */
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 19/27] x86: Support global stack cookie

2017-10-26 Thread Thomas Garnier via Virtualization
Add an off-by-default configuration option to use a global stack cookie
instead of the default TLS. This configuration option will only be used
with PIE binaries.

For kernel stack cookie, the compiler uses the mcmodel=kernel to switch
between the fs segment to gs segment. A PIE binary does not use
mcmodel=kernel because it can be relocated anywhere, therefore the
compiler will default to the fs segment register. This is going to be
fixed with a compiler change allowing to pick the segment register as
done on PowerPC. In the meantime, this configuration can be used to
support older compilers.

Signed-off-by: Thomas Garnier 
---
 arch/x86/Kconfig  | 11 +++
 arch/x86/Makefile |  9 +
 arch/x86/entry/entry_32.S |  3 ++-
 arch/x86/entry/entry_64.S |  3 ++-
 arch/x86/include/asm/processor.h  |  3 ++-
 arch/x86/include/asm/stackprotector.h | 19 ++-
 arch/x86/kernel/asm-offsets.c |  3 ++-
 arch/x86/kernel/asm-offsets_32.c  |  3 ++-
 arch/x86/kernel/asm-offsets_64.c  |  3 ++-
 arch/x86/kernel/cpu/common.c  |  3 ++-
 arch/x86/kernel/head_32.S |  3 ++-
 arch/x86/kernel/process.c |  5 +
 12 files changed, 55 insertions(+), 13 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 063f1e0d51aa..772ff3e0f623 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2133,6 +2133,17 @@ config RANDOMIZE_MEMORY_PHYSICAL_PADDING
 
   If unsure, leave at the default value.
 
+config X86_GLOBAL_STACKPROTECTOR
+   bool "Stack cookie using a global variable"
+   select CC_STACKPROTECTOR
+   ---help---
+  This option turns on the "stack-protector" GCC feature using a global
+  variable instead of a segment register. It is useful when the
+  compiler does not support custom segment registers when building a
+  position independent (PIE) binary.
+
+  If unsure, say N
+
 config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs"
depends on SMP
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 6276572259c8..de228200ef2a 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -141,6 +141,15 @@ else
 KBUILD_CFLAGS += $(call cc-option,-funit-at-a-time)
 endif
 
+ifdef CONFIG_X86_GLOBAL_STACKPROTECTOR
+ifeq ($(call cc-option, -mstack-protector-guard=global),)
+$(error Cannot use CONFIG_X86_GLOBAL_STACKPROTECTOR: \
+-mstack-protector-guard=global not supported \
+by compiler)
+endif
+KBUILD_CFLAGS += -mstack-protector-guard=global
+endif
+
 ifdef CONFIG_X86_X32
x32_ld_ok := $(call try-run,\
/bin/echo -e '1: .quad 1b' | \
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 8a13d468635a..ab3e5056722f 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -237,7 +237,8 @@ ENTRY(__switch_to_asm)
movl%esp, TASK_threadsp(%eax)
movlTASK_threadsp(%edx), %esp
 
-#ifdef CONFIG_CC_STACKPROTECTOR
+#if defined(CONFIG_CC_STACKPROTECTOR) && \
+   !defined(CONFIG_X86_GLOBAL_STACKPROTECTOR)
movlTASK_stack_canary(%edx), %ebx
movl%ebx, PER_CPU_VAR(stack_canary)+stack_canary_offset
 #endif
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index d3a52d2342af..01be62c1b436 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -390,7 +390,8 @@ ENTRY(__switch_to_asm)
movq%rsp, TASK_threadsp(%rdi)
movqTASK_threadsp(%rsi), %rsp
 
-#ifdef CONFIG_CC_STACKPROTECTOR
+#if defined(CONFIG_CC_STACKPROTECTOR) && \
+   !defined(CONFIG_X86_GLOBAL_STACKPROTECTOR)
movqTASK_stack_canary(%rsi), %rbx
movq%rbx, PER_CPU_VAR(irq_stack_union + stack_canary_offset)
 #endif
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index b09bd50b06c7..e3a7ef8d5fb8 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -394,7 +394,8 @@ DECLARE_PER_CPU(char *, irq_stack_ptr);
 DECLARE_PER_CPU(unsigned int, irq_count);
 extern asmlinkage void ignore_sysret(void);
 #else  /* X86_64 */
-#ifdef CONFIG_CC_STACKPROTECTOR
+#if defined(CONFIG_CC_STACKPROTECTOR) && \
+   defined(CONFIG_X86_GLOBAL_STACKPROTECTOR)
 /*
  * Make sure stack canary segment base is cached-aligned:
  *   "For Intel Atom processors, avoid non zero segment base address
diff --git a/arch/x86/include/asm/stackprotector.h 
b/arch/x86/include/asm/stackprotector.h
index 8abedf1d650e..66462d778dc5 100644
--- a/arch/x86/include/asm/stackprotector.h
+++ b/arch/x86/include/asm/stackprotector.h
@@ -51,6 +51,10 @@
 #define GDT_STACK_CANARY_INIT  \
[GDT_ENTRY_STACK_CANARY] = GDT_ENTRY_INIT(0x4090, 0, 0x18),
 
+#ifdef CONFIG_X86_GLOBAL_STACKPROTECTOR
+extern 

[PATCH v1 20/27] x86/ftrace: Adapt function tracing for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
When using -fPIE/PIC with function tracing, the compiler generates a
call through the GOT (call *__fentry__@GOTPCREL). This instruction
takes 6 bytes instead of 5 on the usual relative call.

If PIE is enabled, replace the 6th byte of the GOT call by a 1-byte nop
so ftrace can handle the previous 5-bytes as before.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/ftrace.h   |  6 --
 arch/x86/include/asm/sections.h |  4 
 arch/x86/kernel/ftrace.c| 42 +++--
 3 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index eccd0ac6bc38..183990157a5e 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -24,9 +24,11 @@ extern void __fentry__(void);
 static inline unsigned long ftrace_call_adjust(unsigned long addr)
 {
/*
-* addr is the address of the mcount call instruction.
-* recordmcount does the necessary offset calculation.
+* addr is the address of the mcount call instruction. PIE has always a
+* byte added to the start of the function.
 */
+   if (IS_ENABLED(CONFIG_X86_PIE))
+   addr -= 1;
return addr;
 }
 
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 2f75f30cb2f6..6b2d496cf1aa 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -11,4 +11,8 @@ extern struct exception_table_entry __stop___ex_table[];
 extern char __end_rodata_hpage_align[];
 #endif
 
+#if defined(CONFIG_X86_PIE)
+extern char __start_got[], __end_got[];
+#endif
+
 #endif /* _ASM_X86_SECTIONS_H */
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 9bef1bbeba63..a253601e783b 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -101,7 +101,7 @@ static const unsigned char *ftrace_nop_replace(void)
 
 static int
 ftrace_modify_code_direct(unsigned long ip, unsigned const char *old_code,
-  unsigned const char *new_code)
+ unsigned const char *new_code)
 {
unsigned char replaced[MCOUNT_INSN_SIZE];
 
@@ -134,6 +134,44 @@ ftrace_modify_code_direct(unsigned long ip, unsigned const 
char *old_code,
return 0;
 }
 
+/* Bytes before call GOT offset */
+const unsigned char got_call_preinsn[] = { 0xff, 0x15 };
+
+static int
+ftrace_modify_initial_code(unsigned long ip, unsigned const char *old_code,
+  unsigned const char *new_code)
+{
+   unsigned char replaced[MCOUNT_INSN_SIZE + 1];
+
+   ftrace_expected = old_code;
+
+   /*
+* If PIE is not enabled or no GOT call was found, default to the
+* original approach to code modification.
+*/
+   if (!IS_ENABLED(CONFIG_X86_PIE)
+   || probe_kernel_read(replaced, (void *)ip, sizeof(replaced))
+   || memcmp(replaced, got_call_preinsn, sizeof(got_call_preinsn)))
+   return ftrace_modify_code_direct(ip, old_code, new_code);
+
+   /*
+* Build a nop slide with a 5-byte nop and 1-byte nop to keep the ftrace
+* hooking algorithm working with the expected 5 bytes instruction.
+*/
+   memcpy(replaced, new_code, MCOUNT_INSN_SIZE);
+   replaced[MCOUNT_INSN_SIZE] = ideal_nops[1][0];
+
+   ip = text_ip_addr(ip);
+
+   if (probe_kernel_write((void *)ip, replaced, sizeof(replaced)))
+   return -EPERM;
+
+   sync_core();
+
+   return 0;
+
+}
+
 int ftrace_make_nop(struct module *mod,
struct dyn_ftrace *rec, unsigned long addr)
 {
@@ -152,7 +190,7 @@ int ftrace_make_nop(struct module *mod,
 * just modify the code directly.
 */
if (addr == MCOUNT_ADDR)
-   return ftrace_modify_code_direct(rec->ip, old, new);
+   return ftrace_modify_initial_code(rec->ip, old, new);
 
ftrace_expected = NULL;
 
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 15/27] compiler: Option to default to hidden symbols

2017-10-26 Thread Thomas Garnier via Virtualization
Provide an option to default visibility to hidden except for key
symbols. This option is disabled by default and will be used by x86_64
PIE support to remove errors between compilation units.

The default visibility is also enabled for external symbols that are
compared as they maybe equals (start/end of sections). In this case,
older versions of GCC will remove the comparison if the symbols are
hidden. This issue exists at least on gcc 4.9 and before.

Signed-off-by: Thomas Garnier 
---
 arch/x86/boot/boot.h |  2 +-
 arch/x86/include/asm/setup.h |  2 +-
 arch/x86/kernel/cpu/microcode/core.c |  4 ++--
 drivers/base/firmware_class.c|  4 ++--
 include/asm-generic/sections.h   |  6 ++
 include/linux/compiler.h |  8 
 init/Kconfig |  7 +++
 kernel/kallsyms.c| 16 
 kernel/trace/trace.h |  4 ++--
 lib/dynamic_debug.c  |  4 ++--
 10 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index ef5a9cc66fb8..d726c35bdd96 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -193,7 +193,7 @@ static inline bool memcmp_gs(const void *s1, addr_t s2, 
size_t len)
 }
 
 /* Heap -- available for dynamic lists. */
-extern char _end[];
+extern char _end[] __default_visibility;
 extern char *HEAP;
 extern char *heap_end;
 #define RESET_HEAP() ((void *)( HEAP = _end ))
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index a65cf544686a..7e0b54f605c6 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -67,7 +67,7 @@ static inline void x86_ce4100_early_setup(void) { }
  * This is set up by the setup-routine at boot-time
  */
 extern struct boot_params boot_params;
-extern char _text[];
+extern char _text[] __default_visibility;
 
 static inline bool kaslr_enabled(void)
 {
diff --git a/arch/x86/kernel/cpu/microcode/core.c 
b/arch/x86/kernel/cpu/microcode/core.c
index 86e8f0b2537b..8f021783a929 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -144,8 +144,8 @@ static bool __init check_loader_disabled_bsp(void)
return *res;
 }
 
-extern struct builtin_fw __start_builtin_fw[];
-extern struct builtin_fw __end_builtin_fw[];
+extern struct builtin_fw __start_builtin_fw[] __default_visibility;
+extern struct builtin_fw __end_builtin_fw[] __default_visibility;
 
 bool get_builtin_firmware(struct cpio_data *cd, const char *name)
 {
diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c
index 4b57cf5bc81d..77d4727f6594 100644
--- a/drivers/base/firmware_class.c
+++ b/drivers/base/firmware_class.c
@@ -45,8 +45,8 @@ MODULE_LICENSE("GPL");
 
 #ifdef CONFIG_FW_LOADER
 
-extern struct builtin_fw __start_builtin_fw[];
-extern struct builtin_fw __end_builtin_fw[];
+extern struct builtin_fw __start_builtin_fw[] __default_visibility;
+extern struct builtin_fw __end_builtin_fw[] __default_visibility;
 
 static bool fw_get_builtin_firmware(struct firmware *fw, const char *name,
void *buf, size_t size)
diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
index e5da44eddd2f..1aa5d6dac9e1 100644
--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -30,6 +30,9 @@
  * __irqentry_text_start, __irqentry_text_end
  * __softirqentry_text_start, __softirqentry_text_end
  */
+#ifdef CONFIG_DEFAULT_HIDDEN
+#pragma GCC visibility push(default)
+#endif
 extern char _text[], _stext[], _etext[];
 extern char _data[], _sdata[], _edata[];
 extern char __bss_start[], __bss_stop[];
@@ -46,6 +49,9 @@ extern char __softirqentry_text_start[], 
__softirqentry_text_end[];
 
 /* Start and end of .ctors section - used for constructor calls. */
 extern char __ctors_start[], __ctors_end[];
+#ifdef CONFIG_DEFAULT_HIDDEN
+#pragma GCC visibility pop
+#endif
 
 extern __visible const void __nosave_begin, __nosave_end;
 
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index e95a2631e545..6997716f73bf 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -78,6 +78,14 @@ extern void __chk_io_ptr(const volatile void __iomem *);
 #include 
 #endif
 
+/* Useful for Position Independent Code to reduce global references */
+#ifdef CONFIG_DEFAULT_HIDDEN
+#pragma GCC visibility push(hidden)
+#define __default_visibility  __attribute__((visibility ("default")))
+#else
+#define __default_visibility
+#endif
+
 /*
  * Generic compiler-dependent macros required for kernel
  * build go below this comment. Actual compiler/compiler version
diff --git a/init/Kconfig b/init/Kconfig
index ccb1d8daf241..b640201fcff7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1649,6 +1649,13 @@ config PROFILING
 config TRACEPOINTS
bool
 
+#
+# Default to hidden visibility for all symbols.
+# Useful for Position Independent 

[PATCH v1 16/27] x86/relocs: Handle PIE relocations

2017-10-26 Thread Thomas Garnier via Virtualization
Change the relocation tool to correctly handle relocations generated by
-fPIE option:

 - Add relocation for each entry of the .got section given the linker does not
   generate R_X86_64_GLOB_DAT on a simple link.
 - Ignore R_X86_64_GOTPCREL and R_X86_64_PLT32.

Signed-off-by: Thomas Garnier 
---
 arch/x86/tools/relocs.c | 94 -
 1 file changed, 93 insertions(+), 1 deletion(-)

diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index 73eb7fd4aec4..5d3eb2760198 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -31,6 +31,7 @@ struct section {
Elf_Sym*symtab;
Elf_Rel*reltab;
char   *strtab;
+   Elf_Addr   *got;
 };
 static struct section *secs;
 
@@ -292,6 +293,35 @@ static Elf_Sym *sym_lookup(const char *symname)
return 0;
 }
 
+static Elf_Sym *sym_lookup_addr(Elf_Addr addr, const char **name)
+{
+   int i;
+   for (i = 0; i < ehdr.e_shnum; i++) {
+   struct section *sec = [i];
+   long nsyms;
+   Elf_Sym *symtab;
+   Elf_Sym *sym;
+
+   if (sec->shdr.sh_type != SHT_SYMTAB)
+   continue;
+
+   nsyms = sec->shdr.sh_size/sizeof(Elf_Sym);
+   symtab = sec->symtab;
+
+   for (sym = symtab; --nsyms >= 0; sym++) {
+   if (sym->st_value == addr) {
+   if (name) {
+   *name = sym_name(sec->link->strtab,
+sym);
+   }
+   return sym;
+   }
+   }
+   }
+   return 0;
+}
+
+
 #if BYTE_ORDER == LITTLE_ENDIAN
 #define le16_to_cpu(val) (val)
 #define le32_to_cpu(val) (val)
@@ -512,6 +542,33 @@ static void read_relocs(FILE *fp)
}
 }
 
+static void read_got(FILE *fp)
+{
+   int i;
+   for (i = 0; i < ehdr.e_shnum; i++) {
+   struct section *sec = [i];
+   sec->got = NULL;
+   if (sec->shdr.sh_type != SHT_PROGBITS ||
+   strcmp(sec_name(i), ".got")) {
+   continue;
+   }
+   sec->got = malloc(sec->shdr.sh_size);
+   if (!sec->got) {
+   die("malloc of %d bytes for got failed\n",
+   sec->shdr.sh_size);
+   }
+   if (fseek(fp, sec->shdr.sh_offset, SEEK_SET) < 0) {
+   die("Seek to %d failed: %s\n",
+   sec->shdr.sh_offset, strerror(errno));
+   }
+   if (fread(sec->got, 1, sec->shdr.sh_size, fp)
+   != sec->shdr.sh_size) {
+   die("Cannot read got: %s\n",
+   strerror(errno));
+   }
+   }
+}
+
 
 static void print_absolute_symbols(void)
 {
@@ -642,6 +699,32 @@ static void add_reloc(struct relocs *r, uint32_t offset)
r->offset[r->count++] = offset;
 }
 
+/*
+ * The linker does not generate relocations for the GOT for the kernel.
+ * If a GOT is found, simulate the relocations that should have been included.
+ */
+static void walk_got_table(int (*process)(struct section *sec, Elf_Rel *rel,
+ Elf_Sym *sym, const char *symname),
+  struct section *sec)
+{
+   int i;
+   Elf_Addr entry;
+   Elf_Sym *sym;
+   const char *symname;
+   Elf_Rel rel;
+
+   for (i = 0; i < sec->shdr.sh_size/sizeof(Elf_Addr); i++) {
+   entry = sec->got[i];
+   sym = sym_lookup_addr(entry, );
+   if (!sym)
+   die("Could not found got symbol for entry %d\n", i);
+   rel.r_offset = sec->shdr.sh_addr + i * sizeof(Elf_Addr);
+   rel.r_info = ELF_BITS == 64 ? R_X86_64_GLOB_DAT
+: R_386_GLOB_DAT;
+   process(sec, , sym, symname);
+   }
+}
+
 static void walk_relocs(int (*process)(struct section *sec, Elf_Rel *rel,
Elf_Sym *sym, const char *symname))
 {
@@ -655,6 +738,8 @@ static void walk_relocs(int (*process)(struct section *sec, 
Elf_Rel *rel,
struct section *sec = [i];
 
if (sec->shdr.sh_type != SHT_REL_TYPE) {
+   if (sec->got)
+   walk_got_table(process, sec);
continue;
}
sec_symtab  = sec->link;
@@ -764,6 +849,8 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, 
ElfW(Sym) *sym,
offset += per_cpu_load_addr;
 
switch (r_type) {
+   case R_X86_64_PLT32:
+   case R_X86_64_GOTPCREL:
case R_X86_64_NONE:
/* NONE can be ignored. */
break;
@@ -805,7 

[PATCH v1 13/27] x86/boot/64: Use _text in a global for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
By default PIE generated code create only relative references so _text
points to the temporary virtual address. Instead use a global variable
so the relocation is done as expected.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/kernel/head64.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index bab4fa579450..675f1dba3b21 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -45,8 +45,14 @@ static void __head *fixup_pointer(void *ptr, unsigned long 
physaddr)
return ptr - (void *)_text + (void *)physaddr;
 }
 
-unsigned long __head __startup_64(unsigned long physaddr,
- struct boot_params *bp)
+/*
+ * Use a global variable to properly calculate _text delta on PIE. By default
+ * a PIE binary do a RIP relative difference instead of the relocated address.
+ */
+unsigned long _text_offset = (unsigned long)(_text - __START_KERNEL_map);
+
+unsigned long __head notrace __startup_64(unsigned long physaddr,
+ struct boot_params *bp)
 {
unsigned long load_delta, *p;
unsigned long pgtable_flags;
@@ -65,7 +71,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
 * Compute the delta between the address I am compiled to run at
 * and the address I am actually running at.
 */
-   load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+   load_delta = physaddr - _text_offset;
 
/* Is the address not 2M aligned? */
if (load_delta & ~PMD_PAGE_MASK)
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 14/27] x86/percpu: Adapt percpu for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Perpcu uses a clever design where the .percu ELF section has a virtual
address of zero and the relocation code avoid relocating specific
symbols. It makes the code simple and easily adaptable with or without
SMP support.

This design is incompatible with PIE because generated code always try to
access the zero virtual address relative to the default mapping address.
It becomes impossible when KASLR is configured to go below -2G. This
patch solves this problem by removing the zero mapping and adapting the GS
base to be relative to the expected address. These changes are done only
when PIE is enabled. The original implementation is kept as-is
by default.

The assembly and PER_CPU macros are changed to use relative references
when PIE is enabled.

The KALLSYMS_ABSOLUTE_PERCPU configuration is disabled with PIE given
percpu symbols are not absolute in this case.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/entry/entry_64.S  |  4 ++--
 arch/x86/include/asm/percpu.h  | 25 +++--
 arch/x86/kernel/cpu/common.c   |  4 +++-
 arch/x86/kernel/head_64.S  |  4 
 arch/x86/kernel/setup_percpu.c |  2 +-
 arch/x86/kernel/vmlinux.lds.S  | 13 +++--
 arch/x86/lib/cmpxchg16b_emu.S  |  8 
 arch/x86/xen/xen-asm.S | 12 ++--
 init/Kconfig   |  2 +-
 9 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 15bd5530d2ae..d3a52d2342af 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -392,7 +392,7 @@ ENTRY(__switch_to_asm)
 
 #ifdef CONFIG_CC_STACKPROTECTOR
movqTASK_stack_canary(%rsi), %rbx
-   movq%rbx, PER_CPU_VAR(irq_stack_union)+stack_canary_offset
+   movq%rbx, PER_CPU_VAR(irq_stack_union + stack_canary_offset)
 #endif
 
/* restore callee-saved registers */
@@ -808,7 +808,7 @@ apicinterrupt IRQ_WORK_VECTOR   
irq_work_interrupt  smp_irq_work_interrupt
 /*
  * Exception entry points.
  */
-#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss) + (TSS_ist + ((x) - 1) * 8)
+#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss + (TSS_ist + ((x) - 1) * 8))
 
 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
 ENTRY(\sym)
diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index b21a475fd7ed..07250f1099b5 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -4,9 +4,11 @@
 #ifdef CONFIG_X86_64
 #define __percpu_seg   gs
 #define __percpu_mov_opmovq
+#define __percpu_rel   (%rip)
 #else
 #define __percpu_seg   fs
 #define __percpu_mov_opmovl
+#define __percpu_rel
 #endif
 
 #ifdef __ASSEMBLY__
@@ -27,10 +29,14 @@
 #define PER_CPU(var, reg)  \
__percpu_mov_op %__percpu_seg:this_cpu_off, reg;\
lea var(reg), reg
-#define PER_CPU_VAR(var)   %__percpu_seg:var
+/* Compatible with Position Independent Code */
+#define PER_CPU_VAR(var)   %__percpu_seg:(var)##__percpu_rel
+/* Rare absolute reference */
+#define PER_CPU_VAR_ABS(var)   %__percpu_seg:var
 #else /* ! SMP */
 #define PER_CPU(var, reg)  __percpu_mov_op $var, reg
-#define PER_CPU_VAR(var)   var
+#define PER_CPU_VAR(var)   (var)##__percpu_rel
+#define PER_CPU_VAR_ABS(var)   var
 #endif /* SMP */
 
 #ifdef CONFIG_X86_64_SMP
@@ -208,27 +214,34 @@ do {  
\
pfo_ret__;  \
 })
 
+/* Position Independent code uses relative addresses only */
+#ifdef CONFIG_X86_PIE
+#define __percpu_stable_arg __percpu_arg(a1)
+#else
+#define __percpu_stable_arg __percpu_arg(P1)
+#endif
+
 #define percpu_stable_op(op, var)  \
 ({ \
typeof(var) pfo_ret__;  \
switch (sizeof(var)) {  \
case 1: \
-   asm(op "b "__percpu_arg(P1)",%0"\
+   asm(op "b "__percpu_stable_arg ",%0"\
: "=q" (pfo_ret__)  \
: "p" (&(var)));\
break;  \
case 2: \
-   asm(op "w "__percpu_arg(P1)",%0"\
+   asm(op "w "__percpu_stable_arg ",%0"\
: "=r" (pfo_ret__)  \
: "p" (&(var)));\
break;  \
case 4: \
-   asm(op "l "__percpu_arg(P1)",%0"

[PATCH v1 12/27] x86/paravirt: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
if PIE is enabled, switch the paravirt assembly constraints to be
compatible. The %c/i constrains generate smaller code so is kept by
default.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/paravirt_types.h | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 280d94c36dad..e6961f8a74aa 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -335,9 +335,17 @@ extern struct pv_lock_ops pv_lock_ops;
 #define PARAVIRT_PATCH(x)  \
(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
 
+#ifdef CONFIG_X86_PIE
+#define paravirt_opptr_call "a"
+#define paravirt_opptr_type "p"
+#else
+#define paravirt_opptr_call "c"
+#define paravirt_opptr_type "i"
+#endif
+
 #define paravirt_type(op)  \
[paravirt_typenum] "i" (PARAVIRT_PATCH(op)),\
-   [paravirt_opptr] "i" (&(op))
+   [paravirt_opptr] paravirt_opptr_type (&(op))
 #define paravirt_clobber(clobber)  \
[paravirt_clobber] "i" (clobber)
 
@@ -391,7 +399,7 @@ int paravirt_disable_iospace(void);
  * offset into the paravirt_patch_template structure, and can therefore be
  * freely converted back into a structure offset.
  */
-#define PARAVIRT_CALL  "call *%c[paravirt_opptr];"
+#define PARAVIRT_CALL  "call *%" paravirt_opptr_call "[paravirt_opptr];"
 
 /*
  * These macros are intended to wrap calls through one of the paravirt
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 11/27] x86/power/64: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/power/hibernate_asm_64.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/power/hibernate_asm_64.S 
b/arch/x86/power/hibernate_asm_64.S
index ce8da3a0412c..6fdd7bbc3c33 100644
--- a/arch/x86/power/hibernate_asm_64.S
+++ b/arch/x86/power/hibernate_asm_64.S
@@ -24,7 +24,7 @@
 #include 
 
 ENTRY(swsusp_arch_suspend)
-   movq$saved_context, %rax
+   leaqsaved_context(%rip), %rax
movq%rsp, pt_regs_sp(%rax)
movq%rbp, pt_regs_bp(%rax)
movq%rsi, pt_regs_si(%rax)
@@ -115,7 +115,7 @@ ENTRY(restore_registers)
movq%rax, %cr4;  # turn PGE back on
 
/* We don't restore %rax, it must be 0 anyway */
-   movq$saved_context, %rax
+   leaqsaved_context(%rip), %rax
movqpt_regs_sp(%rax), %rsp
movqpt_regs_bp(%rax), %rbp
movqpt_regs_si(%rax), %rsi
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 10/27] x86/boot/64: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Early at boot, the kernel is mapped at a temporary address while preparing
the page table. To know the changes needed for the page table with KASLR,
the boot code calculate the difference between the expected address of the
kernel and the one chosen by KASLR. It does not work with PIE because all
symbols in code are relatives. Instead of getting the future relocated
virtual address, you will get the current temporary mapping. The solution
is using global variables that will be relocated as expected.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/kernel/head_64.S | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 42e32c2e51bb..32d1899f48df 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -86,8 +86,21 @@ startup_64:
popq%rsi
 
/* Form the CR3 value being sure to include the CR3 modifier */
-   addq$(early_top_pgt - __START_KERNEL_map), %rax
+   addq_early_top_pgt_offset(%rip), %rax
jmp 1f
+
+   /*
+* Position Independent Code takes only relative references in code
+* meaning a global variable address is relative to RIP and not its
+* future virtual address. Global variables can be used instead as they
+* are still relocated on the expected kernel mapping address.
+*/
+   .align 8
+_early_top_pgt_offset:
+   .quad early_top_pgt - __START_KERNEL_map
+_init_top_offset:
+   .quad init_top_pgt - __START_KERNEL_map
+
 ENTRY(secondary_startup_64)
UNWIND_HINT_EMPTY
/*
@@ -116,7 +129,7 @@ ENTRY(secondary_startup_64)
popq%rsi
 
/* Form the CR3 value being sure to include the CR3 modifier */
-   addq$(init_top_pgt - __START_KERNEL_map), %rax
+   addq_init_top_offset(%rip), %rax
 1:
 
/* Enable PAE mode, PGE and LA57 */
@@ -131,7 +144,7 @@ ENTRY(secondary_startup_64)
movq%rax, %cr3
 
/* Ensure I am executing from virtual addresses */
-   movq$1f, %rax
+   movabs  $1f, %rax
jmp *%rax
 1:
UNWIND_HINT_EMPTY
@@ -230,11 +243,12 @@ ENTRY(secondary_startup_64)
 *  REX.W + FF /5 JMP m16:64 Jump far, absolute indirect,
 *  address given in m16:64.
 */
-   pushq   $.Lafter_lret   # put return address on stack for unwinder
+   leaq.Lafter_lret(%rip), %rax
+   pushq   %rax# put return address on stack for unwinder
xorq%rbp, %rbp  # clear frame pointer
-   movqinitial_code(%rip), %rax
+   leaqinitial_code(%rip), %rax
pushq   $__KERNEL_CS# set correct cs
-   pushq   %rax# target address in negative space
+   pushq   (%rax)  # target address in negative space
lretq
 .Lafter_lret:
 END(secondary_startup_64)
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 08/27] x86/CPU: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible. Use the new _ASM_GET_PTR macro instead of
the 'mov $symbol, %dst' construct to not have an absolute reference.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/processor.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index b446c5a082ad..b09bd50b06c7 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -49,7 +49,7 @@ static inline void *current_text_addr(void)
 {
void *pc;
 
-   asm volatile("mov $1f, %0; 1:":"=r" (pc));
+   asm volatile(_ASM_GET_PTR(1f, %0) "; 1:":"=r" (pc));
 
return pc;
 }
@@ -695,6 +695,7 @@ static inline void sync_core(void)
: ASM_CALL_CONSTRAINT : : "memory");
 #else
unsigned int tmp;
+   unsigned long tmp2;
 
asm volatile (
UNWIND_HINT_SAVE
@@ -705,11 +706,13 @@ static inline void sync_core(void)
"pushfq\n\t"
"mov %%cs, %0\n\t"
"pushq %q0\n\t"
-   "pushq $1f\n\t"
+   "leaq 1f(%%rip), %1\n\t"
+   "pushq %1\n\t"
"iretq\n\t"
UNWIND_HINT_RESTORE
"1:"
-   : "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory");
+   : "=" (tmp), "=" (tmp2), ASM_CALL_CONSTRAINT
+   : : "cc", "memory");
 #endif
 }
 
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 05/27] x86: relocate_kernel - Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/kernel/relocate_kernel_64.S | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/relocate_kernel_64.S 
b/arch/x86/kernel/relocate_kernel_64.S
index 307d3bac5f04..2ecbdcbe985b 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -200,9 +200,11 @@ identity_mapped:
movq%rax, %cr3
lea PAGE_SIZE(%r8), %rsp
callswap_pages
-   movq$virtual_mapped, %rax
-   pushq   %rax
-   ret
+   jmp *virtual_mapped_addr(%rip)
+
+   /* Absolute value for PIE support */
+virtual_mapped_addr:
+   .quad virtual_mapped
 
 virtual_mapped:
movqRSP(%r8), %rsp
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 07/27] x86: pm-trace - Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change assembly to use the new _ASM_GET_PTR macro instead of _ASM_MOV for
the assembly to be PIE compatible.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/pm-trace.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pm-trace.h b/arch/x86/include/asm/pm-trace.h
index 7b7ac42c3661..a3801261f0dd 100644
--- a/arch/x86/include/asm/pm-trace.h
+++ b/arch/x86/include/asm/pm-trace.h
@@ -7,7 +7,7 @@
 do {   \
if (pm_trace_enabled) { \
const void *tracedata;  \
-   asm volatile(_ASM_MOV " $1f,%0\n"   \
+   asm volatile(_ASM_GET_PTR(1f, %0) "\n"  \
 ".section .tracedata,\"a\"\n"  \
 "1:\t.word %c1\n\t"\
 _ASM_PTR " %c2\n"  \
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 06/27] x86/entry/64: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/entry/entry_64.S | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 49167258d587..15bd5530d2ae 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -194,12 +194,15 @@ entry_SYSCALL_64_fastpath:
ja  1f  /* return -ENOSYS (already in 
pt_regs->ax) */
movq%r10, %rcx
 
+   /* Ensures the call is position independent */
+   leaqsys_call_table(%rip), %r11
+
/*
 * This call instruction is handled specially in stub_ptregs_64.
 * It might end up jumping to the slow path.  If it jumps, RAX
 * and all argument registers are clobbered.
 */
-   call*sys_call_table(, %rax, 8)
+   call*(%r11, %rax, 8)
 .Lentry_SYSCALL_64_after_fastpath_call:
 
movq%rax, RAX(%rsp)
@@ -334,7 +337,8 @@ ENTRY(stub_ptregs_64)
 * RAX stores a pointer to the C function implementing the syscall.
 * IRQs are on.
 */
-   cmpq$.Lentry_SYSCALL_64_after_fastpath_call, (%rsp)
+   leaq.Lentry_SYSCALL_64_after_fastpath_call(%rip), %r11
+   cmpq%r11, (%rsp)
jne 1f
 
/*
@@ -1172,7 +1176,8 @@ ENTRY(error_entry)
movl%ecx, %eax  /* zero extend */
cmpq%rax, RIP+8(%rsp)
je  .Lbstep_iret
-   cmpq$.Lgs_change, RIP+8(%rsp)
+   leaq.Lgs_change(%rip), %rcx
+   cmpq%rcx, RIP+8(%rsp)
jne .Lerror_entry_done
 
/*
@@ -1383,10 +1388,10 @@ ENTRY(nmi)
 * resume the outer NMI.
 */
 
-   movq$repeat_nmi, %rdx
+   leaqrepeat_nmi(%rip), %rdx
cmpq8(%rsp), %rdx
ja  1f
-   movq$end_repeat_nmi, %rdx
+   leaqend_repeat_nmi(%rip), %rdx
cmpq8(%rsp), %rdx
ja  nested_nmi_out
 1:
@@ -1440,7 +1445,8 @@ nested_nmi:
pushq   %rdx
pushfq
pushq   $__KERNEL_CS
-   pushq   $repeat_nmi
+   leaqrepeat_nmi(%rip), %rdx
+   pushq   %rdx
 
/* Put stack back */
addq$(6*8), %rsp
@@ -1479,7 +1485,9 @@ first_nmi:
addq$8, (%rsp)  /* Fix up RSP */
pushfq  /* RFLAGS */
pushq   $__KERNEL_CS/* CS */
-   pushq   $1f /* RIP */
+   pushq   %rax/* Support Position Independent Code */
+   leaq1f(%rip), %rax  /* RIP */
+   xchgq   %rax, (%rsp)/* Restore RAX, put 1f */
INTERRUPT_RETURN/* continues at repeat_nmi below */
UNWIND_HINT_IRET_REGS
 1:
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 09/27] x86/acpi: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/kernel/acpi/wakeup_64.S | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/acpi/wakeup_64.S b/arch/x86/kernel/acpi/wakeup_64.S
index 50b8ed0317a3..472659c0f811 100644
--- a/arch/x86/kernel/acpi/wakeup_64.S
+++ b/arch/x86/kernel/acpi/wakeup_64.S
@@ -14,7 +14,7 @@
 * Hooray, we are in Long 64-bit mode (but still running in low memory)
 */
 ENTRY(wakeup_long64)
-   movqsaved_magic, %rax
+   movqsaved_magic(%rip), %rax
movq$0x123456789abcdef0, %rdx
cmpq%rdx, %rax
jne bogus_64_magic
@@ -25,14 +25,14 @@ ENTRY(wakeup_long64)
movw%ax, %es
movw%ax, %fs
movw%ax, %gs
-   movqsaved_rsp, %rsp
+   movqsaved_rsp(%rip), %rsp
 
-   movqsaved_rbx, %rbx
-   movqsaved_rdi, %rdi
-   movqsaved_rsi, %rsi
-   movqsaved_rbp, %rbp
+   movqsaved_rbx(%rip), %rbx
+   movqsaved_rdi(%rip), %rdi
+   movqsaved_rsi(%rip), %rsi
+   movqsaved_rbp(%rip), %rbp
 
-   movqsaved_rip, %rax
+   movqsaved_rip(%rip), %rax
jmp *%rax
 ENDPROC(wakeup_long64)
 
@@ -45,7 +45,7 @@ ENTRY(do_suspend_lowlevel)
xorl%eax, %eax
callsave_processor_state
 
-   movq$saved_context, %rax
+   leaqsaved_context(%rip), %rax
movq%rsp, pt_regs_sp(%rax)
movq%rbp, pt_regs_bp(%rax)
movq%rsi, pt_regs_si(%rax)
@@ -64,13 +64,14 @@ ENTRY(do_suspend_lowlevel)
pushfq
popqpt_regs_flags(%rax)
 
-   movq$.Lresume_point, saved_rip(%rip)
+   leaq.Lresume_point(%rip), %rax
+   movq%rax, saved_rip(%rip)
 
-   movq%rsp, saved_rsp
-   movq%rbp, saved_rbp
-   movq%rbx, saved_rbx
-   movq%rdi, saved_rdi
-   movq%rsi, saved_rsi
+   movq%rsp, saved_rsp(%rip)
+   movq%rbp, saved_rbp(%rip)
+   movq%rbx, saved_rbx(%rip)
+   movq%rdi, saved_rdi(%rip)
+   movq%rsi, saved_rsi(%rip)
 
addq$8, %rsp
movl$3, %edi
@@ -82,7 +83,7 @@ ENTRY(do_suspend_lowlevel)
.align 4
 .Lresume_point:
/* We don't restore %rax, it must be 0 anyway */
-   movq$saved_context, %rax
+   leaqsaved_context(%rip), %rax
movqsaved_context_cr4(%rax), %rbx
movq%rbx, %cr4
movqsaved_context_cr3(%rax), %rbx
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] packed ring layout proposal v3

2017-10-26 Thread Jens Freimann

On Tue, Oct 10, 2017 at 09:56:44AM +, Liang, Cunming wrote:

> > DESC_WRAP: used by device to poll. Driver sets it to a *different*
> > value every time it overwrites a descriptor. How to achieve it?
> > since descriptors are written out in ring order, simply maintain the
> > current value internally (start value 1) and flip it every time you
> > overwrite the first descriptor.
> > Device leaves it intact when overwriting a descriptor.

Ok, get it now.


>
> This is confusing me a bit.
>
> My understanding is: 1. the internally kept wrap value only flipped
> when the first descriptor is overwritten
>
> 2. the moment the first descriptor is written the internal wrap value
> is flipped 0->1 or 1->0 and this value is written to every descriptor
> DESC_WRAP until we reach the first descriptor again

That's right, it's also my take.
DESC_WRAP is only used by driver, device does nothing with that flag.



Yes this is what I tried to say. Can you suggest a better wording then?


I'll give it a try.


The term of DESC_WRAP is fine to me.


Couldn't think of a better name either. 


> > After writing down this explanation, I think the names aren't great.
> >
> > Let me try an alternative explanation.
> >
> > ---
> > A two-bit field, DRIVER_OWNER, signals the buffer ownership.
> > It has 4 possible values:
> > values 0x1, 0x11 are written by driver values 0x0, 0x10 are written
> > by device
>
> The 0x prefix might add to the confusion here. It is really just two
> bits, no?

Ouch. Yes I meant 0b. Thanks!

0b00, 0b10 are written by device?
I suppose device can only clear high bit, can keep low bit no change.
Then the value written by device can be either 0b01 or 0b00, but 0b10 means to 
set high bit, no?



> > each time driver writes out a descriptor, it must make sure that the
> > high bit in OWNER changes.
> >
> > each time device writes out a descriptor, it must make sure that the
> > high bit in OWNER does not change.

Typo here? It should be "..., it must make sure that the low bit in OWNER does not 
change."?
For high bit in OWNER, each time devices writes out a descriptor, it must make 
sure to clear high bit in OWNER.


> >
> > this is exactly the same functionally, DRIVER is high bit and WRAP
> > is the low bit.  Does this make things clearer?
>
> So far it makes sense to me.

It sounds good.


So I implemented two ideas in the DPDK prototype code. The code is
very rough and simple. I'll describe again how I understood the ideas.

1. The one discussed in this thread: Adding two flags DESC_DRIVER and DESC_WRAP. 


Driver code: keeps an internal wrap value. Every time we cross the ring
boundary at descriptor 0 the wrap value is flipped. For all
descriptors used from this point on we will set the DESC_WRAP flag if it wasn't
set before or the other way round. Driver code looks at the
DESC_DRIVER flag to check if the descriptor is available for it to
use.  


Device code: when dequeuing descriptors from the ring it keeps going
until it sees a different value in the DESC_WRAP bit. Device only
checks this bit but doesn't change it. Device clears DESC_DRIVER when
done with descriptor to signal to driver that descriptor can be reused.

https://github.com/jensfr/dpdk/tree/add-driver-and-wrap-flag

2. Driver writes ID of last written descriptor into index field of first
descriptor and turns on a flag DESC_SKIP in the descriptor. (This idea is from 
Michael)

Driver code: Let's say driver adds 32 descriptors to the ring at once.
It fills in starting at ring position 0 to 31. It will write 31 to the
index field of descriptor 0. 


Device code: When dequeueing descriptors from the ring it looks into
the first descriptor. For example, it looks at desc[0] and the index field is 31
instead of 0. In addition to this the flag DESC_SKIP is set. The
device can expect all descriptors in the ring from 0 to 31 to be ready
for it to dequeue and doesn't have to check the DESC_HW flag.

https://github.com/jensfr/dpdk/tree/desc-hw-index-hint


I tried implementing both ideas on top of the DPDK prototype code
(without the DESC_WB code) and ran a quick test with two testpmd
instances, one with a vhost-user interface and the other one with a
virtio-user device. 


From a performance point of view I saw no difference between both
implementations. 


I understand that DESC_DRIVER/DESC_WRAP would be better for virtio
hardware implementations?

regards,
Jens 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 03/27] x86: Use symbol name in jump table for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Replace the %c constraint with %P. The %c is incompatible with PIE
because it implies an immediate value whereas %P reference a symbol.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/jump_label.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/jump_label.h 
b/arch/x86/include/asm/jump_label.h
index adc54c12cbd1..6e558e4524dc 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -36,9 +36,9 @@ static __always_inline bool arch_static_branch(struct 
static_key *key, bool bran
".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
".pushsection __jump_table,  \"aw\" \n\t"
_ASM_ALIGN "\n\t"
-   _ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
+   _ASM_PTR "1b, %l[l_yes], %P0 \n\t"
".popsection \n\t"
-   : :  "i" (key), "i" (branch) : : l_yes);
+   : :  "X" (&((char *)key)[branch]) : : l_yes);
 
return false;
 l_yes:
@@ -52,9 +52,9 @@ static __always_inline bool arch_static_branch_jump(struct 
static_key *key, bool
"2:\n\t"
".pushsection __jump_table,  \"aw\" \n\t"
_ASM_ALIGN "\n\t"
-   _ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
+   _ASM_PTR "1b, %l[l_yes], %P0 \n\t"
".popsection \n\t"
-   : :  "i" (key), "i" (branch) : : l_yes);
+   : :  "X" (&((char *)key)[branch]) : : l_yes);
 
return false;
 l_yes:
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 04/27] x86: Add macro to get symbol address for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Add a new _ASM_GET_PTR macro to fetch a symbol address. It will be used
to replace "_ASM_MOV $, %dst" code construct that are not compatible
with PIE.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/asm.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index b0dc91f4bedc..6de365b8e3fd 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -57,6 +57,19 @@
 # define CC_OUT(c) [_cc_ ## c] "=qm"
 #endif
 
+/* Macros to get a global variable address with PIE support on 64-bit */
+#ifdef CONFIG_X86_32
+#define __ASM_GET_PTR_PRE(_src) __ASM_FORM_COMMA(movl $##_src)
+#else
+#ifdef __ASSEMBLY__
+#define __ASM_GET_PTR_PRE(_src) __ASM_FORM_COMMA(leaq (_src)(%rip))
+#else
+#define __ASM_GET_PTR_PRE(_src) __ASM_FORM_COMMA(leaq (_src)(%%rip))
+#endif
+#endif
+#define _ASM_GET_PTR(_src, _dst) \
+   __ASM_GET_PTR_PRE(_src) __ASM_FORM(_dst)
+
 /* Exception table entry */
 #ifdef __ASSEMBLY__
 # define _ASM_EXTABLE_HANDLE(from, to, handler)\
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization

2017-10-26 Thread Thomas Garnier via Virtualization
Changes:
 - patch v1:
   - Simplify ftrace implementation.
   - Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
 - rfc v3:
   - Use --emit-relocs instead of -pie to reduce dynamic relocation space on
 mapped memory. It also simplifies the relocation process.
   - Move the start the module section next to the kernel. Remove the need for
 -mcmodel=large on modules. Extends module space from 1 to 2G maximum.
   - Support for XEN PVH as 32-bit relocations can be ignored with
 --emit-relocs.
   - Support for GOT relocations previously done automatically with -pie.
   - Remove need for dynamic PLT in modules.
   - Support dymamic GOT for modules.
 - rfc v2:
   - Add support for global stack cookie while compiler default to fs without
 mcmodel=kernel
   - Change patch 7 to correctly jump out of the identity mapping on kexec load
 preserve.

These patches make the changes necessary to build the kernel as Position
Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
the top 2G of the virtual address space. It allows to optionally extend the
KASLR randomization range from 1G to 3G.

Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
changes, PIE support and KASLR in general. Thanks to Roland McGrath on his
feedback for using -pie versus --emit-relocs and details on compiler code
generation.

The patches:
 - 1-3, 5-1#, 17-18: Change in assembly code to be PIE compliant.
 - 4: Add a new _ASM_GET_PTR macro to fetch a symbol address generically.
 - 14: Adapt percpu design to work correctly when PIE is enabled.
 - 15: Provide an option to default visibility to hidden except for key symbols.
   It removes errors between compilation units.
 - 16: Adapt relocation tool to handle PIE binary correctly.
 - 19: Add support for global cookie.
 - 20: Support ftrace with PIE (used on Ubuntu config).
 - 21: Fix incorrect address marker on dump_pagetables.
 - 22: Add option to move the module section just after the kernel.
 - 23: Adapt module loading to support PIE with dynamic GOT.
 - 24: Make the GOT read-only.
 - 25: Add the CONFIG_X86_PIE option (off by default).
 - 26: Adapt relocation tool to generate a 64-bit relocation table.
 - 27: Add the CONFIG_RANDOMIZE_BASE_LARGE option to increase relocation range
   from 1G to 3G (off by default).

Performance/Size impact:

Size of vmlinux (Default configuration):
 File size:
 - PIE disabled: +0.31%
 - PIE enabled: -3.210% (less relocations)
 .text section:
 - PIE disabled: +0.000644%
 - PIE enabled: +0.837%

Size of vmlinux (Ubuntu configuration):
 File size:
 - PIE disabled: -0.201%
 - PIE enabled: -0.082%
 .text section:
 - PIE disabled: same
 - PIE enabled: +1.319%

Size of vmlinux (Default configuration + ORC):
 File size:
 - PIE enabled: -3.167%
 .text section:
 - PIE enabled: +0.814%

Size of vmlinux (Ubuntu configuration + ORC):
 File size:
 - PIE enabled: -3.167%
 .text section:
 - PIE enabled: +1.26%

The size increase is mainly due to not having access to the 32-bit signed
relocation that can be used with mcmodel=kernel. A small part is due to reduced
optimization for PIE code. This bug [1] was opened with gcc to provide a better
code generation for kernel PIE.

Hackbench (50% and 1600% on thread/process for pipe/sockets):
 - PIE disabled: no significant change (avg +0.1% on latest test).
 - PIE enabled: between -0.50% to +0.86% in average (default and Ubuntu config).

slab_test (average of 10 runs):
 - PIE disabled: no significant change (-2% on latest run, likely noise).
 - PIE enabled: between -1% and +0.8% on latest runs.

Kernbench (average of 10 Half and Optimal runs):
 Elapsed Time:
 - PIE disabled: no significant change (avg -0.239%)
 - PIE enabled: average +0.07%
 System Time:
 - PIE disabled: no significant change (avg -0.277%)
 - PIE enabled: average +0.7%

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82303

diffstat:
 Documentation/x86/x86_64/mm.txt  |3 
 arch/x86/Kconfig |   43 ++
 arch/x86/Makefile|   40 +
 arch/x86/boot/boot.h |2 
 arch/x86/boot/compressed/Makefile|5 
 arch/x86/boot/compressed/misc.c  |   10 +
 arch/x86/crypto/aes-x86_64-asm_64.S  |   45 --
 arch/x86/crypto/aesni-intel_asm.S|   14 +-
 arch/x86/crypto/aesni-intel_avx-x86_64.S |6 
 arch/x86/crypto/camellia-aesni-avx-asm_64.S  |   42 +++---
 arch/x86/crypto/camellia-aesni-avx2-asm_64.S |   44 +++---
 arch/x86/crypto/camellia-x86_64-asm_64.S |8 -
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S|   50 ---
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S|   44 +++---
 arch/x86/crypto/des3_ede-asm_64.S|   96 +-
 arch/x86/crypto/ghash-clmulni-intel_asm.S|4 
 arch/x86/crypto/glue_helper-asm-avx.S|4 
 arch/x86/crypto/glue_helper-asm-avx2.S   |6 
 arch/x86/entry/entry_32.S 

[PATCH v1 01/27] x86/crypto: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/crypto/aes-x86_64-asm_64.S  | 45 -
 arch/x86/crypto/aesni-intel_asm.S| 14 ++--
 arch/x86/crypto/aesni-intel_avx-x86_64.S |  6 +-
 arch/x86/crypto/camellia-aesni-avx-asm_64.S  | 42 ++--
 arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 44 ++---
 arch/x86/crypto/camellia-x86_64-asm_64.S |  8 ++-
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S| 50 ---
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S| 44 +++--
 arch/x86/crypto/des3_ede-asm_64.S| 96 ++--
 arch/x86/crypto/ghash-clmulni-intel_asm.S|  4 +-
 arch/x86/crypto/glue_helper-asm-avx.S|  4 +-
 arch/x86/crypto/glue_helper-asm-avx2.S   |  6 +-
 12 files changed, 211 insertions(+), 152 deletions(-)

diff --git a/arch/x86/crypto/aes-x86_64-asm_64.S 
b/arch/x86/crypto/aes-x86_64-asm_64.S
index 8739cf7795de..86fa068e5e81 100644
--- a/arch/x86/crypto/aes-x86_64-asm_64.S
+++ b/arch/x86/crypto/aes-x86_64-asm_64.S
@@ -48,8 +48,12 @@
 #define R10%r10
 #define R11%r11
 
+/* Hold global for PIE suport */
+#define RBASE  %r12
+
 #define prologue(FUNC,KEY,B128,B192,r1,r2,r5,r6,r7,r8,r9,r10,r11) \
ENTRY(FUNC);\
+   pushq   RBASE;  \
movqr1,r2;  \
leaqKEY+48(r8),r9;  \
movqr10,r11;\
@@ -74,54 +78,63 @@
movlr6 ## E,4(r9);  \
movlr7 ## E,8(r9);  \
movlr8 ## E,12(r9); \
+   popqRBASE;  \
ret;\
ENDPROC(FUNC);
 
+#define round_mov(tab_off, reg_i, reg_o) \
+   leaqtab_off(%rip), RBASE; \
+   movl(RBASE,reg_i,4), reg_o;
+
+#define round_xor(tab_off, reg_i, reg_o) \
+   leaqtab_off(%rip), RBASE; \
+   xorl(RBASE,reg_i,4), reg_o;
+
 #define round(TAB,OFFSET,r1,r2,r3,r4,r5,r6,r7,r8,ra,rb,rc,rd) \
movzbl  r2 ## H,r5 ## E;\
movzbl  r2 ## L,r6 ## E;\
-   movlTAB+1024(,r5,4),r5 ## E;\
+   round_mov(TAB+1024, r5, r5 ## E)\
movwr4 ## X,r2 ## X;\
-   movlTAB(,r6,4),r6 ## E; \
+   round_mov(TAB, r6, r6 ## E) \
roll$16,r2 ## E;\
shrl$16,r4 ## E;\
movzbl  r4 ## L,r7 ## E;\
movzbl  r4 ## H,r4 ## E;\
xorlOFFSET(r8),ra ## E; \
xorlOFFSET+4(r8),rb ## E;   \
-   xorlTAB+3072(,r4,4),r5 ## E;\
-   xorlTAB+2048(,r7,4),r6 ## E;\
+   round_xor(TAB+3072, r4, r5 ## E)\
+   round_xor(TAB+2048, r7, r6 ## E)\
movzbl  r1 ## L,r7 ## E;\
movzbl  r1 ## H,r4 ## E;\
-   movlTAB+1024(,r4,4),r4 ## E;\
+   round_mov(TAB+1024, r4, r4 ## E)\
movwr3 ## X,r1 ## X;\
roll$16,r1 ## E;\
shrl$16,r3 ## E;\
-   xorlTAB(,r7,4),r5 ## E; \
+   round_xor(TAB, r7, r5 ## E) \
movzbl  r3 ## L,r7 ## E;\
movzbl  r3 ## H,r3 ## E;\
-   xorlTAB+3072(,r3,4),r4 ## E;\
-   xorlTAB+2048(,r7,4),r5 ## E;\
+   round_xor(TAB+3072, r3, r4 ## E)\
+   round_xor(TAB+2048, r7, r5 ## E)\
movzbl  r1 ## L,r7 ## E;\
movzbl  r1 ## H,r3 ## E;\
shrl$16,r1 ## E;\
-   xorlTAB+3072(,r3,4),r6 ## E;\
-   movlTAB+2048(,r7,4),r3 ## E;\
+   round_xor(TAB+3072, r3, r6 ## E)\
+   round_mov(TAB+2048, r7, r3 ## E)\
movzbl  r1 ## L,r7 ## E;\
movzbl  r1 ## H,r1 ## E;\
-   xorlTAB+1024(,r1,4),r6 ## E;\
-   xorlTAB(,r7,4),r3 ## E; \
+   round_xor(TAB+1024, r1, r6 ## E)\
+   round_xor(TAB, r7, r3 ## E) \
movzbl  r2 ## H,r1 ## E;\
movzbl  r2 ## L,r7 ## E;\
shrl$16,r2 ## E;\
-   xorlTAB+3072(,r1,4),r3 ## E;\
-   xorlTAB+2048(,r7,4),r4 ## E;\
+   round_xor(TAB+3072, r1, r3 ## E)\
+   round_xor(TAB+2048, r7, r4 ## E)\
movzbl  r2 ## H,r1 ## E;\
movzbl  r2 ## L,r2 ## E;\
xorlOFFSET+8(r8),rc ## E;   \
xorlOFFSET+12(r8),rd ## E;  \
-   xorlTAB+1024(,r1,4),r3 ## E;\
-   xorlTAB(,r2,4),r4 ## E;
+   round_xor(TAB+1024, r1, r3 ## E)\
+   round_xor(TAB, r2, r4 ## E)
 
 #define move_regs(r1,r2,r3,r4) \
movlr3 ## E,r1 ## E;\
diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 16627fec80b2..5f73201dff32 100644
--- 

[PATCH v1 02/27] x86: Use symbol name on bug table for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Replace the %c constraint with %P. The %c is incompatible with PIE
because it implies an immediate value whereas %P reference a symbol.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/bug.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index aa6b2023d8f8..1210d22ad547 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -37,7 +37,7 @@ do {  
\
asm volatile("1:\t" ins "\n"\
 ".pushsection __bug_table,\"aw\"\n"\
 "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n"   \
-"\t"  __BUG_REL(%c0) "\t# bug_entry::file\n"   \
+"\t"  __BUG_REL(%P0) "\t# bug_entry::file\n"   \
 "\t.word %c1""\t# bug_entry::line\n"   \
 "\t.word %c2""\t# bug_entry::flags\n"  \
 "\t.org 2b+%c3\n"  \
-- 
2.15.0.rc0.271.g36b669edcc-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH 00/13] x86/paravirt: Make pv ops code generation more closely match reality

2017-10-26 Thread Josh Poimboeuf
On Fri, Oct 06, 2017 at 09:35:16AM +0200, Vitaly Kuznetsov wrote:
> Josh Poimboeuf  writes:
> 
> > - For the most common runtime cases (everything except Xen and vSMP),
> >   vmlinux disassembly now matches what the actual runtime code looks
> >   like.  This improves debuggability and kernel developer sanity (a
> >   precious resource).
> >
> > ...
> >
> > - It's hopefully a first step in simplifying paravirt patching by
> >   getting rid of .parainstructions, pv ops, and apply_paravirt()
> >   completely.  (I think Xen can be changed to set CPU feature bits to
> >   specify which ops it needs during early boot, then those ops can be
> >   patched in using early alternatives.)
> 
> JFYI starting 4.14 Xen PV is not the only user of pv_mmu_ops, Hyper-V
> uses it for TLB shootdown now.

Yeah, I saw that.  It should be fine because the pv_alternatives get
patched before the Hyper-V code sets up pv_mmu_ops.

-- 
Josh
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-26 Thread Josh Poimboeuf
On Thu, Oct 05, 2017 at 04:35:03PM -0400, Boris Ostrovsky wrote:
> 
> >  #ifdef CONFIG_PARAVIRT
> > +/*
> > + * Paravirt alternatives are applied much earlier than normal alternatives.
> > + * They are only applied when running on a hypervisor.  They replace some
> > + * native instructions with calls to pv ops.
> > + */
> > +void __init apply_pv_alternatives(void)
> > +{
> > +   setup_force_cpu_cap(X86_FEATURE_PV_OPS);
> 
> Not for Xen HVM guests.

>From what I can tell, HVM guests still use pv_time_ops and
pv_mmu_ops.exit_mmap, right?

> > +   apply_alternatives(__pv_alt_instructions, __pv_alt_instructions_end);
> > +}
> 
> 
> This is a problem (at least for Xen PV guests):
> apply_alternatives()->text_poke_early()->local_irq_save()->...'cli'->death.

Ah, right.

> It might be possible not to turn off/on the interrupts in this
> particular case since the guest probably won't be able to handle an
> interrupt at this point anyway.

Yeah, that should work.  For Xen and for the other hypervisors, this is
called well before irq init, so interrupts can't be handled yet anyway.

> > +
> >  void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
> >  struct paravirt_patch_site *end)
> >  {
> > diff --git a/arch/x86/kernel/cpu/hypervisor.c 
> > b/arch/x86/kernel/cpu/hypervisor.c
> > index 4fa90006ac68..17243fe0f5ce 100644
> > --- a/arch/x86/kernel/cpu/hypervisor.c
> > +++ b/arch/x86/kernel/cpu/hypervisor.c
> > @@ -71,6 +71,8 @@ void __init init_hypervisor_platform(void)
> > if (!x86_hyper)
> > return;
> >  
> > +   apply_pv_alternatives();
> 
> Not for Xen PV guests who have already done this.

I think it would be harmless, but yeah, it's probably best to only write
it once.

Thanks for the review!

-- 
Josh
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-26 Thread Josh Poimboeuf
On Fri, Oct 06, 2017 at 11:29:52AM -0400, Boris Ostrovsky wrote:
> >>> +
> >>>  void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
> >>>struct paravirt_patch_site *end)
> >>>  {
> >>> diff --git a/arch/x86/kernel/cpu/hypervisor.c 
> >>> b/arch/x86/kernel/cpu/hypervisor.c
> >>> index 4fa90006ac68..17243fe0f5ce 100644
> >>> --- a/arch/x86/kernel/cpu/hypervisor.c
> >>> +++ b/arch/x86/kernel/cpu/hypervisor.c
> >>> @@ -71,6 +71,8 @@ void __init init_hypervisor_platform(void)
> >>>   if (!x86_hyper)
> >>>   return;
> >>>  
> >>> + apply_pv_alternatives();
> >> Not for Xen PV guests who have already done this.
> > I think it would be harmless, but yeah, it's probably best to only write
> > it once.
> 
> I also wonder whether calling apply_pv_alternatives() here before
> x86_hyper->init_platform() will work since the latter may be setting
> those op. In fact, that's what Xen HVM does for pv_mmu_ops.exit_mmap.

apply_pv_alternatives() changes:

  (native code)

to

  call *pv_whatever_ops.whatever

So apply_pv_alternatives() should be called *before* any of the ops are
set up.

-- 
Josh
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC v3 20/27] x86/ftrace: Adapt function tracing for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
On Thu, Oct 5, 2017 at 9:11 AM, Steven Rostedt  wrote:
> On Thu, 5 Oct 2017 09:01:14 -0700
> Thomas Garnier  wrote:
>
>> On Thu, Oct 5, 2017 at 6:06 AM, Steven Rostedt  wrote:
>> > On Wed,  4 Oct 2017 14:19:56 -0700
>> > Thomas Garnier  wrote:
>> >
>> >> When using -fPIE/PIC with function tracing, the compiler generates a
>> >> call through the GOT (call *__fentry__@GOTPCREL). This instruction
>> >> takes 6 bytes instead of 5 on the usual relative call.
>> >>
>> >> With this change, function tracing supports 6 bytes on traceable
>> >> function and can still replace relative calls on the ftrace assembly
>> >> functions.
>> >>
>> >> Position Independent Executable (PIE) support will allow to extended the
>> >> KASLR randomization range below the -2G memory limit.
>> >
>> > Question: This 6 bytes is only the initial call that gcc creates. When
>> > function tracing is enabled, the calls are back to the normal call to
>> > the ftrace trampoline?
>>
>> That is correct.
>>
>
> Then I think a better idea is to simply nop them out at compile time,
> and have the code that updates them to nops to know about it.
>
> See scripts/recordmcount.c
>
> Could we simply add a 5 byte nop followed by a 1 byte nop, and treat it
> the same as if it didn't exist? This code can be a little complex, and
> can cause really nasty side effects if things go wrong. I would like to
> keep from adding more variables to the changes here.

Sure, I will simplify it for the next iteration.

Thanks for the feedback.

>
> -- Steve



-- 
Thomas
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC v3 20/27] x86/ftrace: Adapt function tracing for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
On Thu, Oct 5, 2017 at 6:06 AM, Steven Rostedt  wrote:
> On Wed,  4 Oct 2017 14:19:56 -0700
> Thomas Garnier  wrote:
>
>> When using -fPIE/PIC with function tracing, the compiler generates a
>> call through the GOT (call *__fentry__@GOTPCREL). This instruction
>> takes 6 bytes instead of 5 on the usual relative call.
>>
>> With this change, function tracing supports 6 bytes on traceable
>> function and can still replace relative calls on the ftrace assembly
>> functions.
>>
>> Position Independent Executable (PIE) support will allow to extended the
>> KASLR randomization range below the -2G memory limit.
>
> Question: This 6 bytes is only the initial call that gcc creates. When
> function tracing is enabled, the calls are back to the normal call to
> the ftrace trampoline?

That is correct.

>
> -- Steve
>



-- 
Thomas
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 0/1] linux: Buffers/caches in VirtIO Balloon driver stats

2017-10-26 Thread Tomáš Golembiovský
On Thu, 21 Sep 2017 14:55:40 +0200
Tomáš Golembiovský  wrote:

> Linux driver part
> 
> v2:
> - fixed typos
> 
> Tomáš Golembiovský (1):
>   virtio_balloon: include buffers and cached memory statistics
> 
>  drivers/virtio/virtio_balloon.c | 11 +++
>  include/uapi/linux/virtio_balloon.h |  4 +++-
>  mm/swap_state.c |  1 +
>  3 files changed, 15 insertions(+), 1 deletion(-)
> 
> -- 
> 2.14.1
> 

ping

-- 
Tomáš Golembiovský 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC v3 27/27] x86/kaslr: Add option to extend KASLR range from 1GB to 3GB

2017-10-26 Thread Thomas Garnier via Virtualization
Add a new CONFIG_RANDOMIZE_BASE_LARGE option to benefit from PIE
support. It increases the KASLR range from 1GB to 3GB. The new range
stars at 0x just above the EFI memory region. This
option is off by default.

The boot code is adapted to create the appropriate page table spanning
three PUD pages.

The relocation table uses 64-bit integers generated with the updated
relocation tool with the large-reloc option.

Signed-off-by: Thomas Garnier 
---
 arch/x86/Kconfig | 21 +
 arch/x86/boot/compressed/Makefile|  5 +
 arch/x86/boot/compressed/misc.c  | 10 +-
 arch/x86/include/asm/page_64_types.h |  9 +
 arch/x86/kernel/head64.c | 15 ---
 arch/x86/kernel/head_64.S| 11 ++-
 6 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b92f96923712..81f4512549d1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2149,6 +2149,27 @@ config X86_PIE
select MODULE_REL_CRCS if MODVERSIONS
select X86_GLOBAL_STACKPROTECTOR if CC_STACKPROTECTOR
 
+config RANDOMIZE_BASE_LARGE
+   bool "Increase the randomization range of the kernel image"
+   depends on X86_64 && RANDOMIZE_BASE
+   select X86_PIE
+   select X86_MODULE_PLTS if MODULES
+   default n
+   ---help---
+ Build the kernel as a Position Independent Executable (PIE) and
+ increase the available randomization range from 1GB to 3GB.
+
+ This option impacts performance on kernel CPU intensive workloads up
+ to 10% due to PIE generated code. Impact on user-mode processes and
+ typical usage would be significantly less (0.50% when you build the
+ kernel).
+
+ The kernel and modules will generate slightly more assembly (1 to 2%
+ increase on the .text sections). The vmlinux binary will be
+ significantly smaller due to less relocations.
+
+ If unsure say N
+
 config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs"
depends on SMP
diff --git a/arch/x86/boot/compressed/Makefile 
b/arch/x86/boot/compressed/Makefile
index 8a958274b54c..94dfee5a7cd2 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -112,7 +112,12 @@ $(obj)/vmlinux.bin: vmlinux FORCE
 
 targets += $(patsubst $(obj)/%,%,$(vmlinux-objs-y)) vmlinux.bin.all 
vmlinux.relocs
 
+# Large randomization require bigger relocation table
+ifeq ($(CONFIG_RANDOMIZE_BASE_LARGE),y)
+CMD_RELOCS = arch/x86/tools/relocs --large-reloc
+else
 CMD_RELOCS = arch/x86/tools/relocs
+endif
 quiet_cmd_relocs = RELOCS  $@
   cmd_relocs = $(CMD_RELOCS) $< > $@;$(CMD_RELOCS) --abs-relocs $<
 $(obj)/vmlinux.relocs: vmlinux FORCE
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index c14217cd0155..c1ac9f2e283d 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -169,10 +169,18 @@ void __puthex(unsigned long value)
 }
 
 #if CONFIG_X86_NEED_RELOCS
+
+/* Large randomization go lower than -2G and use large relocation table */
+#ifdef CONFIG_RANDOMIZE_BASE_LARGE
+typedef long rel_t;
+#else
+typedef int rel_t;
+#endif
+
 static void handle_relocations(void *output, unsigned long output_len,
   unsigned long virt_addr)
 {
-   int *reloc;
+   rel_t *reloc;
unsigned long delta, map, ptr;
unsigned long min_addr = (unsigned long)output;
unsigned long max_addr = min_addr + (VO___bss_start - VO__text);
diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 3f5f08b010d0..6b65f846dd64 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -48,7 +48,11 @@
 #define __PAGE_OFFSET   __PAGE_OFFSET_BASE
 #endif /* CONFIG_RANDOMIZE_MEMORY */
 
+#ifdef CONFIG_RANDOMIZE_BASE_LARGE
+#define __START_KERNEL_map _AC(0x, UL)
+#else
 #define __START_KERNEL_map _AC(0x8000, UL)
+#endif /* CONFIG_RANDOMIZE_BASE_LARGE */
 
 /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
 #ifdef CONFIG_X86_5LEVEL
@@ -65,9 +69,14 @@
  * 512MiB by default, leaving 1.5GiB for modules once the page tables
  * are fully set up. If kernel ASLR is configured, it can extend the
  * kernel page table mapping, reducing the size of the modules area.
+ * On PIE, we relocate the binary 2G lower so add this extra space.
  */
 #if defined(CONFIG_RANDOMIZE_BASE)
+#ifdef CONFIG_RANDOMIZE_BASE_LARGE
+#define KERNEL_IMAGE_SIZE  (_AC(3, UL) * 1024 * 1024 * 1024)
+#else
 #define KERNEL_IMAGE_SIZE  (1024 * 1024 * 1024)
+#endif
 #else
 #define KERNEL_IMAGE_SIZE  (512 * 1024 * 1024)
 #endif
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index b6363f0d11a7..d603d0f5a40a 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ 

[RFC v3 25/27] x86/pie: Add option to build the kernel as PIE

2017-10-26 Thread Thomas Garnier via Virtualization
Add the CONFIG_X86_PIE option which builds the kernel as a Position
Independent Executable (PIE). The kernel is currently build with the
mcmodel=kernel option which forces it to stay on the top 2G of the
virtual address space. With PIE, the kernel will be able to move below
the current limit.

The --emit-relocs linker option was kept instead of using -pie to limit
the impact on mapped sections. Any incompatible relocation will be
catch by the arch/x86/tools/relocs binary at compile time.

Performance/Size impact:
Size of vmlinux (Default configuration):
 File size:
 - PIE disabled: +0.31%
 - PIE enabled: -3.210% (less relocations)
 .text section:
 - PIE disabled: +0.000644%
 - PIE enabled: +0.837%

Size of vmlinux (Ubuntu configuration):
 File size:
 - PIE disabled: -0.201%
 - PIE enabled: -0.082%
 .text section:
 - PIE disabled: same
 - PIE enabled: +1.319%

Size of vmlinux (Default configuration + ORC):
 File size:
 - PIE enabled: -3.167%
 .text section:
 - PIE enabled: +0.814%

Size of vmlinux (Ubuntu configuration + ORC):
 File size:
 - PIE enabled: -3.167%
 .text section:
 - PIE enabled: +1.26%

The size increase is mainly due to not having access to the 32-bit signed
relocation that can be used with mcmodel=kernel. A small part is due to reduced
optimization for PIE code. This bug [1] was opened with gcc to provide a better
code generation for kernel PIE.

Hackbench (50% and 1600% on thread/process for pipe/sockets):
 - PIE disabled: no significant change (avg +0.1% on latest test).
 - PIE enabled: between -0.50% to +0.86% in average (default and Ubuntu config).

slab_test (average of 10 runs):
 - PIE disabled: no significant change (-2% on latest run, likely noise).
 - PIE enabled: between -1% and +0.8% on latest runs.

Kernbench (average of 10 Half and Optimal runs):
 Elapsed Time:
 - PIE disabled: no significant change (avg -0.239%)
 - PIE enabled: average +0.07%
 System Time:
 - PIE disabled: no significant change (avg -0.277%)
 - PIE enabled: average +0.7%

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82303

Signed-off-by: Thomas Garnier 
---
 arch/x86/Kconfig  | 8 
 arch/x86/Makefile | 1 +
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1e4b399c64e5..b92f96923712 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2141,6 +2141,14 @@ config X86_GLOBAL_STACKPROTECTOR
bool
depends on CC_STACKPROTECTOR
 
+config X86_PIE
+   bool
+   depends on X86_64
+   select DEFAULT_HIDDEN
+   select DYNAMIC_MODULE_BASE
+   select MODULE_REL_CRCS if MODVERSIONS
+   select X86_GLOBAL_STACKPROTECTOR if CC_STACKPROTECTOR
+
 config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs"
depends on SMP
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 42774185a58a..c49855b4b1be 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -144,6 +144,7 @@ else
 
 KBUILD_CFLAGS += -mno-red-zone
 ifdef CONFIG_X86_PIE
+KBUILD_CFLAGS += -fPIC
 KBUILD_LDFLAGS_MODULE += -T $(srctree)/arch/x86/kernel/module.lds
 else
 KBUILD_CFLAGS += -mcmodel=kernel
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 26/27] x86/relocs: Add option to generate 64-bit relocations

2017-10-26 Thread Thomas Garnier via Virtualization
The x86 relocation tool generates a list of 32-bit signed integers. There
was no need to use 64-bit integers because all addresses where above the 2G
top of the memory.

This change add a large-reloc option to generate 64-bit unsigned integers.
It can be used when the kernel plan to go below the top 2G and 32-bit
integers are not enough.

Signed-off-by: Thomas Garnier 
---
 arch/x86/tools/relocs.c| 60 +-
 arch/x86/tools/relocs.h|  4 +--
 arch/x86/tools/relocs_common.c | 15 +++
 3 files changed, 60 insertions(+), 19 deletions(-)

diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index bc032ad88b22..e7497ea1fe76 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -12,8 +12,14 @@
 
 static Elf_Ehdr ehdr;
 
+#if ELF_BITS == 64
+typedef uint64_t rel_off_t;
+#else
+typedef uint32_t rel_off_t;
+#endif
+
 struct relocs {
-   uint32_t*offset;
+   rel_off_t   *offset;
unsigned long   count;
unsigned long   size;
 };
@@ -684,7 +690,7 @@ static void print_absolute_relocs(void)
printf("\n");
 }
 
-static void add_reloc(struct relocs *r, uint32_t offset)
+static void add_reloc(struct relocs *r, rel_off_t offset)
 {
if (r->count == r->size) {
unsigned long newsize = r->size + 5;
@@ -1058,26 +1064,48 @@ static void sort_relocs(struct relocs *r)
qsort(r->offset, r->count, sizeof(r->offset[0]), cmp_relocs);
 }
 
-static int write32(uint32_t v, FILE *f)
+static int write32(rel_off_t rel, FILE *f)
 {
-   unsigned char buf[4];
+   unsigned char buf[sizeof(uint32_t)];
+   uint32_t v = (uint32_t)rel;
 
put_unaligned_le32(v, buf);
-   return fwrite(buf, 1, 4, f) == 4 ? 0 : -1;
+   return fwrite(buf, 1, sizeof(buf), f) == sizeof(buf) ? 0 : -1;
 }
 
-static int write32_as_text(uint32_t v, FILE *f)
+static int write32_as_text(rel_off_t rel, FILE *f)
 {
+   uint32_t v = (uint32_t)rel;
return fprintf(f, "\t.long 0x%08"PRIx32"\n", v) > 0 ? 0 : -1;
 }
 
-static void emit_relocs(int as_text, int use_real_mode)
+static int write64(rel_off_t rel, FILE *f)
+{
+   unsigned char buf[sizeof(uint64_t)];
+   uint64_t v = (uint64_t)rel;
+
+   put_unaligned_le64(v, buf);
+   return fwrite(buf, 1, sizeof(buf), f) == sizeof(buf) ? 0 : -1;
+}
+
+static int write64_as_text(rel_off_t rel, FILE *f)
+{
+   uint64_t v = (uint64_t)rel;
+   return fprintf(f, "\t.quad 0x%016"PRIx64"\n", v) > 0 ? 0 : -1;
+}
+
+static void emit_relocs(int as_text, int use_real_mode, int use_large_reloc)
 {
int i;
-   int (*write_reloc)(uint32_t, FILE *) = write32;
+   int (*write_reloc)(rel_off_t, FILE *);
int (*do_reloc)(struct section *sec, Elf_Rel *rel, Elf_Sym *sym,
const char *symname);
 
+   if (use_large_reloc)
+   write_reloc = write64;
+   else
+   write_reloc = write32;
+
 #if ELF_BITS == 64
if (!use_real_mode)
do_reloc = do_reloc64;
@@ -1088,6 +1116,9 @@ static void emit_relocs(int as_text, int use_real_mode)
do_reloc = do_reloc32;
else
do_reloc = do_reloc_real;
+
+   /* Large relocations only for 64-bit */
+   use_large_reloc = 0;
 #endif
 
/* Collect up the relocations */
@@ -,8 +1142,13 @@ static void emit_relocs(int as_text, int use_real_mode)
 * gas will like.
 */
printf(".section \".data.reloc\",\"a\"\n");
-   printf(".balign 4\n");
-   write_reloc = write32_as_text;
+   if (use_large_reloc) {
+   printf(".balign 8\n");
+   write_reloc = write64_as_text;
+   } else {
+   printf(".balign 4\n");
+   write_reloc = write32_as_text;
+   }
}
 
if (use_real_mode) {
@@ -1180,7 +1216,7 @@ static void print_reloc_info(void)
 
 void process(FILE *fp, int use_real_mode, int as_text,
 int show_absolute_syms, int show_absolute_relocs,
-int show_reloc_info)
+int show_reloc_info, int use_large_reloc)
 {
regex_init(use_real_mode);
read_ehdr(fp);
@@ -1203,5 +1239,5 @@ void process(FILE *fp, int use_real_mode, int as_text,
print_reloc_info();
return;
}
-   emit_relocs(as_text, use_real_mode);
+   emit_relocs(as_text, use_real_mode, use_large_reloc);
 }
diff --git a/arch/x86/tools/relocs.h b/arch/x86/tools/relocs.h
index 1d23bf953a4a..cb771cc4412d 100644
--- a/arch/x86/tools/relocs.h
+++ b/arch/x86/tools/relocs.h
@@ -30,8 +30,8 @@ enum symtype {
 
 void process_32(FILE *fp, int use_real_mode, int as_text,
int show_absolute_syms, int show_absolute_relocs,
-   int show_reloc_info);
+   int show_reloc_info, int 

[RFC v3 24/27] x86/mm: Make the x86 GOT read-only

2017-10-26 Thread Thomas Garnier via Virtualization
The GOT is changed during early boot when relocations are applied. Make
it read-only directly. This table exists only for PIE binary.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 include/asm-generic/vmlinux.lds.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index e549bff87c5b..a2301c292e26 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -279,6 +279,17 @@
VMLINUX_SYMBOL(__end_ro_after_init) = .;
 #endif
 
+#ifdef CONFIG_X86_PIE
+#define RO_GOT_X86 \
+   .got: AT(ADDR(.got) - LOAD_OFFSET) {\
+   VMLINUX_SYMBOL(__start_got) = .;\
+   *(.got);\
+   VMLINUX_SYMBOL(__end_got) = .;  \
+   }
+#else
+#define RO_GOT_X86
+#endif
+
 /*
  * Read only Data
  */
@@ -335,6 +346,7 @@
VMLINUX_SYMBOL(__end_builtin_fw) = .;   \
}   \
\
+   RO_GOT_X86  \
TRACEDATA   \
\
/* Kernel symbol table: Normal symbols */   \
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 23/27] x86/modules: Adapt module loading for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Adapt module loading to support PIE relocations. Generate dynamic GOT if
a symbol requires it but no entry exist in the kernel GOT.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/Makefile |   4 +
 arch/x86/include/asm/module.h |  14 +++
 arch/x86/kernel/module.c  | 204 --
 3 files changed, 217 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 4cb4f0495ddc..42774185a58a 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -143,7 +143,11 @@ else
 KBUILD_CFLAGS += $(cflags-y)
 
 KBUILD_CFLAGS += -mno-red-zone
+ifdef CONFIG_X86_PIE
+KBUILD_LDFLAGS_MODULE += -T $(srctree)/arch/x86/kernel/module.lds
+else
 KBUILD_CFLAGS += -mcmodel=kernel
+endif
 
 # -funit-at-a-time shrinks the kernel .text considerably
 # unfortunately it makes reading oopses harder.
diff --git a/arch/x86/include/asm/module.h b/arch/x86/include/asm/module.h
index 9eb7c718aaf8..8e0bd52bbadf 100644
--- a/arch/x86/include/asm/module.h
+++ b/arch/x86/include/asm/module.h
@@ -4,12 +4,23 @@
 #include 
 #include 
 
+#ifdef CONFIG_X86_PIE
+struct mod_got_sec {
+   struct elf64_shdr   *got;
+   int got_num_entries;
+   int got_max_entries;
+};
+#endif
+
 struct mod_arch_specific {
 #ifdef CONFIG_ORC_UNWINDER
unsigned int num_orcs;
int *orc_unwind_ip;
struct orc_entry *orc_unwind;
 #endif
+#ifdef CONFIG_X86_PIE
+   struct mod_got_sec  core;
+#endif
 };
 
 #ifdef CONFIG_X86_64
@@ -70,4 +81,7 @@ struct mod_arch_specific {
 # define MODULE_ARCH_VERMAGIC MODULE_PROC_FAMILY
 #endif
 
+
+u64 module_find_got_entry(struct module *mod, u64 addr);
+
 #endif /* _ASM_X86_MODULE_H */
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 62e7d70aadd5..3b9b43a9d63b 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -77,6 +78,195 @@ static unsigned long int get_module_load_offset(void)
 }
 #endif
 
+#ifdef CONFIG_X86_PIE
+static u64 find_got_kernel_entry(Elf64_Sym *sym, const Elf64_Rela *rela)
+{
+   u64 *pos;
+
+   for (pos = (u64*)__start_got; pos < (u64*)__end_got; pos++) {
+   if (*pos == sym->st_value)
+   return (u64)pos + rela->r_addend;
+   }
+
+   return 0;
+}
+
+/* Search the GOT entry for a specific address and a module (optional) */
+u64 module_find_got_entry(struct module *mod, u64 addr)
+{
+   int i;
+   u64 *pos;
+
+   for (pos = (u64*)__start_got; pos < (u64*)__end_got; pos++) {
+   if (*pos == addr)
+   return (u64)pos;
+   }
+
+   if (!mod)
+   return 0;
+
+   pos = (u64*)mod->arch.core.got->sh_addr;
+   for (i = 0; i < mod->arch.core.got_num_entries; i++) {
+   if (pos[i] == addr)
+   return (u64)[i];
+   }
+   return 0;
+}
+
+static u64 module_emit_got_entry(struct module *mod, void *loc,
+const Elf64_Rela *rela, Elf64_Sym *sym)
+{
+   struct mod_got_sec *gotsec = >arch.core;
+   u64 *got = (u64*)gotsec->got->sh_addr;
+   int i = gotsec->got_num_entries;
+   u64 ret;
+
+   /* Check if we can use the kernel GOT */
+   ret = find_got_kernel_entry(sym, rela);
+   if (ret)
+   return ret;
+
+   got[i] = sym->st_value;
+
+   /*
+* Check if the entry we just created is a duplicate. Given that the
+* relocations are sorted, this will be the last entry we allocated.
+* (if one exists).
+*/
+   if (i > 0 && got[i] == got[i - 2]) {
+   ret = (u64)[i - 1];
+   } else {
+   gotsec->got_num_entries++;
+   BUG_ON(gotsec->got_num_entries > gotsec->got_max_entries);
+   ret = (u64)[i];
+   }
+
+   return ret + rela->r_addend;
+}
+
+#define cmp_3way(a,b)  ((a) < (b) ? -1 : (a) > (b))
+
+static int cmp_rela(const void *a, const void *b)
+{
+   const Elf64_Rela *x = a, *y = b;
+   int i;
+
+   /* sort by type, symbol index and addend */
+   i = cmp_3way(ELF64_R_TYPE(x->r_info), ELF64_R_TYPE(y->r_info));
+   if (i == 0)
+   i = cmp_3way(ELF64_R_SYM(x->r_info), ELF64_R_SYM(y->r_info));
+   if (i == 0)
+   i = cmp_3way(x->r_addend, y->r_addend);
+   return i;
+}
+
+static bool duplicate_rel(const Elf64_Rela *rela, int num)
+{
+   /*
+* Entries are sorted by type, symbol index and addend. That means
+* that, if a duplicate entry exists, it must be in the preceding
+* slot.
+*/
+   return num > 0 && cmp_rela(rela + num, rela + num - 1) == 0;
+}
+
+static 

[RFC v3 21/27] x86/mm/dump_pagetables: Fix address markers index on x86_64

2017-10-26 Thread Thomas Garnier via Virtualization
The address_markers_idx enum is not aligned with the table when EFI is
enabled. Add an EFI_VA_END_NR entry in this case.

Signed-off-by: Thomas Garnier 
---
 arch/x86/mm/dump_pagetables.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 5e3ac6fe6c9e..8691a57da63e 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -52,12 +52,15 @@ enum address_markers_idx {
LOW_KERNEL_NR,
VMALLOC_START_NR,
VMEMMAP_START_NR,
-#ifdef CONFIG_KASAN
+# ifdef CONFIG_KASAN
KASAN_SHADOW_START_NR,
KASAN_SHADOW_END_NR,
-#endif
+# endif
 # ifdef CONFIG_X86_ESPFIX64
ESPFIX_START_NR,
+# endif
+# ifdef CONFIG_EFI
+   EFI_VA_END_NR,
 # endif
HIGH_KERNEL_NR,
MODULES_VADDR_NR,
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 22/27] x86/modules: Add option to start module section after kernel

2017-10-26 Thread Thomas Garnier via Virtualization
Add an option so the module section is just after the mapped kernel. It
will ensure position independent modules are always at the right
distance from the kernel and do not require mcmodule=large. It also
optimize the available size for modules by getting rid of the empty
space on kernel randomization range.

Signed-off-by: Thomas Garnier 
---
 Documentation/x86/x86_64/mm.txt | 3 +++
 arch/x86/Kconfig| 4 
 arch/x86/include/asm/pgtable_64_types.h | 6 +-
 arch/x86/kernel/head64.c| 5 -
 arch/x86/mm/dump_pagetables.c   | 4 ++--
 5 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index b0798e281aa6..b51d66386e32 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -73,4 +73,7 @@ Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct 
mapping of all
 physical memory, vmalloc/ioremap space and virtual memory map are randomized.
 Their order is preserved but their base will be offset early at boot time.
 
+If CONFIG_DYNAMIC_MODULE_BASE is enabled, the module section follows the end of
+the mapped kernel.
+
 -Andi Kleen, Jul 2004
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 777197fab6dd..1e4b399c64e5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2133,6 +2133,10 @@ config RANDOMIZE_MEMORY_PHYSICAL_PADDING
 
   If unsure, leave at the default value.
 
+# Module section starts just after the end of the kernel module
+config DYNAMIC_MODULE_BASE
+   bool
+
 config X86_GLOBAL_STACKPROTECTOR
bool
depends on CC_STACKPROTECTOR
diff --git a/arch/x86/include/asm/pgtable_64_types.h 
b/arch/x86/include/asm/pgtable_64_types.h
index 06470da156ba..e00fc429b898 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -6,6 +6,7 @@
 #ifndef __ASSEMBLY__
 #include 
 #include 
+#include 
 
 /*
  * These are used to make use of C type-checking..
@@ -18,7 +19,6 @@ typedef unsigned long pgdval_t;
 typedef unsigned long  pgprotval_t;
 
 typedef struct { pteval_t pte; } pte_t;
-
 #endif /* !__ASSEMBLY__ */
 
 #define SHARED_KERNEL_PMD  0
@@ -93,7 +93,11 @@ typedef struct { pteval_t pte; } pte_t;
 #define VMEMMAP_START  __VMEMMAP_BASE
 #endif /* CONFIG_RANDOMIZE_MEMORY */
 #define VMALLOC_END(VMALLOC_START + _AC((VMALLOC_SIZE_TB << 40) - 1, UL))
+#ifdef CONFIG_DYNAMIC_MODULE_BASE
+#define MODULES_VADDR   ALIGN(((unsigned long)_end + PAGE_SIZE), PMD_SIZE)
+#else
 #define MODULES_VADDR(__START_KERNEL_map + KERNEL_IMAGE_SIZE)
+#endif
 /* The module sections ends with the start of the fixmap */
 #define MODULES_END   __fix_to_virt(__end_of_fixed_addresses + 1)
 #define MODULES_LEN   (MODULES_END - MODULES_VADDR)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 675f1dba3b21..b6363f0d11a7 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -321,12 +321,15 @@ asmlinkage __visible void __init x86_64_start_kernel(char 
* real_mode_data)
 * Build-time sanity checks on the kernel image and module
 * area mappings. (these are purely build-time and produce no code)
 */
+#ifndef CONFIG_DYNAMIC_MODULE_BASE
BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map);
BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE);
-   BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
+   BUILD_BUG_ON(!IS_ENABLED(CONFIG_RANDOMIZE_BASE_LARGE) &&
+MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
+#endif
BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) ==
(__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 8691a57da63e..8565b2b45848 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -95,7 +95,7 @@ static struct addr_marker address_markers[] = {
{ EFI_VA_END,   "EFI Runtime Services" },
 # endif
{ __START_KERNEL_map,   "High Kernel Mapping" },
-   { MODULES_VADDR,"Modules" },
+   { 0/* MODULES_VADDR */, "Modules" },
{ MODULES_END,  "End Modules" },
 #else
{ PAGE_OFFSET,  "Kernel Mapping" },
@@ -529,7 +529,7 @@ static int __init pt_dump_init(void)
 # endif
address_markers[FIXADDR_START_NR].start_address = FIXADDR_START;
 #endif
-
+   address_markers[MODULES_VADDR_NR].start_address = MODULES_VADDR;
return 0;
 }
 __initcall(pt_dump_init);
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list

[RFC v3 20/27] x86/ftrace: Adapt function tracing for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
When using -fPIE/PIC with function tracing, the compiler generates a
call through the GOT (call *__fentry__@GOTPCREL). This instruction
takes 6 bytes instead of 5 on the usual relative call.

With this change, function tracing supports 6 bytes on traceable
function and can still replace relative calls on the ftrace assembly
functions.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/ftrace.h   |  23 +-
 arch/x86/include/asm/sections.h |   4 +
 arch/x86/kernel/ftrace.c| 168 ++--
 arch/x86/kernel/module.lds  |   3 +
 4 files changed, 139 insertions(+), 59 deletions(-)
 create mode 100644 arch/x86/kernel/module.lds

diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index eccd0ac6bc38..b8bbcc7fad7f 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_X86_FTRACE_H
 #define _ASM_X86_FTRACE_H
 
+
 #ifdef CONFIG_FUNCTION_TRACER
 #ifdef CC_USING_FENTRY
 # define MCOUNT_ADDR   ((unsigned long)(__fentry__))
@@ -8,7 +9,19 @@
 # define MCOUNT_ADDR   ((unsigned long)(mcount))
 # define HAVE_FUNCTION_GRAPH_FP_TEST
 #endif
-#define MCOUNT_INSN_SIZE   5 /* sizeof mcount call */
+
+#define MCOUNT_RELINSN_SIZE5 /* sizeof relative (call or jump) */
+#define MCOUNT_GOTCALL_SIZE6 /* sizeof call *got */
+
+/*
+ * MCOUNT_INSN_SIZE is the highest size of instructions based on the
+ * configuration.
+ */
+#ifdef CONFIG_X86_PIE
+#define MCOUNT_INSN_SIZE   MCOUNT_GOTCALL_SIZE
+#else
+#define MCOUNT_INSN_SIZE   MCOUNT_RELINSN_SIZE
+#endif
 
 #ifdef CONFIG_DYNAMIC_FTRACE
 #define ARCH_SUPPORTS_FTRACE_OPS 1
@@ -17,6 +30,8 @@
 #define HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
 
 #ifndef __ASSEMBLY__
+#include 
+
 extern void mcount(void);
 extern atomic_t modifying_ftrace_code;
 extern void __fentry__(void);
@@ -24,9 +39,11 @@ extern void __fentry__(void);
 static inline unsigned long ftrace_call_adjust(unsigned long addr)
 {
/*
-* addr is the address of the mcount call instruction.
-* recordmcount does the necessary offset calculation.
+* addr is the address of the mcount call instruction. PIE has always a
+* byte added to the start of the function.
 */
+   if (IS_ENABLED(CONFIG_X86_PIE))
+   addr -= 1;
return addr;
 }
 
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 2f75f30cb2f6..6b2d496cf1aa 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -11,4 +11,8 @@ extern struct exception_table_entry __stop___ex_table[];
 extern char __end_rodata_hpage_align[];
 #endif
 
+#if defined(CONFIG_X86_PIE)
+extern char __start_got[], __end_got[];
+#endif
+
 #endif /* _ASM_X86_SECTIONS_H */
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 9bef1bbeba63..41d8c4c4306d 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -58,12 +58,17 @@ static int ftrace_calc_offset(long ip, long addr)
return (int)(addr - ip);
 }
 
-static unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr)
+static unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr,
+ unsigned int size)
 {
static union ftrace_code_union calc;
 
+   /* On PIE, fill the rest of the buffer with nops */
+   if (IS_ENABLED(CONFIG_X86_PIE))
+   memset(calc.code, ideal_nops[1][0], sizeof(calc.code));
+
calc.e8 = 0xe8;
-   calc.offset = ftrace_calc_offset(ip + MCOUNT_INSN_SIZE, addr);
+   calc.offset = ftrace_calc_offset(ip + MCOUNT_RELINSN_SIZE, addr);
 
/*
 * No locking needed, this must be called via kstop_machine
@@ -72,6 +77,44 @@ static unsigned char *ftrace_call_replace(unsigned long ip, 
unsigned long addr)
return calc.code;
 }
 
+#ifdef CONFIG_X86_PIE
+union ftrace_code_got_union {
+   char code[MCOUNT_INSN_SIZE];
+   struct {
+   unsigned short ff15;
+   int offset;
+   } __attribute__((packed));
+};
+
+/* Used to identify a mcount GOT call on PIE */
+static unsigned char *ftrace_original_call(struct module* mod, unsigned long 
ip,
+  unsigned long addr,
+  unsigned int size)
+{
+   static union ftrace_code_got_union calc;
+   unsigned long gotaddr;
+
+   calc.ff15 = 0x15ff;
+
+   gotaddr = module_find_got_entry(mod, addr);
+   if (!gotaddr) {
+   pr_err("Failed to find GOT entry for 0x%lx\n", addr);
+   return NULL;
+   }
+
+   calc.offset = ftrace_calc_offset(ip + MCOUNT_GOTCALL_SIZE, gotaddr);
+   return calc.code;
+}
+#else
+static unsigned char 

[RFC v3 18/27] kvm: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible. The new __ASM_GET_PTR_PRE macro is used to
get the address of a symbol on both 32 and 64-bit with PIE support.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/kvm_host.h | 6 --
 arch/x86/kernel/kvm.c   | 6 --
 arch/x86/kvm/svm.c  | 4 ++--
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9d7d856b2d89..14073fda75fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1342,9 +1342,11 @@ asmlinkage void kvm_spurious_fault(void);
".pushsection .fixup, \"ax\" \n" \
"667: \n\t" \
cleanup_insn "\n\t"   \
-   "cmpb $0, kvm_rebooting \n\t" \
+   "cmpb $0, kvm_rebooting" __ASM_SEL(,(%%rip)) " \n\t" \
"jne 668b \n\t"   \
-   __ASM_SIZE(push) " $666b \n\t"\
+   __ASM_SIZE(push) "%%" _ASM_AX " \n\t"   \
+   __ASM_GET_PTR_PRE(666b) "%%" _ASM_AX "\n\t" \
+   "xchg %%" _ASM_AX ", (%%" _ASM_SP ") \n\t"  \
"call kvm_spurious_fault \n\t"\
".popsection \n\t" \
_ASM_EXTABLE(666b, 667b)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index aa60a08b65b1..07176bfc188b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -620,8 +620,10 @@ asm(
 ".global __raw_callee_save___kvm_vcpu_is_preempted;"
 ".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
 "__raw_callee_save___kvm_vcpu_is_preempted:"
-"movq  __per_cpu_offset(,%rdi,8), %rax;"
-"cmpb  $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);"
+"leaq  __per_cpu_offset(%rip), %rax;"
+"movq  (%rax,%rdi,8), %rax;"
+"addq  " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rip), %rax;"
+"cmpb  $0, (%rax);"
 "setne %al;"
 "ret;"
 ".popsection");
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 0e68f0b3cbf7..364536080438 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -568,12 +568,12 @@ static u32 svm_msrpm_offset(u32 msr)
 
 static inline void clgi(void)
 {
-   asm volatile (__ex(SVM_CLGI));
+   asm volatile (__ex(SVM_CLGI) : :);
 }
 
 static inline void stgi(void)
 {
-   asm volatile (__ex(SVM_STGI));
+   asm volatile (__ex(SVM_STGI) : :);
 }
 
 static inline void invlpga(unsigned long addr, u32 asid)
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 17/27] xen: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use the new _ASM_GET_PTR macro which get a
symbol reference while being PIE compatible. Adapt the relocation tool
to ignore 32-bit Xen code.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/tools/relocs.c | 16 +++-
 arch/x86/xen/xen-head.S |  9 +
 arch/x86/xen/xen-pvh.S  | 13 +
 3 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index 5d3eb2760198..bc032ad88b22 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -831,6 +831,16 @@ static int is_percpu_sym(ElfW(Sym) *sym, const char 
*symname)
strncmp(symname, "init_per_cpu_", 13);
 }
 
+/*
+ * Check if the 32-bit relocation is within the xenpvh 32-bit code.
+ * If so, ignores it.
+ */
+static int is_in_xenpvh_assembly(ElfW(Addr) offset)
+{
+   ElfW(Sym) *sym = sym_lookup("pvh_start_xen");
+   return sym && (offset >= sym->st_value) &&
+   (offset < (sym->st_value + sym->st_size));
+}
 
 static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
  const char *symname)
@@ -892,8 +902,12 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, 
ElfW(Sym) *sym,
 * the relocations are processed.
 * Make sure that the offset will fit.
 */
-   if (r_type != R_X86_64_64 && (int32_t)offset != (int64_t)offset)
+   if (r_type != R_X86_64_64 &&
+   (int32_t)offset != (int64_t)offset) {
+   if (is_in_xenpvh_assembly(offset))
+   break;
die("Relocation offset doesn't fit in 32 bits\n");
+   }
 
if (r_type == R_X86_64_64)
add_reloc(, offset);
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 124941d09b2b..e5b7b9566191 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -25,14 +25,15 @@ ENTRY(startup_xen)
 
/* Clear .bss */
xor %eax,%eax
-   mov $__bss_start, %_ASM_DI
-   mov $__bss_stop, %_ASM_CX
+   _ASM_GET_PTR(__bss_start, %_ASM_DI)
+   _ASM_GET_PTR(__bss_stop, %_ASM_CX)
sub %_ASM_DI, %_ASM_CX
shr $__ASM_SEL(2, 3), %_ASM_CX
rep __ASM_SIZE(stos)
 
-   mov %_ASM_SI, xen_start_info
-   mov $init_thread_union+THREAD_SIZE, %_ASM_SP
+   _ASM_GET_PTR(xen_start_info, %_ASM_AX)
+   mov %_ASM_SI, (%_ASM_AX)
+   _ASM_GET_PTR(init_thread_union+THREAD_SIZE, %_ASM_SP)
 
jmp xen_start_kernel
 END(startup_xen)
diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
index e1a5fbeae08d..43e234c7c2de 100644
--- a/arch/x86/xen/xen-pvh.S
+++ b/arch/x86/xen/xen-pvh.S
@@ -101,8 +101,8 @@ ENTRY(pvh_start_xen)
call xen_prepare_pvh
 
/* startup_64 expects boot_params in %rsi. */
-   mov $_pa(pvh_bootparams), %rsi
-   mov $_pa(startup_64), %rax
+   movabs $_pa(pvh_bootparams), %rsi
+   movabs $_pa(startup_64), %rax
jmp *%rax
 
 #else /* CONFIG_X86_64 */
@@ -137,10 +137,15 @@ END(pvh_start_xen)
 
.section ".init.data","aw"
.balign 8
+   /*
+* Use a quad for _pa(gdt_start) because PIE does not understand a
+* long is enough. The resulting value will still be in the lower long
+* part.
+*/
 gdt:
.word gdt_end - gdt_start
-   .long _pa(gdt_start)
-   .word 0
+   .quad _pa(gdt_start)
+   .balign 8
 gdt_start:
.quad 0x/* NULL descriptor */
.quad 0x/* reserved */
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 16/27] x86/relocs: Handle PIE relocations

2017-10-26 Thread Thomas Garnier via Virtualization
Change the relocation tool to correctly handle relocations generated by
-fPIE option:

 - Add relocation for each entry of the .got section given the linker does not
   generate R_X86_64_GLOB_DAT on a simple link.
 - Ignore R_X86_64_GOTPCREL and R_X86_64_PLT32.

Signed-off-by: Thomas Garnier 
---
 arch/x86/tools/relocs.c | 94 -
 1 file changed, 93 insertions(+), 1 deletion(-)

diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index 73eb7fd4aec4..5d3eb2760198 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -31,6 +31,7 @@ struct section {
Elf_Sym*symtab;
Elf_Rel*reltab;
char   *strtab;
+   Elf_Addr   *got;
 };
 static struct section *secs;
 
@@ -292,6 +293,35 @@ static Elf_Sym *sym_lookup(const char *symname)
return 0;
 }
 
+static Elf_Sym *sym_lookup_addr(Elf_Addr addr, const char **name)
+{
+   int i;
+   for (i = 0; i < ehdr.e_shnum; i++) {
+   struct section *sec = [i];
+   long nsyms;
+   Elf_Sym *symtab;
+   Elf_Sym *sym;
+
+   if (sec->shdr.sh_type != SHT_SYMTAB)
+   continue;
+
+   nsyms = sec->shdr.sh_size/sizeof(Elf_Sym);
+   symtab = sec->symtab;
+
+   for (sym = symtab; --nsyms >= 0; sym++) {
+   if (sym->st_value == addr) {
+   if (name) {
+   *name = sym_name(sec->link->strtab,
+sym);
+   }
+   return sym;
+   }
+   }
+   }
+   return 0;
+}
+
+
 #if BYTE_ORDER == LITTLE_ENDIAN
 #define le16_to_cpu(val) (val)
 #define le32_to_cpu(val) (val)
@@ -512,6 +542,33 @@ static void read_relocs(FILE *fp)
}
 }
 
+static void read_got(FILE *fp)
+{
+   int i;
+   for (i = 0; i < ehdr.e_shnum; i++) {
+   struct section *sec = [i];
+   sec->got = NULL;
+   if (sec->shdr.sh_type != SHT_PROGBITS ||
+   strcmp(sec_name(i), ".got")) {
+   continue;
+   }
+   sec->got = malloc(sec->shdr.sh_size);
+   if (!sec->got) {
+   die("malloc of %d bytes for got failed\n",
+   sec->shdr.sh_size);
+   }
+   if (fseek(fp, sec->shdr.sh_offset, SEEK_SET) < 0) {
+   die("Seek to %d failed: %s\n",
+   sec->shdr.sh_offset, strerror(errno));
+   }
+   if (fread(sec->got, 1, sec->shdr.sh_size, fp)
+   != sec->shdr.sh_size) {
+   die("Cannot read got: %s\n",
+   strerror(errno));
+   }
+   }
+}
+
 
 static void print_absolute_symbols(void)
 {
@@ -642,6 +699,32 @@ static void add_reloc(struct relocs *r, uint32_t offset)
r->offset[r->count++] = offset;
 }
 
+/*
+ * The linker does not generate relocations for the GOT for the kernel.
+ * If a GOT is found, simulate the relocations that should have been included.
+ */
+static void walk_got_table(int (*process)(struct section *sec, Elf_Rel *rel,
+ Elf_Sym *sym, const char *symname),
+  struct section *sec)
+{
+   int i;
+   Elf_Addr entry;
+   Elf_Sym *sym;
+   const char *symname;
+   Elf_Rel rel;
+
+   for (i = 0; i < sec->shdr.sh_size/sizeof(Elf_Addr); i++) {
+   entry = sec->got[i];
+   sym = sym_lookup_addr(entry, );
+   if (!sym)
+   die("Could not found got symbol for entry %d\n", i);
+   rel.r_offset = sec->shdr.sh_addr + i * sizeof(Elf_Addr);
+   rel.r_info = ELF_BITS == 64 ? R_X86_64_GLOB_DAT
+: R_386_GLOB_DAT;
+   process(sec, , sym, symname);
+   }
+}
+
 static void walk_relocs(int (*process)(struct section *sec, Elf_Rel *rel,
Elf_Sym *sym, const char *symname))
 {
@@ -655,6 +738,8 @@ static void walk_relocs(int (*process)(struct section *sec, 
Elf_Rel *rel,
struct section *sec = [i];
 
if (sec->shdr.sh_type != SHT_REL_TYPE) {
+   if (sec->got)
+   walk_got_table(process, sec);
continue;
}
sec_symtab  = sec->link;
@@ -764,6 +849,8 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, 
ElfW(Sym) *sym,
offset += per_cpu_load_addr;
 
switch (r_type) {
+   case R_X86_64_PLT32:
+   case R_X86_64_GOTPCREL:
case R_X86_64_NONE:
/* NONE can be ignored. */
break;
@@ -805,7 

[RFC v3 13/27] x86/boot/64: Use _text in a global for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
By default PIE generated code create only relative references so _text
points to the temporary virtual address. Instead use a global variable
so the relocation is done as expected.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/kernel/head64.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index bab4fa579450..675f1dba3b21 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -45,8 +45,14 @@ static void __head *fixup_pointer(void *ptr, unsigned long 
physaddr)
return ptr - (void *)_text + (void *)physaddr;
 }
 
-unsigned long __head __startup_64(unsigned long physaddr,
- struct boot_params *bp)
+/*
+ * Use a global variable to properly calculate _text delta on PIE. By default
+ * a PIE binary do a RIP relative difference instead of the relocated address.
+ */
+unsigned long _text_offset = (unsigned long)(_text - __START_KERNEL_map);
+
+unsigned long __head notrace __startup_64(unsigned long physaddr,
+ struct boot_params *bp)
 {
unsigned long load_delta, *p;
unsigned long pgtable_flags;
@@ -65,7 +71,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
 * Compute the delta between the address I am compiled to run at
 * and the address I am actually running at.
 */
-   load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+   load_delta = physaddr - _text_offset;
 
/* Is the address not 2M aligned? */
if (load_delta & ~PMD_PAGE_MASK)
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 14/27] x86/percpu: Adapt percpu for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Perpcu uses a clever design where the .percu ELF section has a virtual
address of zero and the relocation code avoid relocating specific
symbols. It makes the code simple and easily adaptable with or without
SMP support.

This design is incompatible with PIE because generated code always try to
access the zero virtual address relative to the default mapping address.
It becomes impossible when KASLR is configured to go below -2G. This
patch solves this problem by removing the zero mapping and adapting the GS
base to be relative to the expected address. These changes are done only
when PIE is enabled. The original implementation is kept as-is
by default.

The assembly and PER_CPU macros are changed to use relative references
when PIE is enabled.

The KALLSYMS_ABSOLUTE_PERCPU configuration is disabled with PIE given
percpu symbols are not absolute in this case.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/entry/entry_64.S  |  4 ++--
 arch/x86/include/asm/percpu.h  | 25 +++--
 arch/x86/kernel/cpu/common.c   |  4 +++-
 arch/x86/kernel/head_64.S  |  4 
 arch/x86/kernel/setup_percpu.c |  2 +-
 arch/x86/kernel/vmlinux.lds.S  | 13 +++--
 arch/x86/lib/cmpxchg16b_emu.S  |  8 
 arch/x86/xen/xen-asm.S | 12 ++--
 init/Kconfig   |  2 +-
 9 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 15bd5530d2ae..d3a52d2342af 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -392,7 +392,7 @@ ENTRY(__switch_to_asm)
 
 #ifdef CONFIG_CC_STACKPROTECTOR
movqTASK_stack_canary(%rsi), %rbx
-   movq%rbx, PER_CPU_VAR(irq_stack_union)+stack_canary_offset
+   movq%rbx, PER_CPU_VAR(irq_stack_union + stack_canary_offset)
 #endif
 
/* restore callee-saved registers */
@@ -808,7 +808,7 @@ apicinterrupt IRQ_WORK_VECTOR   
irq_work_interrupt  smp_irq_work_interrupt
 /*
  * Exception entry points.
  */
-#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss) + (TSS_ist + ((x) - 1) * 8)
+#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss + (TSS_ist + ((x) - 1) * 8))
 
 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
 ENTRY(\sym)
diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index b21a475fd7ed..07250f1099b5 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -4,9 +4,11 @@
 #ifdef CONFIG_X86_64
 #define __percpu_seg   gs
 #define __percpu_mov_opmovq
+#define __percpu_rel   (%rip)
 #else
 #define __percpu_seg   fs
 #define __percpu_mov_opmovl
+#define __percpu_rel
 #endif
 
 #ifdef __ASSEMBLY__
@@ -27,10 +29,14 @@
 #define PER_CPU(var, reg)  \
__percpu_mov_op %__percpu_seg:this_cpu_off, reg;\
lea var(reg), reg
-#define PER_CPU_VAR(var)   %__percpu_seg:var
+/* Compatible with Position Independent Code */
+#define PER_CPU_VAR(var)   %__percpu_seg:(var)##__percpu_rel
+/* Rare absolute reference */
+#define PER_CPU_VAR_ABS(var)   %__percpu_seg:var
 #else /* ! SMP */
 #define PER_CPU(var, reg)  __percpu_mov_op $var, reg
-#define PER_CPU_VAR(var)   var
+#define PER_CPU_VAR(var)   (var)##__percpu_rel
+#define PER_CPU_VAR_ABS(var)   var
 #endif /* SMP */
 
 #ifdef CONFIG_X86_64_SMP
@@ -208,27 +214,34 @@ do {  
\
pfo_ret__;  \
 })
 
+/* Position Independent code uses relative addresses only */
+#ifdef CONFIG_X86_PIE
+#define __percpu_stable_arg __percpu_arg(a1)
+#else
+#define __percpu_stable_arg __percpu_arg(P1)
+#endif
+
 #define percpu_stable_op(op, var)  \
 ({ \
typeof(var) pfo_ret__;  \
switch (sizeof(var)) {  \
case 1: \
-   asm(op "b "__percpu_arg(P1)",%0"\
+   asm(op "b "__percpu_stable_arg ",%0"\
: "=q" (pfo_ret__)  \
: "p" (&(var)));\
break;  \
case 2: \
-   asm(op "w "__percpu_arg(P1)",%0"\
+   asm(op "w "__percpu_stable_arg ",%0"\
: "=r" (pfo_ret__)  \
: "p" (&(var)));\
break;  \
case 4: \
-   asm(op "l "__percpu_arg(P1)",%0"

[RFC v3 15/27] compiler: Option to default to hidden symbols

2017-10-26 Thread Thomas Garnier via Virtualization
Provide an option to default visibility to hidden except for key
symbols. This option is disabled by default and will be used by x86_64
PIE support to remove errors between compilation units.

The default visibility is also enabled for external symbols that are
compared as they maybe equals (start/end of sections). In this case,
older versions of GCC will remove the comparison if the symbols are
hidden. This issue exists at least on gcc 4.9 and before.

Signed-off-by: Thomas Garnier 
---
 arch/x86/boot/boot.h |  2 +-
 arch/x86/include/asm/setup.h |  2 +-
 arch/x86/kernel/cpu/microcode/core.c |  4 ++--
 drivers/base/firmware_class.c|  4 ++--
 include/asm-generic/sections.h   |  6 ++
 include/linux/compiler.h |  8 
 init/Kconfig |  7 +++
 kernel/kallsyms.c| 16 
 kernel/trace/trace.h |  4 ++--
 lib/dynamic_debug.c  |  4 ++--
 10 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index ef5a9cc66fb8..d726c35bdd96 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -193,7 +193,7 @@ static inline bool memcmp_gs(const void *s1, addr_t s2, 
size_t len)
 }
 
 /* Heap -- available for dynamic lists. */
-extern char _end[];
+extern char _end[] __default_visibility;
 extern char *HEAP;
 extern char *heap_end;
 #define RESET_HEAP() ((void *)( HEAP = _end ))
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index a65cf544686a..7e0b54f605c6 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -67,7 +67,7 @@ static inline void x86_ce4100_early_setup(void) { }
  * This is set up by the setup-routine at boot-time
  */
 extern struct boot_params boot_params;
-extern char _text[];
+extern char _text[] __default_visibility;
 
 static inline bool kaslr_enabled(void)
 {
diff --git a/arch/x86/kernel/cpu/microcode/core.c 
b/arch/x86/kernel/cpu/microcode/core.c
index 86e8f0b2537b..8f021783a929 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -144,8 +144,8 @@ static bool __init check_loader_disabled_bsp(void)
return *res;
 }
 
-extern struct builtin_fw __start_builtin_fw[];
-extern struct builtin_fw __end_builtin_fw[];
+extern struct builtin_fw __start_builtin_fw[] __default_visibility;
+extern struct builtin_fw __end_builtin_fw[] __default_visibility;
 
 bool get_builtin_firmware(struct cpio_data *cd, const char *name)
 {
diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c
index 4b57cf5bc81d..77d4727f6594 100644
--- a/drivers/base/firmware_class.c
+++ b/drivers/base/firmware_class.c
@@ -45,8 +45,8 @@ MODULE_LICENSE("GPL");
 
 #ifdef CONFIG_FW_LOADER
 
-extern struct builtin_fw __start_builtin_fw[];
-extern struct builtin_fw __end_builtin_fw[];
+extern struct builtin_fw __start_builtin_fw[] __default_visibility;
+extern struct builtin_fw __end_builtin_fw[] __default_visibility;
 
 static bool fw_get_builtin_firmware(struct firmware *fw, const char *name,
void *buf, size_t size)
diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
index e5da44eddd2f..1aa5d6dac9e1 100644
--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -30,6 +30,9 @@
  * __irqentry_text_start, __irqentry_text_end
  * __softirqentry_text_start, __softirqentry_text_end
  */
+#ifdef CONFIG_DEFAULT_HIDDEN
+#pragma GCC visibility push(default)
+#endif
 extern char _text[], _stext[], _etext[];
 extern char _data[], _sdata[], _edata[];
 extern char __bss_start[], __bss_stop[];
@@ -46,6 +49,9 @@ extern char __softirqentry_text_start[], 
__softirqentry_text_end[];
 
 /* Start and end of .ctors section - used for constructor calls. */
 extern char __ctors_start[], __ctors_end[];
+#ifdef CONFIG_DEFAULT_HIDDEN
+#pragma GCC visibility pop
+#endif
 
 extern __visible const void __nosave_begin, __nosave_end;
 
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index e95a2631e545..6997716f73bf 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -78,6 +78,14 @@ extern void __chk_io_ptr(const volatile void __iomem *);
 #include 
 #endif
 
+/* Useful for Position Independent Code to reduce global references */
+#ifdef CONFIG_DEFAULT_HIDDEN
+#pragma GCC visibility push(hidden)
+#define __default_visibility  __attribute__((visibility ("default")))
+#else
+#define __default_visibility
+#endif
+
 /*
  * Generic compiler-dependent macros required for kernel
  * build go below this comment. Actual compiler/compiler version
diff --git a/init/Kconfig b/init/Kconfig
index ccb1d8daf241..b640201fcff7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1649,6 +1649,13 @@ config PROFILING
 config TRACEPOINTS
bool
 
+#
+# Default to hidden visibility for all symbols.
+# Useful for Position Independent 

[RFC v3 12/27] x86/paravirt: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
if PIE is enabled, switch the paravirt assembly constraints to be
compatible. The %c/i constrains generate smaller code so is kept by
default.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/paravirt_types.h | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 280d94c36dad..e6961f8a74aa 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -335,9 +335,17 @@ extern struct pv_lock_ops pv_lock_ops;
 #define PARAVIRT_PATCH(x)  \
(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
 
+#ifdef CONFIG_X86_PIE
+#define paravirt_opptr_call "a"
+#define paravirt_opptr_type "p"
+#else
+#define paravirt_opptr_call "c"
+#define paravirt_opptr_type "i"
+#endif
+
 #define paravirt_type(op)  \
[paravirt_typenum] "i" (PARAVIRT_PATCH(op)),\
-   [paravirt_opptr] "i" (&(op))
+   [paravirt_opptr] paravirt_opptr_type (&(op))
 #define paravirt_clobber(clobber)  \
[paravirt_clobber] "i" (clobber)
 
@@ -391,7 +399,7 @@ int paravirt_disable_iospace(void);
  * offset into the paravirt_patch_template structure, and can therefore be
  * freely converted back into a structure offset.
  */
-#define PARAVIRT_CALL  "call *%c[paravirt_opptr];"
+#define PARAVIRT_CALL  "call *%" paravirt_opptr_call "[paravirt_opptr];"
 
 /*
  * These macros are intended to wrap calls through one of the paravirt
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 11/27] x86/power/64: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/power/hibernate_asm_64.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/power/hibernate_asm_64.S 
b/arch/x86/power/hibernate_asm_64.S
index ce8da3a0412c..6fdd7bbc3c33 100644
--- a/arch/x86/power/hibernate_asm_64.S
+++ b/arch/x86/power/hibernate_asm_64.S
@@ -24,7 +24,7 @@
 #include 
 
 ENTRY(swsusp_arch_suspend)
-   movq$saved_context, %rax
+   leaqsaved_context(%rip), %rax
movq%rsp, pt_regs_sp(%rax)
movq%rbp, pt_regs_bp(%rax)
movq%rsi, pt_regs_si(%rax)
@@ -115,7 +115,7 @@ ENTRY(restore_registers)
movq%rax, %cr4;  # turn PGE back on
 
/* We don't restore %rax, it must be 0 anyway */
-   movq$saved_context, %rax
+   leaqsaved_context(%rip), %rax
movqpt_regs_sp(%rax), %rsp
movqpt_regs_bp(%rax), %rbp
movqpt_regs_si(%rax), %rsi
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 08/27] x86/CPU: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible. Use the new _ASM_GET_PTR macro instead of
the 'mov $symbol, %dst' construct to not have an absolute reference.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/processor.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index b446c5a082ad..b09bd50b06c7 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -49,7 +49,7 @@ static inline void *current_text_addr(void)
 {
void *pc;
 
-   asm volatile("mov $1f, %0; 1:":"=r" (pc));
+   asm volatile(_ASM_GET_PTR(1f, %0) "; 1:":"=r" (pc));
 
return pc;
 }
@@ -695,6 +695,7 @@ static inline void sync_core(void)
: ASM_CALL_CONSTRAINT : : "memory");
 #else
unsigned int tmp;
+   unsigned long tmp2;
 
asm volatile (
UNWIND_HINT_SAVE
@@ -705,11 +706,13 @@ static inline void sync_core(void)
"pushfq\n\t"
"mov %%cs, %0\n\t"
"pushq %q0\n\t"
-   "pushq $1f\n\t"
+   "leaq 1f(%%rip), %1\n\t"
+   "pushq %1\n\t"
"iretq\n\t"
UNWIND_HINT_RESTORE
"1:"
-   : "=" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory");
+   : "=" (tmp), "=" (tmp2), ASM_CALL_CONSTRAINT
+   : : "cc", "memory");
 #endif
 }
 
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 10/27] x86/boot/64: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Early at boot, the kernel is mapped at a temporary address while preparing
the page table. To know the changes needed for the page table with KASLR,
the boot code calculate the difference between the expected address of the
kernel and the one chosen by KASLR. It does not work with PIE because all
symbols in code are relatives. Instead of getting the future relocated
virtual address, you will get the current temporary mapping. The solution
is using global variables that will be relocated as expected.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/kernel/head_64.S | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 42e32c2e51bb..32d1899f48df 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -86,8 +86,21 @@ startup_64:
popq%rsi
 
/* Form the CR3 value being sure to include the CR3 modifier */
-   addq$(early_top_pgt - __START_KERNEL_map), %rax
+   addq_early_top_pgt_offset(%rip), %rax
jmp 1f
+
+   /*
+* Position Independent Code takes only relative references in code
+* meaning a global variable address is relative to RIP and not its
+* future virtual address. Global variables can be used instead as they
+* are still relocated on the expected kernel mapping address.
+*/
+   .align 8
+_early_top_pgt_offset:
+   .quad early_top_pgt - __START_KERNEL_map
+_init_top_offset:
+   .quad init_top_pgt - __START_KERNEL_map
+
 ENTRY(secondary_startup_64)
UNWIND_HINT_EMPTY
/*
@@ -116,7 +129,7 @@ ENTRY(secondary_startup_64)
popq%rsi
 
/* Form the CR3 value being sure to include the CR3 modifier */
-   addq$(init_top_pgt - __START_KERNEL_map), %rax
+   addq_init_top_offset(%rip), %rax
 1:
 
/* Enable PAE mode, PGE and LA57 */
@@ -131,7 +144,7 @@ ENTRY(secondary_startup_64)
movq%rax, %cr3
 
/* Ensure I am executing from virtual addresses */
-   movq$1f, %rax
+   movabs  $1f, %rax
jmp *%rax
 1:
UNWIND_HINT_EMPTY
@@ -230,11 +243,12 @@ ENTRY(secondary_startup_64)
 *  REX.W + FF /5 JMP m16:64 Jump far, absolute indirect,
 *  address given in m16:64.
 */
-   pushq   $.Lafter_lret   # put return address on stack for unwinder
+   leaq.Lafter_lret(%rip), %rax
+   pushq   %rax# put return address on stack for unwinder
xorq%rbp, %rbp  # clear frame pointer
-   movqinitial_code(%rip), %rax
+   leaqinitial_code(%rip), %rax
pushq   $__KERNEL_CS# set correct cs
-   pushq   %rax# target address in negative space
+   pushq   (%rax)  # target address in negative space
lretq
 .Lafter_lret:
 END(secondary_startup_64)
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 07/27] x86: pm-trace - Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change assembly to use the new _ASM_GET_PTR macro instead of _ASM_MOV for
the assembly to be PIE compatible.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/pm-trace.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pm-trace.h b/arch/x86/include/asm/pm-trace.h
index 7b7ac42c3661..a3801261f0dd 100644
--- a/arch/x86/include/asm/pm-trace.h
+++ b/arch/x86/include/asm/pm-trace.h
@@ -7,7 +7,7 @@
 do {   \
if (pm_trace_enabled) { \
const void *tracedata;  \
-   asm volatile(_ASM_MOV " $1f,%0\n"   \
+   asm volatile(_ASM_GET_PTR(1f, %0) "\n"  \
 ".section .tracedata,\"a\"\n"  \
 "1:\t.word %c1\n\t"\
 _ASM_PTR " %c2\n"  \
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 06/27] x86/entry/64: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/entry/entry_64.S | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 49167258d587..15bd5530d2ae 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -194,12 +194,15 @@ entry_SYSCALL_64_fastpath:
ja  1f  /* return -ENOSYS (already in 
pt_regs->ax) */
movq%r10, %rcx
 
+   /* Ensures the call is position independent */
+   leaqsys_call_table(%rip), %r11
+
/*
 * This call instruction is handled specially in stub_ptregs_64.
 * It might end up jumping to the slow path.  If it jumps, RAX
 * and all argument registers are clobbered.
 */
-   call*sys_call_table(, %rax, 8)
+   call*(%r11, %rax, 8)
 .Lentry_SYSCALL_64_after_fastpath_call:
 
movq%rax, RAX(%rsp)
@@ -334,7 +337,8 @@ ENTRY(stub_ptregs_64)
 * RAX stores a pointer to the C function implementing the syscall.
 * IRQs are on.
 */
-   cmpq$.Lentry_SYSCALL_64_after_fastpath_call, (%rsp)
+   leaq.Lentry_SYSCALL_64_after_fastpath_call(%rip), %r11
+   cmpq%r11, (%rsp)
jne 1f
 
/*
@@ -1172,7 +1176,8 @@ ENTRY(error_entry)
movl%ecx, %eax  /* zero extend */
cmpq%rax, RIP+8(%rsp)
je  .Lbstep_iret
-   cmpq$.Lgs_change, RIP+8(%rsp)
+   leaq.Lgs_change(%rip), %rcx
+   cmpq%rcx, RIP+8(%rsp)
jne .Lerror_entry_done
 
/*
@@ -1383,10 +1388,10 @@ ENTRY(nmi)
 * resume the outer NMI.
 */
 
-   movq$repeat_nmi, %rdx
+   leaqrepeat_nmi(%rip), %rdx
cmpq8(%rsp), %rdx
ja  1f
-   movq$end_repeat_nmi, %rdx
+   leaqend_repeat_nmi(%rip), %rdx
cmpq8(%rsp), %rdx
ja  nested_nmi_out
 1:
@@ -1440,7 +1445,8 @@ nested_nmi:
pushq   %rdx
pushfq
pushq   $__KERNEL_CS
-   pushq   $repeat_nmi
+   leaqrepeat_nmi(%rip), %rdx
+   pushq   %rdx
 
/* Put stack back */
addq$(6*8), %rsp
@@ -1479,7 +1485,9 @@ first_nmi:
addq$8, (%rsp)  /* Fix up RSP */
pushfq  /* RFLAGS */
pushq   $__KERNEL_CS/* CS */
-   pushq   $1f /* RIP */
+   pushq   %rax/* Support Position Independent Code */
+   leaq1f(%rip), %rax  /* RIP */
+   xchgq   %rax, (%rsp)/* Restore RAX, put 1f */
INTERRUPT_RETURN/* continues at repeat_nmi below */
UNWIND_HINT_IRET_REGS
 1:
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 09/27] x86/acpi: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/kernel/acpi/wakeup_64.S | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/acpi/wakeup_64.S b/arch/x86/kernel/acpi/wakeup_64.S
index 50b8ed0317a3..472659c0f811 100644
--- a/arch/x86/kernel/acpi/wakeup_64.S
+++ b/arch/x86/kernel/acpi/wakeup_64.S
@@ -14,7 +14,7 @@
 * Hooray, we are in Long 64-bit mode (but still running in low memory)
 */
 ENTRY(wakeup_long64)
-   movqsaved_magic, %rax
+   movqsaved_magic(%rip), %rax
movq$0x123456789abcdef0, %rdx
cmpq%rdx, %rax
jne bogus_64_magic
@@ -25,14 +25,14 @@ ENTRY(wakeup_long64)
movw%ax, %es
movw%ax, %fs
movw%ax, %gs
-   movqsaved_rsp, %rsp
+   movqsaved_rsp(%rip), %rsp
 
-   movqsaved_rbx, %rbx
-   movqsaved_rdi, %rdi
-   movqsaved_rsi, %rsi
-   movqsaved_rbp, %rbp
+   movqsaved_rbx(%rip), %rbx
+   movqsaved_rdi(%rip), %rdi
+   movqsaved_rsi(%rip), %rsi
+   movqsaved_rbp(%rip), %rbp
 
-   movqsaved_rip, %rax
+   movqsaved_rip(%rip), %rax
jmp *%rax
 ENDPROC(wakeup_long64)
 
@@ -45,7 +45,7 @@ ENTRY(do_suspend_lowlevel)
xorl%eax, %eax
callsave_processor_state
 
-   movq$saved_context, %rax
+   leaqsaved_context(%rip), %rax
movq%rsp, pt_regs_sp(%rax)
movq%rbp, pt_regs_bp(%rax)
movq%rsi, pt_regs_si(%rax)
@@ -64,13 +64,14 @@ ENTRY(do_suspend_lowlevel)
pushfq
popqpt_regs_flags(%rax)
 
-   movq$.Lresume_point, saved_rip(%rip)
+   leaq.Lresume_point(%rip), %rax
+   movq%rax, saved_rip(%rip)
 
-   movq%rsp, saved_rsp
-   movq%rbp, saved_rbp
-   movq%rbx, saved_rbx
-   movq%rdi, saved_rdi
-   movq%rsi, saved_rsi
+   movq%rsp, saved_rsp(%rip)
+   movq%rbp, saved_rbp(%rip)
+   movq%rbx, saved_rbx(%rip)
+   movq%rdi, saved_rdi(%rip)
+   movq%rsi, saved_rsi(%rip)
 
addq$8, %rsp
movl$3, %edi
@@ -82,7 +83,7 @@ ENTRY(do_suspend_lowlevel)
.align 4
 .Lresume_point:
/* We don't restore %rax, it must be 0 anyway */
-   movq$saved_context, %rax
+   leaqsaved_context(%rip), %rax
movqsaved_context_cr4(%rax), %rbx
movq%rbx, %cr4
movqsaved_context_cr3(%rax), %rbx
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 01/27] x86/crypto: Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/crypto/aes-x86_64-asm_64.S  | 45 -
 arch/x86/crypto/aesni-intel_asm.S| 14 ++--
 arch/x86/crypto/aesni-intel_avx-x86_64.S |  6 +-
 arch/x86/crypto/camellia-aesni-avx-asm_64.S  | 42 ++--
 arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 44 ++---
 arch/x86/crypto/camellia-x86_64-asm_64.S |  8 ++-
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S| 50 ---
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S| 44 +++--
 arch/x86/crypto/des3_ede-asm_64.S| 96 ++--
 arch/x86/crypto/ghash-clmulni-intel_asm.S|  4 +-
 arch/x86/crypto/glue_helper-asm-avx.S|  4 +-
 arch/x86/crypto/glue_helper-asm-avx2.S   |  6 +-
 12 files changed, 211 insertions(+), 152 deletions(-)

diff --git a/arch/x86/crypto/aes-x86_64-asm_64.S 
b/arch/x86/crypto/aes-x86_64-asm_64.S
index 8739cf7795de..86fa068e5e81 100644
--- a/arch/x86/crypto/aes-x86_64-asm_64.S
+++ b/arch/x86/crypto/aes-x86_64-asm_64.S
@@ -48,8 +48,12 @@
 #define R10%r10
 #define R11%r11
 
+/* Hold global for PIE suport */
+#define RBASE  %r12
+
 #define prologue(FUNC,KEY,B128,B192,r1,r2,r5,r6,r7,r8,r9,r10,r11) \
ENTRY(FUNC);\
+   pushq   RBASE;  \
movqr1,r2;  \
leaqKEY+48(r8),r9;  \
movqr10,r11;\
@@ -74,54 +78,63 @@
movlr6 ## E,4(r9);  \
movlr7 ## E,8(r9);  \
movlr8 ## E,12(r9); \
+   popqRBASE;  \
ret;\
ENDPROC(FUNC);
 
+#define round_mov(tab_off, reg_i, reg_o) \
+   leaqtab_off(%rip), RBASE; \
+   movl(RBASE,reg_i,4), reg_o;
+
+#define round_xor(tab_off, reg_i, reg_o) \
+   leaqtab_off(%rip), RBASE; \
+   xorl(RBASE,reg_i,4), reg_o;
+
 #define round(TAB,OFFSET,r1,r2,r3,r4,r5,r6,r7,r8,ra,rb,rc,rd) \
movzbl  r2 ## H,r5 ## E;\
movzbl  r2 ## L,r6 ## E;\
-   movlTAB+1024(,r5,4),r5 ## E;\
+   round_mov(TAB+1024, r5, r5 ## E)\
movwr4 ## X,r2 ## X;\
-   movlTAB(,r6,4),r6 ## E; \
+   round_mov(TAB, r6, r6 ## E) \
roll$16,r2 ## E;\
shrl$16,r4 ## E;\
movzbl  r4 ## L,r7 ## E;\
movzbl  r4 ## H,r4 ## E;\
xorlOFFSET(r8),ra ## E; \
xorlOFFSET+4(r8),rb ## E;   \
-   xorlTAB+3072(,r4,4),r5 ## E;\
-   xorlTAB+2048(,r7,4),r6 ## E;\
+   round_xor(TAB+3072, r4, r5 ## E)\
+   round_xor(TAB+2048, r7, r6 ## E)\
movzbl  r1 ## L,r7 ## E;\
movzbl  r1 ## H,r4 ## E;\
-   movlTAB+1024(,r4,4),r4 ## E;\
+   round_mov(TAB+1024, r4, r4 ## E)\
movwr3 ## X,r1 ## X;\
roll$16,r1 ## E;\
shrl$16,r3 ## E;\
-   xorlTAB(,r7,4),r5 ## E; \
+   round_xor(TAB, r7, r5 ## E) \
movzbl  r3 ## L,r7 ## E;\
movzbl  r3 ## H,r3 ## E;\
-   xorlTAB+3072(,r3,4),r4 ## E;\
-   xorlTAB+2048(,r7,4),r5 ## E;\
+   round_xor(TAB+3072, r3, r4 ## E)\
+   round_xor(TAB+2048, r7, r5 ## E)\
movzbl  r1 ## L,r7 ## E;\
movzbl  r1 ## H,r3 ## E;\
shrl$16,r1 ## E;\
-   xorlTAB+3072(,r3,4),r6 ## E;\
-   movlTAB+2048(,r7,4),r3 ## E;\
+   round_xor(TAB+3072, r3, r6 ## E)\
+   round_mov(TAB+2048, r7, r3 ## E)\
movzbl  r1 ## L,r7 ## E;\
movzbl  r1 ## H,r1 ## E;\
-   xorlTAB+1024(,r1,4),r6 ## E;\
-   xorlTAB(,r7,4),r3 ## E; \
+   round_xor(TAB+1024, r1, r6 ## E)\
+   round_xor(TAB, r7, r3 ## E) \
movzbl  r2 ## H,r1 ## E;\
movzbl  r2 ## L,r7 ## E;\
shrl$16,r2 ## E;\
-   xorlTAB+3072(,r1,4),r3 ## E;\
-   xorlTAB+2048(,r7,4),r4 ## E;\
+   round_xor(TAB+3072, r1, r3 ## E)\
+   round_xor(TAB+2048, r7, r4 ## E)\
movzbl  r2 ## H,r1 ## E;\
movzbl  r2 ## L,r2 ## E;\
xorlOFFSET+8(r8),rc ## E;   \
xorlOFFSET+12(r8),rd ## E;  \
-   xorlTAB+1024(,r1,4),r3 ## E;\
-   xorlTAB(,r2,4),r4 ## E;
+   round_xor(TAB+1024, r1, r3 ## E)\
+   round_xor(TAB, r2, r4 ## E)
 
 #define move_regs(r1,r2,r3,r4) \
movlr3 ## E,r1 ## E;\
diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 16627fec80b2..5f73201dff32 100644
--- 

[RFC v3 05/27] x86: relocate_kernel - Adapt assembly for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/kernel/relocate_kernel_64.S | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/relocate_kernel_64.S 
b/arch/x86/kernel/relocate_kernel_64.S
index 307d3bac5f04..2ecbdcbe985b 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -200,9 +200,11 @@ identity_mapped:
movq%rax, %cr3
lea PAGE_SIZE(%r8), %rsp
callswap_pages
-   movq$virtual_mapped, %rax
-   pushq   %rax
-   ret
+   jmp *virtual_mapped_addr(%rip)
+
+   /* Absolute value for PIE support */
+virtual_mapped_addr:
+   .quad virtual_mapped
 
 virtual_mapped:
movqRSP(%r8), %rsp
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 04/27] x86: Add macro to get symbol address for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Add a new _ASM_GET_PTR macro to fetch a symbol address. It will be used
to replace "_ASM_MOV $, %dst" code construct that are not compatible
with PIE.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/asm.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index c1eadbaf1115..dddcb8a3b777 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -55,6 +55,19 @@
 # define CC_OUT(c) [_cc_ ## c] "=qm"
 #endif
 
+/* Macros to get a global variable address with PIE support on 64-bit */
+#ifdef CONFIG_X86_32
+#define __ASM_GET_PTR_PRE(_src) __ASM_FORM_COMMA(movl $##_src)
+#else
+#ifdef __ASSEMBLY__
+#define __ASM_GET_PTR_PRE(_src) __ASM_FORM_COMMA(leaq (_src)(%rip))
+#else
+#define __ASM_GET_PTR_PRE(_src) __ASM_FORM_COMMA(leaq (_src)(%%rip))
+#endif
+#endif
+#define _ASM_GET_PTR(_src, _dst) \
+   __ASM_GET_PTR_PRE(_src) __ASM_FORM(_dst)
+
 /* Exception table entry */
 #ifdef __ASSEMBLY__
 # define _ASM_EXTABLE_HANDLE(from, to, handler)\
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 02/27] x86: Use symbol name on bug table for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Replace the %c constraint with %P. The %c is incompatible with PIE
because it implies an immediate value whereas %P reference a symbol.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/bug.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index aa6b2023d8f8..1210d22ad547 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -37,7 +37,7 @@ do {  
\
asm volatile("1:\t" ins "\n"\
 ".pushsection __bug_table,\"aw\"\n"\
 "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n"   \
-"\t"  __BUG_REL(%c0) "\t# bug_entry::file\n"   \
+"\t"  __BUG_REL(%P0) "\t# bug_entry::file\n"   \
 "\t.word %c1""\t# bug_entry::line\n"   \
 "\t.word %c2""\t# bug_entry::flags\n"  \
 "\t.org 2b+%c3\n"  \
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC v3 03/27] x86: Use symbol name in jump table for PIE support

2017-10-26 Thread Thomas Garnier via Virtualization
Replace the %c constraint with %P. The %c is incompatible with PIE
because it implies an immediate value whereas %P reference a symbol.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier 
---
 arch/x86/include/asm/jump_label.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/jump_label.h 
b/arch/x86/include/asm/jump_label.h
index adc54c12cbd1..6e558e4524dc 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -36,9 +36,9 @@ static __always_inline bool arch_static_branch(struct 
static_key *key, bool bran
".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
".pushsection __jump_table,  \"aw\" \n\t"
_ASM_ALIGN "\n\t"
-   _ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
+   _ASM_PTR "1b, %l[l_yes], %P0 \n\t"
".popsection \n\t"
-   : :  "i" (key), "i" (branch) : : l_yes);
+   : :  "X" (&((char *)key)[branch]) : : l_yes);
 
return false;
 l_yes:
@@ -52,9 +52,9 @@ static __always_inline bool arch_static_branch_jump(struct 
static_key *key, bool
"2:\n\t"
".pushsection __jump_table,  \"aw\" \n\t"
_ASM_ALIGN "\n\t"
-   _ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
+   _ASM_PTR "1b, %l[l_yes], %P0 \n\t"
".popsection \n\t"
-   : :  "i" (key), "i" (branch) : : l_yes);
+   : :  "X" (&((char *)key)[branch]) : : l_yes);
 
return false;
 l_yes:
-- 
2.14.2.920.gcf0c67979c-goog

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


x86: PIE support and option to extend KASLR randomization

2017-10-26 Thread Thomas Garnier via Virtualization
These patches make the changes necessary to build the kernel as Position
Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
the top 2G of the virtual address space. It allows to optionally extend the
KASLR randomization range from 1G to 3G.

Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
changes, PIE support and KASLR in general. Thanks to Roland McGrath on his
feedback for using -pie versus --emit-relocs and details on compiler code
generation.

The patches:
 - 1-3, 5-1#, 17-18: Change in assembly code to be PIE compliant.
 - 4: Add a new _ASM_GET_PTR macro to fetch a symbol address generically.
 - 14: Adapt percpu design to work correctly when PIE is enabled.
 - 15: Provide an option to default visibility to hidden except for key symbols.
   It removes errors between compilation units.
 - 16: Adapt relocation tool to handle PIE binary correctly.
 - 19: Add support for global cookie.
 - 20: Support ftrace with PIE (used on Ubuntu config).
 - 21: Fix incorrect address marker on dump_pagetables.
 - 22: Add option to move the module section just after the kernel.
 - 23: Adapt module loading to support PIE with dynamic GOT.
 - 24: Make the GOT read-only.
 - 25: Add the CONFIG_X86_PIE option (off by default).
 - 26: Adapt relocation tool to generate a 64-bit relocation table.
 - 27: Add the CONFIG_RANDOMIZE_BASE_LARGE option to increase relocation range
   from 1G to 3G (off by default).

Performance/Size impact:

Size of vmlinux (Default configuration):
 File size:
 - PIE disabled: +0.31%
 - PIE enabled: -3.210% (less relocations)
 .text section:
 - PIE disabled: +0.000644%
 - PIE enabled: +0.837%

Size of vmlinux (Ubuntu configuration):
 File size:
 - PIE disabled: -0.201%
 - PIE enabled: -0.082%
 .text section:
 - PIE disabled: same
 - PIE enabled: +1.319%

Size of vmlinux (Default configuration + ORC):
 File size:
 - PIE enabled: -3.167%
 .text section:
 - PIE enabled: +0.814%

Size of vmlinux (Ubuntu configuration + ORC):
 File size:
 - PIE enabled: -3.167%
 .text section:
 - PIE enabled: +1.26%

The size increase is mainly due to not having access to the 32-bit signed
relocation that can be used with mcmodel=kernel. A small part is due to reduced
optimization for PIE code. This bug [1] was opened with gcc to provide a better
code generation for kernel PIE.

Hackbench (50% and 1600% on thread/process for pipe/sockets):
 - PIE disabled: no significant change (avg +0.1% on latest test).
 - PIE enabled: between -0.50% to +0.86% in average (default and Ubuntu config).

slab_test (average of 10 runs):
 - PIE disabled: no significant change (-2% on latest run, likely noise).
 - PIE enabled: between -1% and +0.8% on latest runs.

Kernbench (average of 10 Half and Optimal runs):
 Elapsed Time:
 - PIE disabled: no significant change (avg -0.239%)
 - PIE enabled: average +0.07%
 System Time:
 - PIE disabled: no significant change (avg -0.277%)
 - PIE enabled: average +0.7%

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82303

diffstat:
 Documentation/x86/x86_64/mm.txt  |3 
 arch/x86/Kconfig |   37 
 arch/x86/Makefile|   14 +
 arch/x86/boot/boot.h |2 
 arch/x86/boot/compressed/Makefile|5 
 arch/x86/boot/compressed/misc.c  |   10 +
 arch/x86/crypto/aes-x86_64-asm_64.S  |   45 +++--
 arch/x86/crypto/aesni-intel_asm.S|   14 +
 arch/x86/crypto/aesni-intel_avx-x86_64.S |6 
 arch/x86/crypto/camellia-aesni-avx-asm_64.S  |   42 ++---
 arch/x86/crypto/camellia-aesni-avx2-asm_64.S |   44 ++---
 arch/x86/crypto/camellia-x86_64-asm_64.S |8 -
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S|   50 +++---
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S|   44 +++--
 arch/x86/crypto/des3_ede-asm_64.S|   96 
 arch/x86/crypto/ghash-clmulni-intel_asm.S|4 
 arch/x86/crypto/glue_helper-asm-avx.S|4 
 arch/x86/crypto/glue_helper-asm-avx2.S   |6 
 arch/x86/entry/entry_32.S|3 
 arch/x86/entry/entry_64.S|   29 ++-
 arch/x86/include/asm/asm.h   |   13 +
 arch/x86/include/asm/bug.h   |2 
 arch/x86/include/asm/ftrace.h|   23 ++-
 arch/x86/include/asm/jump_label.h|8 -
 arch/x86/include/asm/kvm_host.h  |6 
 arch/x86/include/asm/module.h|   14 +
 arch/x86/include/asm/page_64_types.h |9 +
 arch/x86/include/asm/paravirt_types.h|   12 +
 arch/x86/include/asm/percpu.h|   25 ++-
 arch/x86/include/asm/pgtable_64_types.h  |6 
 arch/x86/include/asm/pm-trace.h  |2 
 arch/x86/include/asm/processor.h |   12 +
 arch/x86/include/asm/sections.h  |4 
 arch/x86/include/asm/setup.h |2 
 arch/x86/include/asm/stackprotector.h|   19 

[PATCH 13/13] x86/paravirt: Convert natively patched pv ops to use paravirt alternatives

2017-10-26 Thread Josh Poimboeuf
Now that the paravirt alternatives infrastructure is in place, use it
for all natively patched pv ops.

This fixes KASAN warnings in the ORC unwinder like the following:

  BUG: KASAN: stack-out-of-bounds in deref_stack_reg+0x123/0x140

This also improves debuggability by making vmlinux more likely to match
reality.

Reported-by: Sasha Levin 
Signed-off-by: Josh Poimboeuf 
---
 arch/x86/include/asm/paravirt-asm.h | 23 +--
 arch/x86/include/asm/paravirt.h | 37 +
 2 files changed, 34 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/paravirt-asm.h 
b/arch/x86/include/asm/paravirt-asm.h
index a8139ea27cc1..b051f9254ace 100644
--- a/arch/x86/include/asm/paravirt-asm.h
+++ b/arch/x86/include/asm/paravirt-asm.h
@@ -86,16 +86,18 @@
pv_cpu_ops, PV_CPU_iret, CLBR_NONE)
 
 #define DISABLE_INTERRUPTS(clobbers)   \
-   PV_SITE(PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE);  \
-   call PV_INDIRECT(pv_irq_ops+PV_IRQ_irq_disable);\
-   PV_RESTORE_REGS(clobbers | CLBR_CALLEE_SAVE),   \
-   pv_irq_ops, PV_IRQ_irq_disable, clobbers)
+   PV_ALT_SITE(cli,\
+   PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE);  \
+   call PV_INDIRECT(pv_irq_ops+PV_IRQ_irq_disable);\
+   PV_RESTORE_REGS(clobbers | CLBR_CALLEE_SAVE),   \
+   pv_irq_ops, PV_IRQ_irq_disable, clobbers)
 
 #define ENABLE_INTERRUPTS(clobbers)\
-   PV_SITE(PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE);  \
-   call PV_INDIRECT(pv_irq_ops+PV_IRQ_irq_enable); \
-   PV_RESTORE_REGS(clobbers | CLBR_CALLEE_SAVE),   \
-   pv_irq_ops, PV_IRQ_irq_enable, clobbers)
+   PV_ALT_SITE(sti,\
+   PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE);  \
+   call PV_INDIRECT(pv_irq_ops+PV_IRQ_irq_enable); \
+   PV_RESTORE_REGS(clobbers | CLBR_CALLEE_SAVE),   \
+   pv_irq_ops, PV_IRQ_irq_enable, clobbers)
 
 #ifdef CONFIG_X86_32
 
@@ -128,8 +130,9 @@
call PV_INDIRECT(pv_mmu_ops+PV_MMU_read_cr2)
 
 #define USERGS_SYSRET64
\
-   PV_SITE(jmp PV_INDIRECT(pv_cpu_ops+PV_CPU_usergs_sysret64), \
-   pv_cpu_ops, PV_CPU_usergs_sysret64, CLBR_NONE)
+   PV_ALT_SITE(swapgs; sysret, \
+   jmp PV_INDIRECT(pv_cpu_ops+PV_CPU_usergs_sysret64), \
+   pv_cpu_ops, PV_CPU_usergs_sysret64, CLBR_NONE)
 
 #endif /* !CONFIG_X86_32 */
 
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index bfd02c3335cb..4216a3b02832 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static inline void load_sp0(struct tss_struct *tss,
 struct thread_struct *thread)
@@ -50,9 +51,10 @@ static inline void write_cr0(unsigned long x)
PVOP_VCALL1(pv_cpu_ops.write_cr0, x);
 }
 
-static inline unsigned long read_cr2(void)
+static __always_inline unsigned long read_cr2(void)
 {
-   return PVOP_CALL0(unsigned long, pv_mmu_ops.read_cr2);
+   return PVOP_ALT_CALL0(unsigned long, NATIVE_READ_CR2,
+ pv_mmu_ops.read_cr2);
 }
 
 static inline void write_cr2(unsigned long x)
@@ -60,14 +62,15 @@ static inline void write_cr2(unsigned long x)
PVOP_VCALL1(pv_mmu_ops.write_cr2, x);
 }
 
-static inline unsigned long __read_cr3(void)
+static __always_inline unsigned long __read_cr3(void)
 {
-   return PVOP_CALL0(unsigned long, pv_mmu_ops.read_cr3);
+   return PVOP_ALT_CALL0(unsigned long, NATIVE_READ_CR3,
+ pv_mmu_ops.read_cr3);
 }
 
-static inline void write_cr3(unsigned long x)
+static __always_inline void write_cr3(unsigned long x)
 {
-   PVOP_VCALL1(pv_mmu_ops.write_cr3, x);
+   PVOP_ALT_VCALL1(NATIVE_WRITE_CR3, pv_mmu_ops.write_cr3, x);
 }
 
 static inline void __write_cr4(unsigned long x)
@@ -291,9 +294,10 @@ static inline void __flush_tlb_global(void)
 {
PVOP_VCALL0(pv_mmu_ops.flush_tlb_kernel);
 }
-static inline void __flush_tlb_single(unsigned long addr)
+static __always_inline void __flush_tlb_single(unsigned long addr)
 {
-   PVOP_VCALL1(pv_mmu_ops.flush_tlb_single, addr);
+   PVOP_ALT_VCALL1(NATIVE_FLUSH_TLB_SINGLE, pv_mmu_ops.flush_tlb_single,
+   addr);
 }
 
 static inline void flush_tlb_others(const struct cpumask *cpumask,
@@ -761,24 +765,25 @@ static __always_inline bool pv_vcpu_is_preempted(long cpu)
 #define 

[PATCH 12/13] objtool: Add support for new .pv_altinstructions section

2017-10-26 Thread Josh Poimboeuf
Signed-off-by: Josh Poimboeuf 
---
 tools/objtool/special.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/tools/objtool/special.c b/tools/objtool/special.c
index 84f001d52322..dc15a3564fc9 100644
--- a/tools/objtool/special.c
+++ b/tools/objtool/special.c
@@ -63,6 +63,16 @@ struct special_entry entries[] = {
.feature = ALT_FEATURE_OFFSET,
},
{
+   .sec = ".pv_altinstructions",
+   .group = true,
+   .size = ALT_ENTRY_SIZE,
+   .orig = ALT_ORIG_OFFSET,
+   .orig_len = ALT_ORIG_LEN_OFFSET,
+   .new = ALT_NEW_OFFSET,
+   .new_len = ALT_NEW_LEN_OFFSET,
+   .feature = ALT_FEATURE_OFFSET,
+   },
+   {
.sec = "__jump_table",
.jump_or_nop = true,
.size = JUMP_ENTRY_SIZE,
-- 
2.13.6

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 10/13] x86/alternative: Support indirect call replacement

2017-10-26 Thread Josh Poimboeuf
Add alternative patching support for replacing an instruction with an
indirect call.  This will be needed for the paravirt alternatives.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/kernel/alternative.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 3344d3382e91..81c577c7deba 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -410,20 +410,28 @@ void __init_or_module noinline apply_alternatives(struct 
alt_instr *start,
insnbuf_sz = a->replacementlen;
 
/*
-* 0xe8 is a relative jump; fix the offset.
-*
-* Instruction length is checked before the opcode to avoid
-* accessing uninitialized bytes for zero-length replacements.
+* Fix the address offsets for call and jump instructions which
+* use PC-relative addressing.
 */
if (a->replacementlen == 5 && *insnbuf == 0xe8) {
+   /* direct call */
*(s32 *)(insnbuf + 1) += replacement - instr;
-   DPRINTK("Fix CALL offset: 0x%x, CALL 0x%lx",
+   DPRINTK("Fix direct CALL offset: 0x%x, CALL 0x%lx",
*(s32 *)(insnbuf + 1),
(unsigned long)instr + *(s32 *)(insnbuf + 1) + 
5);
-   }
 
-   if (a->replacementlen && is_jmp(replacement[0]))
+   } else if (a->replacementlen == 6 && *insnbuf == 0xff &&
+  *(insnbuf+1) == 0x15) {
+   /* indirect call */
+   *(s32 *)(insnbuf + 2) += replacement - instr;
+   DPRINTK("Fix indirect CALL offset: 0x%x, CALL *0x%lx",
+   *(s32 *)(insnbuf + 2),
+   (unsigned long)instr + *(s32 *)(insnbuf + 2) + 
6);
+
+   } else if (a->replacementlen && is_jmp(replacement[0])) {
+   /* direct jump */
recompute_jump(a, instr, replacement, insnbuf);
+   }
 
if (a->instrlen > a->replacementlen) {
add_nops(insnbuf + a->replacementlen,
-- 
2.13.6

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 09/13] x86/asm: Convert ALTERNATIVE*() assembler macros to preprocessor macros

2017-10-26 Thread Josh Poimboeuf
The ALTERNATIVE() and ALTERNATIVE_2() macros are GNU assembler macros,
which makes them quite inflexible for future changes.  Convert them to
preprocessor macros.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/entry/entry_32.S| 12 +++---
 arch/x86/entry/entry_64.S| 10 ++---
 arch/x86/entry/entry_64_compat.S |  8 ++--
 arch/x86/entry/vdso/vdso32/system_call.S | 10 ++---
 arch/x86/include/asm/alternative-asm.h   | 68 +++-
 arch/x86/include/asm/smap.h  |  4 +-
 arch/x86/lib/copy_page_64.S  |  2 +-
 arch/x86/lib/memcpy_64.S |  4 +-
 arch/x86/lib/memmove_64.S|  3 +-
 arch/x86/lib/memset_64.S |  4 +-
 10 files changed, 59 insertions(+), 66 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 21d1197779a4..338dc838a9a8 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -443,8 +443,8 @@ ENTRY(entry_SYSENTER_32)
movl%esp, %eax
calldo_fast_syscall_32
/* XEN PV guests always use IRET path */
-   ALTERNATIVE "testl %eax, %eax; jz .Lsyscall_32_done", \
-   "jmp .Lsyscall_32_done", X86_FEATURE_XENPV
+   #define JMP_IF_IRET testl %eax, %eax; jz .Lsyscall_32_done
+   ALTERNATIVE(JMP_IF_IRET, jmp .Lsyscall_32_done, X86_FEATURE_XENPV)
 
 /* Opportunistic SYSEXIT */
TRACE_IRQS_ON   /* User mode traces as IRQs on. */
@@ -536,7 +536,7 @@ restore_all:
TRACE_IRQS_IRET
 .Lrestore_all_notrace:
 #ifdef CONFIG_X86_ESPFIX32
-   ALTERNATIVE "jmp .Lrestore_nocheck", "", X86_BUG_ESPFIX
+   ALTERNATIVE(jmp .Lrestore_nocheck, , X86_BUG_ESPFIX)
 
movlPT_EFLAGS(%esp), %eax   # mix EFLAGS, SS and CS
/*
@@ -692,9 +692,9 @@ ENTRY(simd_coprocessor_error)
pushl   $0
 #ifdef CONFIG_X86_INVD_BUG
/* AMD 486 bug: invd from userspace calls exception 19 instead of #GP */
-   ALTERNATIVE "pushl  $do_general_protection",\
-   "pushl  $do_simd_coprocessor_error",\
-   X86_FEATURE_XMM
+   ALTERNATIVE(pushl   $do_general_protection,
+   pushl   $do_simd_coprocessor_error,
+   X86_FEATURE_XMM)
 #else
pushl   $do_simd_coprocessor_error
 #endif
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index c7c85724d7e0..49733c72619a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -925,7 +925,7 @@ ENTRY(native_load_gs_index)
SWAPGS
 .Lgs_change:
movl%edi, %gs
-2: ALTERNATIVE "", "mfence", X86_BUG_SWAPGS_FENCE
+2: ALTERNATIVE(, mfence, X86_BUG_SWAPGS_FENCE)
SWAPGS
popfq
FRAME_END
@@ -938,12 +938,8 @@ EXPORT_SYMBOL(native_load_gs_index)
/* running with kernelgs */
 bad_gs:
SWAPGS  /* switch back to user gs */
-.macro ZAP_GS
-   /* This can't be a string because the preprocessor needs to see it. */
-   movl $__USER_DS, %eax
-   movl %eax, %gs
-.endm
-   ALTERNATIVE "", "ZAP_GS", X86_BUG_NULL_SEG
+   #define ZAP_GS movl $__USER_DS, %eax; movl %eax, %gs
+   ALTERNATIVE(, ZAP_GS, X86_BUG_NULL_SEG)
xorl%eax, %eax
movl%eax, %gs
jmp 2b
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 4d9385529c39..16e82b5103b5 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -124,8 +124,8 @@ ENTRY(entry_SYSENTER_compat)
movq%rsp, %rdi
calldo_fast_syscall_32
/* XEN PV guests always use IRET path */
-   ALTERNATIVE "testl %eax, %eax; jz .Lsyscall_32_done", \
-   "jmp .Lsyscall_32_done", X86_FEATURE_XENPV
+   #define JMP_IF_IRET testl %eax, %eax; jz .Lsyscall_32_done
+   ALTERNATIVE(JMP_IF_IRET, jmp .Lsyscall_32_done, X86_FEATURE_XENPV)
jmp sysret32_from_system_call
 
 .Lsysenter_fix_flags:
@@ -224,8 +224,8 @@ GLOBAL(entry_SYSCALL_compat_after_hwframe)
movq%rsp, %rdi
calldo_fast_syscall_32
/* XEN PV guests always use IRET path */
-   ALTERNATIVE "testl %eax, %eax; jz .Lsyscall_32_done", \
-   "jmp .Lsyscall_32_done", X86_FEATURE_XENPV
+   ALTERNATIVE(JMP_IF_IRET,
+   jmp .Lsyscall_32_done, X86_FEATURE_XENPV)
 
/* Opportunistic SYSRET */
 sysret32_from_system_call:
diff --git a/arch/x86/entry/vdso/vdso32/system_call.S 
b/arch/x86/entry/vdso/vdso32/system_call.S
index ed4bc9731cbb..a0c5f9e8226c 100644
--- a/arch/x86/entry/vdso/vdso32/system_call.S
+++ b/arch/x86/entry/vdso/vdso32/system_call.S
@@ -48,15 +48,15 @@ __kernel_vsyscall:
CFI_ADJUST_CFA_OFFSET   4
CFI_REL_OFFSET  ebp, 0
 
-   #define SYSENTER_SEQUENCE   "movl %esp, %ebp; sysenter"
-   #define SYSCALL_SEQUENCE  

[PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure

2017-10-26 Thread Josh Poimboeuf
With CONFIG_PARAVIRT, the kernel .text is littered with a bunch of calls
to pv_irq_ops function pointers, like:

  callq  *0x81e3a400 (pv_irq_ops.save_fl)

In non-Xen paravirt environments -- including native, KVM, Hyper-V, and
VMware -- the above code gets patched by native_patch() to look like
this instead:

   pushfq
   pop%rax
   nopl   0x0(%rax,%rax,1)

So in most scenarios, there's a mismatch between what vmlinux shows and
the actual runtime code.  This mismatch hurts debuggability and makes
the assembly code harder to understand.

It also causes the ORC unwinder to produce KASAN warnings like:

  BUG: KASAN: stack-out-of-bounds in deref_stack_reg+0x123/0x140

This warning is due to the fact that objtool doesn't know about
parainstructions, so it doesn't know about the "pushfq; pop %rax"
sequence above.

Prepare to fix both of these issues (debuggability and ORC KASAN
warnings) by adding a paravirt alternatives infrastructure to put the
native instructions in .text by default.  Then, when booting on a
hypervisor, replace the native instructions with pv ops calls.

The pv ops calls need to be available much earlier than when
alternatives are normally applied.  So put these alternatives in a
dedicated ".pv_alternatives" section.

So now these instructions may be patched twice:

- in apply_pv_alternatives(), to allow the kernel to boot in the
  virtualized environment;

- and again in apply_paravirt(), to enable performance improvements
  (e.g., replacing an indirect call with a direct call).

That's a bit more complex, but overall this approach should cause less
confusion than before because the vmlinux code is now much more likely
to represent the actual runtime state of the code in the most common
paravirt cases (everything except Xen and vSMP).

It could be simplified by redesigning the paravirt patching code such
that it uses alternatives for all of its patching.  Instead of using pv
ops to specify which functions they need, they would instead set CPU
feature bits, which would then be used by the alternatives to decide
what to replace the native code with.  Then each site would only be
patched once.

But that's going to be a bit more work.  At least this patch creates a
good foundation for eventually getting rid of .parainstructions and pv
ops completely.

Suggested-by: Andy Lutomirski 
Signed-off-by: Josh Poimboeuf 
---
 arch/x86/include/asm/alternative-asm.h |  9 +++-
 arch/x86/include/asm/alternative.h | 12 +++--
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/paravirt-asm.h| 10 
 arch/x86/include/asm/paravirt_types.h  | 84 ++
 arch/x86/kernel/alternative.c  | 13 ++
 arch/x86/kernel/cpu/hypervisor.c   |  2 +
 arch/x86/kernel/module.c   | 11 -
 arch/x86/kernel/vmlinux.lds.S  |  6 +++
 arch/x86/xen/enlighten_pv.c|  1 +
 10 files changed, 141 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/alternative-asm.h 
b/arch/x86/include/asm/alternative-asm.h
index 60073947350d..0ced2e3d0a30 100644
--- a/arch/x86/include/asm/alternative-asm.h
+++ b/arch/x86/include/asm/alternative-asm.h
@@ -39,14 +39,14 @@
  * @newinstr. ".skip" directive takes care of proper instruction padding
  * in case @newinstr is longer than @oldinstr.
  */
-#define ALTERNATIVE(oldinstr, newinstr, feature)   \
+#define __ALTERNATIVE(section, oldinstr, newinstr, feature)\
 140:;  \
oldinstr;   \
 141:;  \
.skip -(((144f-143f)-(141b-140b)) > 0) *\
((144f-143f)-(141b-140b)),0x90; \
 142:;  \
-   .pushsection .altinstructions, "a"; \
+   .pushsection section, "a";  \
altinstruction_entry 140b,143f,feature,142b-140b,144f-143f,142b-141b;\
.popsection;\
.pushsection .altinstr_replacement, "ax";   \
@@ -55,6 +55,11 @@
 144:;  \
.popsection
 
+#define ARGS(args...) args
+
+#define ALTERNATIVE(oldinstr, newinstr, feature)   \
+   __ALTERNATIVE(.altinstructions, ARGS(oldinstr), ARGS(newinstr), feature)
+
 #define old_len141b-140b
 #define new_len1   144f-143f
 #define new_len2   145f-144f
diff --git a/arch/x86/include/asm/alternative.h 
b/arch/x86/include/asm/alternative.h
index c096624137ae..8482f90d5078 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -61,6 +61,7 @@ 

[PATCH 08/13] x86/paravirt: Clean up paravirt_types.h

2017-10-26 Thread Josh Poimboeuf
Make paravirt_types.h more understandable:

- Use more consistent and logical naming
- Simplify interfaces
- Put related macros together
- Improve whitespace

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/include/asm/paravirt_types.h | 104 ++
 1 file changed, 54 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 01f9e10983c1..5656aea79412 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -331,33 +331,6 @@ extern struct pv_irq_ops pv_irq_ops;
 extern struct pv_mmu_ops pv_mmu_ops;
 extern struct pv_lock_ops pv_lock_ops;
 
-#define PARAVIRT_PATCH(x)  \
-   (offsetof(struct paravirt_patch_template, x) / sizeof(void *))
-
-#define paravirt_type(op)  \
-   [paravirt_typenum] "i" (PARAVIRT_PATCH(op)),\
-   [paravirt_opptr] "i" (&(op))
-#define paravirt_clobber(clobber)  \
-   [paravirt_clobber] "i" (clobber)
-
-/*
- * Generate some code, and mark it as patchable by the
- * apply_paravirt() alternate instruction patcher.
- */
-#define _paravirt_alt(insn_string, type, clobber)  \
-   "771:\n\t" insn_string "\n" "772:\n"\
-   ".pushsection .parainstructions,\"a\"\n"\
-   _ASM_ALIGN "\n" \
-   _ASM_PTR " 771b\n"  \
-   "  .byte " type "\n"\
-   "  .byte 772b-771b\n"   \
-   "  .short " clobber "\n"\
-   ".popsection\n"
-
-/* Generate patchable code, with the default asm parameters. */
-#define paravirt_alt(insn_string)  \
-   _paravirt_alt(insn_string, "%c[paravirt_typenum]", 
"%c[paravirt_clobber]")
-
 /* Simple instruction patching code. */
 #define NATIVE_LABEL(a,x,b) "\n" a #x "_" #b ":\n\t"
 
@@ -388,13 +361,46 @@ unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
 
 int paravirt_disable_iospace(void);
 
+
 /*
- * This generates an indirect call based on the operation type number.
- * The type number, computed in PARAVIRT_PATCH, is derived from the
- * offset into the paravirt_patch_template structure, and can therefore be
- * freely converted back into a structure offset.
+ * Generate some code, and mark it as patchable by apply_paravirt().
  */
-#define PARAVIRT_CALL  "call *%c[paravirt_opptr];"
+#define _PV_SITE(insn_string, type, clobber)   \
+   "771:\n\t" insn_string "\n" "772:\n"\
+   ".pushsection .parainstructions,\"a\"\n"\
+   _ASM_ALIGN "\n" \
+   _ASM_PTR " 771b\n"  \
+   "  .byte " type "\n"\
+   "  .byte 772b-771b\n"   \
+   "  .short " clobber "\n"\
+   ".popsection\n"
+
+#define PARAVIRT_PATCH(x)  \
+   (offsetof(struct paravirt_patch_template, x) / sizeof(void *))
+
+#define PV_STRINGIFY(constraint)   "%c[" __stringify(constraint) "]"
+
+#define PV_CALL_CONSTRAINT pv_op_ptr
+#define PV_TYPE_CONSTRAINT pv_typenum
+#define PV_CLBR_CONSTRAINT pv_clobber
+
+#define PV_CALL_CONSTRAINT_STR PV_STRINGIFY(PV_CALL_CONSTRAINT)
+#define PV_TYPE_CONSTRAINT_STR PV_STRINGIFY(PV_TYPE_CONSTRAINT)
+#define PV_CLBR_CONSTRAINT_STR PV_STRINGIFY(PV_CLBR_CONSTRAINT)
+
+#define PV_CALL_STR"call *" PV_CALL_CONSTRAINT_STR ";"
+
+#define PV_INPUT_CONSTRAINTS(op, clobber)  \
+   [PV_TYPE_CONSTRAINT] "i" (PARAVIRT_PATCH(op)),  \
+   [PV_CALL_CONSTRAINT] "i" (&(op)),   \
+   [PV_CLBR_CONSTRAINT] "i" (clobber)
+
+#define PV_SITE(insn_string)   \
+   _PV_SITE(insn_string, PV_TYPE_CONSTRAINT_STR, PV_CLBR_CONSTRAINT_STR)
+
+#define PV_ALT_SITE(oldinstr, newinstr)
\
+   _PV_ALT_SITE(oldinstr, newinstr, PV_TYPE_CONSTRAINT_STR,\
+PV_CLBR_CONSTRAINT_STR)
 
 /*
  * These macros are intended to wrap calls through one of the paravirt
@@ -525,25 +531,24 @@ int paravirt_disable_iospace(void);
 
 #define PVOP_CALL(rettype, op, clbr, call_clbr, extra_clbr,
\
  pre, post, ...)   \
-   ({  \
-   rettype __ret;  \
-   PVOP_CALL_ARGS; \
-   

[PATCH 07/13] x86/paravirt: Simplify ____PVOP_CALL()

2017-10-26 Thread Josh Poimboeuf
Remove the inline asm duplication in PVOP_CALL().

Also add 'IS_ENABLED(CONFIG_X86_32)' to the return variable logic,
making the code clearer and rendering the comment unnecessary.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/include/asm/paravirt_types.h | 36 +--
 1 file changed, 13 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index ab7aabe6b668..01f9e10983c1 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -529,29 +529,19 @@ int paravirt_disable_iospace(void);
rettype __ret;  \
PVOP_CALL_ARGS; \
PVOP_TEST_NULL(op); \
-   /* This is 32-bit specific, but is okay in 64-bit */\
-   /* since this condition will never hold */  \
-   if (sizeof(rettype) > sizeof(unsigned long)) {  \
-   asm volatile(pre\
-paravirt_alt(PARAVIRT_CALL)\
-post   \
-: call_clbr, ASM_CALL_CONSTRAINT   \
-: paravirt_type(op),   \
-  paravirt_clobber(clbr),  \
-  ##__VA_ARGS__\
-: "memory", "cc" extra_clbr);  \
-   __ret = (rettype)u64)__edx) << 32) | __eax); \
-   } else {\
-   asm volatile(pre\
-paravirt_alt(PARAVIRT_CALL)\
-post   \
-: call_clbr, ASM_CALL_CONSTRAINT   \
-: paravirt_type(op),   \
-  paravirt_clobber(clbr),  \
-  ##__VA_ARGS__\
-: "memory", "cc" extra_clbr);  \
-   __ret = (rettype)(__eax & PVOP_RETMASK(rettype));   
\
-   }   \
+   asm volatile(pre\
+paravirt_alt(PARAVIRT_CALL)\
+post   \
+: call_clbr, ASM_CALL_CONSTRAINT   \
+: paravirt_type(op),   \
+  paravirt_clobber(clbr),  \
+  ##__VA_ARGS__\
+: "memory", "cc" extra_clbr);  \
+   if (IS_ENABLED(CONFIG_X86_32) &&\
+   sizeof(rettype) > sizeof(unsigned long))\
+   __ret = (rettype)u64)__edx) << 32) | __eax);\
+   else\
+   __ret = (rettype)(__eax & PVOP_RETMASK(rettype));\
__ret;  \
})
 
-- 
2.13.6

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 06/13] x86/paravirt: Clean up paravirt-asm.h

2017-10-26 Thread Josh Poimboeuf
Some cleanup to make the code easier to read and understand:

- Use the common "PV_" prefix
- Simplify the PV_SITE macro interface
- Improve whitespace

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/include/asm/paravirt-asm.h | 95 +++--
 1 file changed, 49 insertions(+), 46 deletions(-)

diff --git a/arch/x86/include/asm/paravirt-asm.h 
b/arch/x86/include/asm/paravirt-asm.h
index add8a190fdac..8bdd50ee4bf3 100644
--- a/arch/x86/include/asm/paravirt-asm.h
+++ b/arch/x86/include/asm/paravirt-asm.h
@@ -7,16 +7,18 @@
 #include 
 #include 
 
-#define _PVSITE(ptype, clobbers, ops, word, algn)  \
-771:;  \
-   ops;\
-772:;  \
-   .pushsection .parainstructions,"a"; \
-.align algn;   \
-word 771b; \
-.byte ptype;   \
-.byte 772b-771b;   \
-.short clobbers;   \
+#define PV_TYPE(ops, off) ((PARAVIRT_PATCH_##ops + (off)) / __ASM_SEL(4, 8))
+
+#define PV_SITE(insns, ops, off, clobbers) \
+771:;  \
+   insns;  \
+772:;  \
+   .pushsection .parainstructions, "a";\
+_ASM_ALIGN;\
+_ASM_PTR 771b; \
+.byte PV_TYPE(ops, off);   \
+.byte 772b-771b;   \
+.short clobbers;   \
.popsection
 
 
@@ -33,62 +35,65 @@
COND_PUSH(set, CLBR_RDX, rdx);  \
COND_PUSH(set, CLBR_RSI, rsi);  \
COND_PUSH(set, CLBR_RDI, rdi);  \
-   COND_PUSH(set, CLBR_R8, r8);\
-   COND_PUSH(set, CLBR_R9, r9);\
+   COND_PUSH(set, CLBR_R8,  r8);   \
+   COND_PUSH(set, CLBR_R9,  r9);   \
COND_PUSH(set, CLBR_R10, r10);  \
COND_PUSH(set, CLBR_R11, r11)
+
 #define PV_RESTORE_REGS(set)   \
COND_POP(set, CLBR_R11, r11);   \
COND_POP(set, CLBR_R10, r10);   \
-   COND_POP(set, CLBR_R9, r9); \
-   COND_POP(set, CLBR_R8, r8); \
+   COND_POP(set, CLBR_R9,  r9);\
+   COND_POP(set, CLBR_R8,  r8);\
COND_POP(set, CLBR_RDI, rdi);   \
COND_POP(set, CLBR_RSI, rsi);   \
COND_POP(set, CLBR_RDX, rdx);   \
COND_POP(set, CLBR_RCX, rcx);   \
COND_POP(set, CLBR_RAX, rax)
 
-#define PARA_PATCH(struct, off)((PARAVIRT_PATCH_##struct + (off)) / 8)
-#define PARA_SITE(ptype, clobbers, ops) _PVSITE(ptype, clobbers, ops, .quad, 8)
-#define PARA_INDIRECT(addr)*addr(%rip)
-#else
+#define PV_INDIRECT(addr)  *addr(%rip)
+
+#else /* !CONFIG_X86_64 */
+
 #define PV_SAVE_REGS(set)  \
COND_PUSH(set, CLBR_EAX, eax);  \
COND_PUSH(set, CLBR_EDI, edi);  \
COND_PUSH(set, CLBR_ECX, ecx);  \
COND_PUSH(set, CLBR_EDX, edx)
+
 #define PV_RESTORE_REGS(set)   \
COND_POP(set, CLBR_EDX, edx);   \
COND_POP(set, CLBR_ECX, ecx);   \
COND_POP(set, CLBR_EDI, edi);   \
COND_POP(set, CLBR_EAX, eax)
 
-#define PARA_PATCH(struct, off)((PARAVIRT_PATCH_##struct + (off)) / 4)
-#define PARA_SITE(ptype, clobbers, ops) _PVSITE(ptype, clobbers, ops, .long, 4)
-#define PARA_INDIRECT(addr)*%cs:addr
-#endif
+#define PV_INDIRECT(addr)  *%cs:addr
+
+#endif /* !CONFIG_X86_64 */
 
 #define INTERRUPT_RETURN   \
-   PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_iret), CLBR_NONE,   \
- jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_iret))
+   PV_SITE(jmp PV_INDIRECT(pv_cpu_ops+PV_CPU_iret),\
+   pv_cpu_ops, PV_CPU_iret, CLBR_NONE)
 
 #define DISABLE_INTERRUPTS(clobbers)   \
-   PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_disable), clobbers, \
- PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE);\
- call PARA_INDIRECT(pv_irq_ops+PV_IRQ_irq_disable);\
- PV_RESTORE_REGS(clobbers | CLBR_CALLEE_SAVE);)
+   PV_SITE(PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE);  \
+   call PV_INDIRECT(pv_irq_ops+PV_IRQ_irq_disable);\
+   PV_RESTORE_REGS(clobbers | CLBR_CALLEE_SAVE),   \
+   pv_irq_ops, 

[PATCH 05/13] x86/paravirt: Move paravirt asm macros to paravirt-asm.h

2017-10-26 Thread Josh Poimboeuf
The paravirt.h file is quite big and the asm interfaces for paravirt
don't need to be in the same file as the C interfaces.  Move the asm
interfaces to a dedicated header file.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/entry/entry_32.S   |   1 +
 arch/x86/entry/entry_64.S   |   2 +-
 arch/x86/entry/entry_64_compat.S|   1 +
 arch/x86/include/asm/paravirt-asm.h | 126 ++
 arch/x86/include/asm/paravirt.h | 132 +++-
 arch/x86/kernel/head_64.S   |   2 +-
 6 files changed, 138 insertions(+), 126 deletions(-)
 create mode 100644 arch/x86/include/asm/paravirt-asm.h

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 8a13d468635a..21d1197779a4 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 49167258d587..c7c85724d7e0 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -30,7 +30,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index e26c25ca7756..4d9385529c39 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
diff --git a/arch/x86/include/asm/paravirt-asm.h 
b/arch/x86/include/asm/paravirt-asm.h
new file mode 100644
index ..add8a190fdac
--- /dev/null
+++ b/arch/x86/include/asm/paravirt-asm.h
@@ -0,0 +1,126 @@
+#ifndef _ASM_X86_PARAVIRT_ASM_H
+#define _ASM_X86_PARAVIRT_ASM_H
+
+#ifdef CONFIG_PARAVIRT
+#ifdef __ASSEMBLY__
+
+#include 
+#include 
+
+#define _PVSITE(ptype, clobbers, ops, word, algn)  \
+771:;  \
+   ops;\
+772:;  \
+   .pushsection .parainstructions,"a"; \
+.align algn;   \
+word 771b; \
+.byte ptype;   \
+.byte 772b-771b;   \
+.short clobbers;   \
+   .popsection
+
+
+#define COND_PUSH(set, mask, reg)  \
+   .if ((~(set)) & mask); push %reg; .endif
+#define COND_POP(set, mask, reg)   \
+   .if ((~(set)) & mask); pop %reg; .endif
+
+#ifdef CONFIG_X86_64
+
+#define PV_SAVE_REGS(set)  \
+   COND_PUSH(set, CLBR_RAX, rax);  \
+   COND_PUSH(set, CLBR_RCX, rcx);  \
+   COND_PUSH(set, CLBR_RDX, rdx);  \
+   COND_PUSH(set, CLBR_RSI, rsi);  \
+   COND_PUSH(set, CLBR_RDI, rdi);  \
+   COND_PUSH(set, CLBR_R8, r8);\
+   COND_PUSH(set, CLBR_R9, r9);\
+   COND_PUSH(set, CLBR_R10, r10);  \
+   COND_PUSH(set, CLBR_R11, r11)
+#define PV_RESTORE_REGS(set)   \
+   COND_POP(set, CLBR_R11, r11);   \
+   COND_POP(set, CLBR_R10, r10);   \
+   COND_POP(set, CLBR_R9, r9); \
+   COND_POP(set, CLBR_R8, r8); \
+   COND_POP(set, CLBR_RDI, rdi);   \
+   COND_POP(set, CLBR_RSI, rsi);   \
+   COND_POP(set, CLBR_RDX, rdx);   \
+   COND_POP(set, CLBR_RCX, rcx);   \
+   COND_POP(set, CLBR_RAX, rax)
+
+#define PARA_PATCH(struct, off)((PARAVIRT_PATCH_##struct + (off)) / 8)
+#define PARA_SITE(ptype, clobbers, ops) _PVSITE(ptype, clobbers, ops, .quad, 8)
+#define PARA_INDIRECT(addr)*addr(%rip)
+#else
+#define PV_SAVE_REGS(set)  \
+   COND_PUSH(set, CLBR_EAX, eax);  \
+   COND_PUSH(set, CLBR_EDI, edi);  \
+   COND_PUSH(set, CLBR_ECX, ecx);  \
+   COND_PUSH(set, CLBR_EDX, edx)
+#define PV_RESTORE_REGS(set)   \
+   COND_POP(set, CLBR_EDX, edx);   \
+   COND_POP(set, CLBR_ECX, ecx);   \
+   COND_POP(set, CLBR_EDI, edi);   \
+   COND_POP(set, CLBR_EAX, eax)
+
+#define PARA_PATCH(struct, off)((PARAVIRT_PATCH_##struct + (off)) / 4)
+#define PARA_SITE(ptype, clobbers, ops) _PVSITE(ptype, clobbers, ops, .long, 4)
+#define PARA_INDIRECT(addr)*%cs:addr
+#endif
+
+#define INTERRUPT_RETURN   \
+   PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_iret), CLBR_NONE,   \
+ jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_iret))
+
+#define DISABLE_INTERRUPTS(clobbers)   \
+   PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_disable), clobbers, \
+ PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE);\
+ call 

[PATCH 04/13] x86/paravirt: Convert DEF_NATIVE macro to GCC extended asm syntax

2017-10-26 Thread Josh Poimboeuf
In a future patch, the NATIVE_* instruction string macros will be used
in GCC extended inline asm, which requires registers to have two '%'
instead of one in the asm template string.  Convert the DEF_NATIVE macro
to the GCC extended asm syntax so the NATIVE_* macros can be shared more
broadly.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/include/asm/paravirt_types.h | 10 +++---
 arch/x86/include/asm/special_insns.h  | 14 +++---
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index e99e5ac3e036..ab7aabe6b668 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -359,11 +359,15 @@ extern struct pv_lock_ops pv_lock_ops;
_paravirt_alt(insn_string, "%c[paravirt_typenum]", 
"%c[paravirt_clobber]")
 
 /* Simple instruction patching code. */
-#define NATIVE_LABEL(a,x,b) "\n\t.globl " a #x "_" #b "\n" a #x "_" #b ":\n\t"
+#define NATIVE_LABEL(a,x,b) "\n" a #x "_" #b ":\n\t"
 
 #define DEF_NATIVE(ops, name, code)\
-   __visible extern const char start_##ops##_##name[], 
end_##ops##_##name[];   \
-   asm(NATIVE_LABEL("start_", ops, name) code NATIVE_LABEL("end_", ops, 
name))
+static inline void __used __native_ ## name ## _insns(void) {  \
+   asm volatile(NATIVE_LABEL("start_", ops, name)  \
+code   \
+NATIVE_LABEL("end_", ops, name) : );   \
+} \
+__visible extern const char start_##ops##_##name[], end_##ops##_##name[];
 
 unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len);
 unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len);
diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index 0549c5f2c1b3..4b89668f2862 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -7,14 +7,14 @@
 #include 
 
 #ifdef CONFIG_X86_64
-# define _REG_ARG1 "%rdi"
-# define NATIVE_IDENTITY_32"mov %edi, %eax"
+# define _REG_ARG1 "%%rdi"
+# define NATIVE_IDENTITY_32"mov %%edi, %%eax"
 # define NATIVE_USERGS_SYSRET64"swapgs; sysretq"
 #else
-# define _REG_ARG1 "%eax"
+# define _REG_ARG1 "%%eax"
 #endif
 
-#define _REG_RET   "%" _ASM_AX
+#define _REG_RET   "%%" _ASM_AX
 
 #define NATIVE_ZERO"xor " _REG_ARG1 ", " _REG_ARG1
 #define NATIVE_IDENTITY"mov " _REG_ARG1 ", " _REG_RET
@@ -22,9 +22,9 @@
 #define NATIVE_RESTORE_FL  "push " _REG_ARG1 "; popf"
 #define NATIVE_IRQ_DISABLE "cli"
 #define NATIVE_IRQ_ENABLE  "sti"
-#define NATIVE_READ_CR2"mov %cr2, " _REG_RET
-#define NATIVE_READ_CR3"mov %cr3, " _REG_RET
-#define NATIVE_WRITE_CR3   "mov " _REG_ARG1 ", %cr3"
+#define NATIVE_READ_CR2"mov %%cr2, " _REG_RET
+#define NATIVE_READ_CR3"mov %%cr3, " _REG_RET
+#define NATIVE_WRITE_CR3   "mov " _REG_ARG1 ", %%cr3"
 #define NATIVE_FLUSH_TLB_SINGLE"invlpg (" _REG_ARG1 ")"
 #define NATIVE_SWAPGS  "swapgs"
 #define NATIVE_IRET"iret"
-- 
2.13.6

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] [RFC] virtio-iommu version 0.4

2017-10-26 Thread Auger Eric
Hi Jean,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
> 
> http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
> I reiterate the disclaimers: don't use this document as a reference,
> it's a draft. It's also not an OASIS document yet. It may be riddled
> with mistakes. As this is a working draft, it is unstable and I do not
> guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg4.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
> Warning: UAPI headers have changed! They didn't follow the spec,
> please update. (Use branch v0.1, that has the old headers, for the
> Qemu prototype [3])
When rebasing the v0.4 driver on master I observe a regression: commands
are not received properly by QEMU (typically an attach command is
received with a type of 0). After a bisection of the guest kernel the
first commit the problem appears is:

commit 

[PATCH 01/13] x86/paravirt: remove wbinvd() paravirt interface

2017-10-26 Thread Josh Poimboeuf
Since lguest was removed, only the native version of wbinvd() is used.
The paravirt interface is no longer needed.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/include/asm/paravirt.h   | 5 -
 arch/x86/include/asm/paravirt_types.h | 1 -
 arch/x86/include/asm/special_insns.h  | 7 +--
 arch/x86/kernel/paravirt.c| 1 -
 arch/x86/kernel/paravirt_patch_64.c   | 2 --
 arch/x86/xen/enlighten_pv.c   | 2 --
 6 files changed, 1 insertion(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 12deec722cf0..2f51fbf175da 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -98,11 +98,6 @@ static inline void halt(void)
PVOP_VCALL0(pv_irq_ops.halt);
 }
 
-static inline void wbinvd(void)
-{
-   PVOP_VCALL0(pv_cpu_ops.wbinvd);
-}
-
 #define get_kernel_rpl()  (pv_info.kernel_rpl)
 
 static inline u64 paravirt_read_msr(unsigned msr)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 280d94c36dad..0e112f279514 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -137,7 +137,6 @@ struct pv_cpu_ops {
 
void (*set_iopl_mask)(unsigned mask);
 
-   void (*wbinvd)(void);
void (*io_delay)(void);
 
/* cpuid emulation, mostly so that caps bits can be disabled */
diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index a24dfcf79f4a..ac402c6fc24b 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -128,7 +128,7 @@ static inline void __write_pkru(u32 pkru)
 }
 #endif
 
-static inline void native_wbinvd(void)
+static inline void wbinvd(void)
 {
asm volatile("wbinvd": : :"memory");
 }
@@ -183,11 +183,6 @@ static inline void __write_cr4(unsigned long x)
native_write_cr4(x);
 }
 
-static inline void wbinvd(void)
-{
-   native_wbinvd();
-}
-
 #ifdef CONFIG_X86_64
 
 static inline unsigned long read_cr8(void)
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 19a3e8f961c7..3fead3a50723 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -332,7 +332,6 @@ __visible struct pv_cpu_ops pv_cpu_ops = {
.read_cr8 = native_read_cr8,
.write_cr8 = native_write_cr8,
 #endif
-   .wbinvd = native_wbinvd,
.read_msr = native_read_msr,
.write_msr = native_write_msr,
.read_msr_safe = native_read_msr_safe,
diff --git a/arch/x86/kernel/paravirt_patch_64.c 
b/arch/x86/kernel/paravirt_patch_64.c
index 11aaf1eaa0e4..0a1ba3f80cbf 100644
--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -10,7 +10,6 @@ DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax");
 DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax");
 DEF_NATIVE(pv_mmu_ops, write_cr3, "movq %rdi, %cr3");
 DEF_NATIVE(pv_mmu_ops, flush_tlb_single, "invlpg (%rdi)");
-DEF_NATIVE(pv_cpu_ops, wbinvd, "wbinvd");
 
 DEF_NATIVE(pv_cpu_ops, usergs_sysret64, "swapgs; sysretq");
 DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs");
@@ -60,7 +59,6 @@ unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
PATCH_SITE(pv_mmu_ops, read_cr3);
PATCH_SITE(pv_mmu_ops, write_cr3);
PATCH_SITE(pv_mmu_ops, flush_tlb_single);
-   PATCH_SITE(pv_cpu_ops, wbinvd);
 #if defined(CONFIG_PARAVIRT_SPINLOCKS)
case PARAVIRT_PATCH(pv_lock_ops.queued_spin_unlock):
if (pv_is_native_spin_unlock()) {
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 73f809a6ca87..c0cb5c2bfd92 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -1045,8 +1045,6 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst = {
.write_cr8 = xen_write_cr8,
 #endif
 
-   .wbinvd = native_wbinvd,
-
.read_msr = xen_read_msr,
.write_msr = xen_write_msr,
 
-- 
2.13.6

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 00/13] x86/paravirt: Make pv ops code generation more closely match reality

2017-10-26 Thread Josh Poimboeuf
This changes the pv ops code generation to more closely match reality.

For example, instead of:

  callq  *0x81e3a400 (pv_irq_ops.save_fl)

vmlinux will now show:

  pushfq
  pop%rax
  nop
  nop
  nop
  nop
  nop

which is what the runtime version of the code will show in most cases.

This idea was suggested by Andy Lutomirski.

The benefits are:

- For the most common runtime cases (everything except Xen and vSMP),
  vmlinux disassembly now matches what the actual runtime code looks
  like.  This improves debuggability and kernel developer sanity (a
  precious resource).

- It fixes a KASAN warning in the ORC unwinder due to objtool not
  understanding the .parainstructions stuff.

- It's hopefully a first step in simplifying paravirt patching by
  getting rid of .parainstructions, pv ops, and apply_paravirt()
  completely.  (I think Xen can be changed to set CPU feature bits to
  specify which ops it needs during early boot, then those ops can be
  patched in using early alternatives.)

For more details, see the commit log in patch 11/13.

Josh Poimboeuf (13):
  x86/paravirt: remove wbinvd() paravirt interface
  x86/paravirt: Fix output constraint macro names
  x86/paravirt: Convert native patch assembly code strings to macros
  x86/paravirt: Convert DEF_NATIVE macro to GCC extended asm syntax
  x86/paravirt: Move paravirt asm macros to paravirt-asm.h
  x86/paravirt: Clean up paravirt-asm.h
  x86/paravirt: Simplify PVOP_CALL()
  x86/paravirt: Clean up paravirt_types.h
  x86/asm: Convert ALTERNATIVE*() assembler macros to preprocessor
macros
  x86/alternative: Support indirect call replacement
  x86/paravirt: Add paravirt alternatives infrastructure
  objtool: Add support for new .pv_altinstructions section
  x86/paravirt: Convert natively patched pv ops to use paravirt
alternatives

 arch/x86/entry/entry_32.S|  13 +-
 arch/x86/entry/entry_64.S|  12 +-
 arch/x86/entry/entry_64_compat.S |   9 +-
 arch/x86/entry/vdso/vdso32/system_call.S |  10 +-
 arch/x86/include/asm/alternative-asm.h   |  71 -
 arch/x86/include/asm/alternative.h   |  12 +-
 arch/x86/include/asm/cpufeatures.h   |   1 +
 arch/x86/include/asm/paravirt-asm.h  | 142 ++
 arch/x86/include/asm/paravirt.h  | 174 --
 arch/x86/include/asm/paravirt_types.h| 243 ---
 arch/x86/include/asm/smap.h  |   4 +-
 arch/x86/include/asm/special_insns.h |  31 +++-
 arch/x86/kernel/alternative.c|  35 -
 arch/x86/kernel/cpu/hypervisor.c |   2 +
 arch/x86/kernel/head_64.S|   2 +-
 arch/x86/kernel/module.c |  11 +-
 arch/x86/kernel/paravirt.c   |   1 -
 arch/x86/kernel/paravirt_patch_32.c  |  21 +--
 arch/x86/kernel/paravirt_patch_64.c  |  31 ++--
 arch/x86/kernel/vmlinux.lds.S|   6 +
 arch/x86/lib/copy_page_64.S  |   2 +-
 arch/x86/lib/memcpy_64.S |   4 +-
 arch/x86/lib/memmove_64.S|   3 +-
 arch/x86/lib/memset_64.S |   4 +-
 arch/x86/xen/enlighten_pv.c  |   3 +-
 tools/objtool/special.c  |  10 ++
 26 files changed, 516 insertions(+), 341 deletions(-)
 create mode 100644 arch/x86/include/asm/paravirt-asm.h

-- 
2.13.6

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


  1   2   3   >