Re: [PATCH v2 1/2] KVM: Start populating /sys/hypervisor with KVM entries

2019-05-17 Thread Sironi, Filippo


> On 16. May 2019, at 15:50, Graf, Alexander  wrote:
> 
> On 14.05.19 08:16, Filippo Sironi wrote:
>> Start populating /sys/hypervisor with KVM entries when we're running on
>> KVM. This is to replicate functionality that's available when we're
>> running on Xen.
>> 
>> Start with /sys/hypervisor/uuid, which users prefer over
>> /sys/devices/virtual/dmi/id/product_uuid as a way to recognize a virtual
>> machine, since it's also available when running on Xen HVM and on Xen PV
>> and, on top of that doesn't require root privileges by default.
>> Let's create arch-specific hooks so that different architectures can
>> provide different implementations.
>> 
>> Signed-off-by: Filippo Sironi 
> 
> I think this needs something akin to
> 
>  https://www.kernel.org/doc/Documentation/ABI/stable/sysfs-hypervisor-xen
> 
> to document which files are available.
> 
>> ---
>> v2:
>> * move the retrieval of the VM UUID out of uuid_show and into
>>  kvm_para_get_uuid, which is a weak function that can be overwritten
>> 
>> drivers/Kconfig  |  2 ++
>> drivers/Makefile |  2 ++
>> drivers/kvm/Kconfig  | 14 ++
>> drivers/kvm/Makefile |  1 +
>> drivers/kvm/sys-hypervisor.c | 30 ++
>> 5 files changed, 49 insertions(+)
>> create mode 100644 drivers/kvm/Kconfig
>> create mode 100644 drivers/kvm/Makefile
>> create mode 100644 drivers/kvm/sys-hypervisor.c
>> 
> 
> [...]
> 
>> +
>> +__weak const char *kvm_para_get_uuid(void)
>> +{
>> +return NULL;
>> +}
>> +
>> +static ssize_t uuid_show(struct kobject *obj,
>> + struct kobj_attribute *attr,
>> + char *buf)
>> +{
>> +const char *uuid = kvm_para_get_uuid();
>> +return sprintf(buf, "%s\n", uuid);
> 
> The usual return value for the Xen /sys/hypervisor interface is
> "". Wouldn't it make sense to follow that pattern for the KVM
> one too? Currently, if we can not determine the UUID this will just
> return (null).
> 
> Otherwise, looks good to me. Are you aware of any other files we should
> provide? Also, is there any reason not to implement ARM as well while at it?
> 
> Alex

This originated from a customer request that was using /sys/hypervisor/uuid.
My guess is that we would want to expose "type" and "version" moving
forward and that's when we hypervisor hooks will be useful on top
of arch hooks.

On a different note, any idea how to check whether the OS is running
virtualized on KVM on ARM and ARM64?  kvm_para_available() isn't an
option and the same is true for S390 where kvm_para_available()
always returns true and it would even if a KVM enabled kernel would
be running on bare metal.

I think we will need another arch hook to call a function that says
whether the OS is running virtualized on KVM.

>> +}
>> +
>> +static struct kobj_attribute uuid = __ATTR_RO(uuid);
>> +
>> +static int __init uuid_init(void)
>> +{
>> +if (!kvm_para_available())
>> +return 0;
>> +return sysfs_create_file(hypervisor_kobj, );
>> +}
>> +
>> +device_initcall(uuid_init);




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B




Re: [PATCH v2 2/2] KVM: x86: Implement the arch-specific hook to report the VM UUID

2019-05-16 Thread Sironi, Filippo


> On 16. May 2019, at 18:40, Boris Ostrovsky  wrote:
> 
> On 5/16/19 11:33 AM, Alexander Graf wrote:
>> On 16.05.19 08:25, Sironi, Filippo wrote:
>>>> On 16. May 2019, at 15:56, Graf, Alexander  wrote:
>>>> 
>>>> On 14.05.19 08:16, Filippo Sironi wrote:
>>>>> On x86, we report the UUID in DMI System Information (i.e., DMI Type 1)
>>>>> as VM UUID.
>>>>> 
>>>>> Signed-off-by: Filippo Sironi 
>>>>> ---
>>>>> arch/x86/kernel/kvm.c | 7 +++
>>>>> 1 file changed, 7 insertions(+)
>>>>> 
>>>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>>>>> index 5c93a65ee1e5..441cab08a09d 100644
>>>>> --- a/arch/x86/kernel/kvm.c
>>>>> +++ b/arch/x86/kernel/kvm.c
>>>>> @@ -25,6 +25,7 @@
>>>>> #include 
>>>>> #include 
>>>>> #include 
>>>>> +#include 
>>>>> #include 
>>>>> #include 
>>>>> #include 
>>>>> @@ -694,6 +695,12 @@ bool kvm_para_available(void)
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(kvm_para_available);
>>>>> 
>>>>> +const char *kvm_para_get_uuid(void)
>>>>> +{
>>>>> + return dmi_get_system_info(DMI_PRODUCT_UUID);
>>>> This adds a new dependency on CONFIG_DMI. Probably best to guard it with
>>>> an #if IS_ENABLED(CONFIG_DMI).
>>>> 
>>>> The concept seems sound though.
>>>> 
>>>> Alex
>>> include/linux/dmi.h contains a dummy implementation of
>>> dmi_get_system_info that returns NULL if CONFIG_DMI isn't defined.
>> 
>> Oh, I missed that bit. Awesome! Less work :).
>> 
>> 
>>> This is enough unless we decide to return "" like in Xen.
>>> If then, we can have the check in the generic code to turn NULL
>>> into "".
>> 
>> Yes. Waiting for someone from Xen to answer this :)
> 
> Not sure I am answering your question but on Xen we return UUID value
> zero if access permissions are not sufficient. Not .
> 
> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/common/kernel.c;h=612575430f1ce7faf5bd66e7a99f1758c63fb3cb;hb=HEAD#l506
> 
> -boris

Then, I believe that returning ----
instead of NULL in the weak implementation of 1/2 and translating
NULL into ---- is the better approach.

I'll repost.

Filippo





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B




Re: [PATCH v2 2/2] KVM: x86: Implement the arch-specific hook to report the VM UUID

2019-05-16 Thread Sironi, Filippo


> On 16. May 2019, at 15:56, Graf, Alexander  wrote:
> 
> On 14.05.19 08:16, Filippo Sironi wrote:
>> On x86, we report the UUID in DMI System Information (i.e., DMI Type 1)
>> as VM UUID.
>> 
>> Signed-off-by: Filippo Sironi 
>> ---
>> arch/x86/kernel/kvm.c | 7 +++
>> 1 file changed, 7 insertions(+)
>> 
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 5c93a65ee1e5..441cab08a09d 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -25,6 +25,7 @@
>> #include 
>> #include 
>> #include 
>> +#include 
>> #include 
>> #include 
>> #include 
>> @@ -694,6 +695,12 @@ bool kvm_para_available(void)
>> }
>> EXPORT_SYMBOL_GPL(kvm_para_available);
>> 
>> +const char *kvm_para_get_uuid(void)
>> +{
>> +return dmi_get_system_info(DMI_PRODUCT_UUID);
> 
> This adds a new dependency on CONFIG_DMI. Probably best to guard it with
> an #if IS_ENABLED(CONFIG_DMI).
> 
> The concept seems sound though.
> 
> Alex

include/linux/dmi.h contains a dummy implementation of
dmi_get_system_info that returns NULL if CONFIG_DMI isn't defined.
This is enough unless we decide to return "" like in Xen.
If then, we can have the check in the generic code to turn NULL
into "".

Filippo


>> +}
>> +EXPORT_SYMBOL_GPL(kvm_para_get_uuid);
>> +
>> unsigned int kvm_arch_para_features(void)
>> {
>>  return cpuid_eax(kvm_cpuid_base() | KVM_CPUID_FEATURES);




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B




Re: [Xen-devel] [PATCH v2 1/2] KVM: Start populating /sys/hypervisor with KVM entries

2019-05-16 Thread Sironi, Filippo


> On 16. May 2019, at 17:02, Boris Ostrovsky  wrote:
> 
> On 5/16/19 10:08 AM, Alexander Graf wrote:
>> 
>> My point is mostly that we should be as common
>> as possible when it comes to /sys/hypervisor, so that tools don't have
>> to care about the HV they're working against.
> 
> It might make sense to have a common sys-hypervisor.c file
> (drivers/hypervisor/sys-hypervisor.c or some such), with
> hypervisor-specific ops/callbacks/etc.
> 
> -boris


Yes, it definitely does. I would follow up with future patches to make it
happen.

Filippo





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B




Re: [PATCH v2 1/2] KVM: Start populating /sys/hypervisor with KVM entries

2019-05-14 Thread Sironi, Filippo



> On 14. May 2019, at 18:09, Sironi, Filippo  wrote:
> 
>> On 14. May 2019, at 17:26, Christian Borntraeger  
>> wrote:
>> 
>>> On 14.05.19 17:16, Filippo Sironi wrote:
>>> Start populating /sys/hypervisor with KVM entries when we're running on
>>> KVM. This is to replicate functionality that's available when we're
>>> running on Xen.
>>> 
>>> Start with /sys/hypervisor/uuid, which users prefer over
>>> /sys/devices/virtual/dmi/id/product_uuid as a way to recognize a virtual
>>> machine, since it's also available when running on Xen HVM and on Xen PV
>>> and, on top of that doesn't require root privileges by default.
>>> Let's create arch-specific hooks so that different architectures can
>>> provide different implementations.
>>> 
>>> Signed-off-by: Filippo Sironi 
>>> ---
>>> v2:
>>> * move the retrieval of the VM UUID out of uuid_show and into
>>> kvm_para_get_uuid, which is a weak function that can be overwritten
>>> 
>>> drivers/Kconfig  |  2 ++
>>> drivers/Makefile |  2 ++
>>> drivers/kvm/Kconfig  | 14 ++
>>> drivers/kvm/Makefile |  1 +
>>> drivers/kvm/sys-hypervisor.c | 30 ++
>>> 5 files changed, 49 insertions(+)
>>> create mode 100644 drivers/kvm/Kconfig
>>> create mode 100644 drivers/kvm/Makefile
>>> create mode 100644 drivers/kvm/sys-hypervisor.c
>>> 
>>> diff --git a/drivers/Kconfig b/drivers/Kconfig
>>> index 45f9decb9848..90eb835fe951 100644
>>> --- a/drivers/Kconfig
>>> +++ b/drivers/Kconfig
>>> @@ -146,6 +146,8 @@ source "drivers/hv/Kconfig"
>>> 
>>> source "drivers/xen/Kconfig"
>>> 
>>> +source "drivers/kvm/Kconfig"
>>> +
>>> source "drivers/staging/Kconfig"
>>> 
>>> source "drivers/platform/Kconfig"
>>> diff --git a/drivers/Makefile b/drivers/Makefile
>>> index c61cde554340..79cc92a3f6bf 100644
>>> --- a/drivers/Makefile
>>> +++ b/drivers/Makefile
>>> @@ -44,6 +44,8 @@ obj-y += soc/
>>> obj-$(CONFIG_VIRTIO)+= virtio/
>>> obj-$(CONFIG_XEN)   += xen/
>>> 
>>> +obj-$(CONFIG_KVM_GUEST)+= kvm/
>>> +
>>> # regulators early, since some subsystems rely on them to initialize
>>> obj-$(CONFIG_REGULATOR) += regulator/
>>> 
>>> diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig
>>> new file mode 100644
>>> index ..3fc041df7c11
>>> --- /dev/null
>>> +++ b/drivers/kvm/Kconfig
>>> @@ -0,0 +1,14 @@
>>> +menu "KVM driver support"
>>> +depends on KVM_GUEST
>>> +
>>> +config KVM_SYS_HYPERVISOR
>>> +bool "Create KVM entries under /sys/hypervisor"
>>> +depends on SYSFS
>>> +select SYS_HYPERVISOR
>>> +default y
>>> +help
>>> +  Create KVM entries under /sys/hypervisor (e.g., uuid). When 
>>> running
>>> +  native or on another hypervisor, /sys/hypervisor may still be
>>> +  present, but it will have no KVM entries.
>>> +
>>> +endmenu
>>> diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
>>> new file mode 100644
>>> index ..73a43fc994b9
>>> --- /dev/null
>>> +++ b/drivers/kvm/Makefile
>>> @@ -0,0 +1 @@
>>> +obj-$(CONFIG_KVM_SYS_HYPERVISOR) += sys-hypervisor.o
>>> diff --git a/drivers/kvm/sys-hypervisor.c b/drivers/kvm/sys-hypervisor.c
>>> new file mode 100644
>>> index ..43b1d1a09807
>>> --- /dev/null
>>> +++ b/drivers/kvm/sys-hypervisor.c
>>> @@ -0,0 +1,30 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +
>>> +#include 
>>> +
>>> +#include 
>>> +#include 
>>> +
>>> +__weak const char *kvm_para_get_uuid(void)
>>> +{
>>> +   return NULL;
>>> +}
>>> +
>>> +static ssize_t uuid_show(struct kobject *obj,
>>> +struct kobj_attribute *attr,
>>> +char *buf)
>>> +{
>>> +   const char *uuid = kvm_para_get_uuid();
>> 
>> I would prefer to have kvm_para_get_uuid return a uuid_t
>> but char * will probably work out as well.
> 
> Let me give this a quick spin

Re: [PATCH v2 1/2] KVM: Start populating /sys/hypervisor with KVM entries

2019-05-14 Thread Sironi, Filippo



> On 14. May 2019, at 17:26, Christian Borntraeger  
> wrote:
> 
> 
> 
> On 14.05.19 17:16, Filippo Sironi wrote:
>> Start populating /sys/hypervisor with KVM entries when we're running on
>> KVM. This is to replicate functionality that's available when we're
>> running on Xen.
>> 
>> Start with /sys/hypervisor/uuid, which users prefer over
>> /sys/devices/virtual/dmi/id/product_uuid as a way to recognize a virtual
>> machine, since it's also available when running on Xen HVM and on Xen PV
>> and, on top of that doesn't require root privileges by default.
>> Let's create arch-specific hooks so that different architectures can
>> provide different implementations.
>> 
>> Signed-off-by: Filippo Sironi 
>> ---
>> v2:
>> * move the retrieval of the VM UUID out of uuid_show and into
>>  kvm_para_get_uuid, which is a weak function that can be overwritten
>> 
>> drivers/Kconfig  |  2 ++
>> drivers/Makefile |  2 ++
>> drivers/kvm/Kconfig  | 14 ++
>> drivers/kvm/Makefile |  1 +
>> drivers/kvm/sys-hypervisor.c | 30 ++
>> 5 files changed, 49 insertions(+)
>> create mode 100644 drivers/kvm/Kconfig
>> create mode 100644 drivers/kvm/Makefile
>> create mode 100644 drivers/kvm/sys-hypervisor.c
>> 
>> diff --git a/drivers/Kconfig b/drivers/Kconfig
>> index 45f9decb9848..90eb835fe951 100644
>> --- a/drivers/Kconfig
>> +++ b/drivers/Kconfig
>> @@ -146,6 +146,8 @@ source "drivers/hv/Kconfig"
>> 
>> source "drivers/xen/Kconfig"
>> 
>> +source "drivers/kvm/Kconfig"
>> +
>> source "drivers/staging/Kconfig"
>> 
>> source "drivers/platform/Kconfig"
>> diff --git a/drivers/Makefile b/drivers/Makefile
>> index c61cde554340..79cc92a3f6bf 100644
>> --- a/drivers/Makefile
>> +++ b/drivers/Makefile
>> @@ -44,6 +44,8 @@ obj-y  += soc/
>> obj-$(CONFIG_VIRTIO) += virtio/
>> obj-$(CONFIG_XEN)+= xen/
>> 
>> +obj-$(CONFIG_KVM_GUEST) += kvm/
>> +
>> # regulators early, since some subsystems rely on them to initialize
>> obj-$(CONFIG_REGULATOR)  += regulator/
>> 
>> diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig
>> new file mode 100644
>> index ..3fc041df7c11
>> --- /dev/null
>> +++ b/drivers/kvm/Kconfig
>> @@ -0,0 +1,14 @@
>> +menu "KVM driver support"
>> +depends on KVM_GUEST
>> +
>> +config KVM_SYS_HYPERVISOR
>> +bool "Create KVM entries under /sys/hypervisor"
>> +depends on SYSFS
>> +select SYS_HYPERVISOR
>> +default y
>> +help
>> +  Create KVM entries under /sys/hypervisor (e.g., uuid). When 
>> running
>> +  native or on another hypervisor, /sys/hypervisor may still be
>> +  present, but it will have no KVM entries.
>> +
>> +endmenu
>> diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
>> new file mode 100644
>> index ..73a43fc994b9
>> --- /dev/null
>> +++ b/drivers/kvm/Makefile
>> @@ -0,0 +1 @@
>> +obj-$(CONFIG_KVM_SYS_HYPERVISOR) += sys-hypervisor.o
>> diff --git a/drivers/kvm/sys-hypervisor.c b/drivers/kvm/sys-hypervisor.c
>> new file mode 100644
>> index ..43b1d1a09807
>> --- /dev/null
>> +++ b/drivers/kvm/sys-hypervisor.c
>> @@ -0,0 +1,30 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#include 
>> +
>> +#include 
>> +#include 
>> +
>> +__weak const char *kvm_para_get_uuid(void)
>> +{
>> +return NULL;
>> +}
>> +
>> +static ssize_t uuid_show(struct kobject *obj,
>> + struct kobj_attribute *attr,
>> + char *buf)
>> +{
>> +const char *uuid = kvm_para_get_uuid();
> 
> I would prefer to have kvm_para_get_uuid return a uuid_t
> but char * will probably work out as well.

Let me give this a quick spin.

>> +return sprintf(buf, "%s\n", uuid);
>> +}
>> +
>> +static struct kobj_attribute uuid = __ATTR_RO(uuid);
>> +
>> +static int __init uuid_init(void)
>> +{
>> +if (!kvm_para_available())
> 
> Isnt kvm_para_available a function that is defined in the context of the HOST
> and not of the guest?

No, kvm_para_available is defined in the guest context.
On x86, it checks for the presence of the KVM CPUID leafs.

>> +return 0;
>> +return sysfs_create_file(hypervisor_kobj, );
>> +}
>> +
>> +device_initcall(uuid_init);




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B




Re: [PATCH] x86/microcode: Don't duplicate code to update ucode cpu info and cpu info

2018-07-31 Thread Sironi, Filippo


> On 31. Jul 2018, at 17:41, Greg KH  wrote:
> 
> On Tue, Jul 31, 2018 at 05:29:30PM +0200, Filippo Sironi wrote:
>> ... on late microcode loading when handling a CPU that's already been
>> updated and a CPU that's yet to be updated.
>> 
>> Signed-off-by: Filippo Sironi 
>> ---
>> arch/x86/kernel/cpu/microcode/amd.c   | 15 +--
>> arch/x86/kernel/cpu/microcode/intel.c | 10 ++
>> 2 files changed, 15 insertions(+), 10 deletions(-)
> 
> 
> 
> This is not the correct way to submit patches for inclusion in the
> stable kernel tree.  Please read:
>https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> for how to do this properly.
> 
> 

Greg,

If you're concerned that this isn't a fix, keep in mind that this is going
to be a prerequisite for https://lore.kernel.org/patchwork/patch/969070/ ,
which is a bug and was sent to stable.

Filippo


Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH] x86/microcode: Don't duplicate code to update ucode cpu info and cpu info

2018-07-31 Thread Sironi, Filippo


> On 31. Jul 2018, at 17:41, Greg KH  wrote:
> 
> On Tue, Jul 31, 2018 at 05:29:30PM +0200, Filippo Sironi wrote:
>> ... on late microcode loading when handling a CPU that's already been
>> updated and a CPU that's yet to be updated.
>> 
>> Signed-off-by: Filippo Sironi 
>> ---
>> arch/x86/kernel/cpu/microcode/amd.c   | 15 +--
>> arch/x86/kernel/cpu/microcode/intel.c | 10 ++
>> 2 files changed, 15 insertions(+), 10 deletions(-)
> 
> 
> 
> This is not the correct way to submit patches for inclusion in the
> stable kernel tree.  Please read:
>https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> for how to do this properly.
> 
> 

Greg,

If you're concerned that this isn't a fix, keep in mind that this is going
to be a prerequisite for https://lore.kernel.org/patchwork/patch/969070/ ,
which is a bug and was sent to stable.

Filippo


Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH v2] arch/x86: Fix boot_cpu_data.microcode version output

2018-07-31 Thread Sironi, Filippo


> On 31. Jul 2018, at 15:00, Borislav Petkov  wrote:
> 
> On Tue, Jul 31, 2018 at 11:46:09AM +0000, Sironi, Filippo wrote:
>> There may be a chance of skipping this code, I think.
>> 
>> If the microcode is loaded on the hyperthread sibling of the boot cpu
>> before being loaded on the boot cpu, the boot cpu will exit earlier
>> from apply_microcode_intel() - in if (rev >= mc->hdr.rev) { ... }.
>> 
>> (This seems to be possible in apply_microcode_amd() as well.)
>> 
>> In my tree with the aforementioned change - Intel only - I also had
>> the following patch:
> 
> Yap, good catch.
> 
> -- 
> Regards/Gruss,
>Boris.
> 
> ECO tip #101: Trim your mails when you reply.
> --
> 

Boris, Prarit,

I've sent the patch the I had in my tree and extended it to cover
apply_microcode_amd(). I don't have an AMD host available for testing,
can one of you give that a spin?

Thanks,
Filippo


Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH v2] arch/x86: Fix boot_cpu_data.microcode version output

2018-07-31 Thread Sironi, Filippo


> On 31. Jul 2018, at 15:00, Borislav Petkov  wrote:
> 
> On Tue, Jul 31, 2018 at 11:46:09AM +0000, Sironi, Filippo wrote:
>> There may be a chance of skipping this code, I think.
>> 
>> If the microcode is loaded on the hyperthread sibling of the boot cpu
>> before being loaded on the boot cpu, the boot cpu will exit earlier
>> from apply_microcode_intel() - in if (rev >= mc->hdr.rev) { ... }.
>> 
>> (This seems to be possible in apply_microcode_amd() as well.)
>> 
>> In my tree with the aforementioned change - Intel only - I also had
>> the following patch:
> 
> Yap, good catch.
> 
> -- 
> Regards/Gruss,
>Boris.
> 
> ECO tip #101: Trim your mails when you reply.
> --
> 

Boris, Prarit,

I've sent the patch the I had in my tree and extended it to cover
apply_microcode_amd(). I don't have an AMD host available for testing,
can one of you give that a spin?

Thanks,
Filippo


Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH v2] arch/x86: Fix boot_cpu_data.microcode version output

2018-07-31 Thread Sironi, Filippo


> On 31. Jul 2018, at 13:27, Prarit Bhargava  wrote:
> 
> I tested this on AMD Ryzen & Intel Broadwell system and dumped the
> boot_cpu_data before and after a microcode update.  On the Intel
> system I also did a fatal MCE using mce-inject to confirm the output
> from the mce handling code.
> 
> P.
> 
> ---8<---
> 
> On systems where a runtime microcode update has occurred the microcode
> version output in a MCE log record is wrong because
> boot_cpu_data.microcode is not updated during runtime.
> 
> Update boot_cpu_data.microcode when the BSP's microcode is updated.
> 
> Fixes: fa94d0c6e0f3 ("x86/MCE: Save microcode revision in machine check 
> records")
> Suggested-by: Borislav Petkov 
> Signed-off-by: Prarit Bhargava 
> Cc: sta...@vger.kernel.org
> Cc: sir...@amazon.de
> Cc: tony.l...@intel.com
> ---
> Changes in v2: Use mc_amd->hdr.patch_id on AMD
> 
> arch/x86/kernel/cpu/microcode/amd.c   | 4 
> arch/x86/kernel/cpu/microcode/intel.c | 4 
> 2 files changed, 8 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/microcode/amd.c 
> b/arch/x86/kernel/cpu/microcode/amd.c
> index 0624957aa068..63b072377ba4 100644
> --- a/arch/x86/kernel/cpu/microcode/amd.c
> +++ b/arch/x86/kernel/cpu/microcode/amd.c
> @@ -537,6 +537,10 @@ static enum ucode_state apply_microcode_amd(int cpu)
>   uci->cpu_sig.rev = mc_amd->hdr.patch_id;
>   c->microcode = mc_amd->hdr.patch_id;
> 
> + /* Update boot_cpu_data's revision too, if we're on the BSP: */
> + if (c->cpu_index == boot_cpu_data.cpu_index)
> + boot_cpu_data.microcode =  mc_amd->hdr.patch_id;
> +
>   return UCODE_UPDATED;
> }
> 
> diff --git a/arch/x86/kernel/cpu/microcode/intel.c 
> b/arch/x86/kernel/cpu/microcode/intel.c
> index 97ccf4c3b45b..256d336cbc04 100644
> --- a/arch/x86/kernel/cpu/microcode/intel.c
> +++ b/arch/x86/kernel/cpu/microcode/intel.c
> @@ -851,6 +851,10 @@ static enum ucode_state apply_microcode_intel(int cpu)
>   uci->cpu_sig.rev = rev;
>   c->microcode = rev;
> 
> + /* Update boot_cpu_data's revision too, if we're on the BSP: */
> + if (c->cpu_index == boot_cpu_data.cpu_index)
> + boot_cpu_data.microcode = rev;
> +
>   return UCODE_UPDATED;
> }

There may be a chance of skipping this code, I think.

If the microcode is loaded on the hyperthread sibling of the boot cpu
before being loaded on the boot cpu, the boot cpu will exit earlier
from apply_microcode_intel() - in if (rev >= mc->hdr.rev) { ... }.

(This seems to be possible in apply_microcode_amd() as well.)

In my tree with the aforementioned change - Intel only - I also had
the following patch:

diff --git a/arch/x86/kernel/cpu/microcode/intel.c 
b/arch/x86/kernel/cpu/microcode/intel.c
index 97ccf4c3b45b..4bc869e829eb 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -797,6 +797,7 @@ static enum ucode_state apply_microcode_intel(int cpu)
struct microcode_intel *mc;
static int prev_rev;
u32 rev;
+   enum ucode_state ret;
 
/* We should bind the task to the CPU */
if (WARN_ON(raw_smp_processor_id() != cpu))
@@ -817,9 +818,8 @@ static enum ucode_state apply_microcode_intel(int cpu)
 */
rev = intel_get_microcode_revision();
if (rev >= mc->hdr.rev) {
-   uci->cpu_sig.rev = rev;
-   c->microcode = rev;
-   return UCODE_OK;
+   ret = UCODE_OK;
+   goto out;
}
 
/*
@@ -848,10 +848,12 @@ static enum ucode_state apply_microcode_intel(int cpu)
prev_rev = rev;
}
 
+   ret = UCODE_UPDATED;
+out:
uci->cpu_sig.rev = rev;
c->microcode = rev;
 
-   return UCODE_UPDATED;
+   return ret;
 }
 
 static enum ucode_state generic_load_microcode(int cpu, void *data, size_t 
size,

which prevents the issue.

> -- 
> 2.17.0
> 
> 

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH v2] arch/x86: Fix boot_cpu_data.microcode version output

2018-07-31 Thread Sironi, Filippo


> On 31. Jul 2018, at 13:27, Prarit Bhargava  wrote:
> 
> I tested this on AMD Ryzen & Intel Broadwell system and dumped the
> boot_cpu_data before and after a microcode update.  On the Intel
> system I also did a fatal MCE using mce-inject to confirm the output
> from the mce handling code.
> 
> P.
> 
> ---8<---
> 
> On systems where a runtime microcode update has occurred the microcode
> version output in a MCE log record is wrong because
> boot_cpu_data.microcode is not updated during runtime.
> 
> Update boot_cpu_data.microcode when the BSP's microcode is updated.
> 
> Fixes: fa94d0c6e0f3 ("x86/MCE: Save microcode revision in machine check 
> records")
> Suggested-by: Borislav Petkov 
> Signed-off-by: Prarit Bhargava 
> Cc: sta...@vger.kernel.org
> Cc: sir...@amazon.de
> Cc: tony.l...@intel.com
> ---
> Changes in v2: Use mc_amd->hdr.patch_id on AMD
> 
> arch/x86/kernel/cpu/microcode/amd.c   | 4 
> arch/x86/kernel/cpu/microcode/intel.c | 4 
> 2 files changed, 8 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/microcode/amd.c 
> b/arch/x86/kernel/cpu/microcode/amd.c
> index 0624957aa068..63b072377ba4 100644
> --- a/arch/x86/kernel/cpu/microcode/amd.c
> +++ b/arch/x86/kernel/cpu/microcode/amd.c
> @@ -537,6 +537,10 @@ static enum ucode_state apply_microcode_amd(int cpu)
>   uci->cpu_sig.rev = mc_amd->hdr.patch_id;
>   c->microcode = mc_amd->hdr.patch_id;
> 
> + /* Update boot_cpu_data's revision too, if we're on the BSP: */
> + if (c->cpu_index == boot_cpu_data.cpu_index)
> + boot_cpu_data.microcode =  mc_amd->hdr.patch_id;
> +
>   return UCODE_UPDATED;
> }
> 
> diff --git a/arch/x86/kernel/cpu/microcode/intel.c 
> b/arch/x86/kernel/cpu/microcode/intel.c
> index 97ccf4c3b45b..256d336cbc04 100644
> --- a/arch/x86/kernel/cpu/microcode/intel.c
> +++ b/arch/x86/kernel/cpu/microcode/intel.c
> @@ -851,6 +851,10 @@ static enum ucode_state apply_microcode_intel(int cpu)
>   uci->cpu_sig.rev = rev;
>   c->microcode = rev;
> 
> + /* Update boot_cpu_data's revision too, if we're on the BSP: */
> + if (c->cpu_index == boot_cpu_data.cpu_index)
> + boot_cpu_data.microcode = rev;
> +
>   return UCODE_UPDATED;
> }

There may be a chance of skipping this code, I think.

If the microcode is loaded on the hyperthread sibling of the boot cpu
before being loaded on the boot cpu, the boot cpu will exit earlier
from apply_microcode_intel() - in if (rev >= mc->hdr.rev) { ... }.

(This seems to be possible in apply_microcode_amd() as well.)

In my tree with the aforementioned change - Intel only - I also had
the following patch:

diff --git a/arch/x86/kernel/cpu/microcode/intel.c 
b/arch/x86/kernel/cpu/microcode/intel.c
index 97ccf4c3b45b..4bc869e829eb 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -797,6 +797,7 @@ static enum ucode_state apply_microcode_intel(int cpu)
struct microcode_intel *mc;
static int prev_rev;
u32 rev;
+   enum ucode_state ret;
 
/* We should bind the task to the CPU */
if (WARN_ON(raw_smp_processor_id() != cpu))
@@ -817,9 +818,8 @@ static enum ucode_state apply_microcode_intel(int cpu)
 */
rev = intel_get_microcode_revision();
if (rev >= mc->hdr.rev) {
-   uci->cpu_sig.rev = rev;
-   c->microcode = rev;
-   return UCODE_OK;
+   ret = UCODE_OK;
+   goto out;
}
 
/*
@@ -848,10 +848,12 @@ static enum ucode_state apply_microcode_intel(int cpu)
prev_rev = rev;
}
 
+   ret = UCODE_UPDATED;
+out:
uci->cpu_sig.rev = rev;
c->microcode = rev;
 
-   return UCODE_UPDATED;
+   return ret;
 }
 
 static enum ucode_state generic_load_microcode(int cpu, void *data, size_t 
size,

which prevents the issue.

> -- 
> 2.17.0
> 
> 

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH] x86/MCE: Fix CPU microcode version output

2018-07-31 Thread Sironi, Filippo


> On 30. Jul 2018, at 18:16, Borislav Petkov  wrote:
> 
> On Mon, Jul 30, 2018 at 11:23:18AM -0400, Prarit Bhargava wrote:
>> Filippo & Borislav, did the patch get committed to a -next tree?
> 
> No, I'm still waiting for it - looks like Filippo is busy.
> 
> Care to send one instead as suggested here?
> 
> https://lore.kernel.org/lkml/20180601121939.ga23...@nazgul.tnic/
> 
> -- 
> Regards/Gruss,
>Boris.
> 
> ECO tip #101: Trim your mails when you reply.
> --

Sorry, I was out of office for a while.
I'll look into this today.

Filippo


Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH] x86/MCE: Fix CPU microcode version output

2018-07-31 Thread Sironi, Filippo


> On 30. Jul 2018, at 18:16, Borislav Petkov  wrote:
> 
> On Mon, Jul 30, 2018 at 11:23:18AM -0400, Prarit Bhargava wrote:
>> Filippo & Borislav, did the patch get committed to a -next tree?
> 
> No, I'm still waiting for it - looks like Filippo is busy.
> 
> Care to send one instead as suggested here?
> 
> https://lore.kernel.org/lkml/20180601121939.ga23...@nazgul.tnic/
> 
> -- 
> Regards/Gruss,
>Boris.
> 
> ECO tip #101: Trim your mails when you reply.
> --

Sorry, I was out of office for a while.
I'll look into this today.

Filippo


Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH] x86/MCE: Get microcode revision from cpu_info instead of boot_cpu_data

2018-06-01 Thread Sironi, Filippo
> On 1. Jun 2018, at 14:19, Borislav Petkov  wrote:
> 
> On Fri, Jun 01, 2018 at 01:30:26PM +0200, Filippo Sironi wrote:
>> Commit fa94d0c6e0f3 ("x86/MCE: Save microcode revision in machine check
>> records") extended MCE entries to report the microcode revision taken
>> from boot_cpu_data. Unfortunately, boot_cpu_data isn't updated on late
>> microcode loading,
> 
> Actually, I'd prefer if we fixed *that* instead by adding:
> 
>   /* Update boot_cpu_data's revision too, if we're on the BSP: */
>   if (c->cpu_index == boot_cpu_data.cpu_index)
>   boot_cpu_data.microcode = ;
> 
> to the end of ->apply_microcode() functions so that boot_cpu_data has
> the correct revision too, in case something else queries it.
> 
> Thx.
> 
> -- 
> Regards/Gruss,
>Boris.
> 
> ECO tip #101: Trim your mails when you reply.
> --

I've that patch in my tree already, I can post it.
I'm still curious on why you'd prefer to use boot_cpu_data for all
CPUs instead of using cpu_data(m->extcpu) though.

Regards,
Filippo


Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH] x86/MCE: Get microcode revision from cpu_info instead of boot_cpu_data

2018-06-01 Thread Sironi, Filippo
> On 1. Jun 2018, at 14:19, Borislav Petkov  wrote:
> 
> On Fri, Jun 01, 2018 at 01:30:26PM +0200, Filippo Sironi wrote:
>> Commit fa94d0c6e0f3 ("x86/MCE: Save microcode revision in machine check
>> records") extended MCE entries to report the microcode revision taken
>> from boot_cpu_data. Unfortunately, boot_cpu_data isn't updated on late
>> microcode loading,
> 
> Actually, I'd prefer if we fixed *that* instead by adding:
> 
>   /* Update boot_cpu_data's revision too, if we're on the BSP: */
>   if (c->cpu_index == boot_cpu_data.cpu_index)
>   boot_cpu_data.microcode = ;
> 
> to the end of ->apply_microcode() functions so that boot_cpu_data has
> the correct revision too, in case something else queries it.
> 
> Thx.
> 
> -- 
> Regards/Gruss,
>Boris.
> 
> ECO tip #101: Trim your mails when you reply.
> --

I've that patch in my tree already, I can post it.
I'm still curious on why you'd prefer to use boot_cpu_data for all
CPUs instead of using cpu_data(m->extcpu) though.

Regards,
Filippo


Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [Fwd: [RFC PATCH 2/4] KVM: x86: Reduce retpoline performance impact in slot_handle_level_range()]

2018-02-08 Thread Sironi, Filippo

> On 8. Feb 2018, at 13:17, David Woodhouse  wrote:
> 
> 
> From: David Woodhouse 
> Subject: [RFC PATCH 2/4] KVM: x86: Reduce retpoline performance impact in 
> slot_handle_level_range()
> Date: 7. February 2018 at 01:03:12 GMT+1
> To: t...@linutronix.de, torva...@linux-foundation.org, x...@kernel.org, 
> linux-kernel@vger.kernel.org, b...@alien8.de, pet...@infradead.org, 
> tim.c.c...@linux.intel.com, dave.han...@intel.com, arjan.van.de@intel.com
> 
> 
> With retpoline, tight loops of "call this function for every XXX" are
> very much pessimised by taking a prediction miss *every* time. This one
> showed up very high in our early testing.
> 
> By marking the iterator slot_handle_…() functions always_inline, we can
> ensure that the indirect function call can be optimised away into a
> direct call and it actually generates slightly smaller code because
> some of the other conditionals can get optimised away too.
> 
> Suggested-by: Linus Torvalds 
> Signed-off-by: David Woodhouse 
> ---
> arch/x86/kvm/mmu.c | 10 +-
> 1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 2b8eb4d..cc83bdc 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -5058,7 +5058,7 @@ void kvm_mmu_uninit_vm(struct kvm *kvm)
> typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head 
> *rmap_head);
> 
> /* The caller should hold mmu-lock before calling this function. */
> -static bool
> +static __always_inline bool
> slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
>   slot_level_handler fn, int start_level, int end_level,
>   gfn_t start_gfn, gfn_t end_gfn, bool lock_flush_tlb)
> @@ -5088,7 +5088,7 @@ slot_handle_level_range(struct kvm *kvm, struct 
> kvm_memory_slot *memslot,
>   return flush;
> }
> 
> -static bool
> +static __always_inline bool
> slot_handle_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
> slot_level_handler fn, int start_level, int end_level,
> bool lock_flush_tlb)
> @@ -5099,7 +5099,7 @@ slot_handle_level(struct kvm *kvm, struct 
> kvm_memory_slot *memslot,
>   lock_flush_tlb);
> }
> 
> -static bool
> +static __always_inline bool
> slot_handle_all_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
> slot_level_handler fn, bool lock_flush_tlb)
> {
> @@ -5107,7 +5107,7 @@ slot_handle_all_level(struct kvm *kvm, struct 
> kvm_memory_slot *memslot,
>PT_MAX_HUGEPAGE_LEVEL, lock_flush_tlb);
> }
> 
> -static bool
> +static __always_inline bool
> slot_handle_large_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
>   slot_level_handler fn, bool lock_flush_tlb)
> {
> @@ -5115,7 +5115,7 @@ slot_handle_large_level(struct kvm *kvm, struct 
> kvm_memory_slot *memslot,
>PT_MAX_HUGEPAGE_LEVEL, lock_flush_tlb);
> }
> 
> -static bool
> +static __always_inline bool
> slot_handle_leaf(struct kvm *kvm, struct kvm_memory_slot *memslot,
>slot_level_handler fn, bool lock_flush_tlb)
> {
> -- 
> 2.7.4

+k...@vger.kernel.org

With this patch, launches of "large instances" are pretty close to what we see 
with
nospectre_v2 (within tens of milliseconds).

Reviewed-by: Filippo Sironi 
Tested-by: Filippo Sironi 

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


Re: [Fwd: [RFC PATCH 2/4] KVM: x86: Reduce retpoline performance impact in slot_handle_level_range()]

2018-02-08 Thread Sironi, Filippo

> On 8. Feb 2018, at 13:17, David Woodhouse  wrote:
> 
> 
> From: David Woodhouse 
> Subject: [RFC PATCH 2/4] KVM: x86: Reduce retpoline performance impact in 
> slot_handle_level_range()
> Date: 7. February 2018 at 01:03:12 GMT+1
> To: t...@linutronix.de, torva...@linux-foundation.org, x...@kernel.org, 
> linux-kernel@vger.kernel.org, b...@alien8.de, pet...@infradead.org, 
> tim.c.c...@linux.intel.com, dave.han...@intel.com, arjan.van.de@intel.com
> 
> 
> With retpoline, tight loops of "call this function for every XXX" are
> very much pessimised by taking a prediction miss *every* time. This one
> showed up very high in our early testing.
> 
> By marking the iterator slot_handle_…() functions always_inline, we can
> ensure that the indirect function call can be optimised away into a
> direct call and it actually generates slightly smaller code because
> some of the other conditionals can get optimised away too.
> 
> Suggested-by: Linus Torvalds 
> Signed-off-by: David Woodhouse 
> ---
> arch/x86/kvm/mmu.c | 10 +-
> 1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 2b8eb4d..cc83bdc 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -5058,7 +5058,7 @@ void kvm_mmu_uninit_vm(struct kvm *kvm)
> typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head 
> *rmap_head);
> 
> /* The caller should hold mmu-lock before calling this function. */
> -static bool
> +static __always_inline bool
> slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
>   slot_level_handler fn, int start_level, int end_level,
>   gfn_t start_gfn, gfn_t end_gfn, bool lock_flush_tlb)
> @@ -5088,7 +5088,7 @@ slot_handle_level_range(struct kvm *kvm, struct 
> kvm_memory_slot *memslot,
>   return flush;
> }
> 
> -static bool
> +static __always_inline bool
> slot_handle_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
> slot_level_handler fn, int start_level, int end_level,
> bool lock_flush_tlb)
> @@ -5099,7 +5099,7 @@ slot_handle_level(struct kvm *kvm, struct 
> kvm_memory_slot *memslot,
>   lock_flush_tlb);
> }
> 
> -static bool
> +static __always_inline bool
> slot_handle_all_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
> slot_level_handler fn, bool lock_flush_tlb)
> {
> @@ -5107,7 +5107,7 @@ slot_handle_all_level(struct kvm *kvm, struct 
> kvm_memory_slot *memslot,
>PT_MAX_HUGEPAGE_LEVEL, lock_flush_tlb);
> }
> 
> -static bool
> +static __always_inline bool
> slot_handle_large_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
>   slot_level_handler fn, bool lock_flush_tlb)
> {
> @@ -5115,7 +5115,7 @@ slot_handle_large_level(struct kvm *kvm, struct 
> kvm_memory_slot *memslot,
>PT_MAX_HUGEPAGE_LEVEL, lock_flush_tlb);
> }
> 
> -static bool
> +static __always_inline bool
> slot_handle_leaf(struct kvm *kvm, struct kvm_memory_slot *memslot,
>slot_level_handler fn, bool lock_flush_tlb)
> {
> -- 
> 2.7.4

+k...@vger.kernel.org

With this patch, launches of "large instances" are pretty close to what we see 
with
nospectre_v2 (within tens of milliseconds).

Reviewed-by: Filippo Sironi 
Tested-by: Filippo Sironi 

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


Re: [PATCH] KVM: x86: Reduce retpoline performance impact in slot_handle_level_range()

2018-02-02 Thread Sironi, Filippo

> On 2. Feb 2018, at 15:59, David Woodhouse  wrote:
> 
> With retpoline, tight loops of "call this function for every XXX" are
> very much pessimised by taking a prediction miss *every* time.
> 
> This one showed up very high in our early testing, and it only has five
> things it'll ever call so make it take an 'op' enum instead of a
> function pointer and let's see how that works out...
> 
> Signed-off-by: David Woodhouse 
> ---
> Not sure I like this. Better suggestions welcomed...
> 
> In the general case, we have a few things we can do with the calls that
> retpoline turns into bottlenecks. This is one of them.
> 
> Another option, if there are just one or two "likely" functions, is
> something along the lines of
> 
> if (func == likelyfunc)
>likelyfunc()
> else
>(*func)(); // GCC does retpoline for this
> 
> For things like kvm_x86_ops we really could just turn *all* of those
> into direct calls at runtime, like pvops does.
> 
> There are some which land somewhere in the middle, like the default
> dma_ops. We probably want something like the 'likelyfunc' version
> above, except that we *also* want to flip the likelyfunc between the
> Intel and AMD IOMMU ops functions, at early boot. I'll see what I can
> come up with...
> 
> arch/x86/kvm/mmu.c | 72 --
> 1 file changed, 48 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 2b8eb4d..44f9de7 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -5055,12 +5055,21 @@ void kvm_mmu_uninit_vm(struct kvm *kvm)
> }
> 
> /* The return value indicates if tlb flush on all vcpus is needed. */
> -typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head 
> *rmap_head);
> +enum slot_handler_op {
> + SLOT_RMAP_CLEAR_DIRTY,
> + SLOT_RMAP_SET_DIRTY,
> + SLOT_RMAP_WRITE_PROTECT,
> + SLOT_ZAP_RMAPP,
> + SLOT_ZAP_COLLAPSIBLE_SPTE,
> +};
> +
> +static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> +  struct kvm_rmap_head *rmap_head);
> 
> /* The caller should hold mmu-lock before calling this function. */
> static bool
> slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
> - slot_level_handler fn, int start_level, int end_level,
> + enum slot_handler_op op, int start_level, int end_level,
>   gfn_t start_gfn, gfn_t end_gfn, bool lock_flush_tlb)
> {
>   struct slot_rmap_walk_iterator iterator;
> @@ -5068,8 +5077,29 @@ slot_handle_level_range(struct kvm *kvm, struct 
> kvm_memory_slot *memslot,
> 
>   for_each_slot_rmap_range(memslot, start_level, end_level, start_gfn,
>   end_gfn, ) {
> - if (iterator.rmap)
> - flush |= fn(kvm, iterator.rmap);
> + if (iterator.rmap) {
> + switch (op) {
> + case SLOT_RMAP_CLEAR_DIRTY:
> + flush |= __rmap_clear_dirty(kvm, iterator.rmap);
> + break;
> +
> + case SLOT_RMAP_SET_DIRTY:
> + flush |= __rmap_set_dirty(kvm, iterator.rmap);
> + break;
> +
> + case SLOT_RMAP_WRITE_PROTECT:
> + flush |= __rmap_write_protect(kvm, 
> iterator.rmap, false);
> + break;
> +
> + case SLOT_ZAP_RMAPP:
> + flush |= kvm_zap_rmapp(kvm, iterator.rmap);
> + break;
> +
> + case SLOT_ZAP_COLLAPSIBLE_SPTE:
> + flush |= kvm_mmu_zap_collapsible_spte(kvm, 
> iterator.rmap);
> + break;
> + }
> + }
> 
>   if (need_resched() || spin_needbreak(>mmu_lock)) {
>   if (flush && lock_flush_tlb) {
> @@ -5090,10 +5120,10 @@ slot_handle_level_range(struct kvm *kvm, struct 
> kvm_memory_slot *memslot,
> 
> static bool
> slot_handle_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
> -   slot_level_handler fn, int start_level, int end_level,
> +   enum slot_handler_op op, int start_level, int end_level,
> bool lock_flush_tlb)
> {
> - return slot_handle_level_range(kvm, memslot, fn, start_level,
> + return slot_handle_level_range(kvm, memslot, op, start_level,
>   end_level, memslot->base_gfn,
>   memslot->base_gfn + memslot->npages - 1,
>   lock_flush_tlb);
> @@ -5101,25 +5131,25 @@ slot_handle_level(struct kvm *kvm, struct 
> kvm_memory_slot *memslot,
> 
> static bool
> slot_handle_all_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
> -   slot_level_handler fn, bool lock_flush_tlb)
> +   enum 

Re: [PATCH] KVM: x86: Reduce retpoline performance impact in slot_handle_level_range()

2018-02-02 Thread Sironi, Filippo

> On 2. Feb 2018, at 15:59, David Woodhouse  wrote:
> 
> With retpoline, tight loops of "call this function for every XXX" are
> very much pessimised by taking a prediction miss *every* time.
> 
> This one showed up very high in our early testing, and it only has five
> things it'll ever call so make it take an 'op' enum instead of a
> function pointer and let's see how that works out...
> 
> Signed-off-by: David Woodhouse 
> ---
> Not sure I like this. Better suggestions welcomed...
> 
> In the general case, we have a few things we can do with the calls that
> retpoline turns into bottlenecks. This is one of them.
> 
> Another option, if there are just one or two "likely" functions, is
> something along the lines of
> 
> if (func == likelyfunc)
>likelyfunc()
> else
>(*func)(); // GCC does retpoline for this
> 
> For things like kvm_x86_ops we really could just turn *all* of those
> into direct calls at runtime, like pvops does.
> 
> There are some which land somewhere in the middle, like the default
> dma_ops. We probably want something like the 'likelyfunc' version
> above, except that we *also* want to flip the likelyfunc between the
> Intel and AMD IOMMU ops functions, at early boot. I'll see what I can
> come up with...
> 
> arch/x86/kvm/mmu.c | 72 --
> 1 file changed, 48 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 2b8eb4d..44f9de7 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -5055,12 +5055,21 @@ void kvm_mmu_uninit_vm(struct kvm *kvm)
> }
> 
> /* The return value indicates if tlb flush on all vcpus is needed. */
> -typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head 
> *rmap_head);
> +enum slot_handler_op {
> + SLOT_RMAP_CLEAR_DIRTY,
> + SLOT_RMAP_SET_DIRTY,
> + SLOT_RMAP_WRITE_PROTECT,
> + SLOT_ZAP_RMAPP,
> + SLOT_ZAP_COLLAPSIBLE_SPTE,
> +};
> +
> +static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> +  struct kvm_rmap_head *rmap_head);
> 
> /* The caller should hold mmu-lock before calling this function. */
> static bool
> slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
> - slot_level_handler fn, int start_level, int end_level,
> + enum slot_handler_op op, int start_level, int end_level,
>   gfn_t start_gfn, gfn_t end_gfn, bool lock_flush_tlb)
> {
>   struct slot_rmap_walk_iterator iterator;
> @@ -5068,8 +5077,29 @@ slot_handle_level_range(struct kvm *kvm, struct 
> kvm_memory_slot *memslot,
> 
>   for_each_slot_rmap_range(memslot, start_level, end_level, start_gfn,
>   end_gfn, ) {
> - if (iterator.rmap)
> - flush |= fn(kvm, iterator.rmap);
> + if (iterator.rmap) {
> + switch (op) {
> + case SLOT_RMAP_CLEAR_DIRTY:
> + flush |= __rmap_clear_dirty(kvm, iterator.rmap);
> + break;
> +
> + case SLOT_RMAP_SET_DIRTY:
> + flush |= __rmap_set_dirty(kvm, iterator.rmap);
> + break;
> +
> + case SLOT_RMAP_WRITE_PROTECT:
> + flush |= __rmap_write_protect(kvm, 
> iterator.rmap, false);
> + break;
> +
> + case SLOT_ZAP_RMAPP:
> + flush |= kvm_zap_rmapp(kvm, iterator.rmap);
> + break;
> +
> + case SLOT_ZAP_COLLAPSIBLE_SPTE:
> + flush |= kvm_mmu_zap_collapsible_spte(kvm, 
> iterator.rmap);
> + break;
> + }
> + }
> 
>   if (need_resched() || spin_needbreak(>mmu_lock)) {
>   if (flush && lock_flush_tlb) {
> @@ -5090,10 +5120,10 @@ slot_handle_level_range(struct kvm *kvm, struct 
> kvm_memory_slot *memslot,
> 
> static bool
> slot_handle_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
> -   slot_level_handler fn, int start_level, int end_level,
> +   enum slot_handler_op op, int start_level, int end_level,
> bool lock_flush_tlb)
> {
> - return slot_handle_level_range(kvm, memslot, fn, start_level,
> + return slot_handle_level_range(kvm, memslot, op, start_level,
>   end_level, memslot->base_gfn,
>   memslot->base_gfn + memslot->npages - 1,
>   lock_flush_tlb);
> @@ -5101,25 +5131,25 @@ slot_handle_level(struct kvm *kvm, struct 
> kvm_memory_slot *memslot,
> 
> static bool
> slot_handle_all_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
> -   slot_level_handler fn, bool lock_flush_tlb)
> +   enum slot_handler_op op, bool lock_flush_tlb)

Re: [PATCH] sched/fair: Prevent a division by 0 in scale_rt_capacity()

2017-12-20 Thread Sironi, Filippo

> On 19. Dec 2017, at 20:33, Peter Zijlstra  wrote:
> 
> On Sat, Dec 09, 2017 at 09:03:49AM +0100, Filippo Sironi wrote:
>> ... since total = sched_avg_period() + delta can yield 0x1,
>> which results in a division by 0, given that div_u64() takes a u32
>> divisor.  Use div64_u64() instead.
>> 
>> divide error:  [#1] SMP
>> CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.58 #1
>> Hardware name: ...
>> task: 8800a24e2800 task.stack: c974c000
>> RIP: 0010:[] [] 
>> update_group_capacity+0x16e/0x1c0
>> RSP: 0018:8800a74e3c18 EFLAGS: 00010246
>> RAX: 00445ced RBX: 0007 RCX: 024d
>> RDX:  RSI:  RDI: 000160c0
>> RBP: 8800a74e3c38 R08: 8800a17d5ac0 R09: 8800a74e
>> R10:  R11:  R12: 8800a297e400
>> R13: 8800a17d5ac0 R14:  R15: 8800a17d5ac0
>> FS: () GS:8800a74e() knlGS:
>> CS: 0010 DS:  ES:  CR0: 80050033
>> CR2: 006f3580 CR3: 01607000 CR4: 007426e0
>> DR0:  DR1:  DR2: 
>> DR3:  DR6: fffe0ff0 DR7: 0400
>> PKRU: 5554
>> Stack:
>> 8800a17d5180 8800a74e3e00 8800a17d5a01 8800a74e3c68
>> 8800a74e3d90 810d37e6 fff8 002300010c40
>> 0040 8800a17d5ad8  
>> Call Trace:
>>  [162553.008569] [] find_busiest_group+0xe6/0x950
>> [] load_balance+0x188/0xa70
>> [] ? update_rq_clock.part.88+0x13/0x30
>> [] rebalance_domains+0x210/0x290
>> [] run_rebalance_domains+0x1b0/0x1d0
>> [] __do_softirq+0x89/0x2b0
>> [] irq_exit+0xab/0xb0
>> [] smp_reschedule_interrupt+0x2e/0x30
>> [] reschedule_interrupt+0x84/0x90
>>  [162553.008603] [] ? cpuidle_enter_state+0x12f/0x2c0
>> [] cpuidle_enter+0x12/0x20
>> [] cpu_startup_entry+0x1a2/0x1f0
>> [] start_secondary+0x12d/0x140
>> Code: 0f 00 4c 8b 96 48 09 00 00 48 8b 86 40 09 00 00 48 8b b6 b0 08 00 00 
>> 48 d1 ea 4c 29 d6 41 ba 00 00 00 00 49 0f 48 f2 01 d6 31 d2 <48> f7 f6 ba 00 
>> 04 00 00 48 29 c2 48 3d ff 03 00 00 b8 01 00 00
>> RIP [] update_group_capacity+0x16e/0x1c0
>> RSP 
>> 
>> Cc: Ingo Molnar 
>> Cc: Peter Zijlstra 
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Filippo Sironi 
>> ---
>> kernel/sched/fair.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4037e19bbca2..04b6f847a241 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7517,7 +7517,7 @@ static unsigned long scale_rt_capacity(int cpu)
>> 
>>  total = sched_avg_period() + delta;
>> 
>> -used = div_u64(avg, total);
>> +used = div64_u64(avg, total);
>> 
> 
> so total should not get larger than 2*period IIRC, how did we get so
> large?
> 

>From the vmcores I got, it was always delta that made it become very
large.

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH] sched/fair: Prevent a division by 0 in scale_rt_capacity()

2017-12-20 Thread Sironi, Filippo

> On 19. Dec 2017, at 20:33, Peter Zijlstra  wrote:
> 
> On Sat, Dec 09, 2017 at 09:03:49AM +0100, Filippo Sironi wrote:
>> ... since total = sched_avg_period() + delta can yield 0x1,
>> which results in a division by 0, given that div_u64() takes a u32
>> divisor.  Use div64_u64() instead.
>> 
>> divide error:  [#1] SMP
>> CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.58 #1
>> Hardware name: ...
>> task: 8800a24e2800 task.stack: c974c000
>> RIP: 0010:[] [] 
>> update_group_capacity+0x16e/0x1c0
>> RSP: 0018:8800a74e3c18 EFLAGS: 00010246
>> RAX: 00445ced RBX: 0007 RCX: 024d
>> RDX:  RSI:  RDI: 000160c0
>> RBP: 8800a74e3c38 R08: 8800a17d5ac0 R09: 8800a74e
>> R10:  R11:  R12: 8800a297e400
>> R13: 8800a17d5ac0 R14:  R15: 8800a17d5ac0
>> FS: () GS:8800a74e() knlGS:
>> CS: 0010 DS:  ES:  CR0: 80050033
>> CR2: 006f3580 CR3: 01607000 CR4: 007426e0
>> DR0:  DR1:  DR2: 
>> DR3:  DR6: fffe0ff0 DR7: 0400
>> PKRU: 5554
>> Stack:
>> 8800a17d5180 8800a74e3e00 8800a17d5a01 8800a74e3c68
>> 8800a74e3d90 810d37e6 fff8 002300010c40
>> 0040 8800a17d5ad8  
>> Call Trace:
>>  [162553.008569] [] find_busiest_group+0xe6/0x950
>> [] load_balance+0x188/0xa70
>> [] ? update_rq_clock.part.88+0x13/0x30
>> [] rebalance_domains+0x210/0x290
>> [] run_rebalance_domains+0x1b0/0x1d0
>> [] __do_softirq+0x89/0x2b0
>> [] irq_exit+0xab/0xb0
>> [] smp_reschedule_interrupt+0x2e/0x30
>> [] reschedule_interrupt+0x84/0x90
>>  [162553.008603] [] ? cpuidle_enter_state+0x12f/0x2c0
>> [] cpuidle_enter+0x12/0x20
>> [] cpu_startup_entry+0x1a2/0x1f0
>> [] start_secondary+0x12d/0x140
>> Code: 0f 00 4c 8b 96 48 09 00 00 48 8b 86 40 09 00 00 48 8b b6 b0 08 00 00 
>> 48 d1 ea 4c 29 d6 41 ba 00 00 00 00 49 0f 48 f2 01 d6 31 d2 <48> f7 f6 ba 00 
>> 04 00 00 48 29 c2 48 3d ff 03 00 00 b8 01 00 00
>> RIP [] update_group_capacity+0x16e/0x1c0
>> RSP 
>> 
>> Cc: Ingo Molnar 
>> Cc: Peter Zijlstra 
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Filippo Sironi 
>> ---
>> kernel/sched/fair.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4037e19bbca2..04b6f847a241 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7517,7 +7517,7 @@ static unsigned long scale_rt_capacity(int cpu)
>> 
>>  total = sched_avg_period() + delta;
>> 
>> -used = div_u64(avg, total);
>> +used = div64_u64(avg, total);
>> 
> 
> so total should not get larger than 2*period IIRC, how did we get so
> large?
> 

>From the vmcores I got, it was always delta that made it become very
large.

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH 2/2] KVM: x86: Allow userspace to define what's the microcode version

2017-12-08 Thread Sironi, Filippo

> On 27. Nov 2017, at 14:09, Steve Rutherford  wrote:
> 
> On Mon, Nov 27, 2017 at 3:58 AM, Paolo Bonzini  wrote:
>> On 26/11/2017 17:41, Filippo Sironi wrote:
>>> ... that the guest should see.
>>> Guest operating systems may check the microcode version to decide whether
>>> to disable certain features that are known to be buggy up to certain
>>> microcode versions.  Address the issue by making the microcode version
>>> that the guest should see settable.
>>> The rationale for having userspace specifying the microcode version, rather
>>> than having the kernel picking it, is to ensure consistency for 
>>> live-migrated
>>> instances; we don't want them to see a microcode version increase without a
>>> reset.
>>> 
>>> Signed-off-by: Filippo Sironi 
>>> ---
>>> arch/x86/kvm/x86.c   | 23 +++
>>> include/uapi/linux/kvm.h |  3 +++
>>> 2 files changed, 26 insertions(+)
>>> 
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 925c3e29cad3..741588f27ebc 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -4033,6 +4033,29 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>>  } u;
>>> 
>>>  switch (ioctl) {
>>> + case KVM_GET_MICROCODE_VERSION: {
>>> + r = -EFAULT;
>>> + if (copy_to_user(argp,
>>> +  >arch.microcode_version,
>>> +  sizeof(kvm->arch.microcode_version)))
>>> + goto out;
>>> + break;
>>> + }
>>> + case KVM_SET_MICROCODE_VERSION: {
>>> + u32 microcode_version;
>>> +
>>> + r = -EFAULT;
>>> + if (copy_from_user(_version,
>>> +argp,
>>> +sizeof(microcode_version)))
>>> + goto out;
>>> + r = -EINVAL;
>>> + if (!microcode_version)
>>> + goto out;
>>> + kvm->arch.microcode_version = microcode_version;
>>> + r = 0;
>>> + break;
>>> + }
>> 
>> Also, there's no need to define new ioctls, instead you can just place
>> it in the vcpu and use KVM_GET_MSR/KVM_SET_MSR.  I'd agree that's
>> slightly less polished, but it matches what we do already for e.g.
>> nested VMX model specific registers.  And it spares you for writing the
>> documentation that you didn't include in this patch. :)
>> 
>> Paolo
> 
> This feels good time to mention Peter Hornyack's old MSR KVM_EXIT
> patches. With something like them, there would be no need to push this
> into the kernel at all.

That's one of the solution we discussed internally (at Amazon) but we
didn't pursue yet given the need to release a quick fix for customers.
I was thinking about implementing a mechanism to selectively go back to
userspace to emulate MSRs; something that's not limited to KVM unhandled
MSRs but that instead could even override KVM's handling.

Filippo

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH 2/2] KVM: x86: Allow userspace to define what's the microcode version

2017-12-08 Thread Sironi, Filippo

> On 27. Nov 2017, at 14:09, Steve Rutherford  wrote:
> 
> On Mon, Nov 27, 2017 at 3:58 AM, Paolo Bonzini  wrote:
>> On 26/11/2017 17:41, Filippo Sironi wrote:
>>> ... that the guest should see.
>>> Guest operating systems may check the microcode version to decide whether
>>> to disable certain features that are known to be buggy up to certain
>>> microcode versions.  Address the issue by making the microcode version
>>> that the guest should see settable.
>>> The rationale for having userspace specifying the microcode version, rather
>>> than having the kernel picking it, is to ensure consistency for 
>>> live-migrated
>>> instances; we don't want them to see a microcode version increase without a
>>> reset.
>>> 
>>> Signed-off-by: Filippo Sironi 
>>> ---
>>> arch/x86/kvm/x86.c   | 23 +++
>>> include/uapi/linux/kvm.h |  3 +++
>>> 2 files changed, 26 insertions(+)
>>> 
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 925c3e29cad3..741588f27ebc 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -4033,6 +4033,29 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>>  } u;
>>> 
>>>  switch (ioctl) {
>>> + case KVM_GET_MICROCODE_VERSION: {
>>> + r = -EFAULT;
>>> + if (copy_to_user(argp,
>>> +  >arch.microcode_version,
>>> +  sizeof(kvm->arch.microcode_version)))
>>> + goto out;
>>> + break;
>>> + }
>>> + case KVM_SET_MICROCODE_VERSION: {
>>> + u32 microcode_version;
>>> +
>>> + r = -EFAULT;
>>> + if (copy_from_user(_version,
>>> +argp,
>>> +sizeof(microcode_version)))
>>> + goto out;
>>> + r = -EINVAL;
>>> + if (!microcode_version)
>>> + goto out;
>>> + kvm->arch.microcode_version = microcode_version;
>>> + r = 0;
>>> + break;
>>> + }
>> 
>> Also, there's no need to define new ioctls, instead you can just place
>> it in the vcpu and use KVM_GET_MSR/KVM_SET_MSR.  I'd agree that's
>> slightly less polished, but it matches what we do already for e.g.
>> nested VMX model specific registers.  And it spares you for writing the
>> documentation that you didn't include in this patch. :)
>> 
>> Paolo
> 
> This feels good time to mention Peter Hornyack's old MSR KVM_EXIT
> patches. With something like them, there would be no need to push this
> into the kernel at all.

That's one of the solution we discussed internally (at Amazon) but we
didn't pursue yet given the need to release a quick fix for customers.
I was thinking about implementing a mechanism to selectively go back to
userspace to emulate MSRs; something that's not limited to KVM unhandled
MSRs but that instead could even override KVM's handling.

Filippo

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH 2/2] KVM: x86: Allow userspace to define what's the microcode version

2017-12-08 Thread Sironi, Filippo

> On 27. Nov 2017, at 03:58, Paolo Bonzini  wrote:
> 
> On 26/11/2017 17:41, Filippo Sironi wrote:
>> ... that the guest should see.
>> Guest operating systems may check the microcode version to decide whether
>> to disable certain features that are known to be buggy up to certain
>> microcode versions.  Address the issue by making the microcode version
>> that the guest should see settable.
>> The rationale for having userspace specifying the microcode version, rather
>> than having the kernel picking it, is to ensure consistency for live-migrated
>> instances; we don't want them to see a microcode version increase without a
>> reset.
>> 
>> Signed-off-by: Filippo Sironi 
>> ---
>> arch/x86/kvm/x86.c   | 23 +++
>> include/uapi/linux/kvm.h |  3 +++
>> 2 files changed, 26 insertions(+)
>> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 925c3e29cad3..741588f27ebc 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -4033,6 +4033,29 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>  } u;
>> 
>>  switch (ioctl) {
>> +case KVM_GET_MICROCODE_VERSION: {
>> +r = -EFAULT;
>> +if (copy_to_user(argp,
>> + >arch.microcode_version,
>> + sizeof(kvm->arch.microcode_version)))
>> +goto out;
>> +break;
>> +}
>> +case KVM_SET_MICROCODE_VERSION: {
>> +u32 microcode_version;
>> +
>> +r = -EFAULT;
>> +if (copy_from_user(_version,
>> +   argp,
>> +   sizeof(microcode_version)))
>> +goto out;
>> +r = -EINVAL;
>> +if (!microcode_version)
>> +goto out;
>> +kvm->arch.microcode_version = microcode_version;
>> +r = 0;
>> +break;
>> +}
> 
> Also, there's no need to define new ioctls, instead you can just place
> it in the vcpu and use KVM_GET_MSR/KVM_SET_MSR.  I'd agree that's
> slightly less polished, but it matches what we do already for e.g.
> nested VMX model specific registers.  And it spares you for writing the
> documentation that you didn't include in this patch. :)
> 
> Paolo

I wanted to do the work once rather than doing it per vCPU but using
KVM_{GET|SET}_MSR and extending the list of MSRs that userspace can
save/restore is certainly doable.

I'll look into that and post a v2.

Filippo

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH 2/2] KVM: x86: Allow userspace to define what's the microcode version

2017-12-08 Thread Sironi, Filippo

> On 27. Nov 2017, at 03:58, Paolo Bonzini  wrote:
> 
> On 26/11/2017 17:41, Filippo Sironi wrote:
>> ... that the guest should see.
>> Guest operating systems may check the microcode version to decide whether
>> to disable certain features that are known to be buggy up to certain
>> microcode versions.  Address the issue by making the microcode version
>> that the guest should see settable.
>> The rationale for having userspace specifying the microcode version, rather
>> than having the kernel picking it, is to ensure consistency for live-migrated
>> instances; we don't want them to see a microcode version increase without a
>> reset.
>> 
>> Signed-off-by: Filippo Sironi 
>> ---
>> arch/x86/kvm/x86.c   | 23 +++
>> include/uapi/linux/kvm.h |  3 +++
>> 2 files changed, 26 insertions(+)
>> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 925c3e29cad3..741588f27ebc 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -4033,6 +4033,29 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>  } u;
>> 
>>  switch (ioctl) {
>> +case KVM_GET_MICROCODE_VERSION: {
>> +r = -EFAULT;
>> +if (copy_to_user(argp,
>> + >arch.microcode_version,
>> + sizeof(kvm->arch.microcode_version)))
>> +goto out;
>> +break;
>> +}
>> +case KVM_SET_MICROCODE_VERSION: {
>> +u32 microcode_version;
>> +
>> +r = -EFAULT;
>> +if (copy_from_user(_version,
>> +   argp,
>> +   sizeof(microcode_version)))
>> +goto out;
>> +r = -EINVAL;
>> +if (!microcode_version)
>> +goto out;
>> +kvm->arch.microcode_version = microcode_version;
>> +r = 0;
>> +break;
>> +}
> 
> Also, there's no need to define new ioctls, instead you can just place
> it in the vcpu and use KVM_GET_MSR/KVM_SET_MSR.  I'd agree that's
> slightly less polished, but it matches what we do already for e.g.
> nested VMX model specific registers.  And it spares you for writing the
> documentation that you didn't include in this patch. :)
> 
> Paolo

I wanted to do the work once rather than doing it per vCPU but using
KVM_{GET|SET}_MSR and extending the list of MSRs that userspace can
save/restore is certainly doable.

I'll look into that and post a v2.

Filippo

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH 2/2] KVM: x86: Allow userspace to define what's the microcode version

2017-12-08 Thread Sironi, Filippo

> On 27. Nov 2017, at 02:40, Paolo Bonzini  wrote:
> 
> On 26/11/2017 17:41, Filippo Sironi wrote:
>> ... that the guest should see.
>> Guest operating systems may check the microcode version to decide whether
>> to disable certain features that are known to be buggy up to certain
>> microcode versions.  Address the issue by making the microcode version
>> that the guest should see settable.
> 
> What's the advantage of specifying the microcode version, rather than
> relying on userspace to drop the CPUID bit for the buggy feature?
> 
>   old guest(*) new guest
> 
>   hide in CPUID  good good
> 
>   use ucode rev  BAD  good
> 
> 
> (*) old guest = doesn't know that the feature is buggy until a given
> ucode revision
> 
> Thanks,
> 
> Paolo

On C5 and M5 instances, we're basically exposing the host CPUID with few
exceptions.  Linux (among the others) has checks to make sure that certain
features aren't enabled on a certain family/model/stepping if the microcode
version isn't greater than or equal to a known good version.
apic_check_deadline_errata() in arch/x86/kernel/apic/apic.c is the most
recent example in Linux that I know of (by now you've updated it ;) but
when we got the original bug report that triggered this patch, this wasn't
the case).

By exposing the real microcode version, we're preventing buggy guests that
don't check that they are running virtualized (i.e., they should trust the
hypervisor) from disabling features that are effectively not buggy.

Filippo

>> The rationale for having userspace specifying the microcode version, rather
>> than having the kernel picking it, is to ensure consistency for live-migrated
>> instances; we don't want them to see a microcode version increase without a
>> reset.

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH 2/2] KVM: x86: Allow userspace to define what's the microcode version

2017-12-08 Thread Sironi, Filippo

> On 27. Nov 2017, at 02:40, Paolo Bonzini  wrote:
> 
> On 26/11/2017 17:41, Filippo Sironi wrote:
>> ... that the guest should see.
>> Guest operating systems may check the microcode version to decide whether
>> to disable certain features that are known to be buggy up to certain
>> microcode versions.  Address the issue by making the microcode version
>> that the guest should see settable.
> 
> What's the advantage of specifying the microcode version, rather than
> relying on userspace to drop the CPUID bit for the buggy feature?
> 
>   old guest(*) new guest
> 
>   hide in CPUID  good good
> 
>   use ucode rev  BAD  good
> 
> 
> (*) old guest = doesn't know that the feature is buggy until a given
> ucode revision
> 
> Thanks,
> 
> Paolo

On C5 and M5 instances, we're basically exposing the host CPUID with few
exceptions.  Linux (among the others) has checks to make sure that certain
features aren't enabled on a certain family/model/stepping if the microcode
version isn't greater than or equal to a known good version.
apic_check_deadline_errata() in arch/x86/kernel/apic/apic.c is the most
recent example in Linux that I know of (by now you've updated it ;) but
when we got the original bug report that triggered this patch, this wasn't
the case).

By exposing the real microcode version, we're preventing buggy guests that
don't check that they are running virtualized (i.e., they should trust the
hypervisor) from disabling features that are effectively not buggy.

Filippo

>> The rationale for having userspace specifying the microcode version, rather
>> than having the kernel picking it, is to ensure consistency for live-migrated
>> instances; we don't want them to see a microcode version increase without a
>> reset.

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH 2/2] KVM: x86: Allow userspace to define what's the microcode version

2017-12-08 Thread Sironi, Filippo

> On 26. Nov 2017, at 17:02, Wanpeng Li  wrote:
> 
> 2017-11-27 0:41 GMT+08:00 Filippo Sironi :
>> ... that the guest should see.
>> Guest operating systems may check the microcode version to decide whether
>> to disable certain features that are known to be buggy up to certain
>> microcode versions.  Address the issue by making the microcode version
>> that the guest should see settable.
>> The rationale for having userspace specifying the microcode version, rather
>> than having the kernel picking it, is to ensure consistency for live-migrated
>> instances; we don't want them to see a microcode version increase without a
>> reset.
> 
> Is there a scenario which needs to refresh the microcode in the guest
> instead of on the host?
> 
> Regards,
> Wanpeng Li

Not that I can think of.
Today, we're picking up the host microcode version when launching an instance
and making sure that the same version is exposed for the life time of the
instance (i.e., across migrations that don't result in a reset).

Filippo

>> 
>> Signed-off-by: Filippo Sironi 
>> ---
>> arch/x86/kvm/x86.c   | 23 +++
>> include/uapi/linux/kvm.h |  3 +++
>> 2 files changed, 26 insertions(+)
>> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 925c3e29cad3..741588f27ebc 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -4033,6 +4033,29 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>} u;
>> 
>>switch (ioctl) {
>> +   case KVM_GET_MICROCODE_VERSION: {
>> +   r = -EFAULT;
>> +   if (copy_to_user(argp,
>> +>arch.microcode_version,
>> +sizeof(kvm->arch.microcode_version)))
>> +   goto out;
>> +   break;
>> +   }
>> +   case KVM_SET_MICROCODE_VERSION: {
>> +   u32 microcode_version;
>> +
>> +   r = -EFAULT;
>> +   if (copy_from_user(_version,
>> +  argp,
>> +  sizeof(microcode_version)))
>> +   goto out;
>> +   r = -EINVAL;
>> +   if (!microcode_version)
>> +   goto out;
>> +   kvm->arch.microcode_version = microcode_version;
>> +   r = 0;
>> +   break;
>> +   }
>>case KVM_SET_TSS_ADDR:
>>r = kvm_vm_ioctl_set_tss_addr(kvm, arg);
>>break;
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 282d7613fce8..e11887758e29 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1192,6 +1192,9 @@ struct kvm_s390_ucas_mapping {
>> #define KVM_S390_UCAS_UNMAP  _IOW(KVMIO, 0x51, struct 
>> kvm_s390_ucas_mapping)
>> #define KVM_S390_VCPU_FAULT _IOW(KVMIO, 0x52, unsigned long)
>> 
>> +#define KVM_GET_MICROCODE_VERSION _IOR(KVMIO, 0x5e, __u32)
>> +#define KVM_SET_MICROCODE_VERSION _IOW(KVMIO, 0x5f, __u32)
>> +
>> /* Device model IOC */
>> #define KVM_CREATE_IRQCHIP_IO(KVMIO,   0x60)
>> #define KVM_IRQ_LINE  _IOW(KVMIO,  0x61, struct kvm_irq_level)
>> --
>> 2.7.4
>> 
> 

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH 2/2] KVM: x86: Allow userspace to define what's the microcode version

2017-12-08 Thread Sironi, Filippo

> On 26. Nov 2017, at 17:02, Wanpeng Li  wrote:
> 
> 2017-11-27 0:41 GMT+08:00 Filippo Sironi :
>> ... that the guest should see.
>> Guest operating systems may check the microcode version to decide whether
>> to disable certain features that are known to be buggy up to certain
>> microcode versions.  Address the issue by making the microcode version
>> that the guest should see settable.
>> The rationale for having userspace specifying the microcode version, rather
>> than having the kernel picking it, is to ensure consistency for live-migrated
>> instances; we don't want them to see a microcode version increase without a
>> reset.
> 
> Is there a scenario which needs to refresh the microcode in the guest
> instead of on the host?
> 
> Regards,
> Wanpeng Li

Not that I can think of.
Today, we're picking up the host microcode version when launching an instance
and making sure that the same version is exposed for the life time of the
instance (i.e., across migrations that don't result in a reset).

Filippo

>> 
>> Signed-off-by: Filippo Sironi 
>> ---
>> arch/x86/kvm/x86.c   | 23 +++
>> include/uapi/linux/kvm.h |  3 +++
>> 2 files changed, 26 insertions(+)
>> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 925c3e29cad3..741588f27ebc 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -4033,6 +4033,29 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>} u;
>> 
>>switch (ioctl) {
>> +   case KVM_GET_MICROCODE_VERSION: {
>> +   r = -EFAULT;
>> +   if (copy_to_user(argp,
>> +>arch.microcode_version,
>> +sizeof(kvm->arch.microcode_version)))
>> +   goto out;
>> +   break;
>> +   }
>> +   case KVM_SET_MICROCODE_VERSION: {
>> +   u32 microcode_version;
>> +
>> +   r = -EFAULT;
>> +   if (copy_from_user(_version,
>> +  argp,
>> +  sizeof(microcode_version)))
>> +   goto out;
>> +   r = -EINVAL;
>> +   if (!microcode_version)
>> +   goto out;
>> +   kvm->arch.microcode_version = microcode_version;
>> +   r = 0;
>> +   break;
>> +   }
>>case KVM_SET_TSS_ADDR:
>>r = kvm_vm_ioctl_set_tss_addr(kvm, arg);
>>break;
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 282d7613fce8..e11887758e29 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1192,6 +1192,9 @@ struct kvm_s390_ucas_mapping {
>> #define KVM_S390_UCAS_UNMAP  _IOW(KVMIO, 0x51, struct 
>> kvm_s390_ucas_mapping)
>> #define KVM_S390_VCPU_FAULT _IOW(KVMIO, 0x52, unsigned long)
>> 
>> +#define KVM_GET_MICROCODE_VERSION _IOR(KVMIO, 0x5e, __u32)
>> +#define KVM_SET_MICROCODE_VERSION _IOW(KVMIO, 0x5f, __u32)
>> +
>> /* Device model IOC */
>> #define KVM_CREATE_IRQCHIP_IO(KVMIO,   0x60)
>> #define KVM_IRQ_LINE  _IOW(KVMIO,  0x61, struct kvm_irq_level)
>> --
>> 2.7.4
>> 
> 

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH 2/2] pci: Expose offset, stride, and VF device ID via sysfs

2017-10-04 Thread Sironi, Filippo

> On 3. Oct 2017, at 21:48, Bjorn Helgaas <helg...@kernel.org> wrote:
> 
> On Tue, Oct 03, 2017 at 02:31:14PM -0500, Bjorn Helgaas wrote:
>> On Fri, Sep 29, 2017 at 07:53:31AM +, Sironi, Filippo wrote:
>>> 
>>> Hi Bjorn,
>>> 
>>>> On 25. Sep 2017, at 20:55, Bjorn Helgaas <helg...@kernel.org> wrote:
>>>> 
>>>> Hi Filippo,
>>>> 
>>>> On Mon, Aug 28, 2017 at 03:38:50PM +0200, Filippo Sironi wrote:
>>>>> +static ssize_t sriov_vf_did_show(struct device *dev,
>>>>> +  struct device_attribute *attr,
>>>>> +  char *buf)
>>>>> +{
>>>>> + struct pci_dev *pdev = to_pci_dev(dev);
>>>>> +
>>>>> + return sprintf(buf, "%x\n", pdev->sriov->vf_did);
>>>>> +}
>>>> 
>>>> What does the vf_did part look like in sysfs?  Do we have a directory with
>>>> both "device" and "vf_did" in it?  If so, why do we have both and do we
>>>> need both?  Could we put the vf_did in the "device" file?
>>> 
>>> On my machine:
>>> 
>>> /sys/bus/pci/devices/:03:00.0# ls -l  # this is the PF
>>> total 0
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 broken_parity_status
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 class
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 config
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 consistent_dma_mask_bits
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 d3cold_allowed
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 device
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 dma_mask_bits
>>> lrwxrwxrwx 1 root root   0 Sep 28 19:41 driver -> 
>>> ../../../../bus/pci/drivers/igb
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 driver_override
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 enable
>>> lrwxrwxrwx 1 root root   0 Sep 28 19:41 firmware_node -> 
>>> ../../../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:4b/device:4c
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 irq
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 local_cpulist
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 local_cpus
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 modalias
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 msi_bus
>>> drwxr-xr-x 2 root root   0 Sep 29 09:44 msi_irqs
>>> drwxr-xr-x 3 root root   0 Sep 28 19:41 net
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 numa_node
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 offset  # this is new
>>> drwxr-xr-x 2 root root   0 Sep 28 19:41 power
>>> drwxr-xr-x 3 root root   0 Sep 28 19:41 ptp
>>> --w--w 1 root root4096 Sep 28 19:41 remove
>>> --w--w 1 root root4096 Sep 28 19:41 rescan
>>> --w--- 1 root root4096 Sep 28 19:41 reset
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 resource
>>> -rw--- 1 root root  131072 Sep 28 19:41 resource0
>>> -rw--- 1 root root 4194304 Sep 28 19:41 resource1
>>> -rw--- 1 root root  32 Sep 28 19:41 resource2
>>> -rw--- 1 root root   16384 Sep 28 19:41 resource3
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 revision
>>> -rw-rw-r-- 1 root root4096 Sep 29 09:44 sriov_numvfs
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 sriov_totalvfs
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 stride  # this is new
>>> lrwxrwxrwx 1 root root   0 Sep 28 19:41 subsystem -> ../../../../bus/pci
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 subsystem_device
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 subsystem_vendor
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 uevent
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 vendor
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 vf_did  # this is new
>>> lrwxrwxrwx 1 root root   0 Sep 29 09:44 virtfn0 -> ../:03:10.0
>>> 
>>> nothing changes on for VFs.
>>> Then:
>>> 
>>> /sys/bus/pci/devices/:03:00.0# cat device 
>>> 0x10c9
>>> 
>>> /sys/bus/pci/devices/:03:00.0# cat vf_did
>>> 0x10ca
>>> 
>>> Putting the VF device ID in the PF device file would be a change of
>>> that we expose to userspace.  Something might break.
>> 
>> Oh, sorry, I misunderstood!  I was thinking that you were adding
>> "vf_did" to the VF directory, not the PF directory.
>> 
>> Then I guess my on

Re: [PATCH 2/2] pci: Expose offset, stride, and VF device ID via sysfs

2017-10-04 Thread Sironi, Filippo

> On 3. Oct 2017, at 21:48, Bjorn Helgaas  wrote:
> 
> On Tue, Oct 03, 2017 at 02:31:14PM -0500, Bjorn Helgaas wrote:
>> On Fri, Sep 29, 2017 at 07:53:31AM +, Sironi, Filippo wrote:
>>> 
>>> Hi Bjorn,
>>> 
>>>> On 25. Sep 2017, at 20:55, Bjorn Helgaas  wrote:
>>>> 
>>>> Hi Filippo,
>>>> 
>>>> On Mon, Aug 28, 2017 at 03:38:50PM +0200, Filippo Sironi wrote:
>>>>> +static ssize_t sriov_vf_did_show(struct device *dev,
>>>>> +  struct device_attribute *attr,
>>>>> +  char *buf)
>>>>> +{
>>>>> + struct pci_dev *pdev = to_pci_dev(dev);
>>>>> +
>>>>> + return sprintf(buf, "%x\n", pdev->sriov->vf_did);
>>>>> +}
>>>> 
>>>> What does the vf_did part look like in sysfs?  Do we have a directory with
>>>> both "device" and "vf_did" in it?  If so, why do we have both and do we
>>>> need both?  Could we put the vf_did in the "device" file?
>>> 
>>> On my machine:
>>> 
>>> /sys/bus/pci/devices/:03:00.0# ls -l  # this is the PF
>>> total 0
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 broken_parity_status
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 class
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 config
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 consistent_dma_mask_bits
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 d3cold_allowed
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 device
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 dma_mask_bits
>>> lrwxrwxrwx 1 root root   0 Sep 28 19:41 driver -> 
>>> ../../../../bus/pci/drivers/igb
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 driver_override
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 enable
>>> lrwxrwxrwx 1 root root   0 Sep 28 19:41 firmware_node -> 
>>> ../../../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:4b/device:4c
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 irq
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 local_cpulist
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 local_cpus
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 modalias
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 msi_bus
>>> drwxr-xr-x 2 root root   0 Sep 29 09:44 msi_irqs
>>> drwxr-xr-x 3 root root   0 Sep 28 19:41 net
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 numa_node
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 offset  # this is new
>>> drwxr-xr-x 2 root root   0 Sep 28 19:41 power
>>> drwxr-xr-x 3 root root   0 Sep 28 19:41 ptp
>>> --w--w 1 root root4096 Sep 28 19:41 remove
>>> --w--w 1 root root4096 Sep 28 19:41 rescan
>>> --w--- 1 root root4096 Sep 28 19:41 reset
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 resource
>>> -rw--- 1 root root  131072 Sep 28 19:41 resource0
>>> -rw--- 1 root root 4194304 Sep 28 19:41 resource1
>>> -rw--- 1 root root  32 Sep 28 19:41 resource2
>>> -rw--- 1 root root   16384 Sep 28 19:41 resource3
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 revision
>>> -rw-rw-r-- 1 root root4096 Sep 29 09:44 sriov_numvfs
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 sriov_totalvfs
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 stride  # this is new
>>> lrwxrwxrwx 1 root root   0 Sep 28 19:41 subsystem -> ../../../../bus/pci
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 subsystem_device
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 subsystem_vendor
>>> -rw-r--r-- 1 root root4096 Sep 28 19:41 uevent
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 vendor
>>> -r--r--r-- 1 root root4096 Sep 28 19:41 vf_did  # this is new
>>> lrwxrwxrwx 1 root root   0 Sep 29 09:44 virtfn0 -> ../:03:10.0
>>> 
>>> nothing changes on for VFs.
>>> Then:
>>> 
>>> /sys/bus/pci/devices/:03:00.0# cat device 
>>> 0x10c9
>>> 
>>> /sys/bus/pci/devices/:03:00.0# cat vf_did
>>> 0x10ca
>>> 
>>> Putting the VF device ID in the PF device file would be a change of
>>> that we expose to userspace.  Something might break.
>> 
>> Oh, sorry, I misunderstood!  I was thinking that you were adding
>> "vf_did" to the VF directory, not the PF directory.
>> 
>> Then I guess my only issue is that "vf_did&q

Re: [PATCH 2/2] pci: Expose offset, stride, and VF device ID via sysfs

2017-09-29 Thread Sironi, Filippo

Hi Bjorn,

> On 25. Sep 2017, at 20:55, Bjorn Helgaas  wrote:
> 
> Hi Filippo,
> 
> On Mon, Aug 28, 2017 at 03:38:50PM +0200, Filippo Sironi wrote:
>> +static ssize_t sriov_vf_did_show(struct device *dev,
>> + struct device_attribute *attr,
>> + char *buf)
>> +{
>> +struct pci_dev *pdev = to_pci_dev(dev);
>> +
>> +return sprintf(buf, "%x\n", pdev->sriov->vf_did);
>> +}
> 
> What does the vf_did part look like in sysfs?  Do we have a directory with
> both "device" and "vf_did" in it?  If so, why do we have both and do we
> need both?  Could we put the vf_did in the "device" file?

On my machine:

/sys/bus/pci/devices/:03:00.0# ls -l  # this is the PF
total 0
-rw-r--r-- 1 root root4096 Sep 28 19:41 broken_parity_status
-r--r--r-- 1 root root4096 Sep 28 19:41 class
-rw-r--r-- 1 root root4096 Sep 28 19:41 config
-r--r--r-- 1 root root4096 Sep 28 19:41 consistent_dma_mask_bits
-rw-r--r-- 1 root root4096 Sep 28 19:41 d3cold_allowed
-r--r--r-- 1 root root4096 Sep 28 19:41 device
-r--r--r-- 1 root root4096 Sep 28 19:41 dma_mask_bits
lrwxrwxrwx 1 root root   0 Sep 28 19:41 driver -> 
../../../../bus/pci/drivers/igb
-rw-r--r-- 1 root root4096 Sep 28 19:41 driver_override
-rw-r--r-- 1 root root4096 Sep 28 19:41 enable
lrwxrwxrwx 1 root root   0 Sep 28 19:41 firmware_node -> 
../../../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:4b/device:4c
-r--r--r-- 1 root root4096 Sep 28 19:41 irq
-r--r--r-- 1 root root4096 Sep 28 19:41 local_cpulist
-r--r--r-- 1 root root4096 Sep 28 19:41 local_cpus
-r--r--r-- 1 root root4096 Sep 28 19:41 modalias
-rw-r--r-- 1 root root4096 Sep 28 19:41 msi_bus
drwxr-xr-x 2 root root   0 Sep 29 09:44 msi_irqs
drwxr-xr-x 3 root root   0 Sep 28 19:41 net
-rw-r--r-- 1 root root4096 Sep 28 19:41 numa_node
-r--r--r-- 1 root root4096 Sep 28 19:41 offset  # this is new
drwxr-xr-x 2 root root   0 Sep 28 19:41 power
drwxr-xr-x 3 root root   0 Sep 28 19:41 ptp
--w--w 1 root root4096 Sep 28 19:41 remove
--w--w 1 root root4096 Sep 28 19:41 rescan
--w--- 1 root root4096 Sep 28 19:41 reset
-r--r--r-- 1 root root4096 Sep 28 19:41 resource
-rw--- 1 root root  131072 Sep 28 19:41 resource0
-rw--- 1 root root 4194304 Sep 28 19:41 resource1
-rw--- 1 root root  32 Sep 28 19:41 resource2
-rw--- 1 root root   16384 Sep 28 19:41 resource3
-r--r--r-- 1 root root4096 Sep 28 19:41 revision
-rw-rw-r-- 1 root root4096 Sep 29 09:44 sriov_numvfs
-r--r--r-- 1 root root4096 Sep 28 19:41 sriov_totalvfs
-r--r--r-- 1 root root4096 Sep 28 19:41 stride  # this is new
lrwxrwxrwx 1 root root   0 Sep 28 19:41 subsystem -> ../../../../bus/pci
-r--r--r-- 1 root root4096 Sep 28 19:41 subsystem_device
-r--r--r-- 1 root root4096 Sep 28 19:41 subsystem_vendor
-rw-r--r-- 1 root root4096 Sep 28 19:41 uevent
-r--r--r-- 1 root root4096 Sep 28 19:41 vendor
-r--r--r-- 1 root root4096 Sep 28 19:41 vf_did  # this is new
lrwxrwxrwx 1 root root   0 Sep 29 09:44 virtfn0 -> ../:03:10.0

nothing changes on for VFs.
Then:

/sys/bus/pci/devices/:03:00.0# cat device 
0x10c9

/sys/bus/pci/devices/:03:00.0# cat vf_did
0x10ca

Putting the VF device ID in the PF device file would be a change of
that we expose to userspace.  Something might break.

vf_did provides a easy way to retrieve the VF device ID without reading
the PF config (looking up the SR-IOV capability and reading it) or without
enabling SR-IOV to read for example virtfn0/device.

Similar considerations (ease of access) apply to offset and stride.


>> static ssize_t sriov_drivers_autoprobe_show(struct device *dev,
>>  struct device_attribute *attr,
>>  char *buf)
>> @@ -676,6 +703,9 @@ static struct device_attribute sriov_totalvfs_attr = 
>> __ATTR_RO(sriov_totalvfs);
>> static struct device_attribute sriov_numvfs_attr =
>>  __ATTR(sriov_numvfs, (S_IRUGO|S_IWUSR|S_IWGRP),
>> sriov_numvfs_show, sriov_numvfs_store);
>> +static struct device_attribute sriov_offset_attr = __ATTR_RO(sriov_offset);
>> +static struct device_attribute sriov_stride_attr = __ATTR_RO(sriov_stride);
>> +static struct device_attribute sriov_vf_did_attr = __ATTR_RO(sriov_vf_did);
>> static struct device_attribute sriov_drivers_autoprobe_attr =
>>  __ATTR(sriov_drivers_autoprobe, (S_IRUGO|S_IWUSR|S_IWGRP),
>> sriov_drivers_autoprobe_show, 
>> sriov_drivers_autoprobe_store);
>> @@ -1744,6 +1774,9 @@ static struct attribute_group pci_dev_hp_attr_group = {
>> static struct attribute *sriov_dev_attrs[] = {
>>  _totalvfs_attr.attr,
>>  _numvfs_attr.attr,
>> +_offset_attr.attr,
>> +_stride_attr.attr,
>> +_vf_did_attr.attr,
>>  _drivers_autoprobe_attr.attr,
>>  NULL,
>> };
>> -- 
>> 2.7.4
>> 

Re: [PATCH 2/2] pci: Expose offset, stride, and VF device ID via sysfs

2017-09-29 Thread Sironi, Filippo

Hi Bjorn,

> On 25. Sep 2017, at 20:55, Bjorn Helgaas  wrote:
> 
> Hi Filippo,
> 
> On Mon, Aug 28, 2017 at 03:38:50PM +0200, Filippo Sironi wrote:
>> +static ssize_t sriov_vf_did_show(struct device *dev,
>> + struct device_attribute *attr,
>> + char *buf)
>> +{
>> +struct pci_dev *pdev = to_pci_dev(dev);
>> +
>> +return sprintf(buf, "%x\n", pdev->sriov->vf_did);
>> +}
> 
> What does the vf_did part look like in sysfs?  Do we have a directory with
> both "device" and "vf_did" in it?  If so, why do we have both and do we
> need both?  Could we put the vf_did in the "device" file?

On my machine:

/sys/bus/pci/devices/:03:00.0# ls -l  # this is the PF
total 0
-rw-r--r-- 1 root root4096 Sep 28 19:41 broken_parity_status
-r--r--r-- 1 root root4096 Sep 28 19:41 class
-rw-r--r-- 1 root root4096 Sep 28 19:41 config
-r--r--r-- 1 root root4096 Sep 28 19:41 consistent_dma_mask_bits
-rw-r--r-- 1 root root4096 Sep 28 19:41 d3cold_allowed
-r--r--r-- 1 root root4096 Sep 28 19:41 device
-r--r--r-- 1 root root4096 Sep 28 19:41 dma_mask_bits
lrwxrwxrwx 1 root root   0 Sep 28 19:41 driver -> 
../../../../bus/pci/drivers/igb
-rw-r--r-- 1 root root4096 Sep 28 19:41 driver_override
-rw-r--r-- 1 root root4096 Sep 28 19:41 enable
lrwxrwxrwx 1 root root   0 Sep 28 19:41 firmware_node -> 
../../../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:4b/device:4c
-r--r--r-- 1 root root4096 Sep 28 19:41 irq
-r--r--r-- 1 root root4096 Sep 28 19:41 local_cpulist
-r--r--r-- 1 root root4096 Sep 28 19:41 local_cpus
-r--r--r-- 1 root root4096 Sep 28 19:41 modalias
-rw-r--r-- 1 root root4096 Sep 28 19:41 msi_bus
drwxr-xr-x 2 root root   0 Sep 29 09:44 msi_irqs
drwxr-xr-x 3 root root   0 Sep 28 19:41 net
-rw-r--r-- 1 root root4096 Sep 28 19:41 numa_node
-r--r--r-- 1 root root4096 Sep 28 19:41 offset  # this is new
drwxr-xr-x 2 root root   0 Sep 28 19:41 power
drwxr-xr-x 3 root root   0 Sep 28 19:41 ptp
--w--w 1 root root4096 Sep 28 19:41 remove
--w--w 1 root root4096 Sep 28 19:41 rescan
--w--- 1 root root4096 Sep 28 19:41 reset
-r--r--r-- 1 root root4096 Sep 28 19:41 resource
-rw--- 1 root root  131072 Sep 28 19:41 resource0
-rw--- 1 root root 4194304 Sep 28 19:41 resource1
-rw--- 1 root root  32 Sep 28 19:41 resource2
-rw--- 1 root root   16384 Sep 28 19:41 resource3
-r--r--r-- 1 root root4096 Sep 28 19:41 revision
-rw-rw-r-- 1 root root4096 Sep 29 09:44 sriov_numvfs
-r--r--r-- 1 root root4096 Sep 28 19:41 sriov_totalvfs
-r--r--r-- 1 root root4096 Sep 28 19:41 stride  # this is new
lrwxrwxrwx 1 root root   0 Sep 28 19:41 subsystem -> ../../../../bus/pci
-r--r--r-- 1 root root4096 Sep 28 19:41 subsystem_device
-r--r--r-- 1 root root4096 Sep 28 19:41 subsystem_vendor
-rw-r--r-- 1 root root4096 Sep 28 19:41 uevent
-r--r--r-- 1 root root4096 Sep 28 19:41 vendor
-r--r--r-- 1 root root4096 Sep 28 19:41 vf_did  # this is new
lrwxrwxrwx 1 root root   0 Sep 29 09:44 virtfn0 -> ../:03:10.0

nothing changes on for VFs.
Then:

/sys/bus/pci/devices/:03:00.0# cat device 
0x10c9

/sys/bus/pci/devices/:03:00.0# cat vf_did
0x10ca

Putting the VF device ID in the PF device file would be a change of
that we expose to userspace.  Something might break.

vf_did provides a easy way to retrieve the VF device ID without reading
the PF config (looking up the SR-IOV capability and reading it) or without
enabling SR-IOV to read for example virtfn0/device.

Similar considerations (ease of access) apply to offset and stride.


>> static ssize_t sriov_drivers_autoprobe_show(struct device *dev,
>>  struct device_attribute *attr,
>>  char *buf)
>> @@ -676,6 +703,9 @@ static struct device_attribute sriov_totalvfs_attr = 
>> __ATTR_RO(sriov_totalvfs);
>> static struct device_attribute sriov_numvfs_attr =
>>  __ATTR(sriov_numvfs, (S_IRUGO|S_IWUSR|S_IWGRP),
>> sriov_numvfs_show, sriov_numvfs_store);
>> +static struct device_attribute sriov_offset_attr = __ATTR_RO(sriov_offset);
>> +static struct device_attribute sriov_stride_attr = __ATTR_RO(sriov_stride);
>> +static struct device_attribute sriov_vf_did_attr = __ATTR_RO(sriov_vf_did);
>> static struct device_attribute sriov_drivers_autoprobe_attr =
>>  __ATTR(sriov_drivers_autoprobe, (S_IRUGO|S_IWUSR|S_IWGRP),
>> sriov_drivers_autoprobe_show, 
>> sriov_drivers_autoprobe_store);
>> @@ -1744,6 +1774,9 @@ static struct attribute_group pci_dev_hp_attr_group = {
>> static struct attribute *sriov_dev_attrs[] = {
>>  _totalvfs_attr.attr,
>>  _numvfs_attr.attr,
>> +_offset_attr.attr,
>> +_stride_attr.attr,
>> +_vf_did_attr.attr,
>>  _drivers_autoprobe_attr.attr,
>>  NULL,
>> };
>> -- 
>> 2.7.4
>> 
> 

Amazon 

Re: [PATCH] intel-iommu: Don't be too aggressive when clearing one context entry

2017-08-31 Thread Sironi, Filippo
Hi Jacob,

> On 30. Aug 2017, at 17:50, Jacob Pan  wrote:
> 
> On Mon, 28 Aug 2017 16:16:29 +0200
> Filippo Sironi via iommu  wrote:
> 
>> Previously, we were invalidating context cache and IOTLB globally when
>> clearing one context entry.  This is a tad too aggressive.
>> Invalidate the context cache and IOTLB for the interested device only.
>> 
>> Signed-off-by: Filippo Sironi 
>> Cc: David Woodhouse 
>> Cc: David Woodhouse 
>> Cc: Joerg Roedel 
>> Cc: io...@lists.linux-foundation.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>> drivers/iommu/intel-iommu.c | 25 ++---
>> 1 file changed, 22 insertions(+), 3 deletions(-)
>> 
>> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
>> index 3e8636f1220e..4bf3e59b0929 100644
>> --- a/drivers/iommu/intel-iommu.c
>> +++ b/drivers/iommu/intel-iommu.c
>> @@ -2351,13 +2351,32 @@ static inline int domain_pfn_mapping(struct
>> dmar_domain *domain, unsigned long i 
>> static void domain_context_clear_one(struct intel_iommu *iommu, u8
>> bus, u8 devfn) {
>> +unsigned long flags;
>> +struct context_entry *context;
>> +u16 did_old;
>> +
>>  if (!iommu)
>>  return;
>> 
>> +spin_lock_irqsave(>lock, flags);
>> +context = iommu_context_addr(iommu, bus, devfn, 0);
>> +if (!context) {
>> +spin_unlock_irqrestore(>lock, flags);
>> +return;
>> +}
> perhaps check with device_context_mapped()?

Using device_context_mapped() wouldn't simplify the code since it
would just tell me that at the time of check there was a context.
I would still need to lock, get the context, check if the context
is valid, do the work, and unlock.

Modifying device_context_mapped() to return the context isn't
going to work either because it may go away in the meantime since
I wouldn't hold the lock.

>> +did_old = context_domain_id(context);
>> +spin_unlock_irqrestore(>lock, flags);
>>  clear_context_table(iommu, bus, devfn);
>> -iommu->flush.flush_context(iommu, 0, 0, 0,
>> -   DMA_CCMD_GLOBAL_INVL);
>> -iommu->flush.flush_iotlb(iommu, 0, 0, 0,
>> DMA_TLB_GLOBAL_FLUSH);
>> +iommu->flush.flush_context(iommu,
>> +   did_old,
>> +   (((u16)bus) << 8) | devfn,
>> +   DMA_CCMD_MASK_NOBIT,
>> +   DMA_CCMD_DEVICE_INVL);
>> +iommu->flush.flush_iotlb(iommu,
>> + did_old,
>> + 0,
>> + 0,
>> + DMA_TLB_DSI_FLUSH);
>> }
>> 
>> static inline void unlink_domain_info(struct device_domain_info
>> *info)
> 
> [Jacob Pan]

Regards,
Filippo


Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH] intel-iommu: Don't be too aggressive when clearing one context entry

2017-08-31 Thread Sironi, Filippo
Hi Jacob,

> On 30. Aug 2017, at 17:50, Jacob Pan  wrote:
> 
> On Mon, 28 Aug 2017 16:16:29 +0200
> Filippo Sironi via iommu  wrote:
> 
>> Previously, we were invalidating context cache and IOTLB globally when
>> clearing one context entry.  This is a tad too aggressive.
>> Invalidate the context cache and IOTLB for the interested device only.
>> 
>> Signed-off-by: Filippo Sironi 
>> Cc: David Woodhouse 
>> Cc: David Woodhouse 
>> Cc: Joerg Roedel 
>> Cc: io...@lists.linux-foundation.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>> drivers/iommu/intel-iommu.c | 25 ++---
>> 1 file changed, 22 insertions(+), 3 deletions(-)
>> 
>> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
>> index 3e8636f1220e..4bf3e59b0929 100644
>> --- a/drivers/iommu/intel-iommu.c
>> +++ b/drivers/iommu/intel-iommu.c
>> @@ -2351,13 +2351,32 @@ static inline int domain_pfn_mapping(struct
>> dmar_domain *domain, unsigned long i 
>> static void domain_context_clear_one(struct intel_iommu *iommu, u8
>> bus, u8 devfn) {
>> +unsigned long flags;
>> +struct context_entry *context;
>> +u16 did_old;
>> +
>>  if (!iommu)
>>  return;
>> 
>> +spin_lock_irqsave(>lock, flags);
>> +context = iommu_context_addr(iommu, bus, devfn, 0);
>> +if (!context) {
>> +spin_unlock_irqrestore(>lock, flags);
>> +return;
>> +}
> perhaps check with device_context_mapped()?

Using device_context_mapped() wouldn't simplify the code since it
would just tell me that at the time of check there was a context.
I would still need to lock, get the context, check if the context
is valid, do the work, and unlock.

Modifying device_context_mapped() to return the context isn't
going to work either because it may go away in the meantime since
I wouldn't hold the lock.

>> +did_old = context_domain_id(context);
>> +spin_unlock_irqrestore(>lock, flags);
>>  clear_context_table(iommu, bus, devfn);
>> -iommu->flush.flush_context(iommu, 0, 0, 0,
>> -   DMA_CCMD_GLOBAL_INVL);
>> -iommu->flush.flush_iotlb(iommu, 0, 0, 0,
>> DMA_TLB_GLOBAL_FLUSH);
>> +iommu->flush.flush_context(iommu,
>> +   did_old,
>> +   (((u16)bus) << 8) | devfn,
>> +   DMA_CCMD_MASK_NOBIT,
>> +   DMA_CCMD_DEVICE_INVL);
>> +iommu->flush.flush_iotlb(iommu,
>> + did_old,
>> + 0,
>> + 0,
>> + DMA_TLB_DSI_FLUSH);
>> }
>> 
>> static inline void unlink_domain_info(struct device_domain_info
>> *info)
> 
> [Jacob Pan]

Regards,
Filippo


Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH] intel-iommu: Don't be too aggressive when clearing one context entry

2017-08-31 Thread Sironi, Filippo
Hi Joerg,

> On 30. Aug 2017, at 15:31, Joerg Roedel  wrote:
> 
> Hi Filippo,
> 
> please change the subject to:
> 
>   iommu/vt-d: Don't be too aggressive when clearing one context entry
> 
> to follow the convention used in the iommu-tree. Another comment below.

Will do.

> On Mon, Aug 28, 2017 at 04:16:29PM +0200, Filippo Sironi wrote:
>> static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 
>> devfn)
>> {
>> +unsigned long flags;
>> +struct context_entry *context;
>> +u16 did_old;
>> +
>>  if (!iommu)
>>  return;
>> 
>> +spin_lock_irqsave(>lock, flags);
>> +context = iommu_context_addr(iommu, bus, devfn, 0);
>> +if (!context) {
>> +spin_unlock_irqrestore(>lock, flags);
>> +return;
>> +}
>> +did_old = context_domain_id(context);
>> +spin_unlock_irqrestore(>lock, flags);
>>  clear_context_table(iommu, bus, devfn);
> 
> This function is the only caller of clear_context_table(), which does
> similar things (like fetching the context-entry) as you are adding
> above.
> 
> So you can either make clear_context_table() return the old domain-id
> so that you don't need to do it here, or you get rid of the function
> entirely and add the context_clear_entry() and __iommu_flush_cache()
> calls into this code-path.
> 
> Regards,
> 
>   Joerg

I went for merging domain_context_clear_one() with context_clear_one().

Regards,
Filippo


Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH] intel-iommu: Don't be too aggressive when clearing one context entry

2017-08-31 Thread Sironi, Filippo
Hi Joerg,

> On 30. Aug 2017, at 15:31, Joerg Roedel  wrote:
> 
> Hi Filippo,
> 
> please change the subject to:
> 
>   iommu/vt-d: Don't be too aggressive when clearing one context entry
> 
> to follow the convention used in the iommu-tree. Another comment below.

Will do.

> On Mon, Aug 28, 2017 at 04:16:29PM +0200, Filippo Sironi wrote:
>> static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 
>> devfn)
>> {
>> +unsigned long flags;
>> +struct context_entry *context;
>> +u16 did_old;
>> +
>>  if (!iommu)
>>  return;
>> 
>> +spin_lock_irqsave(>lock, flags);
>> +context = iommu_context_addr(iommu, bus, devfn, 0);
>> +if (!context) {
>> +spin_unlock_irqrestore(>lock, flags);
>> +return;
>> +}
>> +did_old = context_domain_id(context);
>> +spin_unlock_irqrestore(>lock, flags);
>>  clear_context_table(iommu, bus, devfn);
> 
> This function is the only caller of clear_context_table(), which does
> similar things (like fetching the context-entry) as you are adding
> above.
> 
> So you can either make clear_context_table() return the old domain-id
> so that you don't need to do it here, or you get rid of the function
> entirely and add the context_clear_entry() and __iommu_flush_cache()
> calls into this code-path.
> 
> Regards,
> 
>   Joerg

I went for merging domain_context_clear_one() with context_clear_one().

Regards,
Filippo


Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH] x86, kvm: Handle PFNs outside of kernel reach when touching GPTEs

2017-04-12 Thread Sironi, Filippo
Thanks for taking the time and sorry for the delay.

> On 6. Apr 2017, at 16:22, Radim Krčmář  wrote:
> 
> 2017-04-05 15:07+0200, Filippo Sironi:
>> cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of
>> pages and the respective struct pages for mapping in the kernel virtual
>> address space.
>> This doesn't work if get_user_pages_fast() is invoked with a userspace
>> virtual address that's backed by PFNs outside of kernel reach (e.g.,
>> when limiting the kernel memory with mem= in the command line and using
>> /dev/mem to map memory).
>> 
>> If get_user_pages_fast() fails, look up the VMA that backs the userspace
>> virtual address, compute the PFN and the physical address, and map it in
>> the kernel virtual address space with memremap().
> 
> What is the reason for a configuration that voluntarily restricts access
> to memory that it needs?

By using /dev/mem to provide VM memory, one can avoid the overhead of 
allocating struct page(s) for the whole memory, which is wasteful when using a 
server entirely for hosting VMs.

>> Signed-off-by: Filippo Sironi 
>> Cc: Anthony Liguori 
>> Cc: k...@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> @@ -147,15 +147,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, 
>> struct kvm_mmu *mmu,
>>  struct page *page;
>> 
>>  npages = get_user_pages_fast((unsigned long)ptep_user, 1, 1, );
>> -/* Check if the user is doing something meaningless. */
>> -if (unlikely(npages != 1))
>> -return -EFAULT;
>> -
>> -table = kmap_atomic(page);
>> -ret = CMPXCHG([index], orig_pte, new_pte);
>> -kunmap_atomic(table);
>> -
>> -kvm_release_page_dirty(page);
>> +if (likely(npages == 1)) {
>> +table = kmap_atomic(page);
>> +ret = CMPXCHG([index], orig_pte, new_pte);
>> +kunmap_atomic(table);
>> +
>> +kvm_release_page_dirty(page);
>> +} else {
>> +struct vm_area_struct *vma;
>> +unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK;
>> +unsigned long pfn;
>> +unsigned long paddr;
>> +
>> +down_read(>mm->mmap_sem);
>> +vma = find_vma_intersection(current->mm, vaddr,
>> +vaddr + PAGE_SIZE);
> 
> Hm, with the argument order like this, we check that
> 
>  vaddr < vma->vm_end && vaddr + PAGE_SIZE > vma->vm_start
> 
> but shouldn't we actually check that the whole page is there, i.e.
> 
>  vaddr + PAGE_SIZE < vma->vm_end && vaddr > vma->vm_start
> 
> ?
> 
> Thanks.

Hm, right now we check for the following:

vaddr >= vma->vm_start && vaddr < vma->vm_end && vaddr + PAGE_SIZE > 
vma->vm_start

given that vaddr is PAGE_SIZE aligned, we're guaranteed that vaddr + PAGE_SIZE 
<= vma->vm_end.
This seems more complex than necessary. I believe that:

vma = find_vma(current->mm, vaddr);

should be enough.

>> +if (!vma || !(vma->vm_flags & VM_PFNMAP)) {
>> +up_read(>mm->mmap_sem);
>> +return -EFAULT;
>> +}
>> +pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
>> +paddr = pfn << PAGE_SHIFT;
>> +table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB);
> 
> (I don't undestand why there isn't a wrapper for this ...
> Looks like we're doing something unexpected.)

Do you mean a wrapper for getting the pfn/paddr?

>> +if (!table) {
>> +up_read(>mm->mmap_sem);
>> +return -EFAULT;
>> +}
>> +ret = CMPXCHG([index], orig_pte, new_pte);
>> +memunmap(table);
>> +up_read(>mm->mmap_sem);
>> +}
>> 
>>  return (ret != orig_pte);
>> }
>> -- 
>> 2.7.4

I'll submit a v2 version soon.

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


Re: [PATCH] x86, kvm: Handle PFNs outside of kernel reach when touching GPTEs

2017-04-12 Thread Sironi, Filippo
Thanks for taking the time and sorry for the delay.

> On 6. Apr 2017, at 16:22, Radim Krčmář  wrote:
> 
> 2017-04-05 15:07+0200, Filippo Sironi:
>> cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of
>> pages and the respective struct pages for mapping in the kernel virtual
>> address space.
>> This doesn't work if get_user_pages_fast() is invoked with a userspace
>> virtual address that's backed by PFNs outside of kernel reach (e.g.,
>> when limiting the kernel memory with mem= in the command line and using
>> /dev/mem to map memory).
>> 
>> If get_user_pages_fast() fails, look up the VMA that backs the userspace
>> virtual address, compute the PFN and the physical address, and map it in
>> the kernel virtual address space with memremap().
> 
> What is the reason for a configuration that voluntarily restricts access
> to memory that it needs?

By using /dev/mem to provide VM memory, one can avoid the overhead of 
allocating struct page(s) for the whole memory, which is wasteful when using a 
server entirely for hosting VMs.

>> Signed-off-by: Filippo Sironi 
>> Cc: Anthony Liguori 
>> Cc: k...@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> @@ -147,15 +147,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, 
>> struct kvm_mmu *mmu,
>>  struct page *page;
>> 
>>  npages = get_user_pages_fast((unsigned long)ptep_user, 1, 1, );
>> -/* Check if the user is doing something meaningless. */
>> -if (unlikely(npages != 1))
>> -return -EFAULT;
>> -
>> -table = kmap_atomic(page);
>> -ret = CMPXCHG([index], orig_pte, new_pte);
>> -kunmap_atomic(table);
>> -
>> -kvm_release_page_dirty(page);
>> +if (likely(npages == 1)) {
>> +table = kmap_atomic(page);
>> +ret = CMPXCHG([index], orig_pte, new_pte);
>> +kunmap_atomic(table);
>> +
>> +kvm_release_page_dirty(page);
>> +} else {
>> +struct vm_area_struct *vma;
>> +unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK;
>> +unsigned long pfn;
>> +unsigned long paddr;
>> +
>> +down_read(>mm->mmap_sem);
>> +vma = find_vma_intersection(current->mm, vaddr,
>> +vaddr + PAGE_SIZE);
> 
> Hm, with the argument order like this, we check that
> 
>  vaddr < vma->vm_end && vaddr + PAGE_SIZE > vma->vm_start
> 
> but shouldn't we actually check that the whole page is there, i.e.
> 
>  vaddr + PAGE_SIZE < vma->vm_end && vaddr > vma->vm_start
> 
> ?
> 
> Thanks.

Hm, right now we check for the following:

vaddr >= vma->vm_start && vaddr < vma->vm_end && vaddr + PAGE_SIZE > 
vma->vm_start

given that vaddr is PAGE_SIZE aligned, we're guaranteed that vaddr + PAGE_SIZE 
<= vma->vm_end.
This seems more complex than necessary. I believe that:

vma = find_vma(current->mm, vaddr);

should be enough.

>> +if (!vma || !(vma->vm_flags & VM_PFNMAP)) {
>> +up_read(>mm->mmap_sem);
>> +return -EFAULT;
>> +}
>> +pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
>> +paddr = pfn << PAGE_SHIFT;
>> +table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB);
> 
> (I don't undestand why there isn't a wrapper for this ...
> Looks like we're doing something unexpected.)

Do you mean a wrapper for getting the pfn/paddr?

>> +if (!table) {
>> +up_read(>mm->mmap_sem);
>> +return -EFAULT;
>> +}
>> +ret = CMPXCHG([index], orig_pte, new_pte);
>> +memunmap(table);
>> +up_read(>mm->mmap_sem);
>> +}
>> 
>>  return (ret != orig_pte);
>> }
>> -- 
>> 2.7.4

I'll submit a v2 version soon.

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B