Re: [PATCH v6] ptp: Add support for the AMZNC10C 'vmclock' device

2024-09-26 Thread David Woodhouse
On Thu, 2024-09-26 at 11:22 +0200, Paolo Abeni wrote:
> 
> Please have a better run at checkpatch before your next submission, 
> there are still a few ones - most relevant white-space damage.

Oops, apparently I missed one. Sorry about that.

I'll also add a MAINTAINERS entry, just to make checkpatch entirely
silent. I don't think it's particularly important, and things will
probably change as the virtio-rtc specification hopefully evolves in
future to provide this same structure, but it doesn't hurt for now.

> Anyway this is net-next material...
> 
> ## Form letter - net-next-closed
> 
> The merge window for v6.12 and therefore net-next is closed for new
> drivers, features, code refactoring and optimizations. We are currently
> accepting bug fixes only.
> 
> Please repost when net-next reopens after Sept 30th.
>
> RFC patches sent for review only are obviously welcome at any time.

Thanks. I've made those changes at
https://git.infradead.org/?p=users/dwmw2/linux.git;a=commitdiff;h=vmclock

Unless you care deeply or have other feedback, I won't bother spamming
the recipients again for the sake of a single space character; I'll
just wait until next week.



smime.p7s
Description: S/MIME cryptographic signature


[PATCH v6] ptp: Add support for the AMZNC10C 'vmclock' device

2024-09-20 Thread David Woodhouse
From: David Woodhouse 

The vmclock device addresses the problem of live migration with
precision clocks. The tolerances of a hardware counter (e.g. TSC) are
typically around ±50PPM. A guest will use NTP/PTP/PPS to discipline that
counter against an external source of 'real' time, and track the precise
frequency of the counter as it changes with environmental conditions.

When a guest is live migrated, anything it knows about the frequency of
the underlying counter becomes invalid. It may move from a host where
the counter running at -50PPM of its nominal frequency, to a host where
it runs at +50PPM. There will also be a step change in the value of the
counter, as the correctness of its absolute value at migration is
limited by the accuracy of the source and destination host's time
synchronization.

In its simplest form, the device merely advertises a 'disruption_marker'
which indicates that the guest should throw away any NTP synchronization
it thinks it has, and start again.

Because the shared memory region can be exposed all the way to userspace
through the /dev/vmclock0 node, applications can still use time from a
fast vDSO 'system call', and check the disruption marker to be sure that
their timestamp is indeed truthful.

The structure also allows for the precise time, as known by the host, to
be exposed directly to guests so that they don't have to wait for NTP to
resync from scratch. The PTP driver consumes this information if present.
Like the KVM PTP clock, this PTP driver can convert TSC-based cross
timestamps into KVM clock values. Unlike the KVM PTP clock, it does so
only when such is actually helpful.

The values and fields are based on the nascent virtio-rtc specification,
and the intent is that a version (hopefully precisely this version) of
this structure will be included as an optional part of that spec. In the
meantime, this driver supports the simple ACPI form of the device which
is being shipped in certain commercial hypervisors (and submitted for
inclusion in QEMU).

Signed-off-by: David Woodhouse 
---


QEMU implementation at
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock

v6:
 • Checkpatch trivia.

v5:
 • Rewrite commit message based on the more informative QEMU one.
 • Fix missing le32_to_cpu() in VMCLOCK_FIELD_PRESENT() macro.
 • Remove obsolete comment about "if __int128 isn't available'.

v4:
 • Make it all explicitly little-endian.
 • Fix duplicate 'the the' in comment.

v3:
 • Fix stray backtick from space→tab conversion.
 • Switch to assigned AMZNC10C HID.

v2:
 • Match "AMZNVCLK" HID instead of CID (QEMU patch updated accordingly)
 • Be more flexible about struct size to allow expansion
 • Remove 'inline'
 • Comment read barriers, other cosmetics.

v1:
 • Change absolute error fields to nanoseconds
 • Update leap second definition to match virtio-rtc intentions in
   
https://lore.kernel.org/all/85c93b42-41a2-42c4-a168-55079bbff...@opensynergy.com

RFC v4:
 • Add esterror fields, MONOTONIC flag.
 • Reduce seq_count to 32 bits
 • Expand size to permit 64KiB pages
 • Align with virtio-rtc fields, values and leap handling
 • Drop gettime() method (since we have gettimex())
 • Add leap second smearing hint
 • Use a real _CRS on the ACPI device

RFC v3: (wrong patch sent)

RFC v2:
 • Add gettimex64() support
 • Convert TSC values to KVM clock when appropriate
 • Require int128 support
 • Add counter_period_shift
 • Add timeout when seq_count is invalid
 • Add flags field
 • Better comments in vmclock ABI structure
 • Explicitly forbid smearing (as clock rates would need to change)


 drivers/ptp/Kconfig  |  13 +
 drivers/ptp/Makefile |   1 +
 drivers/ptp/ptp_vmclock.c| 615 +++
 include/uapi/linux/vmclock-abi.h | 182 +
 4 files changed, 811 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock-abi.h

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
  To compile this driver as a module, choose M here: the module
  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+   tristate "Virtual machine PTP clock"
+   depends on X86_TSC || ARM_ARCH_TIMER
+   depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+   default y
+   help
+ This driver adds support for using a virtual precision clock
+ advertised by the hypervisor. This clock is only useful in virtual
+ machines where such a device is present.
+
+ To compile this driver as a module, choose M here: the module
+ will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
tristate "IDT 82P33xxx PTP clock"
depends on PTP

Re: [PATCH v5] ptp: Add support for the AMZNC10C 'vmclock' device

2024-09-18 Thread David Woodhouse
On Tue, 2024-09-03 at 15:56 +0200, Paolo Abeni wrote:
> I'm sorry for the late feedback.

No problem. I've addressed it all apart from the uint64_t part. I'll
concede the pre-C99 nonsense where it matches existing code, but we've
always also allowed code in the kernel to use the proper C language
too, so I would prefer to do so.

Will give it a quick retest and post v6.

Thanks.



smime.p7s
Description: S/MIME cryptographic signature


[PATCH v5] ptp: Add support for the AMZNC10C 'vmclock' device

2024-08-23 Thread David Woodhouse
From: David Woodhouse 

The vmclock device addresses the problem of live migration with
precision clocks. The tolerances of a hardware counter (e.g. TSC) are
typically around ±50PPM. A guest will use NTP/PTP/PPS to discipline that
counter against an external source of 'real' time, and track the precise
frequency of the counter as it changes with environmental conditions.

When a guest is live migrated, anything it knows about the frequency of
the underlying counter becomes invalid. It may move from a host where
the counter running at -50PPM of its nominal frequency, to a host where
it runs at +50PPM. There will also be a step change in the value of the
counter, as the correctness of its absolute value at migration is
limited by the accuracy of the source and destination host's time
synchronization.

In its simplest form, the device merely advertises a 'disruption_marker'
which indicates that the guest should throw away any NTP synchronization
it thinks it has, and start again.

Because the shared memory region can be exposed all the way to userspace 
through the /dev/vmclock0 node, applications can still use time from a
fast vDSO 'system call', and check the disruption marker to be sure that
their timestamp is indeed truthful.

The structure also allows for the precise time, as known by the host, to
be exposed directly to guests so that they don't have to wait for NTP to
resync from scratch. The PTP driver consumes this information if present.
Like the KVM PTP clock, this PTP driver can convert TSC-based cross
timestamps into KVM clock values. Unlike the KVM PTP clock, it does so
only when such is actually helpful.

The values and fields are based on the nascent virtio-rtc specification,
and the intent is that a version (hopefully precisely this version) of
this structure will be included as an optional part of that spec. In the
meantime, this driver supports the simple ACPI form of the device which
is being shipped in certain commercial hypervisors (and submitted for
inclusion in QEMU).

Signed-off-by: David Woodhouse 
---

QEMU implementation at
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock


v5:
 • Rewrite commit message based on the more informative QEMU one.
 • Fix missing le32_to_cpu() in VMCLOCK_FIELD_PRESENT() macro.
 • Remove obsolete comment about "if __int128 isn't available'.

v4:
 • Make it all explicitly little-endian.
 • Fix duplicate 'the the' in comment.

v3:
 • Fix stray backtick from space→tab conversion.
 • Switch to assigned AMZNC10C HID.

v2:
 • Match "AMZNVCLK" HID instead of CID (QEMU patch updated accordingly)
 • Be more flexible about struct size to allow expansion
 • Remove 'inline'
 • Comment read barriers, other cosmetics.

v1:
 • Change absolute error fields to nanoseconds
 • Update leap second definition to match virtio-rtc intentions in
   
https://lore.kernel.org/all/85c93b42-41a2-42c4-a168-55079bbff...@opensynergy.com

RFC v4:
 • Add esterror fields, MONOTONIC flag.
 • Reduce seq_count to 32 bits
 • Expand size to permit 64KiB pages
 • Align with virtio-rtc fields, values and leap handling
 • Drop gettime() method (since we have gettimex())
 • Add leap second smearing hint
 • Use a real _CRS on the ACPI device

RFC v3: (wrong patch sent)

RFC v2:
 • Add gettimex64() support
 • Convert TSC values to KVM clock when appropriate
 • Require int128 support
 • Add counter_period_shift
 • Add timeout when seq_count is invalid
 • Add flags field
 • Better comments in vmclock ABI structure
 • Explicitly forbid smearing (as clock rates would need to change)

 drivers/ptp/Kconfig  |  13 +
 drivers/ptp/Makefile |   1 +
 drivers/ptp/ptp_vmclock.c| 607 +++
 include/uapi/linux/vmclock-abi.h | 182 +
 4 files changed, 803 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock-abi.h

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
  To compile this driver as a module, choose M here: the module
  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+   tristate "Virtual machine PTP clock"
+   depends on X86_TSC || ARM_ARCH_TIMER
+   depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+   default y
+   help
+ This driver adds support for using a virtual precision clock
+ advertised by the hypervisor. This clock is only useful in virtual
+ machines where such a device is present.
+
+ To compile this driver as a module, choose M here: the module
+ will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
tristate "IDT 82P33xxx PTP clock"
depends on PTP_1588_CLOCK && I2C

Re: [PATCH v4] ptp: Add vDSO-style vmclock support

2024-08-22 Thread David Woodhouse
On Thu, 2024-08-22 at 12:49 +0100, Simon Horman wrote:
> Hi David,
> 
> Sorry to be always the one with the nit-pick.
> Sparse complains about the line above, I believe because the
> type of st->clk->size is __le32.
> 
> .../ptp_vmclock.c:562:13: warning: restricted __le32 degrades to integer

Oops, thanks for catching that!

Fixed in
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock

I'll wait a little while before sending a new version.


smime.p7s
Description: S/MIME cryptographic signature


[PATCH v4] ptp: Add vDSO-style vmclock support

2024-08-21 Thread David Woodhouse
From: David Woodhouse 

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Like the KVM PTP clock, this can convert TSC-based cross timestamps into
KVM clock values. Unlike the KVM PTP clock, it does so only when such is
actually helpful.

The memory region of the device is also exposed to userspace so it can be
read or memory mapped by application which need reliable notification of
clock disruptions.

Signed-off-by: David Woodhouse 
---


QEMU implementation at
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock

The important part here is the pvclock_abi structure. The ACPI device
is just the simplest of many possible methods by which it could be made
discoverable by a guest. The fields and values in the structure are
aligned as much as possible with the nascent virtio-rtc specification,
with the intent that a version of the same structure can be
incorporated into that standard.

v4:
 • Make it all explicitly little-endian.
 • Fix duplicate 'the the' in comment.

v3:
 • Fix stray backtick from space→tab conversion
 • Switch to assigned AMZNC10C HID.

v2:
 • Match "AMZNVCLK" HID instead of CID (QEMU patch updated accordingly)
 • Be more flexible about struct size to allow expansion
 • Remove 'inline'
 • Comment read barriers, other cosmetics.

v1:
 • Change absolute error fields to nanoseconds
 • Update leap second definition to match virtio-rtc intentions in
  
https://lore.kernel.org/all/85c93b42-41a2-42c4-a168-55079bbff...@opensynergy.com

RFC v4:
 • Add esterror fields, MONOTONIC flag.
 • Reduce seq_count to 32 bits
 • Expand size to permit 64KiB pages
 • Align with virtio-rtc fields, values and leap handling
 • Drop gettime() method (since we have gettimex())
 • Add leap second smearing hint
 • Use a real _CRS on the ACPI device

RFC v3: (wrong patch sent)

RFC v2:
 • Add gettimex64() support
 • Convert TSC values to KVM clock when appropriate
 • Require int128 support
 • Add counter_period_shift
 • Add timeout when seq_count is invalid
 • Add flags field
 • Better comments in vmclock ABI structure
 • Explicitly forbid smearing (as clock rates would need to change)


 drivers/ptp/Kconfig  |  13 +
 drivers/ptp/Makefile |   1 +
 drivers/ptp/ptp_vmclock.c| 610 +++
 include/uapi/linux/vmclock-abi.h | 182 +
 4 files changed, 806 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock-abi.h

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
  To compile this driver as a module, choose M here: the module
  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+   tristate "Virtual machine PTP clock"
+   depends on X86_TSC || ARM_ARCH_TIMER
+   depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+   default y
+   help
+ This driver adds support for using a virtual precision clock
+ advertised by the hypervisor. This clock is only useful in virtual
+ machines where such a device is present.
+
+ To compile this driver as a module, choose M here: the module
+ will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
tristate "IDT 82P33xxx PTP clock"
depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE)  += ptp_dte.o
 obj-$(CONFIG_PTP_1588_CLOCK_INES)  += ptp_ines.o
 obj-$(CONFIG_PTP_1588_CLOCK_PCH)   += ptp_pch.o
 obj-$(CONFIG_PTP_1588_CLOCK_KVM)   += ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK)   += ptp_vmclock.o
 obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o
 ptp-qoriq-y+= ptp_qoriq.o
 ptp-qoriq-$(CONFIG_DEBUG_FS)   += ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index ..b1a2546b5173
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,610 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+
+#ifdef CONFIG_X86
+#include 
+#include 
+#endif
+
+#ifdef CONFIG_KVM_GUEST
+#define SUPPORT_KVMCLOCK
+#endif
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+   struct resource res;
+  

Re: [PATCH v3] ptp: Add vDSO-style vmclock support

2024-07-29 Thread David Woodhouse
On Mon, 2024-07-29 at 11:33 -0400, Michael S. Tsirkin wrote:
> you said you will use __le here?

I hadn't intended to bother until we add the virtio discovery and
negotiation paths; it would basically be cosmetic for now.

Although I *will* do so with the QEMU side before posting the latest
version of that patch.


smime.p7s
Description: S/MIME cryptographic signature


[PATCH v3] ptp: Add vDSO-style vmclock support

2024-07-29 Thread David Woodhouse
From: David Woodhouse 

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Like the KVM PTP clock, this can convert TSC-based cross timestamps into
KVM clock values. Unlike the KVM PTP clock, it does so only when such is
actually helpful.

The memory region of the device is also exposed to userspace so it can be
read or memory mapped by application which need reliable notification of
clock disruptions.

Signed-off-by: David Woodhouse 
---

QEMU implementation at
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock

The important part here is the pvclock_abi structure. The ACPI device
is just the simplest of many possible methods by which it could be made
discoverable by a guest. The fields and values in the structure are
aligned as much as possible with the nascent virtio-rtc specification,
with the intent that a version of the same structure can be
incorporated into that standard.

v3:
 • Fix stray backtick from space→tab conversion
 • Switch to assigned AMZNC10C HID.

v2:
 • Match "AMZNVCLK" HID instead of CID (QEMU patch updated accordingly)
 • Be more flexible about struct size to allow expansion
 • Remove 'inline'
 • Comment read barriers, other cosmetics.

v1:
 • Change absolute error fields to nanoseconds
 • Update leap second definition to match virtio-rtc intentions in
   
https://lore.kernel.org/all/85c93b42-41a2-42c4-a168-55079bbff...@opensynergy.com

RFC v4:
 • Add esterror fields, MONOTONIC flag.
 • Reduce seq_count to 32 bits
 • Expand size to permit 64KiB pages
 • Align with virtio-rtc fields, values and leap handling
 • Drop gettime() method (since we have gettimex())
 • Add leap second smearing hint
 • Use a real _CRS on the ACPI device

RFC v3: (wrong patch sent)

RFC v2:
 • Add gettimex64() support
 • Convert TSC values to KVM clock when appropriate
 • Require int128 support
 • Add counter_period_shift
 • Add timeout when seq_count is invalid
 • Add flags field
 • Better comments in vmclock ABI structure
 • Explicitly forbid smearing (as clock rates would need to change)



 drivers/ptp/Kconfig  |  13 +
 drivers/ptp/Makefile |   1 +
 drivers/ptp/ptp_vmclock.c| 610 +++
 include/uapi/linux/vmclock-abi.h | 187 ++
 4 files changed, 811 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock-abi.h

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
  To compile this driver as a module, choose M here: the module
  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+   tristate "Virtual machine PTP clock"
+   depends on X86_TSC || ARM_ARCH_TIMER
+   depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+   default y
+   help
+ This driver adds support for using a virtual precision clock
+ advertised by the hypervisor. This clock is only useful in virtual
+ machines where such a device is present.
+
+ To compile this driver as a module, choose M here: the module
+ will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
tristate "IDT 82P33xxx PTP clock"
depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE)  += ptp_dte.o
 obj-$(CONFIG_PTP_1588_CLOCK_INES)  += ptp_ines.o
 obj-$(CONFIG_PTP_1588_CLOCK_PCH)   += ptp_pch.o
 obj-$(CONFIG_PTP_1588_CLOCK_KVM)   += ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK)   += ptp_vmclock.o
 obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o
 ptp-qoriq-y+= ptp_qoriq.o
 ptp-qoriq-$(CONFIG_DEBUG_FS)   += ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index ..3cd06b1bae39
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,610 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+
+#ifdef CONFIG_X86
+#include 
+#include 
+#endif
+
+#ifdef CONFIG_KVM_GUEST
+#define SUPPORT_KVMCLOCK
+#endif
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+   struct resource res;
+   struct vmclock_abi *clk;
+   struct miscdevice miscdev;
+   struct ptp_clock_info ptp

Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-28 Thread David Woodhouse
On 28 July 2024 16:23:49 BST, "Michael S. Tsirkin"  wrote:
>On Sun, Jul 28, 2024 at 02:07:01PM +0100, David Woodhouse wrote:
>> On 28 July 2024 11:37:04 BST, "Michael S. Tsirkin"  wrote:
>> >Glad you asked :)
>> 
>> Heh, I'm not sure I'm so glad. Did I mention I hate ACPI? Perhaps it's still 
>> not too late for me just to define a DT binding and use PRP0001 for it :)
>> 
>> >Long story short, QEMUVGID is indeed out of spec, but it works
>> >both because of guest compatibility with ACPI 1.0, and because no one
>> >much uses it.

But why *did* it change from QEMU0003 which was in some of the patches that got 
posted?

>> I think it's reasonable enough to follow that example and use AMZNVCLK (or 
>> QEMUVCLK, but there seems little point in both) then?
>
>I'd stick to spec. If you like puns, QEMUC10C maybe?

Meh, might as well be sensible. I'll chase up which of my colleagues curates 
the AMZN space (which will no doubt be me by the end of that thread), and issue 
the next one from that.




Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-28 Thread David Woodhouse
On 28 July 2024 11:37:04 BST, "Michael S. Tsirkin"  wrote:
>Glad you asked :)

Heh, I'm not sure I'm so glad. Did I mention I hate ACPI? Perhaps it's still 
not too late for me just to define a DT binding and use PRP0001 for it :)

>Long story short, QEMUVGID is indeed out of spec, but it works
>both because of guest compatibility with ACPI 1.0, and because no one
>much uses it.


I think it's reasonable enough to follow that example and use AMZNVCLK (or 
QEMUVCLK, but there seems little point in both) then?




Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-26 Thread David Woodhouse
On 26 July 2024 17:49:58 BST, Jonathan Cameron  
wrote:
>On Thu, 25 Jul 2024 14:50:50 +0100
>David Woodhouse  wrote:
>
>> On Thu, 2024-07-25 at 08:33 -0400, Michael S. Tsirkin wrote:
>> > On Thu, Jul 25, 2024 at 01:31:19PM +0100, David Woodhouse wrote:  
>> > > On Thu, 2024-07-25 at 08:29 -0400, Michael S. Tsirkin wrote:  
>> > > > On Thu, Jul 25, 2024 at 01:27:49PM +0100, David Woodhouse wrote:  
>> > > > > On Thu, 2024-07-25 at 08:17 -0400, Michael S. Tsirkin wrote:  
>> > > > > > On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:  
>> > > > > > > > Do you want to just help complete virtio-rtc then? Would be 
>> > > > > > > > easier than
>> > > > > > > > trying to keep two specs in sync.  
>> > > > > > > 
>> > > > > > > The ACPI version is much more lightweight and doesn't take up a
>> > > > > > > valuable PCI slot#. (I know, you can do virtio without PCI but 
>> > > > > > > that's
>> > > > > > > complex in other ways).
>> > > > > > >   
>> > > > > > 
>> > > > > > Hmm, should we support virtio over ACPI? Just asking.  
>> > > > > 
>> > > > > Given that we support virtio DT bindings, and the ACPI "PRP0001" 
>> > > > > device
>> > > > > exists with a DSM method which literally returns DT properties,
>> > > > > including such properties as "compatible=virtio,mmio" ... do we
>> > > > > already?
>> > > > > 
>> > > > >   
>> > > > 
>> > > > In a sense, but you are saying that is too complex?
>> > > > Can you elaborate?  
>> > > 
>> > > No, I think it's fine. I encourage the use of the PRP0001 device to
>> > > expose DT devices through ACPI. I was just reminding you of its
>> > > existence.  
>> > 
>> > Confused. You said "I know, you can do virtio without PCI but that's
>> > complex in other ways" as the explanation why you are doing a custom
>> > protocol.  
>> 
>> Ah, apologies, I wasn't thinking that far back in the conversation.
>> 
>> If we wanted to support virtio over ACPI, I think PRP0001 can be made
>> to work and isn't too complex (even though it probably doesn't yet work
>> out of the box).
>> 
>> But for the VMCLOCK thing, yes, the simple ACPI device is a lot simpler
>> than virtio-rtc and much more attractive.
>> 
>> Even if the virtio-rtc specification were official today, and I was
>> able to expose it via PCI, I probably wouldn't do it that way. There's
>> just far more in virtio-rtc than we need; the simple shared memory
>> region is perfectly sufficient for most needs, and especially ours.
>> 
>> I have reworked
>> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
>> to take your other feedback into account.
>> 
>> It's now more flexible about the size handling, and explicitly checking
>> that specific fields are present before using them. 
>> 
>> I think I'm going to add a method on the ACPI device to enable the
>> precise clock information. I haven't done that in the driver yet; it
>> still just consumes the precise clock information if it happens to be
>> present already. The enable method can be added in a compatible fashion
>> (the failure mode is that guests which don't invoke this method when
>> the hypervisor needs them to will see only the disruption signal and
>> not precise time).
>> 
>> For the HID I'm going to use AMZNVCLK. I had used QEMUVCLK in the QEMU
>> patches, but I'll change that to use AMZNVCLK too when I repost the
>> QEMU patch.
>
>That doesn't fit with ACPI _HID definitions.
>Second set 4 characters need to be hex digits as this is an
>ACPI style ID (which I assume this is given AMZN is a valid
>vendor ID.  6.1.5 in ACPI v6.5
>
>Maybe I'm missing something...
>
>J
>
>



Hm, is the same not true for QEMUVGID and AMZNVGID, which I was using as an 
example?

QEMU seemed to get to 0002, and AFAICT the VMGENID patches were initially 
posted using QEMU0003, but what's actually in QEMU now is QEMUVGID. So I 
presumed that was now the preferred option.



Re: [PATCH v2] ptp: Add vDSO-style vmclock support

2024-07-26 Thread David Woodhouse
On 26 July 2024 15:57:34 BST, Jakub Kicinski  wrote:
>On Fri, 26 Jul 2024 13:28:17 +0100 David Woodhouse wrote:
>> +`   status = acpi_walk_resources(adev->handle, METHOD_NAME__CRS,
>
>   ^ watch out for ticks!

Oops, that last minute space->tab fix after I'd already left home for the 
weekend was clearly not as cosmetic as I'd intended. Will fix; thanks!



Re: [PATCH v2] ptp: Add vDSO-style vmclock support

2024-07-26 Thread David Woodhouse
On Fri, 2024-07-26 at 09:14 -0400, Michael S. Tsirkin wrote:
> For purposes of virtio, should we label all the fields here
> __le?

Yes. Peter and I discussed that, and it's mostly just a cosmetic change
at this point. The simple ACPI thing only exists on LE platforms for
*now* anyway.

We also had a discussion about the use of signed integers. Our fallback
was "declare it to be an uint16_t in the protocol but that it is to be
interpreted such that if it's greater than 0x8000 you subtract 0x8000
from the result". Or whatever actually added up correctly for twos-
complement signed integers in reality.


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-26 Thread David Woodhouse
On Fri, 2024-07-26 at 09:04 -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2024 at 02:00:25PM +0100, David Woodhouse wrote:
> > On Fri, 2024-07-26 at 08:52 -0400, Michael S. Tsirkin wrote:
> > > On Fri, Jul 26, 2024 at 09:35:51AM +0100, David Woodhouse wrote:
> > > > But for this use case, we only need a memory region that the hypervisor
> > > > can update. We don't need any of that complexity of gratuitously
> > > > interrupting all the vCPUs just to ensure that none of them can be
> > > > running userspace while one of them does an update for itself,
> > > > potentially translating from one ABI to another. The hypervisor can
> > > > just update the user-visible memory in place.
> > > 
> > > Looks like then your userspace is hypervisor specific, and that's a
> > > problem because it's a one way street - there is no way for hypervisor
> > > to know what does userspace need, so no way for hypervisor to know which
> > > information to provide. No real way to fix bugs.
> > 
> > It's not hypervisor specific, but you're right that as it stands there
> > is no negotiation of what userspace wants. So the hypervisor provides
> > what it feels it can provide without significant overhead (which may or
> > may not include the precise timekeeping, as discussed, but should
> > always include the disruption signal which is the most important
> > thing).
> > 
> > The guest *does* know what the hypervisor provides. And when we get to
> > do this in virtio, we get all the goodness of negotiation as well. The
> > existence of the simple ACPI model doesn't hurt that at all.
> 
> Maybe it doesn't, at that. E.g. virtio does a copy, acpi doesn't?
> I'll ponder compatibility over the weekend.

For clarity, I think I've ditched the idea of a poor-man's negotiation
through invoking an ACPI method to enable the timekeeping info.

I think we're better off waiting for virtio, to enable that kind of
thing.

The guest gets what the hypervisor is prepared to offer "for free".

That isn't a one-way door; we *can* add an optional ACPI method later
if we really want to. But I'm generally OK with "use virtio if you want
that".



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-26 Thread David Woodhouse
On Fri, 2024-07-26 at 08:52 -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2024 at 09:35:51AM +0100, David Woodhouse wrote:
> > But for this use case, we only need a memory region that the hypervisor
> > can update. We don't need any of that complexity of gratuitously
> > interrupting all the vCPUs just to ensure that none of them can be
> > running userspace while one of them does an update for itself,
> > potentially translating from one ABI to another. The hypervisor can
> > just update the user-visible memory in place.
> 
> Looks like then your userspace is hypervisor specific, and that's a
> problem because it's a one way street - there is no way for hypervisor
> to know what does userspace need, so no way for hypervisor to know which
> information to provide. No real way to fix bugs.

It's not hypervisor specific, but you're right that as it stands there
is no negotiation of what userspace wants. So the hypervisor provides
what it feels it can provide without significant overhead (which may or
may not include the precise timekeeping, as discussed, but should
always include the disruption signal which is the most important
thing).

The guest *does* know what the hypervisor provides. And when we get to
do this in virtio, we get all the goodness of negotiation as well. The
existence of the simple ACPI model doesn't hurt that at all.


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-26 Thread David Woodhouse
On Fri, 2024-07-26 at 08:47 -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2024 at 09:06:29AM +0100, David Woodhouse wrote:
> > That's great. You don't even need it to be per-vCPU if you let the
> > hypervisor write directly to the single physical location that's mapped
> > to userspace. It can do that before it even starts *running* the vCPUs
> > after migration. It's a whole lot simpler. 
> 
> It *seems* simpler, until you realize that there is no way
> to change anything in the interface, there is no negotiation
> between hypervisor and userspace. If I learned anything at all
> in tens of years working on software, it's that it is
> never done. So let's have userspace talk to the kernel
> and have kernel talk to the devices, please. There's
> no compelling reason to have this bypass here.

Thanks for the useful feedback. As you see, I've incorporated most of
it into the v2 post a few minutes ago.

On this particular topic we disagree. I absolutely don't want to take
dependencies on kernel code, on virtio, or on the cross-platform
assumption (even if it's true) that a device can raise an interrupt and
guarantee that no userspace code will run before that interrupt is
handled.

Using virtio does allow for some negotiation — for handling differences
in page sizes, and enabling the timekeeping only on demand. That's
great, but the structure is still an ABI in its own right, and we know
how to do those.

We'll ship the ACPI version, and I look forward to incorporating it
into virtio-rtc too.


smime.p7s
Description: S/MIME cryptographic signature


[PATCH v2] ptp: Add vDSO-style vmclock support

2024-07-26 Thread David Woodhouse
From: David Woodhouse 

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Like the KVM PTP clock, this can convert TSC-based cross timestamps into
KVM clock values. Unlike the KVM PTP clock, it does so only when such is
actually helpful.

The memory region of the device is also exposed to userspace so it can be
read or memory mapped by application which need reliable notification of
clock disruptions.

Signed-off-by: David Woodhouse 
---

QEMU implementation at
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock

The important part here is the pvclock_abi structure. The ACPI device
is just the simplest of many possible methods by which it could be made
discoverable by a guest. The fields and values in the structure are
aligned as much as possible with the nascent virtio-rtc specification,
with the intent that a version of the same structure can be
incorporated into that standard.

v2:
 • Match "AMZNVCLK" HID instead of CID (QEMU patch updated accordingly)
 • Be more flexible about struct size to allow expansion
 • Remove 'inline'
 • Comment read barriers, other cosmetics.

v1:
 • Change absolute error fields to nanoseconds
 • Update leap second definition to match virtio-rtc intentions in

https://lore.kernel.org/all/85c93b42-41a2-42c4-a168-55079bbff...@opensynergy.com

RFC v4:
 • Add esterror fields, MONOTONIC flag.
 • Reduce seq_count to 32 bits
 • Expand size to permit 64KiB pages
 • Align with virtio-rtc fields, values and leap handling
 • Drop gettime() method (since we have gettimex())
 • Add leap second smearing hint
 • Use a real _CRS on the ACPI device

RFC v3: (wrong patch sent)

RFC v2:
 • Add gettimex64() support
 • Convert TSC values to KVM clock when appropriate
 • Require int128 support
 • Add counter_period_shift
 • Add timeout when seq_count is invalid
 • Add flags field
 • Better comments in vmclock ABI structure
 • Explicitly forbid smearing (as clock rates would need to change)



 drivers/ptp/Kconfig  |  13 +
 drivers/ptp/Makefile |   1 +
 drivers/ptp/ptp_vmclock.c| 610 +++
 include/uapi/linux/vmclock-abi.h | 187 ++
 4 files changed, 811 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock-abi.h

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
  To compile this driver as a module, choose M here: the module
  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+   tristate "Virtual machine PTP clock"
+   depends on X86_TSC || ARM_ARCH_TIMER
+   depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+   default y
+   help
+ This driver adds support for using a virtual precision clock
+ advertised by the hypervisor. This clock is only useful in virtual
+ machines where such a device is present.
+
+ To compile this driver as a module, choose M here: the module
+ will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
tristate "IDT 82P33xxx PTP clock"
depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE)  += ptp_dte.o
 obj-$(CONFIG_PTP_1588_CLOCK_INES)  += ptp_ines.o
 obj-$(CONFIG_PTP_1588_CLOCK_PCH)   += ptp_pch.o
 obj-$(CONFIG_PTP_1588_CLOCK_KVM)   += ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK)   += ptp_vmclock.o
 obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o
 ptp-qoriq-y+= ptp_qoriq.o
 ptp-qoriq-$(CONFIG_DEBUG_FS)   += ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index ..9c0058ef5552
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,610 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+
+#ifdef CONFIG_X86
+#include 
+#include 
+#endif
+
+#ifdef CONFIG_KVM_GUEST
+#define SUPPORT_KVMCLOCK
+#endif
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+   struct resource res;
+   struct vmclock_abi *clk;
+   struct miscdevice miscdev;
+   struct ptp_clock_info ptp_clock_info;
+   struct ptp_clock *ptp_clock;
+   enum clocksource_ids cs_id, sys_cs_id

Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-26 Thread David Woodhouse
On Fri, 2024-07-26 at 02:06 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 11:20:56PM +0100, David Woodhouse wrote:
> > We're rolling out the AMZNVCLK device for internal use cases, and plan
> > to add it in public instances some time later.
> 
> Let's be real. If amazon does something in its own hypervisor, and the
> only way to use that is to expose the interface to userspace, there is
> very little the linux community can do.  Moreover, userspace will be
> written to this ABI, and be locked in to the specific hypervisor. It
> might be a win for amazon short term but long term you will want to
> extend things and it will be a mess.
> 
> So I feel you have chosen ACPI badly.  It just does not have the APIs
> that you need. Virtio does, and would not create a userpspace lock-in
> to a specific hypervisor. It's not really virtio specific either,
> you can write a bare pci device with a BAR and a bunch of msix
> vectors and it will get you the same effect.

I *am* as bad as the next person for taking the "I have a hammer,
therefore everything is a nail" approach. For you that hammer is
virtio, and I respect that. But mine isn't ACPI — quite the opposite,
it's DT.

I *hate* ACPI. I hate everything about it. I hate that Arm started
using it for Arm64 instead of going with Device Tree.

That's why we have the DSM method for obtaining properties, and the
PRP0001 ACPI HID which means "look for the compatible property and
treat it like a DT node". So people can make DT bindings and hey, if
you're on a system which is afflicted with ACPI, you can still use
them. Which I'm still proselytising today, as you saw.

But for this use case, we only need a memory region that the hypervisor
can update. We don't need any of that complexity of gratuitously
interrupting all the vCPUs just to ensure that none of them can be
running userspace while one of them does an update for itself,
potentially translating from one ABI to another. The hypervisor can
just update the user-visible memory in place.

In this case, exposing a simple MMIO memory region in _CRS of an ACPI
device was the simplest and most compatible solution. 

Yes, we can add a virtio transport for that where the hypervisor is
invited to DMA into (unencrypted) guest memory, and it solves the
PAGE_SIZE problem of the trivial ACPI method. But there's still a place
in this world for the ACPI method, and it doesn't *hurt* virtio.

The important part is the vmclock_abi structure; the transport is just
fluff. And I do not agree that this is a lock-in to a specific
hypervisor. I've literally rewritten the fields in the structure to
align to what virtio-rtc does and accommodate Peter's feedback (to the
dismay of my internal team who just wanted to stick with the initial
straw man struct and didn't want to keep up, and haven't even engaged
with the public threads which have been ongoing since March¹, even when
I've beaten them with a big stick). I've added a QEMU implementation
too. We absolutely *don't* want this to be hypervisor-specific.


¹ 
https://lore.kernel.org/all/0e21e3e2be26acd70b5575b9932b3a911c9fe721.ca...@infradead.org


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-26 Thread David Woodhouse
On Fri, 2024-07-26 at 01:55 -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2024 at 01:09:24AM -0400, Michael S. Tsirkin wrote:
> > On Thu, Jul 25, 2024 at 10:29:18PM +0100, David Woodhouse wrote:
> > > > > > Then can't we fix it by interrupting all CPUs right after LM?
> > > > > > 
> > > > > > To me that seems like a cleaner approach - we then compartmentalize
> > > > > > the ABI issue - kernel has its own ABI against userspace,
> > > > > > devices have their own ABI against kernel.
> > > > > > It'd mean we need a way to detect that interrupt was sent,
> > > > > > maybe yet another counter inside that structure.
> > > > > > 
> > > > > > WDYT?
> > > > > > 
> > > > > > By the way the same idea would work for snapshots -
> > > > > > some people wanted to expose that info to userspace, too.
> > > 
> > > Those people included me. I wanted to interrupt all the vCPUs, even the
> > > ones which were in userspace at the moment of migration, and have the
> > > kernel deal with passing it on to userspace via a different ABI.
> > > 
> > > It ends up being complex and intricate, and requiring a lot of new
> > > kernel and userspace support. I gave up on it in the end for snapshots,
> > > and didn't go there again for this.
> > 
> > Maybe become you insist on using ACPI?
> > I see a fairly simple way to do it. For example, with virtio:
> > 
> > one vq per CPU, with a single outstanding buffer,
> > callback copies from the buffer into the userspace
> > visible memory.
> > 
> > Want me to show you the code?
> 
> Couldn't resist, so I wrote a bit of this code.
> Fundamentally, we keep a copy of the hypervisor abi
> in the device:
> 
> struct virtclk_info *vci {
> struct vmclock_abi abi;
> };
> 
> each vq will has its own copy:
> 
> struct virtqueue_info {
> struct scatterlist sg[];
> struct vmclock_abi abi;
> }
> 
> we add it during probe:
>     sg_init_one(vqi->sg, &vqi->abi, sizeof(vqi->abi));
> virtqueue_add_inbuf(vq,
>     vqi->sg, 1,
>     &vq->vabi,
>     GFP_ATOMIC);
> 
> 
> 
> We set the affinity for each vq:
> 
>    for (i = 0; i < num_online_cpus(); i++)
>    virtqueue_set_affinity(vi->vq[i], i);
> 
> (virtio net does it, and it handles cpu hotplug as well)
> 
> each vq callback would do:
> 
> static void vmclock_cb(struct virtqueue *vq)
> {
>     struct virtclk_info *vci = vq->vdev->priv;
>     struct virtqueue_info *vqi = vq->priv;
> void *buf;
>     unsigned int len;
> 
> buf = virtqueue_get_buf(vq, &len);
> if (!buf)
> return;
> 
> BUG_ON(buf != &vq->abi);
> 
> spin_lock(vci->lock);
> if (memcmp(&vci->abi, &vqi->abi, sizeof(vqi->abi))) {
> memcpy(&vci->abi, &vqi->abi, sizeof(vqi->abi));
> }
> 
> /* Update the userspace visible structure now */
> .
> 
> /* Re-add the buffer */
> virtqueue_add_inbuf(vq,
>     vqi->sg, 1,
>     &vqi->abi,
>     GFP_ATOMIC);
> 
> spin_unlock(vi->lock);
> }
> 
> That's it!
> Where's the problem here?

That's great. You don't even need it to be per-vCPU if you let the
hypervisor write directly to the single physical location that's mapped
to userspace. It can do that before it even starts *running* the vCPUs
after migration. It's a whole lot simpler. 

Even if we were to insist one having two *different* ABIs, one for
hypervisor←→guest kernel, and another for kernel←→userspace, then you
still only need the hypervisor to write one copy. It's just that you do
need to interrupt to all the other vCPUs to ensure they aren't running
userspace before the data are updated. So yes, we can do the above and
(ab)use gratuitous data transfers to trigger those IRQs, and then the
handler has some kind of locking to ensure that we can't reenter
userspace until the user-visible structure is updated? 

Obviously the "interrupt all CPUs" doesn't work well for hardware
implementations but those won't have to deal with live migration, so
that's OK. A hypothetical hardware implementation would map something
like the ART to reference time. Obviously there's 

Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-25 Thread David Woodhouse
On Thu, 2024-07-25 at 17:47 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 10:29:18PM +0100, David Woodhouse wrote:
> > Those people included me. I wanted to interrupt all the vCPUs, even the
> > ones which were in userspace at the moment of migration, and have the
> > kernel deal with passing it on to userspace via a different ABI.
> > 
> > It ends up being complex and intricate, and requiring a lot of new
> > kernel and userspace support. I gave up on it in the end for snapshots,
> > and didn't go there again for this.
> 
> ok I believe you, I am just curious how come you need userspace
> support - what I imagine would live completely in kernel ...

Userspace doesn't even make a system call for gettimeofday() any more;
the relevant information is exposed to userspace through the vDSO.

If userspace needs to know that the time has been disrupted by LM, then
fundamentally that either needs to be exposed directly to it as well,
or userspace needs to go back to making actual system calls to get the
time (which is slow, and not acceptable for the same use cases which
care about it being accurate).

So how do we make it available in a form that's mappable directly to
userspace? 

Well, we could have a hypervisor enlightenment, where the guest kernel
uses an MSR or hypercall to tell the hypervisor "please write the
information to  GPA", and provides an address within the vDSO
information page. Which isn't nice for Confidential Compute, and is
hard to allow for expansion in the size of the structure. And is much
more complex to support consistently across different hypervisors and
different architectures.

We *could* attempt to contrive a system where we indeed interrupt *all*
vCPUs and the kernel then updates something in the vDSO page before
running userspace again. That could work in theory and *might* be a bit
simpler than what we were trying to do for VMGENID/snapshots, but it's
still complex and would take an eternity to deploy to actual users, and
would probably never work for non-Linux. And imposes an even higher
cost on the guest kernel when LM occurs.

Or there's this method, where the hypervisor puts it in a shared memory
region which is just a PCI BAR or an ACPI _CRS or attached to virtio
(we really don't care how it's discovered). There's a nit that it now
has to be page sized, and a guest which has larger pages than the
hypervisor expects is going to have to use a small PTE to map it (or
not support that mode). But I think that's reasonable.

Having gone around in circles a few times, I'm fairly sure that
exposing a memory region which the hypervisor updates directly is the
simplest and cleanest way of doing it and getting it in the hands of
users.

We're rolling out the AMZNVCLK device for internal use cases, and plan
to add it in public instances some time later. This is the guest driver
which consumes that, and I've separately posted the QEMU patch to
provide the same device. Because I absolutely do want this to be
standardised across hypervisors, for the reasons you point out. You're
preaching to the choir there; I even got Microsoft to implement the
same 15-bit MSI extensions that we added to KVM :)

Supporting the disruption signal is the critical part, which allows
applications to abort operations until their clock is good again.
Providing the actual clock information on the new host, so that
applications can keep running immediately, is what I'll be working on
next.

I'd love virtio-rtc to adopt this structure too, and I've done my best
to ensure that that's feasible, but I can't take a dependency on that
and wait for it (and as discussed, wouldn't use the virtio form in my
environment anyway).

> mutt sucks less ;)

So does 'nc' but Evolution can talk to the corporate Exchange calendar
and email. And I'm used to it and can mostly cope with its quirks :)


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-25 Thread David Woodhouse
On Thu, 2024-07-25 at 17:04 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 10:00:24PM +0100, David Woodhouse wrote:
> > On Thu, 2024-07-25 at 16:50 -0400, Michael S. Tsirkin wrote:
> > > On Thu, Jul 25, 2024 at 08:35:40PM +0100, David Woodhouse wrote:
> > > > On Thu, 2024-07-25 at 12:38 -0400, Michael S. Tsirkin wrote:
> > > > > On Thu, Jul 25, 2024 at 04:18:43PM +0100, David Woodhouse wrote:
> > > > > > The use case isn't necessarily for all users of gettimeofday(), of
> > > > > > course; this is for those applications which *need* precision time.
> > > > > > Like distributed databases which rely on timestamps for coherency, 
> > > > > > and
> > > > > > users who get fined millions of dollars when LM messes up their 
> > > > > > clocks
> > > > > > and they put wrong timestamps on financial transactions.
> > > > > 
> > > > > I would however worry that with all this pass through,
> > > > > applications have to be coded to each hypervisor or even
> > > > > version of the hypervisor.
> > > > 
> > > > Yes, that would be a problem. Which is why I feel it's so important to
> > > > harmonise the contents of the shared memory, and I'm implementing it
> > > > both QEMU and $DAYJOB, as well as aligning with virtio-rtc.
> > > 
> > > 
> > > Writing an actual spec for this would be another thing that might help.

Potentially, although working over it with our internal clock team and
with Peter on virtio-rtc has put us in good shape. I'm confident now
that we have something that's viable and extensible enough.

> > > 
> > > > > virtio has been developed with the painful experience that we keep
> > > > > making mistakes, or coming up with new needed features,
> > > > > and that maintaining forward and backward compatibility
> > > > > becomes a whole lot harder than it seems in the beginning.
> > > > 
> > > > Yes. But as you note, this shared memory structure is a userspace ABI
> > > > all of its own, so we get to make a completely *different* kind of
> > > > mistake :)
> > > > 
> > > 
> > > 
> > > So, something I still don't completely understand.
> > > Can't the VDSO thing be written to by kernel?
> > > Let's say on LM, an interrupt triggers and kernel copies
> > > data from a specific device to the VDSO.
> > > 
> > > Is that problematic somehow? I imagine there is a race where
> > > userspace reads vdso after lm but before kernel updated
> > > vdso - is that the concern?

Yes.

> > > Then can't we fix it by interrupting all CPUs right after LM?
> > > 
> > > To me that seems like a cleaner approach - we then compartmentalize
> > > the ABI issue - kernel has its own ABI against userspace,
> > > devices have their own ABI against kernel.
> > > It'd mean we need a way to detect that interrupt was sent,
> > > maybe yet another counter inside that structure.
> > > 
> > > WDYT?
> > > 
> > > By the way the same idea would work for snapshots -
> > > some people wanted to expose that info to userspace, too.

Those people included me. I wanted to interrupt all the vCPUs, even the
ones which were in userspace at the moment of migration, and have the
kernel deal with passing it on to userspace via a different ABI.

It ends up being complex and intricate, and requiring a lot of new
kernel and userspace support. I gave up on it in the end for snapshots,
and didn't go there again for this.

By contrast, a driver which merely exposes a page of MMIO space
identified by an ACPI device (without even the in-kernel PTP support)
could probably be fewer than a hundred lines of code. In an externally-
buildable module that goes back as far as RHEL8 or even further,
allowing users to just build and use it from their application.

> was there supposed to be text here, or did you just like this
> so much you decided to repost my mail ;) 

Hm, weirdness. I've known Evolution get into a state where it sends
completely *empty* messages, but I've never seen it eat only my own
part before. I had definitely typed responses (along the lines of the
above) last time.


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-25 Thread David Woodhouse
On Thu, 2024-07-25 at 16:50 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 08:35:40PM +0100, David Woodhouse wrote:
> > On Thu, 2024-07-25 at 12:38 -0400, Michael S. Tsirkin wrote:
> > > On Thu, Jul 25, 2024 at 04:18:43PM +0100, David Woodhouse wrote:
> > > > The use case isn't necessarily for all users of gettimeofday(), of
> > > > course; this is for those applications which *need* precision time.
> > > > Like distributed databases which rely on timestamps for coherency, and
> > > > users who get fined millions of dollars when LM messes up their clocks
> > > > and they put wrong timestamps on financial transactions.
> > > 
> > > I would however worry that with all this pass through,
> > > applications have to be coded to each hypervisor or even
> > > version of the hypervisor.
> > 
> > Yes, that would be a problem. Which is why I feel it's so important to
> > harmonise the contents of the shared memory, and I'm implementing it
> > both QEMU and $DAYJOB, as well as aligning with virtio-rtc.
> 
> 
> Writing an actual spec for this would be another thing that might help.
> 

> > I don't think the structure should be changing between hypervisors (and
> > especially versions). We *will* see a progression from simply providing
> > the disruption signal, to providing the full clock information so that
> > guests don't have to abort transactions while they resync their clock.
> > But that's perfectly fine.
> > 
> > And it's also entirely agnostic to the mechanism by which the memory
> > region is *discovered*. It doesn't matter if it's ACPI, DT, a
> > hypervisor enlightenment, a BAR of a simple PCI device, virtio, or
> > anything else.
> > 
> > ACPI is one of the *simplest* options for a hypervisor and guest to
> > implement, and doesn't prevent us from using the same structure in
> > virtio-rtc. I'm happy enough using ACPI and letting virtio-rtc come
> > along later.
> > 
> > > virtio has been developed with the painful experience that we keep
> > > making mistakes, or coming up with new needed features,
> > > and that maintaining forward and backward compatibility
> > > becomes a whole lot harder than it seems in the beginning.
> > 
> > Yes. But as you note, this shared memory structure is a userspace ABI
> > all of its own, so we get to make a completely *different* kind of
> > mistake :)
> > 
> 
> 
> So, something I still don't completely understand.
> Can't the VDSO thing be written to by kernel?
> Let's say on LM, an interrupt triggers and kernel copies
> data from a specific device to the VDSO.
> 
> Is that problematic somehow? I imagine there is a race where
> userspace reads vdso after lm but before kernel updated
> vdso - is that the concern?
> 
> Then can't we fix it by interrupting all CPUs right after LM?
> 
> To me that seems like a cleaner approach - we then compartmentalize
> the ABI issue - kernel has its own ABI against userspace,
> devices have their own ABI against kernel.
> It'd mean we need a way to detect that interrupt was sent,
> maybe yet another counter inside that structure.
> 
> WDYT?
> 
> By the way the same idea would work for snapshots -
> some people wanted to expose that info to userspace, too.
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-25 Thread David Woodhouse
On Thu, 2024-07-25 at 12:38 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 04:18:43PM +0100, David Woodhouse wrote:
> > The use case isn't necessarily for all users of gettimeofday(), of
> > course; this is for those applications which *need* precision time.
> > Like distributed databases which rely on timestamps for coherency, and
> > users who get fined millions of dollars when LM messes up their clocks
> > and they put wrong timestamps on financial transactions.
> 
> I would however worry that with all this pass through,
> applications have to be coded to each hypervisor or even
> version of the hypervisor.

Yes, that would be a problem. Which is why I feel it's so important to
harmonise the contents of the shared memory, and I'm implementing it
both QEMU and $DAYJOB, as well as aligning with virtio-rtc.

I don't think the structure should be changing between hypervisors (and
especially versions). We *will* see a progression from simply providing
the disruption signal, to providing the full clock information so that
guests don't have to abort transactions while they resync their clock.
But that's perfectly fine.

And it's also entirely agnostic to the mechanism by which the memory
region is *discovered*. It doesn't matter if it's ACPI, DT, a
hypervisor enlightenment, a BAR of a simple PCI device, virtio, or
anything else.

ACPI is one of the *simplest* options for a hypervisor and guest to
implement, and doesn't prevent us from using the same structure in
virtio-rtc. I'm happy enough using ACPI and letting virtio-rtc come
along later.

> virtio has been developed with the painful experience that we keep
> making mistakes, or coming up with new needed features,
> and that maintaining forward and backward compatibility
> becomes a whole lot harder than it seems in the beginning.

Yes. But as you note, this shared memory structure is a userspace ABI
all of its own, so we get to make a completely *different* kind of
mistake :)



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-25 Thread David Woodhouse
On Thu, 2024-07-25 at 10:11 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 02:50:50PM +0100, David Woodhouse wrote:
> > Even if the virtio-rtc specification were official today, and I was
> > able to expose it via PCI, I probably wouldn't do it that way. There's
> > just far more in virtio-rtc than we need; the simple shared memory
> > region is perfectly sufficient for most needs, and especially ours.
> 
> I can't stop amazon from shipping whatever in its hypervisor,
> I'd just like to understand this better, if there is a use-case
> not addressed here then we can change virtio to address it.
> 
> The rtc driver patch posted is 900 lines, yours is 700 lines, does not
> look like a big difference.  As for using a memory region, this is
> valid, but maybe rtc should be changed to do exactly that?

I'm certainly aiming for virtio-rtc to include that as an *option*,
because I think I don't think it makes sense for an RTC specification
aimed at virtual machines *not* to deal with the live migration
problem.

AFAICT the only ways to deal with the LM problem are either to make a
hypercall/virtio transaction for *every* clock read which needs to be
accurate, or expose a memory region for the guest to do it "vDSO-
style".

And similarly, unless we want guest userspace to have to make a
*system* call every time, that memory region needs to be mappable all
the way to userspace.

The use case isn't necessarily for all users of gettimeofday(), of
course; this is for those applications which *need* precision time.
Like distributed databases which rely on timestamps for coherency, and
users who get fined millions of dollars when LM messes up their clocks
and they put wrong timestamps on financial transactions.

> E.g. we can easily add a capability describing such a region.
> or put it in device config space.

I think it has to be memory, not config space. But yes.

The intent is that my driver would be usable with the shared memory
region from a virtio-rtc device too. It'd need a tiny amount of
refactoring of the discovery code in vmclock_probe(), which I haven't
done yet as it would be premature optimisation. 

> I mean yes, we can build a new transport for each specific need but in
> the end we'll get a ton of interfaces with unclear compatibility
> requirements.  If effort is instead spent improving common interfaces,
> we get consistency and everyone benefits. That's why I'm trying to
> understand the need here.

It's simplicity. Because this isn't even a "transport". It's just a
simple breadcrumb given to the guest to tell it where the information
is.

In the fullness of time assuming this becomes part of virtio-rtc too,
the fact that it can *also* be discovered by ACPI is just a tiny
detail. And it allows hypervisors to implement it a *whole* lot more
simply.

The addition of an ACPI method to enable the timekeeping does make it a
tiny bit more than a 'breadcrump', I concede — but that's still
basically trivial to implement. A whole lot simpler than a full virtio
device.


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-25 Thread David Woodhouse
On Thu, 2024-07-25 at 08:33 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 01:31:19PM +0100, David Woodhouse wrote:
> > On Thu, 2024-07-25 at 08:29 -0400, Michael S. Tsirkin wrote:
> > > On Thu, Jul 25, 2024 at 01:27:49PM +0100, David Woodhouse wrote:
> > > > On Thu, 2024-07-25 at 08:17 -0400, Michael S. Tsirkin wrote:
> > > > > On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:
> > > > > > > Do you want to just help complete virtio-rtc then? Would be 
> > > > > > > easier than
> > > > > > > trying to keep two specs in sync.
> > > > > > 
> > > > > > The ACPI version is much more lightweight and doesn't take up a
> > > > > > valuable PCI slot#. (I know, you can do virtio without PCI but 
> > > > > > that's
> > > > > > complex in other ways).
> > > > > > 
> > > > > 
> > > > > Hmm, should we support virtio over ACPI? Just asking.
> > > > 
> > > > Given that we support virtio DT bindings, and the ACPI "PRP0001" device
> > > > exists with a DSM method which literally returns DT properties,
> > > > including such properties as "compatible=virtio,mmio" ... do we
> > > > already?
> > > > 
> > > > 
> > > 
> > > In a sense, but you are saying that is too complex?
> > > Can you elaborate?
> > 
> > No, I think it's fine. I encourage the use of the PRP0001 device to
> > expose DT devices through ACPI. I was just reminding you of its
> > existence.
> 
> Confused. You said "I know, you can do virtio without PCI but that's
> complex in other ways" as the explanation why you are doing a custom
> protocol.

Ah, apologies, I wasn't thinking that far back in the conversation.

If we wanted to support virtio over ACPI, I think PRP0001 can be made
to work and isn't too complex (even though it probably doesn't yet work
out of the box).

But for the VMCLOCK thing, yes, the simple ACPI device is a lot simpler
than virtio-rtc and much more attractive.

Even if the virtio-rtc specification were official today, and I was
able to expose it via PCI, I probably wouldn't do it that way. There's
just far more in virtio-rtc than we need; the simple shared memory
region is perfectly sufficient for most needs, and especially ours.

I have reworked
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
to take your other feedback into account.

It's now more flexible about the size handling, and explicitly checking
that specific fields are present before using them. 

I think I'm going to add a method on the ACPI device to enable the
precise clock information. I haven't done that in the driver yet; it
still just consumes the precise clock information if it happens to be
present already. The enable method can be added in a compatible fashion
(the failure mode is that guests which don't invoke this method when
the hypervisor needs them to will see only the disruption signal and
not precise time).

For the HID I'm going to use AMZNVCLK. I had used QEMUVCLK in the QEMU
patches, but I'll change that to use AMZNVCLK too when I repost the
QEMU patch.


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-25 Thread David Woodhouse
On Thu, 2024-07-25 at 08:29 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 01:27:49PM +0100, David Woodhouse wrote:
> > On Thu, 2024-07-25 at 08:17 -0400, Michael S. Tsirkin wrote:
> > > On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:
> > > > > Do you want to just help complete virtio-rtc then? Would be easier 
> > > > > than
> > > > > trying to keep two specs in sync.
> > > > 
> > > > The ACPI version is much more lightweight and doesn't take up a
> > > > valuable PCI slot#. (I know, you can do virtio without PCI but that's
> > > > complex in other ways).
> > > > 
> > > 
> > > Hmm, should we support virtio over ACPI? Just asking.
> > 
> > Given that we support virtio DT bindings, and the ACPI "PRP0001" device
> > exists with a DSM method which literally returns DT properties,
> > including such properties as "compatible=virtio,mmio" ... do we
> > already?
> > 
> > 
> 
> In a sense, but you are saying that is too complex?
> Can you elaborate?

No, I think it's fine. I encourage the use of the PRP0001 device to
expose DT devices through ACPI. I was just reminding you of its
existence.




smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-25 Thread David Woodhouse
On Thu, 2024-07-25 at 08:17 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:
> > > Do you want to just help complete virtio-rtc then? Would be easier than
> > > trying to keep two specs in sync.
> > 
> > The ACPI version is much more lightweight and doesn't take up a
> > valuable PCI slot#. (I know, you can do virtio without PCI but that's
> > complex in other ways).
> > 
> 
> Hmm, should we support virtio over ACPI? Just asking.

Given that we support virtio DT bindings, and the ACPI "PRP0001" device
exists with a DSM method which literally returns DT properties,
including such properties as "compatible=virtio,mmio" ... do we
already?




smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-25 Thread David Woodhouse
On Thu, 2024-07-25 at 12:31 +0100, Daniel P. Berrangé wrote:
> On Thu, Jul 25, 2024 at 10:56:05AM +0100, David Woodhouse wrote:
> > Hi Michael, thanks for the review!
> > 
> > On Thu, 2024-07-25 at 01:48 -0400, Michael S. Tsirkin wrote:
> > > Do you want to just help complete virtio-rtc then? Would be easier than
> > > trying to keep two specs in sync.
> > 
> > The ACPI version is much more lightweight and doesn't take up a
> > valuable PCI slot#. (I know, you can do virtio without PCI but that's
> > complex in other ways).
> 
> In general it shouldn't have to take up a PCI slot, that's just
> a common default policy. virtio-devices only need a dedicated
> slot if there's a need to do hotplug/unplug of them. There is a
> set of core devices for which hotplug doesn't make sense, which
> could all be put as functions in the same slot. ie virtio-rng,
> virtio-balloon and virtio-rtc, could all live in one slot.

But if you don't have any virtio devices already, you still need one
slot to put them in.


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-25 Thread David Woodhouse
On Thu, 2024-07-25 at 13:20 +0200, Paolo Abeni wrote:
> 
> 
> Just a bunch of 'nits below

Thank you. Fixed in
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock

I'll post a new version once I've finished resolving mst's and any
other feedback.



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-25 Thread David Woodhouse
On Thu, 2024-07-25 at 01:54 -0400, Michael S. Tsirkin wrote:
> one other thing worth mentioning is that this design can't work
> with confidential computing setups. By comparison, mapping e.g. a
> range in a PCI BAR would work for these setups.

Why so? This is just like mapping a PCI BAR, isn't it? It's cacheable
MMIO space, *not* part of the encrypted guest RAM ranges. It just
happens to be discovered through the _CRS of an ACPI device, not the
BAR of a PCI device.

> Is there a reason this functionality is not interesting for
> confidential VMs?

It is. In fact, that was one of the reasons for doing it as mappable
MMIO space, instead of having the guest allocate a portion of its own
RAM and invoke a hypervisor enlightenment to populate it. (Although the
latter *can* work with CC too, as demonstrated by e.g. ptp_kvm).



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] ptp: Add vDSO-style vmclock support

2024-07-25 Thread David Woodhouse
Hi Michael, thanks for the review!

On Thu, 2024-07-25 at 01:48 -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 24, 2024 at 06:16:37PM +0100, David Woodhouse wrote:
> > From: David Woodhouse 
> > 
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> > 
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> > 
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
> > 
> > Signed-off-by: David Woodhouse 
> > ---
> > QEMU implementation at
> > https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock
> > 
> > Although the ACPI device implemented in QEMU (and some other
> > hypervisor) stands alone, most of the fields and values herein are
> > aligned as much as possible with the nascent virtio-rtc specification,
> > with the intent that a version of the same structure can be
> > incorporated into that standard.
> 
> Do you want to just help complete virtio-rtc then? Would be easier than
> trying to keep two specs in sync.

The ACPI version is much more lightweight and doesn't take up a
valuable PCI slot#. (I know, you can do virtio without PCI but that's
complex in other ways).

So I do see these existing in parallel, and I'm happy for the VMCLOCK
device to defer to the shared memory structure in the virtio-rtc
specification once that's final.

I've done my best to ensure we have a fair chance of it being adopted
as-is, and it's versioned so if virtio-rtc gets published with v2 of
the structure, that's also OK.
> 

> > +
> > +   if (st->clk->magic != VMCLOCK_MAGIC ||
> > +   st->clk->size < sizeof(*st->clk) ||
> 
> 
> This is a problem and what I mention below.
> As structure will evolve, sizeof(*st->clk) will grow,
> then it will fail on old hosts.

Potentially, yes. At the time a guest starts to support a larger
version of the struct, it would need to check for the size of the v1
structure.

We certainly *allow* for the structure to evolve, but as you say this
is ABI. So we've tried to plan ahead and avoid having to change it
unless something *really* surprising happens in the field of leap
seconds.


> > +
> > +   /* If the structure is big enough, it can be mapped to userspace */
> > +   if (st->clk->size >= PAGE_SIZE) {
> 
> This is also part of ABI. And a problematic one since linux can
> increase PAGE_SIZE at any time for its own reasons.

Yes. For it to be mapped to userspace, the hypervisor has to put it
into a memory region which is large enough for the page granularity
that the guest happens to use.

The hypervisor implementations so far (for x86 in QEMU since there
seems to be no ACPI support for Arm?, and for x86+Arm in my other pet
hypervisor) expose a 4KiB region.

So if the guest uses larger pages (and can't cope with using 4KiB PTEs
for MMIO mappings), then it can't use this feature unless the
hypervisor gives it a larger region. I think that's OK.

> > +   st->miscdev.minor = MISC_DYNAMIC_MINOR;
> > +   st->miscdev.fops = &vmclock_miscdev_fops;
> > +   st->miscdev.name = st->name;
> > +
> > +   ret = misc_register(&st->miscdev);
> > +   if (ret)
> > +   goto out;
> > +   }
> 
> Hmm so you want to map a page to userspace, but who will
> keep the clock in that memory page up to date?
> Making hypervisor do it will mean stealing a bunch
> of hypervisor time and we don't *know* userspace is
> going to use it.

I think it should scale OK. The bulk of the actual *calculations* are
done only once by the hypervisor on an NTP skew adjustment anyway,
rather than per-VM. That gives the period of the actual host counter,
and the reference time.

Where TSC scaling *isn't* occurring, all that the VMM needs to do is
copy that information, applying the appropriate TSC offset. I think it
should be mostly in the noise compared with other VMM overhead?

Where TSC scaling is in use, it's a little more complex. The counter
value at the reference time needs to be converted using the standard
mul-and-shift TSC scaling method. But the counter *period* needs to
undergo the reverse conversion. That would be a bit more work if done
naïvely with a right shift and then a *division*, but an appropriate
efficient conversion to match the frequency scaling

[PATCH] ptp: Add vDSO-style vmclock support

2024-07-24 Thread David Woodhouse
From: David Woodhouse 

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Like the KVM PTP clock, this can convert TSC-based cross timestamps into
KVM clock values. Unlike the KVM PTP clock, it does so only when such is
actually helpful.

The memory region of the device is also exposed to userspace so it can be
read or memory mapped by application which need reliable notification of
clock disruptions.

Signed-off-by: David Woodhouse 
---
QEMU implementation at
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock

Although the ACPI device implemented in QEMU (and some other
hypervisor) stands alone, most of the fields and values herein are
aligned as much as possible with the nascent virtio-rtc specification,
with the intent that a version of the same structure can be
incorporated into that standard.

v1:
 • Change absolute error fields to nanoseconds
 • Update leap second definition to match virtio-rtc intentions in 

https://lore.kernel.org/all/85c93b42-41a2-42c4-a168-55079bbff...@opensynergy.com

RFC v4:
 • Add esterror fields, MONOTONIC flag.
 • Reduce seq_count to 32 bits
 • Expand size to permit 64KiB pages
 • Align with virtio-rtc fields, values and leap handling
 • Drop gettime() method (since we have gettimex())
 • Add leap second smearing hint
 • Use a real _CRS on the ACPI device

RFC v3: (wrong patch sent)

RFC v2:
 • Add gettimex64() support
 • Convert TSC values to KVM clock when appropriate
 • Require int128 support
 • Add counter_period_shift
 • Add timeout when seq_count is invalid
 • Add flags field
 • Better comments in vmclock ABI structure
 • Explicitly forbid smearing (as clock rates would need to change)


 drivers/ptp/Kconfig  |  13 +
 drivers/ptp/Makefile |   1 +
 drivers/ptp/ptp_vmclock.c| 567 +++
 include/uapi/linux/vmclock-abi.h | 187 ++
 4 files changed, 768 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock-abi.h

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
  To compile this driver as a module, choose M here: the module
  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+   tristate "Virtual machine PTP clock"
+   depends on X86_TSC || ARM_ARCH_TIMER
+   depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+   default y
+   help
+ This driver adds support for using a virtual precision clock
+ advertised by the hypervisor. This clock is only useful in virtual
+ machines where such a device is present.
+
+ To compile this driver as a module, choose M here: the module
+ will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
tristate "IDT 82P33xxx PTP clock"
depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE)  += ptp_dte.o
 obj-$(CONFIG_PTP_1588_CLOCK_INES)  += ptp_ines.o
 obj-$(CONFIG_PTP_1588_CLOCK_PCH)   += ptp_pch.o
 obj-$(CONFIG_PTP_1588_CLOCK_KVM)   += ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK)   += ptp_vmclock.o
 obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o
 ptp-qoriq-y+= ptp_qoriq.o
 ptp-qoriq-$(CONFIG_DEBUG_FS)   += ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index ..9c508c21c062
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,567 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+
+#ifdef CONFIG_X86
+#include 
+#include 
+#endif
+
+#ifdef CONFIG_KVM_GUEST
+#define SUPPORT_KVMCLOCK
+#endif
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+   struct resource res;
+   struct vmclock_abi *clk;
+   struct miscdevice miscdev;
+   struct ptp_clock_info ptp_clock_info;
+   struct ptp_clock *ptp_clock;
+   enum clocksource_ids cs_id, sys_cs_id;
+   int index;
+   char *name;
+};
+
+#define VMCLOCK_MAX_WAIT ms_to_ktime(100)
+
+/*
+ * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
+ * and add the fractional second part of the reference time.
+ *
+ * The result is a 128-bit value, 

Re: [RFC PATCH v4] ptp: Add vDSO-style vmclock support

2024-07-17 Thread David Woodhouse
On Tue, 2024-07-16 at 15:20 +0200, Peter Hilber wrote:
> On 16.07.24 14:32, David Woodhouse wrote:
> > On 16 July 2024 12:54:52 BST, Peter Hilber  
> > wrote:
> > > On 11.07.24 09:50, David Woodhouse wrote:
> > > > On Thu, 2024-07-11 at 09:25 +0200, Peter Hilber wrote:
> > > > > 
> > > > > IMHO this phrasing is better, since it directly refers to the state 
> > > > > of the
> > > > > structure.
> > > > 
> > > > Thanks. I'll update it.
> > > > 
> > > > > AFAIU if there would be abnormal delays in store buffers, causing some
> > > > > driver to still see the old clock for some time, the monotonicity 
> > > > > could be
> > > > > violated:
> > > > > 
> > > > > 1. device writes new, much slower clock to store buffer
> > > > > 2. some time passes
> > > > > 3. driver reads old, much faster clock
> > > > > 4. device writes store buffer to cache
> > > > > 5. driver reads new, much slower clock
> > > > > 
> > > > > But I hope such delays do not occur.
> > > > 
> > > > For the case of the hypervisor←→guest interface this should be handled
> > > > by the use of memory barriers and the seqcount lock.
> > > > 
> > > > The guest driver reads the seqcount, performs a read memory barrier,
> > > > then reads the contents of the structure. Then performs *another* read
> > > > memory barrier, and checks the seqcount hasn't changed:
> > > > https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=drivers/ptp/ptp_vmclock.c;hb=vmclock#l351
> > > > 
> > > > The converse happens with write barriers on the hypervisor side:
> > > > https://git.infradead.org/?p=users/dwmw2/qemu.git;a=blob;f=hw/acpi/vmclock.c;hb=vmclock#l68
> > > 
> > > My point is that, looking at the above steps 1. - 5.:
> > > 
> > > 3. read HW counter, smp_rmb, read seqcount
> > > 4. store seqcount, smp_wmb, stores, smp_wmb, store seqcount become 
> > > effective
> > > 5. read seqcount, smp_rmb, read HW counter
> > > 
> > > AFAIU this would still be a theoretical problem suggesting the use of
> > > stronger barriers.
> > 
> > This seems like a bug on the guest side. The HW counter needs to be read 
> > *within* the (paired, matching) seqcount reads, not before or after.
> > 
> > 
> 
> There would be paired reads:
> 
> 1. device writes new, much slower clock to store buffer
> 2. some time passes
> 3. read seqcount, smp_rmb, ..., read HW counter, smp_rmb, read seqcount
> 4. store seqcount, smp_wmb, stores, smp_wmb, store seqcount all become
>    effective only now
> 5. read seqcount, smp_rmb, read HW counter, ..., smp_rmb, read seqcount
> 
> I just omitted the parts which do not necessarily need to happen close to
> 4. for the monotonicity to be violated. My point is that 1. could become
> visible to other cores long after it happened on the local core (during
> 4.).

Oh, I see. That would be a bug on the device side then. And as you say,
it could be fixed by using the appropriate barriers. Or my alternative
of just documenting "Don't Do That Then".



smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v4] ptp: Add vDSO-style vmclock support

2024-07-17 Thread David Woodhouse
On Tue, 2024-07-16 at 13:54 +0200, Peter Hilber wrote:
> On 08.07.24 11:27, David Woodhouse wrote:
> > +
> > +   /*
> > +    * Time according to time_type field above.
> > +    */
> > +   uint64_t time_sec;  /* Seconds since time_type epoch */
> > +   uint64_t time_frac_sec; /* (seconds >> 64) */
> > +   uint64_t time_esterror_picosec; /* (± picoseconds) */
> > +   uint64_t time_maxerror_picosec; /* (± picoseconds) */
> 
> Is this unsigned or signed?

The field itself is unsigned, as it provides the absolute value of the
error (which can be in either direction). Probably better just to drop
the ± from the comment.

Julien is now back from vacation and I'm expecting to see his opinion
on whether we can change that to nanoseconds for consistency.


smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v4] ptp: Add vDSO-style vmclock support

2024-07-16 Thread David Woodhouse
On 16 July 2024 12:54:52 BST, Peter Hilber  wrote:
>On 11.07.24 09:50, David Woodhouse wrote:
>> On Thu, 2024-07-11 at 09:25 +0200, Peter Hilber wrote:
>>>
>>> IMHO this phrasing is better, since it directly refers to the state of the
>>> structure.
>> 
>> Thanks. I'll update it.
>> 
>>> AFAIU if there would be abnormal delays in store buffers, causing some
>>> driver to still see the old clock for some time, the monotonicity could be
>>> violated:
>>>
>>> 1. device writes new, much slower clock to store buffer
>>> 2. some time passes
>>> 3. driver reads old, much faster clock
>>> 4. device writes store buffer to cache
>>> 5. driver reads new, much slower clock
>>>
>>> But I hope such delays do not occur.
>> 
>> For the case of the hypervisor←→guest interface this should be handled
>> by the use of memory barriers and the seqcount lock.
>> 
>> The guest driver reads the seqcount, performs a read memory barrier,
>> then reads the contents of the structure. Then performs *another* read
>> memory barrier, and checks the seqcount hasn't changed:
>> https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=drivers/ptp/ptp_vmclock.c;hb=vmclock#l351
>> 
>> The converse happens with write barriers on the hypervisor side:
>> https://git.infradead.org/?p=users/dwmw2/qemu.git;a=blob;f=hw/acpi/vmclock.c;hb=vmclock#l68
>
>My point is that, looking at the above steps 1. - 5.:
>
>3. read HW counter, smp_rmb, read seqcount
>4. store seqcount, smp_wmb, stores, smp_wmb, store seqcount become effective
>5. read seqcount, smp_rmb, read HW counter
>
>AFAIU this would still be a theoretical problem suggesting the use of
>stronger barriers.

This seems like a bug on the guest side. The HW counter needs to be read 
*within* the (paired, matching) seqcount reads, not before or after.





Re: [RFC PATCH v4] ptp: Add vDSO-style vmclock support

2024-07-11 Thread David Woodhouse
On Thu, 2024-07-11 at 09:25 +0200, Peter Hilber wrote:
> 
> IMHO this phrasing is better, since it directly refers to the state of the
> structure.

Thanks. I'll update it.

> AFAIU if there would be abnormal delays in store buffers, causing some
> driver to still see the old clock for some time, the monotonicity could be
> violated:
> 
> 1. device writes new, much slower clock to store buffer
> 2. some time passes
> 3. driver reads old, much faster clock
> 4. device writes store buffer to cache
> 5. driver reads new, much slower clock
> 
> But I hope such delays do not occur.

For the case of the hypervisor←→guest interface this should be handled
by the use of memory barriers and the seqcount lock.

The guest driver reads the seqcount, performs a read memory barrier,
then reads the contents of the structure. Then performs *another* read
memory barrier, and checks the seqcount hasn't changed:
https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=drivers/ptp/ptp_vmclock.c;hb=vmclock#l351

The converse happens with write barriers on the hypervisor side:
https://git.infradead.org/?p=users/dwmw2/qemu.git;a=blob;f=hw/acpi/vmclock.c;hb=vmclock#l68

Do we need to think harder about the ordering across a real PCI bus? It
isn't entirely unreasonable for this to be implemented in hardware if
we eventually add a counter_id value for a bus-visible counter like the
Intel Always Running Timer (ART). I'm also OK with saying that device
implementations may only provide the shared memory structure if they
can ensure memory ordering.


smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v4] ptp: Add vDSO-style vmclock support

2024-07-10 Thread David Woodhouse
On Wed, 2024-07-10 at 15:07 +0200, Peter Hilber wrote:
> On 08.07.24 11:27, David Woodhouse wrote:
> > From: David Woodhouse 
> > 
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> > 
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> > 
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
> > 
> > Signed-off-by: David Woodhouse 
> 
> [...]
> 
> > +
> > +struct vmclock_abi {
> > +   /* CONSTANT FIELDS */
> > +   uint32_t magic;
> > +#define VMCLOCK_MAGIC  0x4b4c4356 /* "VCLK" */
> > +   uint32_t size;  /* Size of region containing this structure 
> > */
> > +   uint16_t version;   /* 1 */
> > +   uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx except 
> > INVALID */
> > +#define VMCLOCK_COUNTER_ARM_VCNT   0
> > +#define VMCLOCK_COUNTER_X86_TSC1
> > +#define VMCLOCK_COUNTER_INVALID0xff
> > +   uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */
> > +#define VMCLOCK_TIME_UTC   0   /* Since 1970-01-01 
> > 00:00:00z */
> > +#define VMCLOCK_TIME_TAI   1   /* Since 1970-01-01 
> > 00:00:00z */
> > +#define VMCLOCK_TIME_MONOTONIC 2   /* Since undefined 
> > epoch */
> > +#define VMCLOCK_TIME_INVALID_SMEARED   3   /* Not supported */
> > +#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED 4   /* Not supported */
> > +
> > +   /* NON-CONSTANT FIELDS PROTECTED BY SEQCOUNT LOCK */
> > +   uint32_t seq_count; /* Low bit means an update is in progress */
> > +   /*
> > +    * This field changes to another non-repeating value when the CPU
> > +    * counter is disrupted, for example on live migration. This lets
> > +    * the guest know that it should discard any calibration it has
> > +    * performed of the counter against external sources (NTP/PTP/etc.).
> > +    */
> > +   uint64_t disruption_marker;
> > +   uint64_t flags;
> > +   /* Indicates that the tai_offset_sec field is valid */
> > +#define VMCLOCK_FLAG_TAI_OFFSET_VALID  (1 << 0)
> > +   /*
> > +    * Optionally used to notify guests of pending maintenance events.
> > +    * A guest which provides latency-sensitive services may wish to
> > +    * remove itself from service if an event is coming up. Two flags
> > +    * indicate the approximate imminence of the event.
> > +    */
> > +#define VMCLOCK_FLAG_DISRUPTION_SOON   (1 << 1) /* About a day */
> > +#define VMCLOCK_FLAG_DISRUPTION_IMMINENT   (1 << 2) /* About an hour */
> > +#define VMCLOCK_FLAG_PERIOD_ESTERROR_VALID (1 << 3)
> > +#define VMCLOCK_FLAG_PERIOD_MAXERROR_VALID (1 << 4)
> > +#define VMCLOCK_FLAG_TIME_ESTERROR_VALID   (1 << 5)
> > +#define VMCLOCK_FLAG_TIME_MAXERROR_VALID   (1 << 6)
> > +   /*
> > +    * Even regardless of leap seconds, the time presented through this
> > +    * mechanism may not be strictly monotonic. If the counter slows 
> > down
> > +    * and the host adapts to this discovery, the time calculated from
> > +    * the value of the counter immediately after an update to this
> > +    * structure, may appear to be *earlier* than a calculation just
> > +    * before the update (while the counter was believed to be running
> > +    * faster than it now is). A guest operating system will typically
> > +    * *skew* its own system clock back towards the reference clock
> > +    * exposed here, rather than following this clock directly. If,
> > +    * however, this structure is being populated from such a system
> > +    * clock which is already handled in such a fashion and the results
> > +    * *are* guaranteed to be monotonic, such monotonicity can be
> > +    * advertised by setting this bit.
> > +    */
> 
> I wonder if this might be difficult to define in a standard.

I'm sure we could do better than my attempt above, but surely it isn't
*so* hard to define monotonicity?

> Is there a need to define device and driver behavior in more detail? What
> wou

[RFC PATCH v4] ptp: Add vDSO-style vmclock support

2024-07-08 Thread David Woodhouse
From: David Woodhouse 

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Like the KVM PTP clock, this can convert TSC-based cross timestamps into
KVM clock values. Unlike the KVM PTP clock, it does so only when such is
actually helpful.

The memory region of the device is also exposed to userspace so it can be
read or memory mapped by application which need reliable notification of
clock disruptions.

Signed-off-by: David Woodhouse 
---

QEMU implementation at
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock

Remaining questions/TODO for virtio-rtc adoption:
 • Use of signed integer for tai_offset field
 • Explicit little-endianness
 • Is picoseconds the right unit for absolute error (I was going to make
   this (seconds>>64) but that actually reduces the *range* that can be
   expressed).
 • Are the clock_status values sensible?

v4:
 • Add esterror fields, MONOTONIC flag.
 • Reduce seq_count to 32 bits
 • Expand size to permit 64KiB pages
 • Align with virtio-rtc fields, values and leap handling
 • Drop gettime() method (since we have gettimex())
 • Add leap second smearing hint
 • Use a real _CRS on the ACPI device

v3: (wrong patch sent)

v2:
 • Add gettimex64() support
 • Convert TSC values to KVM clock when appropriate
 • Require int128 support
 • Add counter_period_shift
 • Add timeout when seq_count is invalid
 • Add flags field
 • Better comments in vmclock ABI structure
 • Explicitly forbid smearing (as clock rates would need to change)


 drivers/ptp/Kconfig  |  13 +
 drivers/ptp/Makefile |   1 +
 drivers/ptp/ptp_vmclock.c| 567 +++
 include/uapi/linux/vmclock-abi.h | 187 ++
 4 files changed, 768 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock-abi.h

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
  To compile this driver as a module, choose M here: the module
  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+   tristate "Virtual machine PTP clock"
+   depends on X86_TSC || ARM_ARCH_TIMER
+   depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+   default y
+   help
+ This driver adds support for using a virtual precision clock
+ advertised by the hypervisor. This clock is only useful in virtual
+ machines where such a device is present.
+
+ To compile this driver as a module, choose M here: the module
+ will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
tristate "IDT 82P33xxx PTP clock"
depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE)  += ptp_dte.o
 obj-$(CONFIG_PTP_1588_CLOCK_INES)  += ptp_ines.o
 obj-$(CONFIG_PTP_1588_CLOCK_PCH)   += ptp_pch.o
 obj-$(CONFIG_PTP_1588_CLOCK_KVM)   += ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK)   += ptp_vmclock.o
 obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o
 ptp-qoriq-y+= ptp_qoriq.o
 ptp-qoriq-$(CONFIG_DEBUG_FS)   += ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index ..30f15d7753bb
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,567 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+
+#ifdef CONFIG_X86
+#include 
+#include 
+#endif
+
+#ifdef CONFIG_KVM_GUEST
+#define SUPPORT_KVMCLOCK
+#endif
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+   struct resource res;
+   struct vmclock_abi *clk;
+   struct miscdevice miscdev;
+   struct ptp_clock_info ptp_clock_info;
+   struct ptp_clock *ptp_clock;
+   enum clocksource_ids cs_id, sys_cs_id;
+   int index;
+   char *name;
+};
+
+#define VMCLOCK_MAX_WAIT ms_to_ktime(100)
+
+/*
+ * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
+ * and add the fractional second part of the reference time.
+ *
+ * The result is a 128-bit value, the top 64 bits of which are seconds, and
+ * the low 64 bits are (seconds >> 64).
+ *
+ * If __int128 isn't available, perform the calculation 32 bits at a time to
+ 

Re: [RFC PATCH v3] ptp: Add vDSO-style vmclock support

2024-07-08 Thread David Woodhouse
On Mon, 2024-07-08 at 10:17 +0100, Simon Horman wrote:
> Hi David,
> 
> As per my comment on v2, although it is harmless in this case,
> it would be nicer to use strscpy() here and avoid fortification
> warnings.
> 

Oh, pants! I have fixed that; must have sent the wrong version.
V4 coming up shortly. With git-send-email this time.

Thanks.


smime.p7s
Description: S/MIME cryptographic signature


[RFC PATCH v3] ptp: Add vDSO-style vmclock support

2024-07-06 Thread David Woodhouse
From: David Woodhouse 

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Like the KVM PTP clock, this can convert TSC-based cross timestamps into
KVM clock values. Unlike the KVM PTP clock, it does so only when such is
actually helpful.

The memory region of the device is also exposed to userspace so it can be
read or memory mapped by application which need reliable notification of
clock disruptions.

The shared memory structure is intended to be adopted into the nascent
virtio-rtc specification (since one might consider a virtio-rtc
specification that doesn't fix the live migration problem to be not fit
for purpose). It can also be presented via a simple ACPI device.

Signed-off-by: David Woodhouse 
---
QEMU implementation at 
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock

Remaining questions/TODO for virtio-rtc adoption:
 • Use of signed integer for tai_offset field
 • Explicit little-endianness
 • Is picoseconds the right unit for absolute error (I was going to make
   this (seconds>>64) but that actually reduces the *range* that can be
   expressed).
 • Are the clock_status values sensible?

v3:
 • Add esterror fields
 • Reduce seq_count to 32 bits
 • Expand size to permit 64KiB pages
 • Align with virtio-rtc fields, values and leap handling
 • Drop gettime() method (since we have gettimex())
 • Add leap second smearing hint
 • Use a real _CRS on the ACPI device

v2: 
 • Add gettimex64() support
 • Convert TSC values to KVM clock when appropriate
 • Require int128 support
 • Add counter_period_shift
 • Add timeout when seq_count is invalid
 • Add flags field
 • Better comments in vmclock ABI structure
 • Explicitly forbid smearing (as clock rates would need to change)

 drivers/ptp/Kconfig  |  13 +
 drivers/ptp/Makefile |   1 +
 drivers/ptp/ptp_vmclock.c| 516 +++
 include/uapi/linux/vmclock.h | 138 ++
 4 files changed, 668 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock.h

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
  To compile this driver as a module, choose M here: the module
  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+   tristate "Virtual machine PTP clock"
+   depends on X86_TSC || ARM_ARCH_TIMER
+   depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+   default y
+   help
+ This driver adds support for using a virtual precision clock
+ advertised by the hypervisor. This clock is only useful in virtual
+ machines where such a device is present.
+
+ To compile this driver as a module, choose M here: the module
+ will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
tristate "IDT 82P33xxx PTP clock"
depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE)  += ptp_dte.o
 obj-$(CONFIG_PTP_1588_CLOCK_INES)  += ptp_ines.o
 obj-$(CONFIG_PTP_1588_CLOCK_PCH)   += ptp_pch.o
 obj-$(CONFIG_PTP_1588_CLOCK_KVM)   += ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK)   += ptp_vmclock.o
 obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o
 ptp-qoriq-y+= ptp_qoriq.o
 ptp-qoriq-$(CONFIG_DEBUG_FS)   += ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index ..e19c2eed8009
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,516 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#ifdef CONFIG_X86
+#include 
+#include 
+#endif
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+   phys_addr_t phys_addr;
+   struct vmclock_abi *clk;
+   struct miscdevice miscdev;
+   struct ptp_clock_info ptp_clock_info;
+   struct ptp_clock *ptp_clock;
+   enum clocksource_ids cs_id, sys_cs_id;
+   int index;
+   char *name;
+};
+
+#define VMCLOCK_MAX_WAIT ms_to_ktime(100)
+
+/*
+ * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
+ * and add the fractional second part of the reference time.
+ *
+ * The result is a 128-bit value, the top 64 bits of which are seconds, and
+ * the low 6

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-07-05 Thread David Woodhouse
On Fri, 2024-07-05 at 10:12 +0200, Peter Hilber wrote:
> On 03.07.24 12:40, David Woodhouse wrote:
> 
> [...]
> 
> > 
> > 
> > This is what I currently have for 'struct vmclock_abi' that I'd like to
> > persuade you to adopt. I need to tweak it some more, for at least the
> > following reasons, as well as any more you can see:
> > 
> >  • size isn't big enough for 64KiB pages
> >  • Should be explicitly little-endian
> >  • Does it need esterror as well as maxerror?
> 
> I have no opinion about this. I can drop esterror if unwanted.

I also don't care. I'm just observing the inconsistency.

> >  • Why is maxerror in picoseconds? It's the only use of that unit

Between us we now have picoseconds, nanoseconds, (seconds >> 64) and
(seconds >> 64+n).

The power-of-two fractions seem to make a lot of sense for the counter
period, because they mean we don't have to perform divisions.

Does it makes sense to harmonise on (seconds >> 64) for all of the
fractional seconds? Again I don't have a strong opinion; I only want us
to have a *reason* for any differences that exist.

> >  • Where do the clock_status values come from? Do they make sense?
> >  • Are signed integers OK? (I think so!).
> 
> Signed integers would need to be introduced to Virtio, which so far only
> uses explicitly unsigned types: u8, le16 etc.

Perhaps. Although it would also be possible (if not ideal) to define
that e.g. the tai_offset field is a 16-bit "unsigned" integer according
to virtio, but to be interpreted as follows:

If the number is <= 32767 then the TAI offset is that value, but if the
number is >= 32768 then the TAI offset is that value minus 65536.

Perhaps not pretty, but there isn't a *fundamental* dependency on
virtio supporting signed integers as a primary type.



smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-07-03 Thread David Woodhouse
On Wed, 2024-07-03 at 11:56 +0200, Peter Hilber wrote:
> On 02.07.24 20:40, David Woodhouse wrote:
> > On 2 July 2024 19:12:00 BST, Peter Hilber  
> > wrote:
> > > On 02.07.24 18:39, David Woodhouse wrote:
> > > > To clarify then, the main types are
> > > > 
> > > >  VIRTIO_RTC_CLOCK_UTC == 0
> > > >  VIRTIO_RTC_CLOCK_TAI == 1
> > > >  VIRTIO_RTC_CLOCK_MONOTONIC == 2
> > > >  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
> > > > 
> > > > And the subtypes are *only* for the case of
> > > > VIRTIO_RTC_CLOCK_SMEARED_UTC. They include
> > > > 
> > > >  VIRTIO_RTC_SUBTYPE_STRICT
> > > >  VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */
> > > >  VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 
> > > >  VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */
> > > > 
> > > > Is that what we just agreed on?
> > > > 
> > > > 
> > > 
> > > This is a misunderstanding. My idea was that the main types are
> > > 
> > > >  VIRTIO_RTC_CLOCK_UTC == 0
> > > >  VIRTIO_RTC_CLOCK_TAI == 1
> > > >  VIRTIO_RTC_CLOCK_MONOTONIC == 2
> > > >  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
> > > 
> > > VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4
> > > 
> > > The subtypes would be (1st for clocks other than
> > > VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for
> > > VIRTIO_RTC_CLOCK_SMEARED_UTC):
> > > 
> > > #define VIRTIO_RTC_SUBTYPE_STRICT 0
> > > #define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1
> > > #define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2
> > > 
> > 
> > Thanks. I really do think that from the guest point of view there's
> > really no distinction between "maybe smeared" and "undefined
> > smearing", and have a preference for using the latter form, which
> > is the key difference there?
> > 
> > Again though, not a hill for me to die on.
> 
> I have no issue with staying with "undefined smearing", so would you agree
> to something like
> 
> VIRTIO_RTC_CLOCK_SMEAR_UNDEFINED_UTC == 4
> 
> (or another name if you prefer)?

Well, the point of contention was really whether that was a *type* or a
*subtype*.

Either way, it's a "precision clock" telling its consumer that the
device *itself* doesn't really know what time is being exposed. Which
seems like a bizarre thing to support.

But I think I've constructed an argument which persuades me to your
point of view that *if* we permit it, it should be a primary type...

A clock can *either* be UTC, *or* it can be monotonic. The whole point
of smearing is to produce a monotonic clock, of course.

VIRTIO_RTC_CLOCK_UTC is UTC. It is not monotonic.

VIRTIO_RTC_CLOCK_SMEARED is, presumably, monotonic (and I think we
should explicitly require that to be true in virtio-rtc).


But VIRTIO_RTC_CLOCK_MAYBE_SMEARED is the worst of both worlds. It is
neither known to be correct UTC, *nor* is it known to be monotonic. So
(again, if we permit it at all) I think it probably does make sense for
that to be a primary type.


This is what I currently have for 'struct vmclock_abi' that I'd like to
persuade you to adopt. I need to tweak it some more, for at least the
following reasons, as well as any more you can see:

 • size isn't big enough for 64KiB pages
 • Should be explicitly little-endian
 • Does it need esterror as well as maxerror?
 • Why is maxerror in picoseconds? It's the only use of that unit
 • Where do the clock_status values come from? Do they make sense?
 • Are signed integers OK? (I think so!).

 
/*
 * This structure provides a vDSO-style clock to VM guests, exposing the
 * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
 * counter, etc.) and real time. It is designed to address the problem of
 * live migration, which other clock enlightenments do not.
 *
 * When a guest is live migrated, this affects the clock in two ways.
 *
 * First, even between identical hosts the actual frequency of the underlying
 * counter will change within the tolerances of its specification (typically
 * ±50PPM, or 4 seconds a day). This frequency also varies over time on the
 * same host, but can be tracked by NTP as it generally varies slowly. With
 * live migration there is a step change in the frequency, with no warning.
 *
 * Second, there may be a step change in the value of the counter itself, as
 * its accuracy is limited by the precision of the NTP synchronization on the
 * source and destination hosts.
 *
 * So any calibration (NTP, PTP, etc.) which the guest has done on the source
 * host before migration is invalid, and needs

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-07-02 Thread David Woodhouse
On 2 July 2024 19:12:00 BST, Peter Hilber  wrote:
>On 02.07.24 18:39, David Woodhouse wrote:
>> To clarify then, the main types are
>> 
>>  VIRTIO_RTC_CLOCK_UTC == 0
>>  VIRTIO_RTC_CLOCK_TAI == 1
>>  VIRTIO_RTC_CLOCK_MONOTONIC == 2
>>  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
>> 
>> And the subtypes are *only* for the case of
>> VIRTIO_RTC_CLOCK_SMEARED_UTC. They include
>> 
>>  VIRTIO_RTC_SUBTYPE_STRICT
>>  VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */
>>  VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 
>>  VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */
>> 
>> Is that what we just agreed on?
>> 
>> 
>
>This is a misunderstanding. My idea was that the main types are
>
>>  VIRTIO_RTC_CLOCK_UTC == 0
>>  VIRTIO_RTC_CLOCK_TAI == 1
>>  VIRTIO_RTC_CLOCK_MONOTONIC == 2
>>  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
>
>VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4
>
>The subtypes would be (1st for clocks other than
>VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for
>VIRTIO_RTC_CLOCK_SMEARED_UTC):
>
>#define VIRTIO_RTC_SUBTYPE_STRICT 0
>#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1
>#define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2
>

Thanks. I really do think that from the guest point of view there's really no 
distinction between "maybe smeared" and "undefined smearing", and have a 
preference for using the latter form, which is the key difference there?

Again though, not a hill for me to die on.



Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-07-02 Thread David Woodhouse
On Tue, 2024-07-02 at 17:03 +0200, Peter Hilber wrote:
> > On 01.07.24 10:57, David Woodhouse wrote:
> > > > If my proposed memory structure is subsumed into the virtio-rtc
> > > > proposal we'd literally use the same names, but for the time being I'll
> > > > update mine to:
> > 
> > Do you intend vmclock and virtio-rtc to be ABI compatible?

I intend you to incorporate a shared memory structure like the vmclock
one into the virtio-rtc standard :)

Because precision clocks in a VM are fundamentally nonsensical without
a way for the guest to know when live migration has screwed them up.

So yes, so facilitate that, I'm trying to align things so that the
numeric values of fields like the time_type and smearing hint are
*precisely* the same as the VIRTIO_RTC_xxx values.

Time pressure *may* mean I have to ship an ACPI-based device with a
preliminary version of the structure before I've finished persuading
you, and before we've completely finalized the structure. I am hoping
to avoid that.

(In fact, my time pressure only applies to a version of the structure
which carries the disruption_marker field; the actual clock calibration
information doesn't have to be present in the interim implementation.)


> >  FYI, I see a
> > potential problem in that Virtio does avoid the use of signed integers so
> > far. I did not check carefully if there might be other problems, yet.

Hm, you use an unsigned integer to convey the tai_offset. I suppose at
+37 and with a plan to stop doing leap seconds in the next decade,
we're unlikely to get back below zero?

The other signed integer I had was for the leap second direction, but I
think I'm happy to drop that and just adopt your uint8_t leap field
with VIRTIO_RTC_LEAP_{PRE_POS,PRE_NEG,etc.}.





> > > > 
> > > > /*
> > > >  * What time is exposed in the time_sec/time_frac_sec fields?
> > > >  */
> > > > uint8_t time_type;
> > > > #define VMCLOCK_TIME_UTC0   /* Since 1970-01-01 
> > > > 00:00:00z */
> > > > #define VMCLOCK_TIME_TAI1   /* Since 1970-01-01 
> > > > 00:00:00z */
> > > > #define VMCLOCK_TIME_MONOTONIC  2   /* Since undefined 
> > > > epoch */
> > > > #define VMCLOCK_TIME_INVALID3   /* virtio-rtc uses this 
> > > > for smeared UTC */
> > > > 
> > > > 
> > > > I can then use your smearing subtype values as the 'hint' field in the
> > > > shared memory structure. You currently have:
> > > > 
> > > > +\begin{lstlisting}
> > > > +#define VIRTIO_RTC_SUBTYPE_STRICT 0
> > > > +#define VIRTIO_RTC_SUBTYPE_SMEAR 1
> > > > +#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2
> > > > +#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3
> > > > +\end{lstlisting}
> > > > 
> > 
> > I agree with the above part of your proposal.
> > 
> > > > I can certainly ensure that 'noon linear' has the same value. I don't
> > > > think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though:
> > > > 
> > > > 
> > > > +\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard by
> > > > +   smearing time in the vicinity of the leap second, in a not
> > > > +   precisely defined manner. This avoids clock steps due to UTC
> > > > +   leap seconds.
> > > > 
> > > > ...
> > > > 
> > > > +\item VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED may deviate from the UTC
> > > > +   standard w.r.t.\ leap second introduction in an unspecified
> > > > way
> > > > +   (leap seconds may, or may not, be smeared).
> > > > 
> > > > To the client, both of those just mean "for a day or so around a leap
> > > > second event, you can't trust this device to know what the time is".
> > > > There isn't any point in separating "does lie to you" from "might lie
> > > > to you", surely? The guest can't do anything useful with that
> > > > distinction. Let's drop SMEAR and keep only LEAP_UNSPECIFIED?
> > 
> > As for VIRTIO_RTC_SUBTYPE_SMEAR, I think this could be dropped indeed
> > (resp., UTC_SLS may be added).
> > 
> > But VIRTIO_RTC_CLOCK_SMEARED_UTC is an assurance that there will be no
> > steps (in particular, steps backwards, which some clients might not like)
> > due to leap seconds, while LEAP_UNSPECIFIED

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-07-01 Thread David Woodhouse
On Fri, 2024-06-28 at 22:27 +0100, David Woodhouse wrote:
> On 28 June 2024 17:38:15 BST, Peter Hilber  
> wrote:
> > On 28.06.24 14:15, David Woodhouse wrote:
> > > On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
> > > > On 27.06.24 16:52, David Woodhouse wrote:
> > > > > I already added a flags field, so this might look something like:
> > > > > 
> > > > >     /*
> > > > >  * Smearing flags. The UTC clock exposed through this 
> > > > > structure
> > > > >  * is only ever true UTC, but a guest operating system may
> > > > >  * choose to offer a monotonic smeared clock to its users. 
> > > > > This
> > > > >  * merely offers a hint about what kind of smearing to 
> > > > > perform,
> > > > >  * for consistency with systems in the nearby environment.
> > > > >  */
> > > > > #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* 
> > > > > draft-kuhn-leapsecond-00.txt */
> > > > > 
> > > > > (UTC-SLS is probably a bad example but are there formal definitions 
> > > > > for
> > > > > anything else?)
> > > > 
> > > > I think it could also be more generic, like flags for linear smearing,
> > > > cosine smearing(?), and smear_start_sec and smear_end_sec fields 
> > > > (relative
> > > > to the leap second start). That could also represent UTC-SLS, and
> > > > noon-to-noon, and it would be well-defined.
> > > > 
> > > > This should reduce the likelihood that the guest doesn't know the 
> > > > smearing
> > > > variant.
> > > 
> > > I'm wary of making it too generic. That would seem to encourage a
> > > *proliferation* of false "UTC-like" clocks.
> > > 
> > > It's bad enough that we do smearing at all, let alone that we don't
> > > have a single definition of how to do it.
> > > 
> > > I made the smearing hint a full uint8_t instead of using bits in flags,
> > > in the end. That gives us a full 255 ways of lying to users about what
> > > the time is, so we're unlikely to run out. And it's easy enough to add
> > > a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
> > > that get invented.
> > > 
> > > 
> > 
> > My concern is that the registry update may come after a driver has already
> > been implemented, so that it may be hard to ensure that the smearing which
> > has been chosen is actually implemented.
> 
> Well yes, but why in the name of all that is holy would anyone want
> to invent *new* ways to lie to users about the time? If we capture
> the existing ones as we write this, surely it's a good thing that
> there's a barrier to entry for adding more?

Ultimately though, this isn't the hill for me to die on. I'm pushing on
that topic because I want to avoid the proliferation of *ambiguity*. If
we have a precision clock, we should *know* what the time is.

So how about this proposal. I line up the fields in the proposed shared
memory structure to match your virtio-rtc proposal, using 'subtype' as
you proposed. But, instead of the 'subtype' being valid only for
VIRTIO_RTC_CLOCK_UTC, we define a new top-level type for *smeared* UTC.

So, you have:

+\begin{lstlisting}
+#define VIRTIO_RTC_CLOCK_UTC 0
+#define VIRTIO_RTC_CLOCK_TAI 1
+#define VIRTIO_RTC_CLOCK_MONO 2
+\end{lstlisting}

I propose that you add

#define VIRTIO_RTC_CLOCK_SMEARED_UTC 3

If my proposed memory structure is subsumed into the virtio-rtc
proposal we'd literally use the same names, but for the time being I'll
update mine to:

/*
 * What time is exposed in the time_sec/time_frac_sec fields?
 */
uint8_t time_type;
#define VMCLOCK_TIME_UTC0   /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_TAI1   /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_MONOTONIC  2   /* Since undefined epoch */
#define VMCLOCK_TIME_INVALID3   /* virtio-rtc uses this for 
smeared UTC */


I can then use your smearing subtype values as the 'hint' field in the
shared memory structure. You currently have:

+\begin{lstlisting}
+#define VIRTIO_RTC_SUBTYPE_STRICT 0
+#define VIRTIO_RTC_SUBTYPE_SMEAR 1
+#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2
+#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3
+\end{lstlisting}

I can certainly ensure that 'noon linear' has the same value. I don't
think you

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-28 Thread David Woodhouse
On 28 June 2024 17:38:15 BST, Peter Hilber  wrote:
>On 28.06.24 14:15, David Woodhouse wrote:
>> On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
>>> On 27.06.24 16:52, David Woodhouse wrote:
>>>> I already added a flags field, so this might look something like:
>>>>
>>>>     /*
>>>>  * Smearing flags. The UTC clock exposed through this structure
>>>>  * is only ever true UTC, but a guest operating system may
>>>>  * choose to offer a monotonic smeared clock to its users. This
>>>>  * merely offers a hint about what kind of smearing to perform,
>>>>  * for consistency with systems in the nearby environment.
>>>>  */
>>>> #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt 
>>>> */
>>>>
>>>> (UTC-SLS is probably a bad example but are there formal definitions for
>>>> anything else?)
>>>
>>> I think it could also be more generic, like flags for linear smearing,
>>> cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
>>> to the leap second start). That could also represent UTC-SLS, and
>>> noon-to-noon, and it would be well-defined.
>>>
>>> This should reduce the likelihood that the guest doesn't know the smearing
>>> variant.
>> 
>> I'm wary of making it too generic. That would seem to encourage a
>> *proliferation* of false "UTC-like" clocks.
>> 
>> It's bad enough that we do smearing at all, let alone that we don't
>> have a single definition of how to do it.
>> 
>> I made the smearing hint a full uint8_t instead of using bits in flags,
>> in the end. That gives us a full 255 ways of lying to users about what
>> the time is, so we're unlikely to run out. And it's easy enough to add
>> a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
>> that get invented.
>> 
>> 
>
>My concern is that the registry update may come after a driver has already
>been implemented, so that it may be hard to ensure that the smearing which
>has been chosen is actually implemented.

Well yes, but why in the name of all that is holy would anyone want to invent 
*new* ways to lie to users about the time? If we capture the existing ones as 
we write this, surely it's a good thing that there's a barrier to entry for 
adding more?


>But the error bounds could be large or missing. I am trying to address use
>cases where the host steps or slews the clock as well.

The host is absolutely intended to be skewing the clock to keep it accurate as 
the frequency of the underlying oscillator changes, and the seq_count field 
will change every time the host does so.

Do we need to handle steps differently? Or just let the guest deal with it?

If an NTP server suddenly steps the time it reports, what does the client do?




Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-28 Thread David Woodhouse
On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
> On 27.06.24 16:52, David Woodhouse wrote:
> > I already added a flags field, so this might look something like:
> > 
> >     /*
> >  * Smearing flags. The UTC clock exposed through this structure
> >  * is only ever true UTC, but a guest operating system may
> >  * choose to offer a monotonic smeared clock to its users. This
> >  * merely offers a hint about what kind of smearing to perform,
> >  * for consistency with systems in the nearby environment.
> >  */
> > #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt 
> > */
> > 
> > (UTC-SLS is probably a bad example but are there formal definitions for
> > anything else?)
> 
> I think it could also be more generic, like flags for linear smearing,
> cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
> to the leap second start). That could also represent UTC-SLS, and
> noon-to-noon, and it would be well-defined.
> 
> This should reduce the likelihood that the guest doesn't know the smearing
> variant.

I'm wary of making it too generic. That would seem to encourage a
*proliferation* of false "UTC-like" clocks.

It's bad enough that we do smearing at all, let alone that we don't
have a single definition of how to do it.

I made the smearing hint a full uint8_t instead of using bits in flags,
in the end. That gives us a full 255 ways of lying to users about what
the time is, so we're unlikely to run out. And it's easy enough to add
a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
that get invented.


> > > > +   /*
> > > > +    * This field changes to another non-repeating value when the 
> > > > CPU
> > > > +    * counter is disrupted, for example on live migration.
> > > > +    */
> > > > +   uint64_t disruption_marker;
> > > 
> > > The field could also change when the clock is stepped (leap seconds
> > > excepted), or when the clock frequency is slewed.
> > 
> > I'm not sure. The concept of the disruption marker is that it tells the
> > guest to throw away any calibration of the counter that the guest has
> > done for *itself* (with NTP, other PTP devices, etc.).
> > 
> > One mode for this device would be not to populate the clock fields at
> > all, but *only* to signal disruption when it occurs. So the guest can
> > abort transactions until it's resynced its clocks (to avoid incurring
> > fines if breaking databases, etc.).
> > 
> > Exposing the host timekeeping through the structure means that the
> > migrated guest can keep working because it can trust the timekeeping
> > performed by the (new) host and exposed to it.
> > 
> > If the counter is actually varying in frequency over time, and the host
> > is slewing the clock frequency that it reports, that *isn't* a step
> > change and doesn't mean that the guest should throw away any
> > calibration that it's been doing for itself. One hopes that the guest
> > would have detected the *same* frequency change, and be adapting for
> > itself. So I don't think that should indicate a disruption.
> > 
> > I think the same is even true if the clock is stepped by the host. The
> > actual *counter* hasn't changed, so the guest is better off ignoring
> > the vacillating host and continuing to derive its idea of time from the
> > hardware counter itself, as calibrated against some external NTP/PTP
> > sources. Surely we actively *don't* to tell the guest to throw its own
> > calibrations away, in this case?
> 
> In case the guest is also considering other time sources, it might indeed
> not be a good idea to mix host clock changes into the hardware counter
> disruption marker.
> 
> But if the vmclock is the authoritative source of time, it can still be
> helpful to know about such changes, maybe through another marker.

Could that be the existing seq_count field?

Skewing the counter_period_frac_sec as the underlying oscillator speeds
up and slows down is perfectly normal and expected, and we already
expect the seq_count to change when that happens.

Maybe step changes are different, but arguably if the time advertised
by the host steps *outside* the error bounds previously advertised,
that's just broken?

Depending on how the clock information is fed, a change in seq_count
may even result in non-monotonicity. If the underlying oscillator has
sped up and the structure is updated accordingly, the time calculated
the moment *before* that update may appear later than the time
calculated immediately after it.

It's up to the guest operating system to feed that information into its
own timekeeping system and skew towards correctness instead of stepping
the time it reports to its users.



smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-28 Thread David Woodhouse
On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
> 
> > 
> > /*
> >  * What time is exposed in the time_sec/time_frac_sec fields?
> >  */
> > uint8_t time_type;
> > #define VMCLOCK_TIME_UNKNOWN0   /* Invalid / no time 
> > exposed */
> > #define VMCLOCK_TIME_UTC1   /* Since 1970-01-01 
> > 00:00:00z */
> > #define VMCLOCK_TIME_TAI2   /* Since 1970-01-01 
> > 00:00:00z */
> > #define VMCLOCK_TIME_MONOTONIC  3   /* Since undefined epoch */
> > 
> > /* Bit shift for counter_period_frac_sec and its error rate */
> > uint8_t counter_period_shift;
> > 
> > /*
> >  * Unlike in NTP, this can indicate a leap second in the past. This
> >  * is needed to allow guests to derive an imprecise clock with
> >  * smeared leap seconds for themselves, as some modes of smearing
> >  * need the adjustments to continue even after the moment at which
> >  * the leap second should have occurred.
> >  */
> > int8_t leapsecond_direction;
> > uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */
> > 
> > /*
> >  * Paired values of counter and UTC at a given point in time.
> >  */
> > uint64_t counter_value;
> > uint64_t time_sec; /* Since 1970-01-01 00:00:00z */
> 
> Nitpick: The comment is not valid any more for TIME_MONOTONIC.

Ah yes, I "moved" that comment up to the UTC/TAI time_type values, but
neglected to actually delete it from here. Fixed; thanks.


smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-27 Thread David Woodhouse

I've updated the tree at
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
(but not yet the qemu one).

I think I've taken into account all your comments apart from the one
about non-64-bit counters wrapping. I reduced the seq_count to 32 bit
to make room for a 32-bit flags field, added the time type
(UTC/TAI/MONOTONIC) and a smearing hint, with some straw man
definitions for smearing algorithms for which I could actually find
definitions.

The structure now looks like this:


struct vmclock_abi {
uint32_t magic;
#define VMCLOCK_MAGIC   0x4b4c4356 /* "VCLK" */
uint16_t size;  /* Size of page containing this structure */
uint16_t version;   /* 1 */

/* Sequence lock. Low bit means an update is in progress. */
uint32_t seq_count;

uint32_t flags;
/* Indicates that the tai_offset_sec field is valid */
#define VMCLOCK_FLAG_TAI_OFFSET_VALID   (1 << 0)
/*
 * Optionally used to notify guests of pending maintenance events.
 * A guest may wish to remove itself from service if an event is
 * coming up. Two flags indicate the rough imminence of the event.
 */
#define VMCLOCK_FLAG_DISRUPTION_SOON(1 << 1) /* About a day */
#define VMCLOCK_FLAG_DISRUPTION_IMMINENT(1 << 2) /* About an hour */
/* Indicates that the utc_time_maxerror_picosec field is valid */
#define VMCLOCK_FLAG_UTC_MAXERROR_VALID (1 << 3)
/* Indicates counter_period_error_rate_frac_sec is valid */
#define VMCLOCK_FLAG_PERIOD_ERROR_VALID (1 << 4)

/*
 * This field changes to another non-repeating value when the CPU
 * counter is disrupted, for example on live migration. This lets
 * the guest know that it should discard any calibration it has
 * performed of the counter against external sources (NTP/PTP/etc.).
 */
uint64_t disruption_marker;

uint8_t clock_status;
#define VMCLOCK_STATUS_UNKNOWN  0
#define VMCLOCK_STATUS_INITIALIZING 1
#define VMCLOCK_STATUS_SYNCHRONIZED 2
#define VMCLOCK_STATUS_FREERUNNING  3
#define VMCLOCK_STATUS_UNRELIABLE   4

uint8_t counter_id;
#define VMCLOCK_COUNTER_INVALID 0
#define VMCLOCK_COUNTER_X86_TSC 1
#define VMCLOCK_COUNTER_ARM_VCNT2
#define VMCLOCK_COUNTER_X86_ART 3

/*
 * By providing the offset from UTC to TAI, the guest can know both
 * UTC and TAI reliably, whichever is indicated in the time_type
 * field. Valid if VMCLOCK_FLAG_TAI_OFFSET_VALID is set in flags.
 */
int16_t tai_offset_sec;

/*
 * The time exposed through this device is never smeaared; if it
 * claims to be VMCLOCK_TIME_UTC then it MUST be UTC. This field
 * provides a hint to the guest operating system, such that *if*
 * the guest OS wants to provide its users with an alternative
 * clock which does not follow the POSIX CLOCK_REALTIME standard,
 * it may do so in a fashion consistent with the other systems
 * in the nearby environment.
 */
uint8_t leap_second_smearing_hint;
/* Provide true UTC to users, unsmeared. */;
#define VMCLOCK_SMEARING_NONE   0
/*
 * 
https://aws.amazon.com/blogs/aws/look-before-you-leap-the-coming-leap-second-and-aws/
 * From noon on the day before to noon on the day after, smear the
 * clock by a linear 1/86400s per second.
*/
#define VMCLOCK_SMEARING_LINEAR_86400   1
/*
 * draft-kuhn-leapsecond-00
 * For the 1000s leading up to the leap second, smear the clock by
 * clock by a linear 1ms per second.
 */
#define VMCLOCK_SMEARING_UTC_SLS2

/*
 * What time is exposed in the time_sec/time_frac_sec fields?
 */
uint8_t time_type;
#define VMCLOCK_TIME_UNKNOWN0   /* Invalid / no time exposed */
#define VMCLOCK_TIME_UTC1   /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_TAI2   /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_MONOTONIC  3   /* Since undefined epoch */

/* Bit shift for counter_period_frac_sec and its error rate */
uint8_t counter_period_shift;

/*
 * Unlike in NTP, this can indicate a leap second in the past. This
 * is needed to allow guests to derive an imprecise clock with
 * smeared leap seconds for themselves, as some modes of smearing
 * need the adjustments to continue even after the moment at which
 * the leap second should have occurred.
 */
int8_t leapsecond_direction;
uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */

/*
 * Paired values of counter and UTC at a given point in time.
 */
uint64_t counte

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-27 Thread David Woodhouse
On Thu, 2024-06-27 at 15:50 +0200, Peter Hilber wrote:
> On 25.06.24 21:01, David Woodhouse wrote:
> > From: David Woodhouse 
> > 
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> > 
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> > 
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
> > 
> > Signed-off-by: David Woodhouse 
> > ---
> > 
> > v2: 
> >  • Add gettimex64() support
> >  • Convert TSC values to KVM clock when appropriate
> >  • Require int128 support
> >  • Add counter_period_shift 
> >  • Add timeout when seq_count is invalid
> >  • Add flags field
> >  • Better comments in vmclock ABI structure
> >  • Explicitly forbid smearing (as clock rates would need to change)
> 
> Leap second smearing information could still be conveyed through the
> vmclock_abi. AFAIU, to cover the popular smearing variants, it should be
> enough to indicate whether the driver should apply linear or cosine
> smearing, and the start time and end time.

Yes. The clock information actually conveyed through the {counter,
time, rate} tuple should never be smeared, and should only ever be UTC.

But we could provide a hint to the guest operating system about what
type of smearing to perform, *if* it chooses to offer a clock other
than the standard CLOCK_REALTIME to its users.

I already added a flags field, so this might look something like:

/*
 * Smearing flags. The UTC clock exposed through this structure
 * is only ever true UTC, but a guest operating system may
 * choose to offer a monotonic smeared clock to its users. This
 * merely offers a hint about what kind of smearing to perform,
 * for consistency with systems in the nearby environment.
 */
#define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */


(UTC-SLS is probably a bad example but are there formal definitions for
anything else?)


> > But we 
> >  drivers/ptp/Kconfig  |  13 +
> >  drivers/ptp/Makefile |   1 +
> >  drivers/ptp/ptp_vmclock.c    | 516 +++
> >  include/uapi/linux/vmclock.h | 138 ++
> >  4 files changed, 668 insertions(+)
> >  create mode 100644 drivers/ptp/ptp_vmclock.c
> >  create mode 100644 include/uapi/linux/vmclock.h
> > 
> 
> [...]
> 
> > +
> > +/*
> > + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds 
> > >> 64
> > + * and add the fractional second part of the reference time.
> > + *
> > + * The result is a 128-bit value, the top 64 bits of which are seconds, and
> > + * the low 64 bits are (seconds >> 64).
> > + *
> > + * If __int128 isn't available, perform the calculation 32 bits at a time 
> > to
> > + * avoid overflow.
> > + */
> > +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t 
> > delta,
> > +  uint64_t period, uint8_t 
> > shift,
> > +  uint64_t frac_sec)
> > +{
> > +   unsigned __int128 res = (unsigned __int128)delta * period;
> > +
> > +   res >>= shift;
> > +   res += frac_sec;
> > +   *res_hi = res >> 64;
> > +   return (uint64_t)res;
> > +}
> > +
> > +static int vmclock_get_crosststamp(struct vmclock_state *st,
> > +  struct ptp_system_timestamp *sts,
> > +  struct system_counterval_t 
> > *system_counter,
> > +  struct timespec64 *tspec)
> > +{
> > +   ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
> > +   struct system_time_snapshot systime_snapshot;
> > +   uint64_t cycle, delta, seq, frac_sec;
> > +
> > +#ifdef CONFIG_X86
> > +   /*
> > +    * We'd expect the hypervisor to know this and to report the clock
> > +    * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
> > +    */
> > +   if (check_tsc_unstable())
> > +   return -EINVAL;
> > +#endif
> > +
> > +   while (1) {
> > +   seq = st->clk->seq_count & ~1ULL;
> > + 

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-26 Thread David Woodhouse
On Tue, 2024-06-25 at 15:22 -0700, John Stultz wrote:
> On Tue, Jun 25, 2024 at 2:48 PM David Woodhouse  wrote:
> > On Tue, 2024-06-25 at 23:34 +0200, Thomas Gleixner wrote:
> > > On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote:
> > > > From: David Woodhouse 
> > > > 
> > > > The vmclock "device" provides a shared memory region with precision 
> > > > clock
> > > > information. By using shared memory, it is safe across Live Migration.
> > > > 
> > > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > > > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > > > actually helpful.
> > > > 
> > > > The memory region of the device is also exposed to userspace so it can 
> > > > be
> > > > read or memory mapped by application which need reliable notification of
> > > > clock disruptions.
> > > 
> > > There is effort underway to expose PTP clocks to user space via VDSO.
> > 
> > Ooh, interesting. Got a reference to that please?
> > 
> > >  Can we please not expose an ad hoc interface for that?
> > 
> > Absolutely. I'm explicitly trying to intercept the virtio-rtc
> > specification here, to *avoid* having to do anything ad hoc.
> > 
> > Note that this is a "vDSO-style" interface from hypervisor to guest via
> > a shared memory region, not necessarily an actual vDSO.
> > 
> > But yes, it *is* intended to be exposed to userspace, so that userspace
> > can know the *accurate* time without a system call, and know that it
> > hasn't been perturbed by live migration.
> 
> Yea, I was going to raise a concern that just defining an mmaped
> structure means it has to trust the guest logic is as expected. It's
> good that it's versioned! :)

Right. Although it's basically a pvclock, and we've had those for ages.

The main difference here is that we add an indicator that tells the
guest that it's been live migrated, so any additional NTP/PTP
refinement that the *guest* has done of its oscillator, should now be
discarded.

It's also designed to be useful in "disruption-only" mode, where the
pvclock information isn't actually populated, so *all* it does is tell
guests that their clock is now hosed due to live migration.

That part is why it needs to be mappable directly to userspace, so that
userspace can not only get a timestamp but *also* know that it's
actually valid. All without a system call.

The critical use cases are financial systems where they incur massive
fines if they submit mis-timestamped transactions, and distributed
databases which rely on accurate timestamps (and error bounds) for
eventual coherence. Live migration can screw those completely.

I'm open to changing fairly much anything about the proposal as long as
we can address those use cases (which the existing virtio-rtc and other
KVM enlightenments do not).

> I'd fret a bit about exposing this to userland. It feels very similar
> to the old powerpc systemcfg implementation that similarly mapped just
> kernel data out to userland and was difficult to maintain as changes
> were made. Would including a code page like a proper vdso make sense
> to make this more flexible of an UABI to maintain?

I think the structure itself should be stable once we've bikeshedded it
a bit. But there is certainly some potential for vDSO functions which
help us expose it to the user...

This structure exposes a 'disruption count' which is updated every time
the TSC/counter is messed with by live migration. But what is userspace
actually going to *compare* it with?

It basically needs to compare it with the disruption count when the
clock was last synchronized, so maybe the kernel could export *that* to
vDSO too, then expose a simple vDSO function which reports whether the
clock is valid?

The 'invalid' code path could turn into an actual system call which
makes the kernel (check for itself and) call ntp_clear() when the
disruption occurs. Or maybe not just ntp_clear() but actually consume
the pvclock rate information directly and apply the *new* calibration?

That kind of thing would be great, and I've definitely tried to design
the structure so that it *can* be made a first-class citizen within the
kernel's timekeeping code and used like that.

But I was going to start with a more modest proposal that it's "just a
device", and applications which care about reliable time after LM would
have to /dev/vmclock0 and mmap it and check for themselves. (Which
would be assisted by things like the ClockBound library).




smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-25 Thread David Woodhouse
On Tue, 2024-06-25 at 23:34 +0200, Thomas Gleixner wrote:
> On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote:
> > From: David Woodhouse 
> > 
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> > 
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> > 
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
> 
> There is effort underway to expose PTP clocks to user space via VDSO.

Ooh, interesting. Got a reference to that please?

>  Can we please not expose an ad hoc interface for that?

Absolutely. I'm explicitly trying to intercept the virtio-rtc
specification here, to *avoid* having to do anything ad hoc.

Note that this is a "vDSO-style" interface from hypervisor to guest via
a shared memory region, not necessarily an actual vDSO.

But yes, it *is* intended to be exposed to userspace, so that userspace
can know the *accurate* time without a system call, and know that it
hasn't been perturbed by live migration.

> As you might have heard the sad news, I'm not feeling up to the task to
> dig deeper into this right now. Give me a couple of days to look at this
> with working brain.

I have not heard any news, although now I'm making inferences.

Wishing you the best!


smime.p7s
Description: S/MIME cryptographic signature


[RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-25 Thread David Woodhouse
From: David Woodhouse 

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Like the KVM PTP clock, this can convert TSC-based cross timestamps into
KVM clock values. Unlike the KVM PTP clock, it does so only when such is
actually helpful.

The memory region of the device is also exposed to userspace so it can be
read or memory mapped by application which need reliable notification of
clock disruptions.

Signed-off-by: David Woodhouse 
---

v2: 
 • Add gettimex64() support
 • Convert TSC values to KVM clock when appropriate
 • Require int128 support
 • Add counter_period_shift 
 • Add timeout when seq_count is invalid
 • Add flags field
 • Better comments in vmclock ABI structure
 • Explicitly forbid smearing (as clock rates would need to change)

 drivers/ptp/Kconfig  |  13 +
 drivers/ptp/Makefile |   1 +
 drivers/ptp/ptp_vmclock.c| 516 +++
 include/uapi/linux/vmclock.h | 138 ++
 4 files changed, 668 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock.h

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
  To compile this driver as a module, choose M here: the module
  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+   tristate "Virtual machine PTP clock"
+   depends on X86_TSC || ARM_ARCH_TIMER
+   depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+   default y
+   help
+ This driver adds support for using a virtual precision clock
+ advertised by the hypervisor. This clock is only useful in virtual
+ machines where such a device is present.
+
+ To compile this driver as a module, choose M here: the module
+ will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
tristate "IDT 82P33xxx PTP clock"
depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE)  += ptp_dte.o
 obj-$(CONFIG_PTP_1588_CLOCK_INES)  += ptp_ines.o
 obj-$(CONFIG_PTP_1588_CLOCK_PCH)   += ptp_pch.o
 obj-$(CONFIG_PTP_1588_CLOCK_KVM)   += ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK)   += ptp_vmclock.o
 obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o
 ptp-qoriq-y+= ptp_qoriq.o
 ptp-qoriq-$(CONFIG_DEBUG_FS)   += ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index ..e19c2eed8009
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,516 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#ifdef CONFIG_X86
+#include 
+#include 
+#endif
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+   phys_addr_t phys_addr;
+   struct vmclock_abi *clk;
+   struct miscdevice miscdev;
+   struct ptp_clock_info ptp_clock_info;
+   struct ptp_clock *ptp_clock;
+   enum clocksource_ids cs_id, sys_cs_id;
+   int index;
+   char *name;
+};
+
+#define VMCLOCK_MAX_WAIT ms_to_ktime(100)
+
+/*
+ * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
+ * and add the fractional second part of the reference time.
+ *
+ * The result is a 128-bit value, the top 64 bits of which are seconds, and
+ * the low 64 bits are (seconds >> 64).
+ *
+ * If __int128 isn't available, perform the calculation 32 bits at a time to
+ * avoid overflow.
+ */
+static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t 
delta,
+  uint64_t period, uint8_t shift,
+  uint64_t frac_sec)
+{
+   unsigned __int128 res = (unsigned __int128)delta * period;
+
+   res >>= shift;
+   res += frac_sec;
+   *res_hi = res >> 64;
+   return (uint64_t)res;
+}
+
+static int vmclock_get_crosststamp(struct vmclock_state *st,
+  struct ptp_system_timestamp *sts,
+  struct system_counterval_t *system_counter,
+  struct timespec64 *tspec)
+{
+   ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
+   struct system_time_snapshot systime_snapshot;
+   uint64_t cyc

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-06-21 Thread David Woodhouse
On Thu, 2024-06-20 at 14:37 +0200, Peter Hilber wrote:
> Should implement .gettimex64 instead.

Thanks. This look sane?

As noted in the code comment, in the *ideal* case we just build all
three pre/post/device timestamps from the very same counter read. So
sts->pre_ts == sts->post_ts.

In the less ideal case (which will happen on x86 when kvmclock is being
used for the system time), we use the time from ktime_get_snapshot() as
the pre_ts and take a new snapshot immediately after the get_cycles().


diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
index e8c65405a8f3..07a81a94d29a 100644
--- a/drivers/ptp/ptp_vmclock.c
+++ b/drivers/ptp/ptp_vmclock.c
@@ -96,9 +96,11 @@ static inline uint64_t mul_u64_u64_add_u64(uint64_t *res_hi, 
uint64_t delta,
 }
 
 static int vmclock_get_crosststamp(struct vmclock_state *st,
+  struct ptp_system_timestamp *sts,
   struct system_counterval_t *system_counter,
   struct timespec64 *tspec)
 {
+   struct system_time_snapshot systime_snapshot;
uint64_t cycle, delta, seq, frac_sec;
int ret = 0;
 
@@ -119,7 +121,17 @@ static int vmclock_get_crosststamp(struct vmclock_state 
*st,
continue;
}
 
-   cycle = get_cycles();
+   if (sts) {
+   ktime_get_snapshot(&systime_snapshot);
+
+   if (systime_snapshot.cs_id == st->cs_id) {
+   cycle = systime_snapshot.cycles;
+   } else {
+   cycle = get_cycles();
+   ptp_read_system_postts(sts);
+   }
+   } else
+   cycle = get_cycles();
 
delta = cycle - st->clk->counter_value;
 
@@ -139,6 +151,21 @@ static int vmclock_get_crosststamp(struct vmclock_state 
*st,
if (ret)
return ret;
 
+   /*
+* When invoked for gettimex64, fill in the pre/post system times.
+* The ideal case is when system time is based on the the same
+* counter as st->cs_id, in which case all three pre/post/device
+* times are derived from the *same* counter value. If cs_id does
+* not match, then the value from ktime_get_snapshot() is used as
+* pre_ts, and ptp_read_system_postts() was already called above
+* for the post_ts. Those are either side of the get_cycles() call.
+*/
+   if (sts) {
+   sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
+   if (systime_snapshot.cs_id == st->cs_id)
+   sts->post_ts = sts->pre_ts;
+   }
+
if (system_counter) {
system_counter->cycles = cycle;
system_counter->cs_id = st->cs_id;
@@ -155,7 +182,7 @@ static int ptp_vmclock_get_time_fn(ktime_t *device_time,
struct timespec64 tspec;
int ret;
 
-   ret = vmclock_get_crosststamp(st, system_counter, &tspec);
+   ret = vmclock_get_crosststamp(st, NULL, system_counter, &tspec);
if (!ret)
*device_time = timespec64_to_ktime(tspec);
 
@@ -198,7 +225,16 @@ static int ptp_vmclock_gettime(struct ptp_clock_info *ptp, 
struct timespec64 *ts
struct vmclock_state *st = container_of(ptp, struct vmclock_state,
ptp_clock_info);
 
-   return vmclock_get_crosststamp(st, NULL, ts);
+   return vmclock_get_crosststamp(st, NULL, NULL, ts);
+}
+
+static int ptp_vmclock_gettimex(struct ptp_clock_info *ptp, struct timespec64 
*ts,
+   struct ptp_system_timestamp *sts)
+{
+   struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+   ptp_clock_info);
+
+   return vmclock_get_crosststamp(st, sts, NULL, ts);
 }
 
 static int ptp_vmclock_enable(struct ptp_clock_info *ptp,
@@ -216,6 +252,7 @@ static const struct ptp_clock_info ptp_vmclock_info = {
.adjfine= ptp_vmclock_adjfine,
.adjtime= ptp_vmclock_adjtime,
.gettime64  = ptp_vmclock_gettime,
+   .gettimex64 = ptp_vmclock_gettimex,
.settime64  = ptp_vmclock_settime,
.enable = ptp_vmclock_enable,
.getcrosststamp = ptp_vmclock_getcrosststamp,




smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-06-21 Thread David Woodhouse
On Thu, 2024-06-20 at 17:19 +0100, David Woodhouse wrote:
> 
> > 
> > > +
> > > +   /* Counter frequency, and error margin. Units of (second >> 64) */
> > > +   uint64_t counter_period_frac_sec;
> > 
> > AFAIU this might limit the precision in case of high counter frequencies.
> > Could the unit be aligned to the expected frequency band of counters?
> 
> This field indicates the period of a single tick, in units of 1>>64 of
> a second. That's about 5.4e-20 seconds, or 54 zeptoseconds? 
> 
> Can you walk me through a calculation where you believe that level of
> precision is insufficient?
> 
> I guess the precision matters if the structure isn't updated for a long
> period of time, and the delta between the current counter and the
> snapshot is high? That's a *lot* of 54 zeptosecondses? But you really
> would need a *lot* of them before you care? And if nobody's been
> calibrating your counter for that long, surely you have bigger worries?
> 
> Am I missing something there?

Hm, that was a bit rushed at the end of the day; let's take a better look...

Let's take a hypothetical example of a 100GHz counter. That's two
orders of magnitude more than today's Arm arch counter.

The period of such a counter would be 10 picoseconds. 

(Let's ignore the question of how far light actually travels in that
time and how *realistic* that example is, for the moment.)

It turns out that at that rate, there *are* a lot of 54 zeptosecondses
of precision loss in the day. It could be half a millisecond a day, or
20µs an hour.

That particular example of 10 picoseconds is 184467440.7370955
(seconds>>64) which could be truncated to 184467440 — losing about 4PPB
(a third of a millisecond a day; 14µs an hour).

So yeah, I suppose a 'shift' field could make sense. It's easy enough
to consume on the guest side as it doesn't really perturb the 128-bit
multiplication very much; especially if we don't let it be negative.

And implementations *can* just set it to zero. It hurts nobody.

Or were you thinking of just using a fixed shift like (seconds>>80)
instead?


smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-06-20 Thread David Woodhouse
On Thu, 2024-06-20 at 14:37 +0200, Peter Hilber wrote:
> Changing virtio-dev address to the new one. The discussion might also be
> relevant for virtio-comment, but it is discouraged to forward it to both.

I will happily take it to whichever forum you think is most
appropriate. (And you have my permission to direct replies to whichever
you choose.)

> On 15.06.24 10:40, David Woodhouse wrote:
> > As discussed before, I don't think it makes sense to design a new high-
> > precision virtual clock which only gets it right *most* of the time. We
> > absolutely need to address the issue of live migration.
 ...
> > Here's a first attempt at defining such a memory structure. For now
> > I've done it as a "vmclock" ACPI device based loosely on vmgenid, but I
> > think it makes most sense for this to be part of the virtio_rtc spec.
> > Ultimately it doesn't matter *how* the guest finds the memory region.
> 
> This looks sensible to me. I also think it would be possible to adapt this for
> the virtio-rtc spec. The proposal also supports some other use cases which are
> not in the virtio-rtc RFC spec v4 [2], notably vDSO-style clock reading, and
> others such as indication of a past leap second.

Right. The vDSO-style clock reading is key to solving the live
migration problem.

The other key thing this adds is the error bounds, which some
applications care deeply about. I've been working with the team that
owns ClockBound on that part: https://github.com/aws/clock-bound

> Compared to the virtio-rtc RFC spec v4 [2], this proposal does not seem to
> support leap second smearing. 

That's kind of intentional. Leap second smearing should be considered
the *antithesis* of precise time. Anyone who wants a monotonic realtime
clock should be using the POSIX CLOCK_TAI.

Part of my motivation for fixing the LM problem is because some
financial institutions can incur significant penalties for putting
inaccurate timestamps on transactions — even the disruption caused by
live migration is enough to trigger that. So deliberately lying to them
about what the UTC time is, by up to a second in either direction, is
not necessarily in their best interest.

As you noted, this proposal does expose leap seconds in the recent
past, which is all that's needed to allow a guest to generate a smeared
clock *from* the accurate clock that is provided through this
mechanism.

(Knowledge of past leap seconds is needed because in some modes,
smearing adjustments continue for some hours *afte* the leap second
should have occurred. So the NTP style of leap indicator isn't
sufficient).

> Also, it would be helpful to allow indicating
> when some of the fields are not valid (such as leapsecond_counter_value,
> leapsecond_tai_time, tai_offset_sec, utc_time_maxerror_picosec, ...).

Right. For some of those the answer can just be 'zero means invalid',
for the max error, perhaps MAX_UINT64. But we should definitely make
that explicit.

I'm also not entirely sure I understood Julien's insistence that we
include the leapsecond_counter_value as *well* as the
leapsecond_tai_time. It seems to me that the implementation would have
to recalculate that every time the frequency is adjusted. 

For some of those fields, I was expecting a certain amount of
bikeshedding to occur and figured it was better to post an early straw
man and solicit feedback.

> Do you have plans to contribute this to the Virtio specification and Linux
> driver implementation?

Yes, absolutely. For now I've implemented it in the Linux guest¹ and in
QEMU² as an ACPI device modelled on vmgenid, but I'd love *not* to have
to do that, and just to do it based on virtio instead.

¹ https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
² https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock


> > +static const struct ptp_clock_info ptp_vmclock_info = {
> > +   .owner  = THIS_MODULE,
> > +   .max_adj= 0,
> > +   .n_ext_ts   = 0,
> > +   .n_pins = 0,
> > +   .pps= 0,
> > +   .adjfine= ptp_vmclock_adjfine,
> > +   .adjtime= ptp_vmclock_adjtime,
> > +   .gettime64  = ptp_vmclock_gettime,
> 
> Should implement .gettimex64 instead.

Ack, thanks. I'll go play with that.

> 
> > +
> > +   /* Counter frequency, and error margin. Units of (second >> 64) */
> > +   uint64_t counter_period_frac_sec;
> 
> AFAIU this might limit the precision in case of high counter frequencies.
> Could the unit be aligned to the expected frequency band of counters?

This field indicates the period of a single tick, in units of 1>>64 of
a second. That's about 5.4e-20 seconds, or 54 zeptoseconds? 

Can you walk me

Re: [RFC PATCH v3 5/7] virtio_rtc: Add PTP clocks

2024-06-20 Thread David Woodhouse
On Thu, 2024-06-20 at 14:01 +0200, Peter Hilber wrote:
> On 15.06.24 10:01, David Woodhouse wrote:
> > On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> > > 
> > > +   ret = viortc_hw_xtstamp_params(&hw_counter, &cs_id);
> > > +   if (ret)
> > > +   return ret;
> > > +
> > > +   ktime_get_snapshot(&history_begin);
> > > +   if (history_begin.cs_id != cs_id)
> > > +   return -EOPNOTSUPP;
> > 
> > I think you have to call ktime_get_snapshot() anyway to get a snapshot
> > from before your crosststamp? But I still don't much like the fact that
> > you need to use it to work out which cs_id is being used.
> 
> The actual cs_id check is in get_device_system_crosststamp(), where it was
> added recently [1]. So this additional check is just verifying that the
> history_begin is usable.
>
> > Shouldn't get_device_system_crosststamp() pass that to its get_time_fn
> > as a hint?
> 
> This is unneeded in this case, since get_device_system_crosststamp() does
> the check already (but the driver is free to pass it through the
> get_time_fn parameter ctx).

The *check* is a different thing.

As things stand, the device has to *choose* a cs_id to use, and takes a
gamble on that check in get_device_system_crosststamp() throwing the
crosststamp away with -ENODEV because the device picked the wrong
cs_id.

That's why I'm saying it would be nicer if the core code *told* the
device what cs_id to use. Rather than just throwing it away if the
device guesses wrong.

(Yes, it would have to be considered a hint, because it could
theoretically have *changed* by the time the result is obtained, just
as with your code above.)

> > 
> > On x86, you are likely to find that history_begin.cs_id is the KVM
> > clock, so this will return -EOPNOTSUPP and userspace will have to fall
> > back to PTP_SYS_OFFSET. I note the KVM PTP clock actually *converts* a
> > TSC-based crosststamp to kvmclock µs for itself, so that it can report 
> > *cs_id = CSID_X86_KVM_CLK. Not sure how I feel about that though. I'm
> > inclined to suggest that it shouldn't, as anyone who wants accurate
> > timekeeping shouldn't be using the KVM clock anyway.
> > 
> > But we should at least be relatively consistent about it.
> 
> ATM, the driver does indeed not have TSC support (for cross-timestamping)
> enabled at all, so would always use fallback. If *not* using the KVM clock,
> I think TSC can just be enabled by adding architecture-specific code
> similar to virtio_rtc_arm.c.
> 
> I am not familiar with the KVM clock, but maybe it would be sufficient to
> allow CSID_X86_KVM_CLK as well?

Sure, that's what the ptp_kvm clock does. It actually obtains a TSC
reading from the "hardware", and then manually (and unconditionally)
converts that to a kvmclock value so that it can return a clock pairing
based on CSID_X86_KVM_CLK.

Which works until the user configures the clocksource to be the TSC
instead of kvmclock, and then hits that -ENODEV check and has to do the
fallback.

We should just tell the device which cs_id to use.




smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-06-15 Thread David Woodhouse
On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> RFC v3 updates
> --
> 
> This series implements a driver for a virtio-rtc device conforming to spec
> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
> the PTP clock driver already present before.
> 
> This patch series depends on the patch series "treewide: Use clocksource id
> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
> series on top of mainline.
> 
> Overview
> 
> 
> This patch series adds the virtio_rtc module, and related bugfixes. The
> virtio_rtc module implements a driver compatible with the proposed Virtio
> RTC device specification [1]. The Virtio RTC (Real Time Clock) device
> provides information about current time. The device can provide different
> clocks, e.g. for the UTC or TAI time standards, or for physical time
> elapsed since some past epoch. The driver can read the clocks with simple
> or more accurate methods. Optionally, the driver can set an alarm.
> 
> The series first fixes some bugs in the get_device_system_crosststamp()
> interpolation code, which is required for reliable virtio_rtc operation.
> Then, add the virtio_rtc implementation.
> 
> For the Virtio RTC device, there is currently a proprietary implementation,
> which has been used for provisional testing.

As discussed before, I don't think it makes sense to design a new high-
precision virtual clock which only gets it right *most* of the time. We
absolutely need to address the issue of live migration.

When live migration occurs, the guest's time precision suffers in two
ways.

First, even when migrating to a supposedly identical host the precise
rate of the underlying counter (TSC, arch counter, etc.) can change
within the tolerances (e.g. ±50PPM) of the hardware. Unlike the natural
changes that NTP normally manages to track, this is a *step* change,
potentially from -50PPM to +50PPM from one host to the next.

Second, the accuracy of the counter as preserved across migration is
limited by the accuracy of each host's NTP synchronization. So there is
also a step change in the value of the counter itself.

At the moment of migration, the guest's timekeeping should be
considered invalid. Any previous cross-timestamps are meaningless, and
blindly using the previously-known relationship between the counter and
real time can lead to problems such as corruption in distributed
databases, fines for mis-timestamped transactions, and other general
unhappiness.

We obviously can't get a new timestamp from the virtio_rtc device every
time an application wants to know the time reliably. We don't even want
*system* calls for that, which is why we have it in a vDSO.

We can take the same approach to warning the guest about clock
disruption due to live migration. A shared memory region from the
virtual clock device even be mapped all the way to userspace, for those
applications which need precise and *reliable* time to check a
'disruption' marker in it, and do whatever is appropriate (e.g. abort
transactions and wait for time to resync) when it happens.

We can do better than just letting the guest know about disruption, of
course. We can put the actual counter→realtime relationship into the
memory region too. As well as being able to provide a PTP driver with
this, the applications which care about *reliable* timestamps can mmap
the page directly and use it, vDSO-style, to have accurate timestamps
even from the first cycle after migration.

When disruption is signalled, the guest needs to throw away any
*additional* refinement that it's done with NTP/PTP/PPS/etc. and revert
to knowing nothing more than what the hypervisor advertises here.

Here's a first attempt at defining such a memory structure. For now
I've done it as a "vmclock" ACPI device based loosely on vmgenid, but I
think it makes most sense for this to be part of the virtio_rtc spec.
Ultimately it doesn't matter *how* the guest finds the memory region.

Very preliminary QEMU hacks at 
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock
(I still need to do the KVM side helper for actually filling in the
host clock information, converted to the *guest* TSC)

This is the guest side. H aving heckled your assumption that we can use
the virt counter on Arm, I concede I'm doing the same thing for now.
The structure itself is at the end, or see
https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=include/uapi/linux/vmclock.h;hb=vmclock

From 9e1c3b823d497efa4e0acb21b226a72e4d6e8a53 Mon Sep 17 00:00:00 2001
From: David Woodhouse 
Date: Mon, 10 Jun 2024 15:10:11 +0100
Subject: [PATCH] ptp: Add vDSO-style vmclock support

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it i

Re: [RFC PATCH v3 5/7] virtio_rtc: Add PTP clocks

2024-06-15 Thread David Woodhouse
On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> 
> +   ret = viortc_hw_xtstamp_params(&hw_counter, &cs_id);
> +   if (ret)
> +   return ret;
> +
> +   ktime_get_snapshot(&history_begin);
> +   if (history_begin.cs_id != cs_id)
> +   return -EOPNOTSUPP;

I think you have to call ktime_get_snapshot() anyway to get a snapshot
from before your crosststamp? But I still don't much like the fact that
you need to use it to work out which cs_id is being used.

Shouldn't get_device_system_crosststamp() pass that to its get_time_fn
as a hint?

On x86, you are likely to find that history_begin.cs_id is the KVM
clock, so this will return -EOPNOTSUPP and userspace will have to fall
back to PTP_SYS_OFFSET. I note the KVM PTP clock actually *converts* a
TSC-based crosststamp to kvmclock µs for itself, so that it can report 
*cs_id = CSID_X86_KVM_CLK. Not sure how I feel about that though. I'm
inclined to suggest that it shouldn't, as anyone who wants accurate
timekeeping shouldn't be using the KVM clock anyway.

But we should at least be relatively consistent about it.


smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v3 6/7] virtio_rtc: Add Arm Generic Timer cross-timestamping

2024-06-15 Thread David Woodhouse
On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> 
> +int viortc_hw_xtstamp_params(u16 *hw_counter, enum clocksource_ids *cs_id)
> +{
> +   *hw_counter = VIRTIO_RTC_COUNTER_ARM_VIRT;

Hm, but what if it isn't? I think you need to put this in
drivers/clocksource/arm_arch_timer.c where it can do something like
kvm_arch_ptp_get_crosststamp() does to decide:

if (arch_timer_uses_ppi == ARCH_TIMER_VIRT_PPI)
ptp_counter = KVM_PTP_VIRT_COUNTER;
else
ptp_counter = KVM_PTP_PHYS_COUNTER;


> +   *cs_id = CSID_ARM_ARCH_COUNTER;
> +
> +   return 0;
> +}



smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-20 Thread David Woodhouse
On Tue, 2024-03-19 at 14:47 +0100, Peter Hilber wrote:
> While the virtio-comment list is not available, now also CC'ing Parav,
> which may be interested in this virtio-rtc spec related discussion thread.
> 
> On 14.03.24 15:19, David Woodhouse wrote:
> > On 14 March 2024 11:13:37 CET, Peter Hilber  
> > wrote:
> > > > To a certain extent, as long as the virtio-rtc device is designed to 
> > > > expose time precisely and unambiguously, it's less important if the 
> > > > Linux kernel *today* can use that. Although of course we should strive 
> > > > for that. Let's be...well, *unambiguous*, I suppose... that we've 
> > > > changed topics to discuss that though.
> > > > 
> > > 
> > > As Virtio is extensible (unlike hardware), my approach is to mostly 
> > > specify
> > > only what also has a PoC user and a use case.
> > 
> > If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case 
> > on day one. Otherwise, as I said in my first response, I can go do that as 
> > a separate device and decide that virtio_rtc doesn't meet our needs 
> > (especially for maintaining accuracy over LM).
> 
> We plan to add 
> 
> - leap second indication,
> 
> - UTC-to-TAI offset,
> 
> - clock smearing indication (including the noon-to-noon linear smearing
>   variant which seems to be somewhat popular), and
>
> - clock accuracy indication
> 
> to the initial spec and to the PoC implementation.

Sounds good, thanks! I look forward to seeing the new revision. I'm
hoping Julien can give feedback on the clock accuracy parts.

> However, due to resource restrictions, we cannot ourselves add the
> memory-mapped clock to the initial spec.
>
> Everyone is very welcome to contribute the memory-mapped clock to the spec,
> and I think it might then still make it to the initial version.

Makes sense. That is my primary target, so I'm *hoping* we can converge
and get that into your initial spec, otherwise for expediency I'm going
to have to define an ACPI or DT or PCI device of our own and expose the
memory region through that instead.

(Even if I have to do that in the short term to stop the bleeding with
customers' clocks and live migration, I'd still aspire to migrate to a
virtio_rtc version of it in future)



smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-14 Thread David Woodhouse
On 14 March 2024 11:13:37 CET, Peter Hilber  
wrote:
>> To a certain extent, as long as the virtio-rtc device is designed to expose 
>> time precisely and unambiguously, it's less important if the Linux kernel 
>> *today* can use that. Although of course we should strive for that. Let's 
>> be...well, *unambiguous*, I suppose... that we've changed topics to discuss 
>> that though.
>> 
>
>As Virtio is extensible (unlike hardware), my approach is to mostly specify
>only what also has a PoC user and a use case.

If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case on 
day one. Otherwise, as I said in my first response, I can go do that as a 
separate device and decide that virtio_rtc doesn't meet our needs (especially 
for maintaining accuracy over LM).

My main concern for virto_rtc is that we avoid *ambiguity*. Yes, I get that 
it's extensible but we don't want a v1.0 of the spec, implemented by various 
hypervisors, which still leaves guests not knowing what the actual time is. 
That would not be good. And even UTC without a leap second indicator has that 
problem.



Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread David Woodhouse
On 13 March 2024 17:50:48 GMT, Peter Hilber  
wrote:
>On 13.03.24 13:45, David Woodhouse wrote:
>> Surely the whole point of this effort is to provide guests with precise
>> and *unambiguous* knowledge of what the time is? 
>
>I would say, a fundamental point of this effort is to enable such
>implementations, and to detect if a device is promising to support this.
>
>Where we might differ is as to whether the Virtio clock *for every
>implementation* has to be *continuously* accurate w.r.t. a time standard,
>or whether *for some implementations* it could be enough that all guests in
>the local system have the same, precise local notion of time, which might
>be off from the actual time standard.

That makes sense, but remember I don't just want {X, Y, Z} but *also* the error 
bounds of ±deltaY and ±deltaZ too.

So your example just boils down to "I'm calling it UTC, and it's really 
precise, but we make no promises about its *accuracy*". And that's fine.

>Also, cf. ptp_kvm, which AFAIU doesn't address leap seconds at all...

KVM is not an exemplar of good time practices. 
Not in *any* respect :)

>With your described use case the UTC_SMEARED clock should of course not be
>used. The UTC_SMEARED clock would get a distinct name through udev, like
>/dev/ptp_virtio_utc_smeared, so the incompatibility could at least be
>detected.

As long as it's clear to all concerned that this is fundamentally not usable as 
an accurate time source, and is only for the local-sync case you described, 
sure.

>> Using UTC is bad enough, because for a UTC timestamp in the middle of a
>> leap second the guest can't know know *which* occurrence of that leap
>> second it is, so it might be wrong by a second. To resolve that
>> ambiguity needs a leap indicator and/or tai_offset field.
>
>I agree that virtio-rtc should communicate this. The question is, what
>exactly, and for which clock read request?

Are we now conflating software architecture (and Linux in particular) with 
"hardware" design?

To a certain extent, as long as the virtio-rtc device is designed to expose 
time precisely and unambiguously, it's less important if the Linux kernel 
*today* can use that. Although of course we should strive for that. Let's 
be...well, *unambiguous*, I suppose... that we've changed topics to discuss 
that though.

>As for PTP clocks:
>
>- It doesn't fit into the ioctl PTP_SYS_OFFSET_PRECISE2.
>
>- The clock_adjtime(2) tai_offset and return value could be set (if
>  upstream will accept this). Would this help? As discussed, user space
>  would need to interpret this (and currently no dynamic POSIX clock sets
>  this).

Hm, maybe?


>>> I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or
>>> boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor,
>>> so the device might not even know which vCPUs there are. E.g. there is even
>>> interest to make virtio-rtc work as part of the virtio-net device (which
>>> might be implemented in hardware).
>> 
>> Sure, but those implementations aren't going to offer the TSC pairing
>> at all, are they?
>> 
>
>They could offer an Intel ART pairing (some physical PTP NICs are already
>doing this, look for the convert_art_to_tsc() users).

Right, but isn't that software's problem? The time pairing is defined against 
the ART in that case.



Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread David Woodhouse
On Wed, 2024-03-13 at 13:58 +0100, Alexandre Belloni wrote:
> The TSC or whatever CPU counter/clock that is used to keep the system
> time is not an RTC, I don't get why it has to be exposed as such to the
> guests. PTP is fine and precise, RTC is not.

Ah, I see. But the point of the virtio_rtc is not really to expose that
CPU counter. The point is to report the wallclock time, just like an
actual RTC. The real difference is the *precision*.

The virtio_rtc device has a facility to *also* expose the counter,
because that's what we actually need to gain that precision...

Applications don't read the RTC every time they want to know what the
time is. These days, they don't even make a system call; it's done
entirely in userspace mode. The kernel exposes some shared memory,
essentially saying "the counter was X at time Y, and runs at Z Hz".
Then applications just read the CPU counter and do some arithmetic.

As we require more and more precision in the calibration, it becomes
important to get *paired* readings of the CPU counter and the wallclock
time at precisely the same moment. If the guest has to read one and
then the other, potentially taking interrupts, getting preempted and
suffering steal/SMI time in the middle, that introduces an error which
is increasingly significant as we increasingly care about precision.

Peter's proposal exposes the pairs of {X,Y} and leaves *all* the guest
kernels having to repeat readings over time and perform the calibration
as the underlying hardware oscillator frequency (Z) drifts with
temperature. I'm trying to get him to let the hypervisor expose the
calibrated frequency Z too. Along with *error* bounds for ±δX and ±δZ.
Which aside from reducing the duplication of effort, will *also* fix
the problem of live migration where *all* those things suffer a step
change and leave the guest with an inaccurate clock but not knowing it.

But that part isn't relevant to the RTC, as you say. RTC doesn't care
about that level of precision; it's just what the system uses to know
roughly what year it is, when it powers up. (And isn't even really
trusted for that, which is a large part of why I added the
X509_V_FLAG_NO_CHECK_TIME flag to OpenSSL, so Secure Boot doesn't break
when the RTC is catastrophically wrong :)

If you're asking why patch 7/7 in Peter's series exists to expose the
virtio clock through RTC, and you're not particularly interested in the
first six, I suppose that's a fair question. As is the question of "why
is it called virtio_rtc not virtio_ptp?". 

But let me turn it around: if the kernel has access to this virtio
device and *not* any other RTC, why *wouldn't* the kernel use the time
from it? The fact that it can optionally *also* provide paired readings
with the CPU counter doesn't actually *hurt* for the RTC use case, does
it?





smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread David Woodhouse
On Wed, 2024-03-13 at 10:45 +0100, Peter Hilber wrote:
> On 12.03.24 18:15, David Woodhouse wrote:
> > On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
> > > On 08.03.24 13:33, David Woodhouse wrote:
> > > > On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
> > > > > On 07.03.24 15:02, David Woodhouse wrote:
> > > > > > Hm, should we allow UTC? If you tell me the time in UTC, then
> > > > > > (sometimes) I still don't actually know what the time is, because 
> > > > > > some
> > > > > > UTC seconds occur twice. UTC only makes sense if you provide the TAI
> > > > > > offset, surely? Should the virtio_rtc specification make it 
> > > > > > mandatory
> > > > > > to provide such?
> > > > > > 
> > > > > > Otherwise you're just designing it to allow crappy hypervisors to
> > > > > > expose incomplete information.
> > > > > > 
> > > > > 
> > > > > Hi David,
> > > > > 
> > > > > (adding virtio-comm...@lists.oasis-open.org for spec discussion),
> > > > > 
> > > > > thank you for your insightful comments. I think I take a broadly 
> > > > > similar
> > > > > view. The reason why the current spec and driver is like this is that 
> > > > > I
> > > > > took a pragmatic approach at first and only included features which 
> > > > > work
> > > > > out-of-the-box for the current Linux ecosystem.
> > > > > 
> > > > > The current virtio_rtc features work similar to ptp_kvm, and therefore
> > > > > can work out-of-the-box with time sync daemons such as chrony.
> > > > > 
> > > > > As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
> > > > > as well, I am afraid that
> > > > > 
> > > > > - in some (embedded) scenarios, the TAI clock may not be available
> > > > > 
> > > > > - crappy hypervisors will pass off the UTC clock as the TAI clock.
> > > > > 
> > > > > For the same reasons, I am also not sure about adding a *mandatory* 
> > > > > TAI
> > > > > offset to each readout. I don't know user-space software which would
> > > > > leverage this already (at least not through the PTP clock interface).
> > > > > And why would such software not go straight for the TAI clock instead?
> > > > > 
> > > > > How about adding a requirement to the spec that the virtio-rtc device
> > > > > SHOULD expose the TAI clock whenever it is available - would this
> > > > > address your concerns?
> > > > 
> > > > I think that would be too easy for implementors to miss, or decide not
> > > > to obey. Or to get *wrong*, by exposing a TAI clock but actually
> > > > putting UTC in it.
> > > > 
> > > > I think I prefer to mandate the tai_offset field with the UTC clock.
> > > > Crappy implementations will just set it to zero, but at least that
> > > > gives a clear signal to the guests that it's *their* problem to
> > > > resolve.
> > > 
> > > To me there are some open questions regarding how this would work. Is 
> > > there
> > > a use case for this with the v3 clock reading methods, or would it be
> > > enough to address this with the Virtio timekeeper?
> > > 
> > > Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
> > > best alongside some additional information about leap seconds. I am not
> > > aware about any user-space user. In addition, leap second smearing should
> > > also be addressed.
> > > 
> > 
> > Is there even a standard yet for leap-smearing? Will it be linear over
> > 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
> > think is what Google does? Meta does something different again, don't
> > they?
> > 
> > Exposing UTC as the only clock reference is bad enough; when leap
> > seconds happen there's a whole second during which you don't *know*
> > which second it is. It seems odd to me, for a precision clock to be
> > deliberately ambiguous about what the time is!
> 
> Just to be clear, the device can perfectly expose only a TAI reference
> clock (or both UTC and TAI), the spec is just completely open about this,
> as it tries to work for dive

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread David Woodhouse
On Wed, 2024-03-13 at 12:18 +0100, Alexandre Belloni wrote:
> 
> I still don't know anything about virtio but under Linux, an RTC is
> always UTC (or localtime when dual booting but let's not care) and never
> accounts for leap seconds. Having an RTC and RTC driver behaving
> differently would be super inconvenient. Why don't you leave this to
> userspace?

Well yes, we don't need to expose *anything* from the hypervisor and we
can leave it all to guest userspace. We can run NTP on every single one
of *hundreds* of guests, leaving them all to duplicate the work of
calibrating the *same* underlying oscillator.

I thought we were trying to avoid that, by having the hypervisor tell
them what the time was. If we're going to do that, we need it to be
sufficiently precise (and some clients want to *know* the precision),
and above all we need it to be *unambiguous*.

If the hypervisor says that the time is 3692217600.001, then the guest
doesn't actually know *which* 3692217600.001 it is, and thus it still
doesn't know the time to an accuracy better than 1 second.

And if we start allowing the hypervisor to smear clocks in some other
underspecified ways, then we end up with errors of up to 1 second in
the clock for long periods of time *around* the leap second.

We need to avoid that ambiguity.

> I guess I'm still questioning whether this is the correct interface to
> expose the host system time instead of an actual RTC.

If an RTC device is able to report '23:59:60' as the time of day, I
suppose that *could* resolve the ambiguity. But talking to a device is
slow; we want guests to be able to know the time — accurately — with a
simple counter/tsc read and some arithmetic. Which means *paired* reads
of 'RTC' and the counter, and a precise indication of the counter
frequency.



smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-12 Thread David Woodhouse
On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
> On 08.03.24 13:33, David Woodhouse wrote:
> > On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
> > > On 07.03.24 15:02, David Woodhouse wrote:
> > > > Hm, should we allow UTC? If you tell me the time in UTC, then
> > > > (sometimes) I still don't actually know what the time is, because some
> > > > UTC seconds occur twice. UTC only makes sense if you provide the TAI
> > > > offset, surely? Should the virtio_rtc specification make it mandatory
> > > > to provide such?
> > > > 
> > > > Otherwise you're just designing it to allow crappy hypervisors to
> > > > expose incomplete information.
> > > > 
> > > 
> > > Hi David,
> > > 
> > > (adding virtio-comm...@lists.oasis-open.org for spec discussion),
> > > 
> > > thank you for your insightful comments. I think I take a broadly similar
> > > view. The reason why the current spec and driver is like this is that I
> > > took a pragmatic approach at first and only included features which work
> > > out-of-the-box for the current Linux ecosystem.
> > > 
> > > The current virtio_rtc features work similar to ptp_kvm, and therefore
> > > can work out-of-the-box with time sync daemons such as chrony.
> > > 
> > > As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
> > > as well, I am afraid that
> > > 
> > > - in some (embedded) scenarios, the TAI clock may not be available
> > > 
> > > - crappy hypervisors will pass off the UTC clock as the TAI clock.
> > > 
> > > For the same reasons, I am also not sure about adding a *mandatory* TAI
> > > offset to each readout. I don't know user-space software which would
> > > leverage this already (at least not through the PTP clock interface).
> > > And why would such software not go straight for the TAI clock instead?
> > > 
> > > How about adding a requirement to the spec that the virtio-rtc device
> > > SHOULD expose the TAI clock whenever it is available - would this
> > > address your concerns?
> > 
> > I think that would be too easy for implementors to miss, or decide not
> > to obey. Or to get *wrong*, by exposing a TAI clock but actually
> > putting UTC in it.
> > 
> > I think I prefer to mandate the tai_offset field with the UTC clock.
> > Crappy implementations will just set it to zero, but at least that
> > gives a clear signal to the guests that it's *their* problem to
> > resolve.
> 
> To me there are some open questions regarding how this would work. Is there
> a use case for this with the v3 clock reading methods, or would it be
> enough to address this with the Virtio timekeeper?
> 
> Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
> best alongside some additional information about leap seconds. I am not
> aware about any user-space user. In addition, leap second smearing should
> also be addressed.
> 

Is there even a standard yet for leap-smearing? Will it be linear over
1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
think is what Google does? Meta does something different again, don't
they?

Exposing UTC as the only clock reference is bad enough; when leap
seconds happen there's a whole second during which you don't *know*
which second it is. It seems odd to me, for a precision clock to be
deliberately ambiguous about what the time is!

But if the virtio-rtc clock is defined as UTC and then expose something
*different* in it, that's even worse. You potentially end up providing
inaccurate time for a whole *day* leading up to the leap second.

I think you're right that leap second smearing should be addressed. At
the very least, by making it clear that the virtio-rtc clock which
advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
yet-to-be-defined variant.

Please make it explicit that any hypervisor which wants to advertise a
smeared clock shall define a new type which specifies the precise
smearing algorithm and cannot be conflated with the one you're defining
here.

> > One other thing to note is I think we're being very naïve about the TSC
> > on x86 hosts. Theoretically, the TSC for every vCPU might run at a
> > different frequency, and even if they run at the same frequency they
> > might be offset from each other. I'm happy to be naïve but I think we
> > should be *explicitly* so, and just say for example that it's defined
> > against vCPU0 so if other vCPUs are different then all bets are off.
> 
> ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you
> have an opinion on how to represent this in a platform-independent way.

Well, it doesn't have a notion of TSCs either; you include that by
implicit reference don't you?



smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-08 Thread David Woodhouse
On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
> On 07.03.24 15:02, David Woodhouse wrote:
> > On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> > > RFC v3 updates
> > > --
> > > 
> > > This series implements a driver for a virtio-rtc device conforming to spec
> > > RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
> > > the PTP clock driver already present before.
> > > 
> > > This patch series depends on the patch series "treewide: Use clocksource 
> > > id
> > > for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
> > > series on top of mainline.
> > > 
> > > Overview
> > > 
> > > 
> > > This patch series adds the virtio_rtc module, and related bugfixes. The
> > > virtio_rtc module implements a driver compatible with the proposed Virtio
> > > RTC device specification [1]. The Virtio RTC (Real Time Clock) device
> > > provides information about current time. The device can provide different
> > > clocks, e.g. for the UTC or TAI time standards, or for physical time
> > > elapsed since some past epoch. 
> > 
> > Hm, should we allow UTC? If you tell me the time in UTC, then
> > (sometimes) I still don't actually know what the time is, because some
> > UTC seconds occur twice. UTC only makes sense if you provide the TAI
> > offset, surely? Should the virtio_rtc specification make it mandatory
> > to provide such?
> > 
> > Otherwise you're just designing it to allow crappy hypervisors to
> > expose incomplete information.
> > 
> 
> Hi David,
> 
> (adding virtio-comm...@lists.oasis-open.org for spec discussion),
> 
> thank you for your insightful comments. I think I take a broadly similar
> view. The reason why the current spec and driver is like this is that I
> took a pragmatic approach at first and only included features which work
> out-of-the-box for the current Linux ecosystem.
> 
> The current virtio_rtc features work similar to ptp_kvm, and therefore can
> work out-of-the-box with time sync daemons such as chrony.
> 
> As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock as
> well, I am afraid that
> 
> - in some (embedded) scenarios, the TAI clock may not be available
> 
> - crappy hypervisors will pass off the UTC clock as the TAI clock.
> 
> For the same reasons, I am also not sure about adding a *mandatory* TAI
> offset to each readout. I don't know user-space software which would
> leverage this already (at least not through the PTP clock interface). And
> why would such software not go straight for the TAI clock instead?
> 
> How about adding a requirement to the spec that the virtio-rtc device
> SHOULD expose the TAI clock whenever it is available - would this address
> your concerns?

I think that would be too easy for implementors to miss, or decide not
to obey. Or to get *wrong*, by exposing a TAI clock but actually
putting UTC in it.

I think I prefer to mandate the tai_offset field with the UTC clock.
Crappy implementations will just set it to zero, but at least that
gives a clear signal to the guests that it's *their* problem to
resolve.




> > > PTP clock interface
> > > ---
> > > 
> > > virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm.
> > > If both the Virtio RTC device and this driver have special support for the
> > > current clocksource, time synchronization programs can use
> > > cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
> > > PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization
> > > with single-digit ns precision is possible with a quiescent reference 
> > > clock
> > > (from the Virtio RTC device). This works even when the Virtio device
> > > response is slow compared to ptp_kvm hypercalls.
> > 
> > Is PTP the right mechanism for this? As I understand it, PTP is a way
> > to precisely synchronize one clock with another. But in the case of
> > virt guests synchronizing against the host, it isn't really *another*
> > clock. It really is the *same* underlying clock. As the host clock
> > varies with temperature, for example, so does the guest clock. The only
> > difference is an offset and (on x86 perhaps) a mathematical scaling of
> > the frequency.
> > 
> > I was looking at this another way, when I came across this virtio-rtc
> > work.
> > 
> > My idea was just for the hypervisor to expose its own timekeeping
> > information — the counter/TSC value and

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-07 Thread David Woodhouse
On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> RFC v3 updates
> --
> 
> This series implements a driver for a virtio-rtc device conforming to spec
> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
> the PTP clock driver already present before.
> 
> This patch series depends on the patch series "treewide: Use clocksource id
> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
> series on top of mainline.
> 
> Overview
> 
> 
> This patch series adds the virtio_rtc module, and related bugfixes. The
> virtio_rtc module implements a driver compatible with the proposed Virtio
> RTC device specification [1]. The Virtio RTC (Real Time Clock) device
> provides information about current time. The device can provide different
> clocks, e.g. for the UTC or TAI time standards, or for physical time
> elapsed since some past epoch. 

Hm, should we allow UTC? If you tell me the time in UTC, then
(sometimes) I still don't actually know what the time is, because some
UTC seconds occur twice. UTC only makes sense if you provide the TAI
offset, surely? Should the virtio_rtc specification make it mandatory
to provide such?

Otherwise you're just designing it to allow crappy hypervisors to
expose incomplete information.

> PTP clock interface
> ---
> 
> virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm.
> If both the Virtio RTC device and this driver have special support for the
> current clocksource, time synchronization programs can use
> cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
> PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization
> with single-digit ns precision is possible with a quiescent reference clock
> (from the Virtio RTC device). This works even when the Virtio device
> response is slow compared to ptp_kvm hypercalls.

Is PTP the right mechanism for this? As I understand it, PTP is a way
to precisely synchronize one clock with another. But in the case of
virt guests synchronizing against the host, it isn't really *another*
clock. It really is the *same* underlying clock. As the host clock
varies with temperature, for example, so does the guest clock. The only
difference is an offset and (on x86 perhaps) a mathematical scaling of
the frequency.

I was looking at this another way, when I came across this virtio-rtc
work.

My idea was just for the hypervisor to expose its own timekeeping
information — the counter/TSC value and TAI time at a given moment,
frequency of the counter, and the precision of both that frequency
(±PPM) and the TAI timestamp (±µs).

By putting that in a host/guest shared data structure with a seqcount
for lockless updates, we can update it as time synchronization on the
host is refined, and we can even cleanly handle live migration where
the guest ends up on a completely different host. It allows for use
cases which *really* care (e.g. timestamping financial transactions) to
ensure that there is never even a moment of getting *wrong* timestamps
if they haven't yet resynced after a migration.

Now I'm trying to work out if I should attempt to reconcile with your
existing virtio-rtc work, or just decide that virtio-rtc isn't trying
to solve the actual problem that we have, and go ahead with something
different... ?



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH 5/5] modules: only allow symbol_get of EXPORT_SYMBOL_GPL modules

2023-10-17 Thread David Woodhouse
On Tue, 2023-08-01 at 19:35 +0200, Christoph Hellwig wrote:
> It has recently come to my attention that nvidia is circumventing the
> protection added in 262e6ae7081d ("modules: inherit
> TAINT_PROPRIETARY_MODULE") by importing exports from their proprietary
> modules into an allegedly GPL licensed module and then rexporting them.
> 
> Given that symbol_get was only ever intended for tightly cooperating
> modules using very internal symbols it is logical to restrict it to
> being used on EXPORT_SYMBOL_GPL and prevent nvidia from costly DMCA
> Circumvention of Access Controls law suites.

I'm all for insisting that everything be exported with
EXPORT_SYMBOL_GPL and nothing at all ever be exported with just
EXPORT_SYMBOL.

But if we're going to tolerate the core kernel still exporting some
stuff with EXPORT_SYMBOL, why isn't OK for a GPL-licensed module do to
the same? Even an *in-tree* GPL-licensed module now can't export
functionality with EXPORT_SYMBOL and have it used with symbol_get().

We're forced to *either* allow direct linking by non-GPL modules, or
allow symbol_get(), but not both?

> Fixes: 262e6ae7081d ("modules: inherit TAINT_PROPRIETARY_MODULE")

Hm, the condition we really need to fix *that* is "symbol_get() will
only import symbols from GPL-licensed modules", isn't it?

As long as that property is correctly transitive, why does the symbol
itself have to be EXPORT_SYMBOL_GPL instead of EXPORT_SYMBOL? Am I
missing another potential loophole?

I suppose there's now scope for a different type of shim which
*directly* imports an EXPORT_SYMBOL function in order to export it
again as EXPORT_SYMBOL_GPL and thus allow the GPL export to be found
with symbol_get()?

That's the *converse* of the problematic shim that was being used
before, and from a licensing point of view it seems fine... it's just
working around the unintended side-effects of this patch?


smime.p7s
Description: S/MIME cryptographic signature


Re: [EXTERNAL] [PATCH 2/2] KVM: x86: disable interrupts while pvclock_gtod_sync_lock is taken

2021-04-01 Thread David Woodhouse
On Tue, 2021-03-30 at 12:59 -0400, Paolo Bonzini wrote:
> @@ -2686,13 +2688,13 @@ static int kvm_guest_time_update(struct
> kvm_vcpu *v)
>  * If the host uses TSC clock, then passthrough TSC as stable
>  * to the guest.
>  */
> -   spin_lock(&ka->pvclock_gtod_sync_lock);
> +   spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
> use_master_clock = ka->use_master_clock;
> if (use_master_clock) {
> host_tsc = ka->master_cycle_now;
> kernel_ns = ka->master_kernel_ns;
> }
> -   spin_unlock(&ka->pvclock_gtod_sync_lock);
> +   spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
> 
> /* Keep irq disabled to prevent changes to the clock */
> local_irq_save(flags);

That seems a little gratuitous at the end; restoring the flags as part
of the spin_unlock_irqrestore() and then immediately calling
local_irq_save().

Is something going to complain if we just use spin_unlock() there and
then later restore the flags with the existing local_irq_restore()?

Or should we move the local_irq_save() up above the existing
spin_lock() and leave the spin lock/unlock as they are?


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH 2/3] iommu/amd: Don't call early_amd_iommu_init() when AMD IOMMU is disabled

2021-03-17 Thread David Woodhouse



On 17 March 2021 13:32:35 GMT, Joerg Roedel  wrote:
>On Wed, Mar 17, 2021 at 11:47:11AM +0000, David Woodhouse wrote:
>> If you've already moved the Stoney Ridge check out of the way,
>there's
>> no real reason why you can't just set
>init_state=IOMMU_CMDLINE_DISABLED
>> directly from parse_amd_iommu_options(), is there? Then you don't
>need
>> the condition here at all?
>
>True, there is even more room for optimization. The amd_iommu_disabled
>variable can go away entirely, including its checks in
>early_amd_iommu_init(). I will do a patch-set on-top of this for v5.13
>which does more cleanups.

If we can get to the point where we don't even need to check 
amd_iommu_irq_remap in the ...select() function because the IRQ domain is never 
even registered in the case where the flag ends up false, all the better :)

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH 2/3] iommu/amd: Don't call early_amd_iommu_init() when AMD IOMMU is disabled

2021-03-17 Thread David Woodhouse
On Wed, 2021-03-17 at 10:10 +0100, Joerg Roedel wrote:
> diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
> index 3280e6f5b720..61dae1800b7f 100644
> --- a/drivers/iommu/amd/init.c
> +++ b/drivers/iommu/amd/init.c
> @@ -2919,12 +2919,12 @@ static int __init state_next(void)
> }
> break;
> case IOMMU_IVRS_DETECTED:
> -   ret = early_amd_iommu_init();
> -   init_state = ret ? IOMMU_INIT_ERROR : IOMMU_ACPI_FINISHED;
> -   if (init_state == IOMMU_ACPI_FINISHED && amd_iommu_disabled) {
> -   pr_info("AMD IOMMU disabled\n");
> +   if (amd_iommu_disabled) {
> init_state = IOMMU_CMDLINE_DISABLED;
> ret = -EINVAL;
> +   } else {
> +   ret = early_amd_iommu_init();
> +   init_state = ret ? IOMMU_INIT_ERROR : 
> IOMMU_ACPI_FINISHED;
> }
> break;
> case IOMMU_ACPI_FINISHED:
> -- 

If you've already moved the Stoney Ridge check out of the way, there's
no real reason why you can't just set init_state=IOMMU_CMDLINE_DISABLED
directly from parse_amd_iommu_options(), is there? Then you don't need
the condition here at all?


smime.p7s
Description: S/MIME cryptographic signature


[PATCH] iommu/amd: Don't initialise remapping irqdomain if IOMMU is disabled

2021-03-15 Thread David Woodhouse
From: David Woodhouse 

When the IOMMU is disabled, the driver still enumerates and initialises
the hardware in order to turn it off. Because IRQ remapping setup is
done early, the irqdomain is set up opportunistically.

In commit b34f10c2dc59 ("iommu/amd: Stop irq_remapping_select() matching
when remapping is disabled") I already make the irq_remapping_select()
function check the amd_iommu_irq_setup flag because that might get
cleared only after the irqdomain setup is done, when the IVRS is parsed.

However, in the case where 'amd_iommu=off' is passed on the command line,
the IRQ remapping setup isn't done but the amd_iommu_irq_setup flag is
still set by the early IRQ remap init code. Stop it doing that, by
bailing out of amd_iommu_prepare() early when it's disabled.

This avoids the crash in irq_remapping_select() as it dereferences the
NULL amd_iommu_rlookup_table[]:

[0.243659] Switched APIC routing to physical x2apic.
[0.262206] BUG: kernel NULL pointer dereference, address: 0500
[0.262927] #PF: supervisor read access in kernel mode
[0.263390] #PF: error_code(0x) - not-present page
[0.263844] PGD 0 P4D 0
[0.264135] Oops:  [#1] SMP PTI
[0.264460] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc3 #831
[0.265069] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.14.0-1.fc33 04/01/2014
[0.265825] RIP: 0010:irq_remapping_select+0x57/0xb0
[0.266327] Code: 4b 0c 48 3d 30 e0 a7 9e 75 0d eb 35 48 8b 00 48 3d 30 e0 
a7 9e 74 2a 0f b6 50 10 39 d1 75 ed 0f b7 40 12 48 8b 15 69 e3 d2 01 <48> 8b 14 
c2 48 85 d2 74 0e b8 01 00 00 00 48 3b aa 90 04 00 00 74
[0.268412] RSP: :9e803db0 EFLAGS: 00010246
[0.268919] RAX: 00a0 RBX: 9e803df8 RCX: 
[0.269550] RDX:  RSI:  RDI: 98120112fe79
[0.270245] RBP: 9812011c8218 R08: 0001 R09: 000a
[0.270922] R10: 000a R11: f000 R12: 9812011c8218
[0.271549] R13: 98120181ed88 R14:  R15: 
[0.272221] FS:  () GS:98127dc0() 
knlGS:
[0.272997] CS:  0010 DS:  ES:  CR0: 80050033
[0.273508] CR2: 0500 CR3: 3081 CR4: 06b0
[0.274178] Call Trace:
[0.274416]  irq_find_matching_fwspec+0x41/0xc0
[0.274812]  mp_irqdomain_create+0x65/0x150
[0.275251]  setup_IO_APIC+0x70/0x811

Fixes: a1a785b57242 ("iommu/amd: Implement select() method on remapping 
irqdomain")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212017
Signed-off-by: David Woodhouse 
---
 drivers/iommu/amd/init.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 9126efcbaf2c..07edd837b076 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -2998,6 +2998,9 @@ int __init amd_iommu_prepare(void)
 {
int ret;
 
+   if (amd_iommu_disabled)
+   return -ENODEV;
+
amd_iommu_irq_remap = true;
 
ret = iommu_go_to_state(IOMMU_ACPI_FINISHED);
-- 
2.29.2



Re: [EXTERNAL] [PATCH] KVM: x86: allow compiling out the Xen hypercall interface

2021-02-26 Thread David Woodhouse
On Fri, 2021-02-26 at 05:14 -0500, Paolo Bonzini wrote:
> The Xen hypercall interface adds to the attack surface of the hypervisor
> and will be used quite rarely.  Allow compiling it out.
> 
> Suggested-by: Christoph Hellwig 
> Cc: David Woodhouse 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/Kconfig  |  9 
>  arch/x86/kvm/Makefile |  3 ++-
>  arch/x86/kvm/x86.c|  2 ++
>  arch/x86/kvm/xen.h| 49 ++-
>  4 files changed, 61 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 7ac592664c52..5a0d704581bd 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -103,6 +103,15 @@ config KVM_AMD_SEV
>   Provides support for launching Encrypted VMs (SEV) and Encrypted VMs
>   with Encrypted State (SEV-ES) on AMD processors.
> 
> +config KVM_XEN
> +   bool "Support for Xen hypercall interface"
> +   depends on KVM && IA32_FEAT_CTL
> +   help
> + Provides KVM support for the Xen shared information page and
> + for passing Xen hypercalls to userspace.

I'd be a bit less specific there. Just "support for hosting Xen HVM
guests" perhaps? It already includes basic event channel support, I'm
posted an RFC for the runstate stuff, and more event channel
acceleration is next...


> + If in doubt, say "N".
> +
>  config KVM_MMU_AUDIT
> bool "Audit KVM MMU"
> depends on KVM && TRACEPOINTS
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index aeab168c5711..1b4766fe1de2 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -14,11 +14,12 @@ kvm-y   += $(KVM)/kvm_main.o 
> $(KVM)/coalesced_mmio.o \
> $(KVM)/dirty_ring.o
>  kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
> 
> -kvm-y  += x86.o emulate.o i8259.o irq.o lapic.o xen.o \
> +kvm-y  += x86.o emulate.o i8259.o irq.o lapic.o \
>i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
>hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \
>mmu/spte.o
>  kvm-$(CONFIG_X86_64) += mmu/tdp_iter.o mmu/tdp_mmu.o
> +kvm-$(CONFIG_KVM_XEN)  += xen.o
> 
>  kvm-intel-y+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o 
> vmx/vmcs12.o \
>vmx/evmcs.o vmx/nested.o vmx/posted_intr.o
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index bfc928495bd4..9727295b7817 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8039,8 +8039,10 @@ void kvm_arch_exit(void)
> kvm_mmu_module_exit();
> free_percpu(user_return_msrs);
> kmem_cache_destroy(x86_fpu_cache);
> +#ifdef CONFIG_KVM_XEN
> static_key_deferred_flush(&kvm_xen_enabled);
> WARN_ON(static_branch_unlikely(&kvm_xen_enabled.key));
> +#endif

We could always just drop that. It's just paranoia.

>  }
> 
>  static int __kvm_vcpu_halt(struct kvm_vcpu *vcpu, int state, int reason)
> diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
> index b66a921776f4..6abdbebb029b 100644
> --- a/arch/x86/kvm/xen.h
> +++ b/arch/x86/kvm/xen.h
> @@ -9,6 +9,7 @@
>  #ifndef __ARCH_X86_KVM_XEN_H__
>  #define __ARCH_X86_KVM_XEN_H__
> 
> +#ifdef CONFIG_KVM_XEN
>  #include 
> 
>  extern struct static_key_false_deferred kvm_xen_enabled;
> @@ -18,7 +19,6 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct 
> kvm_xen_vcpu_attr *data)
>  int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr 
> *data);
>  int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
>  int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
> -int kvm_xen_hypercall(struct kvm_vcpu *vcpu);
>  int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data);
>  int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc);
>  void kvm_xen_destroy_vm(struct kvm *kvm);
> @@ -38,6 +38,53 @@ static inline int kvm_xen_has_interrupt(struct kvm_vcpu 
> *vcpu)
> 
> return 0;
>  }
> +#else
> +static inline int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct 
> kvm_xen_vcpu_attr *data)
> +{
> +   return -EINVAL;
> +}
> +
> +static inline int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct 
> kvm_xen_vcpu_attr *data)
> +{
> +   return -EINVAL;
> +}
> +
> +static inline int kvm_xen_hvm_set_attr(struct kvm *kvm, struct 
> kvm_xen_hvm_attr *data)
> +{
> +   return -EINVAL;
> +}
> +
> +static inline int kvm_xen_hvm_ge

Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-02-16 Thread David Woodhouse
On Tue, 2021-02-16 at 15:10 +, David Woodhouse wrote:
> Actually it breaks before that, in rcu_cpu_starting(). A spinlock
> around that, an atomic_t to let the APs do their TSC sync one at a time
> (both in the above tree now), and I have a 75% saving on CPU bringup
> time for my 28-thread Haswell:

Here's a dual-socket Skylake box (EC2 c5.metal)... before:

[1.449015] smp: Bringing up secondary CPUs ...
[1.449358] x86: Booting SMP configuration:
[1.449578]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7  #8  #9 
#10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23
[1.514986]  node  #1, CPUs:   #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 
#34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47
[1.644978]  node  #0, CPUs:   #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 
#58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71
[1.711970]  node  #1, CPUs:   #72 #73 #74 #75 #76 #77 #78 #79 #80 #81 
#82 #83 #84 #85 #86 #87 #88 #89 #90 #91 #92 #93 #94 #95
[1.781438] Brought CPUs online in 1063391744 cycles
[1.782289] smp: Brought up 2 nodes, 96 CPUs
[1.782515] smpboot: Max logical packages: 2
[1.782738] smpboot: Total of 96 processors activated (576364.89 BogoMIPS)

... and after:

[1.445661] smp: Bringing up secondary CPUs ...
[1.446004] x86: Booting SMP configuration:
[1.446047]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7  #8  #9 
#10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23
[1.451048]  node  #1, CPUs:   #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 
#34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47
[1.455046]  node  #0, CPUs:   #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 
#58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71
[1.462050]  node  #1, CPUs:   #72 #73 #74 #75 #76 #77 #78 #79 #80 #81 
#82 #83 #84 #85 #86 #87 #88 #89 #90 #91 #92 #93 #94 #95
[1.465983] Brought 96 CPUs to x86/cpu:kick in 68802688 cycles
[0.094170] smpboot: CPU 26 Converting physical 1 to logical package 2
[1.468078] Brought 96 CPUs to x86/cpu:wait-init in 5178300 cycles
[1.476112] Brought CPUs online in 23479546 cycles
[1.476298] smp: Brought up 2 nodes, 96 CPUs
[1.476520] smpboot: Max logical packages: 2
[1.476743] smpboot: Total of 96 processors activated (576000.00 BogoMIPS)

So the CPU bringup went from 334ms to 31ms on that one.

I might try that without spewing over 1KiB to the serial port at 115200
baud during the proceedings, and see if it makes a bigger difference :)




smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-02-16 Thread David Woodhouse
On Tue, 2021-02-16 at 13:53 +, David Woodhouse wrote:
> I threw it into my tree at
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/parallel
>
> It seems to work fairly nicely. The parallel SIPI seems to win be about
> a third of the bringup time on my 28-thread Haswell box. This is at the
> penultimate commit of the above branch:
> 
> [0.307590] smp: Bringing up secondary CPUs ...
> [0.307826] x86: Booting SMP configuration:
> [0.307830]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7  #8  #9 
> #10 #11 #12 #13 #14
> [0.376677] MDS CPU bug present and SMT on, data leak possible. See 
> https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more 
> details.
> [0.377177]  #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27
> [0.402323] Brought CPUs online in 246691584 cycles
> [0.402323] smp: Brought up 1 node, 28 CPUs
> 
> ... and this is the tip of the branch:
> 
> [0.308332] smp: Bringing up secondary CPUs ... 
> [0.308569] x86: Booting SMP configuration:
> [0.308572]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7  #8  #9 
> #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27
> [0.321120] Brought 28 CPUs to x86/cpu:kick in 34828752 cycles
> [0.33] MDS CPU bug present and SMT on, data leak possible. See 
> https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more 
> details.
> [0.368749] Brought CPUs online in 124913032 cycles
> [0.368749] smp: Brought up 1 node, 28 CPUs
> [0.368749] smpboot: Max logical packages: 1
> [0.368749] smpboot: Total of 28 processors activated (145259.85 BogoMIPS)
> 
> There's more to be gained here if we can fix up the next stage. Right
> now if I set every CPU's bit in cpu_initialized_mask to allow them to
> proceed from wait_for_master_cpu() through to the end of cpu_init() and
> onwards through start_secondary(), they all end up hitting
> check_tsc_sync_target() in parallel and it goes horridly wrong.

Actually it breaks before that, in rcu_cpu_starting(). A spinlock
around that, an atomic_t to let the APs do their TSC sync one at a time
(both in the above tree now), and I have a 75% saving on CPU bringup
time for my 28-thread Haswell:

[0.307341] smp: Bringing up secondary CPUs ...
[0.307576] x86: Booting SMP configuration:
[0.307579]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7  #8  #9 
#10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27
[0.320100] Brought 28 CPUs to x86/cpu:kick in 34645984 cycles
[0.325032] Brought 28 CPUs to x86/cpu:wait-init in 12865752 cycles
[0.326902] MDS CPU bug present and SMT on, data leak possible. See 
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more 
details.
[0.328739] Brought CPUs online in 11702224 cycles
[0.328739] smp: Brought up 1 node, 28 CPUs
[0.328739] smpboot: Max logical packages: 1
[0.328739] smpboot: Total of 28 processors activated (145261.81 BogoMIPS)



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-02-16 Thread David Woodhouse
On Fri, 2021-02-12 at 18:30 +0100, Thomas Gleixner wrote:
> On Fri, Feb 12 2021 at 01:29, Thomas Gleixner wrote:
> > On Thu, Feb 11 2021 at 22:58, David Woodhouse wrote:
> > I have no problem with making that jump based. It does not matter at
> > all. But you can't use the idle task stack before the CR3 switch in
> > secondary_startup_64 - unless I'm missing something.
> > 
> > And there's more than 'call verify_cpu' on the way which uses stack. Let
> > me stare more tomorrow with brain awake.
> 
> The below boots on top of mainline. It probably breaks i386 and XEN and
> whatever :)

Nice. As discussed on IRC, you'll need more than 0x for APIC IDs
but I think you can probably get away with 0xF for physical APIC ID
since nothing more than that can fit in 32 bits of *logical* X2APIC ID.

> I didn't come around to play with your patches yet and won't be able to
> do so before next week.

I threw it into my tree at
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/parallel

It seems to work fairly nicely. The parallel SIPI seems to win be about
a third of the bringup time on my 28-thread Haswell box. This is at the
penultimate commit of the above branch:

[0.307590] smp: Bringing up secondary CPUs ...
[0.307826] x86: Booting SMP configuration:
[0.307830]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7  #8  #9 
#10 #11 #12 #13 #14
[0.376677] MDS CPU bug present and SMT on, data leak possible. See 
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more 
details.
[0.377177]  #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27
[0.402323] Brought CPUs online in 246691584 cycles
[0.402323] smp: Brought up 1 node, 28 CPUs

... and this is the tip of the branch:

[0.308332] smp: Bringing up secondary CPUs ... 
[0.308569] x86: Booting SMP configuration:
[0.308572]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7  #8  #9 
#10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27
[0.321120] Brought 28 CPUs to x86/cpu:kick in 34828752 cycles
[0.33] MDS CPU bug present and SMT on, data leak possible. See 
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more 
details.
[0.368749] Brought CPUs online in 124913032 cycles
[0.368749] smp: Brought up 1 node, 28 CPUs
[0.368749] smpboot: Max logical packages: 1
[0.368749] smpboot: Total of 28 processors activated (145259.85 BogoMIPS)

There's more to be gained here if we can fix up the next stage. Right
now if I set every CPU's bit in cpu_initialized_mask to allow them to
proceed from wait_for_master_cpu() through to the end of cpu_init() and
onwards through start_secondary(), they all end up hitting
check_tsc_sync_target() in parallel and it goes horridly wrong.

A simple answer might be to have another synchronisation point right
before the TSC sync, to force them to do it one at a time but *without*
having to wait for the rest of the AP bringup in series.

Other answers include inventing a magical parallel TSC sync algorithm
that isn't 1-to-1 between the BSP and each AP. Or observing that we're
running in KVM or after kexec, and we *know* they're in sync anyway and
bypassing the whole bloody nonsense.

Recall, my initial debug patch timing the four stages of the bringup,
which I've left in my branch above for visibility, was showing this
kind of output with the original serial bringup...

> [0.285312] CPU#10 up in 192950,952898,  60014786,28 (  
> 61160662)
> [0.288311] CPU#11 up in 181092,962704,  60010432,30 (  
> 61154258)
> [0.291312] CPU#12 up in 386080,970416,  60013956,28 (  
> 61370480)
> [0.294311] CPU#13 up in 372782,964506,  60010564,28 (  
> 61347880)
> [0.297312] CPU#14 up in 389602,976280,  60013046,28 (  
> 61378956)
> [0.300312] CPU#15 up in 213132,968148,  60012138,28 (  
> 61193446)

So that's almost a million cycles per CPU of do_wait_cpu_callin()
before we get to the TSC sync (which is insanely expensive on KVM).
It's something like three times the time taken by the SIPI+wait in the
first phase, which is what we've parallelised so far (at a gain of
about 33%).



> ---
>  arch/x86/include/asm/realmode.h  |3 +
>  arch/x86/include/asm/smp.h   |9 +++-
>  arch/x86/kernel/acpi/sleep.c |1 
>  arch/x86/kernel/apic/apic.c  |2 
>  arch/x86/kernel/head_64.S|   71 
> +++
>  arch/x86/kernel/smpboot.c|   19 -
>  arch/x86/realmode/init.c |3 +
>  arch/x86/realmode/rm/trampoline_64.S |   14 ++
>  kernel/smpboot.c |2 
>  9 files ch

Re: [EXTERNAL] [PATCH 4/5] KVM: sefltests: Don't bother mapping GVA for Xen shinfo test

2021-02-10 Thread David Woodhouse
On Wed, 2021-02-10 at 10:26 -0800, Sean Christopherson wrote:
> Don't bother mapping the Xen shinfo pages into the guest, they don't need
> to be accessed using the GVAs and passing a define with "GPA" in the name
> to addr_gva2hpa() is confusing.
> 
> Cc: David Woodhouse 
> Signed-off-by: Sean Christopherson 

Makes sense. We did actually use the mapping in an earlier iteration of
the test when it was still testing the runstate info, but I ripped that
out for the final merge because it needs more thought.

Reviewed-by: David Woodhouse 

> ---
>  tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c 
> b/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c
> index b2a3be9eba8e..9246ea310587 100644
> --- a/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c
> +++ b/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c
> @@ -80,7 +80,6 @@ int main(int argc, char *argv[])
> /* Map a region for the shared_info page */
> vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
> SHINFO_REGION_GPA, SHINFO_REGION_SLOT, 2, 
> 0);
> -   virt_map(vm, SHINFO_REGION_GPA, SHINFO_REGION_GPA, 2, 0);
> 
> struct kvm_xen_hvm_config hvmc = {
> .flags = KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL,
> @@ -147,9 +146,9 @@ int main(int argc, char *argv[])
> struct pvclock_wall_clock *wc;
> struct pvclock_vcpu_time_info *ti, *ti2;
> 
> -   wc = addr_gva2hva(vm, SHINFO_REGION_GPA + 0xc00);
> -   ti = addr_gva2hva(vm, SHINFO_REGION_GPA + 0x40 + 0x20);
> -   ti2 = addr_gva2hva(vm, PVTIME_ADDR);
> +   wc = addr_gpa2hva(vm, SHINFO_REGION_GPA + 0xc00);
> +   ti = addr_gpa2hva(vm, SHINFO_REGION_GPA + 0x40 + 0x20);
> +   ti2 = addr_gpa2hva(vm, PVTIME_ADDR);
> 
> vm_ts.tv_sec = wc->sec;
> vm_ts.tv_nsec = wc->nsec;



smime.p7s
Description: S/MIME cryptographic signature


Re: [EXTERNAL] [PATCH 2/5] KVM: selftests: Fix size of memslots created by Xen tests

2021-02-10 Thread David Woodhouse
On Wed, 2021-02-10 at 10:26 -0800, Sean Christopherson wrote:
> For better or worse, the memslot APIs take the number of pages, not the
> size in bytes.  The Xen tests need 2 pages, not 8192 pages.
> 
> Fixes: 8d4e7e80838f ("KVM: x86: declare Xen HVM shared info capability and 
> add test case")
> Cc: David Woodhouse 
> Signed-off-by: Sean Christopherson 

Reviewed-by: David Woodhouse 



smime.p7s
Description: S/MIME cryptographic signature


Re: [EXTERNAL] [PATCH 1/5] KVM: selftests: Ignore recently added Xen tests' build output

2021-02-10 Thread David Woodhouse
On Wed, 2021-02-10 at 10:26 -0800, Sean Christopherson wrote:
> Add the new Xen test binaries to KVM selftest's .gitnore.
> 
> Signed-off-by: Sean Christopherson 
> ---

Reviewed-by: David Woodhouse 




smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] sign-file: add openssl engine support

2021-02-10 Thread David Woodhouse



On 10 February 2021 07:45:54 GMT, Yang Song  wrote:
>Use a customized signature service supported by openssl engine
>to sign the kernel module.
>Add command line parameters that support engine for sign-file
>to use the customized openssl engine service to sign kernel modules.
>
>Signed-off-by: Yang Song 

Aren't engines already obsolete in the latest versions of OpenSSL, as well as 
being an implementation detail of one particular crypto library? They aren't 
really a concept we should be exposing in *our* user interface.

Better to make sign-file automatically recognise RFC7512 PKCS#11 URIs and 
handle them by automatically loading the PKCS#11 engine.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [EXTERNAL] linux-next: Signed-off-by missing for commit in the kvm tree

2021-02-05 Thread David Woodhouse
On Fri, 2021-02-05 at 14:36 +, Joao Martins wrote:
> On 2/4/21 8:18 PM, Stephen Rothwell wrote:
> > Hi all,
> > 
> > Commit
> > 
> >   79033bebf6fa ("KVM: x86/xen: Fix coexistence of Xen and Hyper-V
> > hypercalls")
> > 
> > is missing a Signed-off-by from its author.
> > 
> 
> Except that David is the author of this particular patch, not me.

Hm, I may have been overly zealous in ensuring that you retained the
alleged authorship of most of this work... :)



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v3 00/17] Introducing Linux root partition support for Microsoft Hypervisor

2021-02-02 Thread David Woodhouse
On Tue, 2020-12-15 at 16:42 +, Wei Liu wrote:
> On Tue, Dec 15, 2020 at 04:25:03PM +0100, Enrico Weigelt, metux IT consult 
> wrote:
> > On 03.12.20 00:22, Wei Liu wrote:
> > 
> > Hi,
> > 
> > > I don't follow. Do you mean reusing /dev/kvm but with a different set of
> > > APIs underneath? I don't think that will work.
> > 
> > My idea was using the same uapi for both hypervisors, so that we can use
> > the same userlands for both.
> > 
> > Are the semantis so different that we can't provide the same API ?
> 
> We can provide some similar APIs for ease of porting, but can't provide
> 1:1 mappings. By definition KVM and MSHV are two different things. There
> is no goal to make one ABI / API compatible with the other.

I'm not sure I understand.

KVM is the Linux userspace API for virtualisation. It is designed to be
versatile enough that it can support multiple implementations across
multiple architectures, including both AMD SVM and Intel VMX on x86.

Are you saying that KVM has *failed* to be versatile enough that this
can be "just another implementation"? What are the problems? Is it
unfixable? 


smime.p7s
Description: S/MIME cryptographic signature


[PATCH 6/6] pre states for x86

2021-02-01 Thread David Woodhouse
Utterly broken because the bringup uses global initial_stack and initial_gs
variables, and the TSC sync is similarly hosed (I should probably do one
at a time).

The bringup is going to be the most fun to fix, because the AP coming up
doesn't actually have a lot that it *can* use to disambiguate itself from
the others. Starting them at different locations is mostly a non-starter
as we can only specify a page address under 1MiB and there are only 256
of those even if we could allocate them *all* to start different CPUs.
I think we need to get them to find their own logical CPU# from their
APICID, perhaps by trawling the per_cpu x86_cpu_to_apicid or some other
means, then find their initial stack / %gs from that.

We also need to work out if we can eliminate the real-mode stack for the
trampoline, or do some locking if we really must share it perhaps.
---
 arch/x86/kernel/smpboot.c | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 649b8236309b..03f63027fdad 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -57,6 +57,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1217,14 +1218,6 @@ int native_cpu_up(unsigned int cpu, struct task_struct 
*tidle)
 {
int ret;
 
-   ret = do_cpu_up(cpu, tidle);
-   if (ret)
-   return ret;
-
-   ret = do_wait_cpu_initialized(cpu);
-   if (ret)
-   return ret;
-
ret = do_wait_cpu_callin(cpu);
if (ret)
return ret;
@@ -1241,6 +1234,16 @@ int native_cpu_up(unsigned int cpu, struct task_struct 
*tidle)
return ret;
 }
 
+int native_cpu_kick(unsigned int cpu)
+{
+   return do_cpu_up(cpu, idle_thread_get(cpu));
+}
+
+int native_cpu_wait_init(unsigned int cpu)
+{
+   return do_wait_cpu_initialized(cpu);
+}
+
 /**
  * arch_disable_smp_support() - disables SMP support for x86 at runtime
  */
@@ -1412,6 +1415,11 @@ void __init native_smp_prepare_cpus(unsigned int 
max_cpus)
smp_quirk_init_udelay();
 
speculative_store_bypass_ht_init();
+
+   cpuhp_setup_state_nocalls(CPUHP_BP_PARALLEL_DYN, "x86/cpu:kick",
+ native_cpu_kick, NULL);
+   cpuhp_setup_state_nocalls(CPUHP_BP_PARALLEL_DYN, "x86/cpu:wait-init",
+ native_cpu_wait_init, NULL);
 }
 
 void arch_thaw_secondary_cpus_begin(void)
-- 
2.29.2



[PATCH 4/6] x86/smpboot: Split up native_cpu_up into separate phases

2021-02-01 Thread David Woodhouse
There are four logical parts to what native_cpu_up() does.

First it actually wakes AP.

Second, it waits for the AP to make it as far as wait_for_master_cpu()
which sets that CPU's bit in cpu_initialized_mask, and sets the bit in
cpu_callout_mask to let the AP proceed through cpu_init().

Then, it waits for the AP to finish cpu_init() and get as far as the
smp_callin() call, which sets that CPU's bit in cpu_callin_mask.

Finally, it does the TSC synchronization and waits for the AP to actually
mark itself online in cpu_online_mask.

This commit should have no behavioural change, but merely splits those
phases out into separate functions so that future commits and make them
happen in parallel for all APs.

Signed-off-by: David Woodhouse 
---
 arch/x86/kernel/smpboot.c | 136 +++---
 1 file changed, 82 insertions(+), 54 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index bec0059d3d3d..649b8236309b 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1039,9 +1039,7 @@ static int do_boot_cpu(int apicid, int cpu, struct 
task_struct *idle,
 {
/* start_ip had better be page-aligned! */
unsigned long start_ip = real_mode_header->trampoline_start;
-
unsigned long boot_error = 0;
-   unsigned long timeout;
 
idle->thread.sp = (unsigned long)task_pt_regs(idle);
early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
@@ -1094,55 +1092,70 @@ static int do_boot_cpu(int apicid, int cpu, struct 
task_struct *idle,
boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,
 cpu0_nmi_registered);
 
-   if (!boot_error) {
-   /*
-* Wait 10s total for first sign of life from AP
-*/
-   boot_error = -1;
-   timeout = jiffies + 10*HZ;
-   while (time_before(jiffies, timeout)) {
-   if (cpumask_test_cpu(cpu, cpu_initialized_mask)) {
-   /*
-* Tell AP to proceed with initialization
-*/
-   cpumask_set_cpu(cpu, cpu_callout_mask);
-   boot_error = 0;
-   break;
-   }
-   schedule();
-   }
-   }
+   return boot_error;
+}
 
-   if (!boot_error) {
-   /*
-* Wait till AP completes initial initialization
-*/
-   while (!cpumask_test_cpu(cpu, cpu_callin_mask)) {
-   /*
-* Allow other tasks to run while we wait for the
-* AP to come online. This also gives a chance
-* for the MTRR work(triggered by the AP coming online)
-* to be completed in the stop machine context.
-*/
-   schedule();
-   }
+static int do_wait_cpu_cpumask(unsigned int cpu, const struct cpumask *mask)
+{
+   unsigned long timeout;
+
+   /*
+* Wait up to 10s for the CPU to report in.
+*/
+   timeout = jiffies + 10*HZ;
+   while (time_before(jiffies, timeout)) {
+   if (cpumask_test_cpu(cpu, mask))
+   return 0;
+
+   schedule();
}
+   return -1;
+}
 
-   if (x86_platform.legacy.warm_reset) {
-   /*
-* Cleanup possible dangling ends...
-*/
-   smpboot_restore_warm_reset_vector();
+static int do_wait_cpu_initialized(unsigned int cpu)
+{
+   /*
+* Wait for first sign of life from AP.
+*/
+   if (do_wait_cpu_cpumask(cpu, cpu_initialized_mask))
+   return -1;
+
+   cpumask_set_cpu(cpu, cpu_callout_mask);
+   return 0;
+}
+
+static int do_wait_cpu_callin(unsigned int cpu)
+{
+   /*
+* Wait till AP completes initial initialization.
+*/
+   return do_wait_cpu_cpumask(cpu, cpu_callin_mask);
+}
+
+static int do_wait_cpu_online(unsigned int cpu)
+{
+   unsigned long flags;
+
+   /*
+* Check TSC synchronization with the AP (keep irqs disabled
+* while doing so):
+*/
+   local_irq_save(flags);
+   check_tsc_sync_source(cpu);
+   local_irq_restore(flags);
+
+   while (!cpu_online(cpu)) {
+   cpu_relax();
+   touch_nmi_watchdog();
}
 
-   return boot_error;
+   return 0;
 }
 
-int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
+int do_cpu_up(unsigned int cpu, struct task_struct *tidle)
 {
int apicid = apic->cpu_present_to_apicid(cpu);
int cpu0_nmi_registered = 0;
-   unsigned long flags;
int err, ret = 0;
 
lockdep_assert_irqs_enabled();
@@ -118

[PATCH 2/6] cpu/hotplug: Add dynamic states before CPUHP_BRINGUP_CPU for parallel bringup

2021-02-01 Thread David Woodhouse
If the platform registers these states, bring all CPUs to each registered
state in parallel, before the final bringup to CPUHP_BRINGUP_CPU.

Signed-off-by: David Woodhouse 
---
 include/linux/cpuhotplug.h |  2 ++
 kernel/cpu.c   | 27 +--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 0042ef362511..ae07cd1dafe3 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -92,6 +92,8 @@ enum cpuhp_state {
CPUHP_MIPS_SOC_PREPARE,
CPUHP_BP_PREPARE_DYN,
CPUHP_BP_PREPARE_DYN_END= CPUHP_BP_PREPARE_DYN + 20,
+   CPUHP_BP_PARALLEL_DYN,
+   CPUHP_BP_PARALLEL_DYN_END   = CPUHP_BP_PARALLEL_DYN + 4,
CPUHP_BRINGUP_CPU,
CPUHP_AP_IDLE_DEAD,
CPUHP_AP_OFFLINE,
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 4e11e91010e1..11f9504b4845 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1335,6 +1335,24 @@ int bringup_hibernate_cpu(unsigned int sleep_cpu)
 void bringup_nonboot_cpus(unsigned int setup_max_cpus)
 {
unsigned int cpu;
+   int n = setup_max_cpus - num_online_cpus();
+
+   /* ∀ parallel pre-bringup state, bring N CPUs to it */
+   if (n > 0) {
+   enum cpuhp_state st = CPUHP_BP_PARALLEL_DYN;
+
+   while (st <= CPUHP_BP_PARALLEL_DYN_END &&
+  cpuhp_hp_states[st].name) {
+   int i = n;
+
+   for_each_present_cpu(cpu) {
+   cpu_up(cpu, st);
+   if (!--i)
+   break;
+   }
+   st++;
+   }
+   }
 
for_each_present_cpu(cpu) {
if (num_online_cpus() >= setup_max_cpus)
@@ -1702,6 +1720,10 @@ static int cpuhp_reserve_state(enum cpuhp_state state)
step = cpuhp_hp_states + CPUHP_BP_PREPARE_DYN;
end = CPUHP_BP_PREPARE_DYN_END;
break;
+   case CPUHP_BP_PARALLEL_DYN:
+   step = cpuhp_hp_states + CPUHP_BP_PARALLEL_DYN;
+   end = CPUHP_BP_PARALLEL_DYN_END;
+   break;
default:
return -EINVAL;
}
@@ -1726,14 +1748,15 @@ static int cpuhp_store_callbacks(enum cpuhp_state 
state, const char *name,
/*
 * If name is NULL, then the state gets removed.
 *
-* CPUHP_AP_ONLINE_DYN and CPUHP_BP_PREPARE_DYN are handed out on
+* CPUHP_AP_ONLINE_DYN and CPUHP_BP_x_DYN are handed out on
 * the first allocation from these dynamic ranges, so the removal
 * would trigger a new allocation and clear the wrong (already
 * empty) state, leaving the callbacks of the to be cleared state
 * dangling, which causes wreckage on the next hotplug operation.
 */
if (name && (state == CPUHP_AP_ONLINE_DYN ||
-state == CPUHP_BP_PREPARE_DYN)) {
+state == CPUHP_BP_PREPARE_DYN ||
+state == CPUHP_BP_PARALLEL_DYN)) {
ret = cpuhp_reserve_state(state);
if (ret < 0)
return ret;
-- 
2.29.2



[PATCH 5/6] cpu/hotplug: Move idle_thread_get() to

2021-02-01 Thread David Woodhouse
Instead of relying purely on the special-case wrapper in bringup_cpu()
to pass the idle thread to __cpu_up(), expose idle_thread_get() so that
the architecture code can obtain it directly when necessary.

This will be useful when the existing __cpu_up() is split into multiple
phases, only *one* of which will actually need the idle thread.

If the architecture code is to register its new pre-bringup states with
the cpuhp core, having a special-case wrapper to pass extra arguments is
non-trivial and it's easier just to let the arch register its function
pointer to be invoked with the standard API.

Signed-off-by: David Woodhouse 
---
 include/linux/smpboot.h | 7 +++
 kernel/smpboot.h| 2 --
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/smpboot.h b/include/linux/smpboot.h
index 9d1bc65d226c..3862addcaa34 100644
--- a/include/linux/smpboot.h
+++ b/include/linux/smpboot.h
@@ -5,6 +5,13 @@
 #include 
 
 struct task_struct;
+
+#ifdef CONFIG_GENERIC_SMP_IDLE_THREAD
+struct task_struct *idle_thread_get(unsigned int cpu);
+#else
+static inline struct task_struct *idle_thread_get(unsigned int cpu) { return 
NULL; }
+#endif
+
 /* Cookie handed to the thread_fn*/
 struct smpboot_thread_data;
 
diff --git a/kernel/smpboot.h b/kernel/smpboot.h
index 34dd3d7ba40b..60c609318ad6 100644
--- a/kernel/smpboot.h
+++ b/kernel/smpboot.h
@@ -5,11 +5,9 @@
 struct task_struct;
 
 #ifdef CONFIG_GENERIC_SMP_IDLE_THREAD
-struct task_struct *idle_thread_get(unsigned int cpu);
 void idle_thread_set_boot_cpu(void);
 void idle_threads_init(void);
 #else
-static inline struct task_struct *idle_thread_get(unsigned int cpu) { return 
NULL; }
 static inline void idle_thread_set_boot_cpu(void) { }
 static inline void idle_threads_init(void) { }
 #endif
-- 
2.29.2



[PATCH 3/6] x86/smpboot: Reference count on smpboot_setup_warm_reset_vector()

2021-02-01 Thread David Woodhouse
If we want to do parallel CPU bringup, we're going to need to set this up
and leave it until all CPUs are done. Might as well use the RTC spinlock
to protect the refcount, as we need to take it anyway.

Signed-off-by: David Woodhouse 
---
 arch/x86/kernel/smpboot.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 117e24fbfd8a..bec0059d3d3d 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -125,17 +125,22 @@ int arch_update_cpu_topology(void)
return retval;
 }
 
+
+static unsigned int smpboot_warm_reset_vector_count;
+
 static inline void smpboot_setup_warm_reset_vector(unsigned long start_eip)
 {
unsigned long flags;
 
spin_lock_irqsave(&rtc_lock, flags);
-   CMOS_WRITE(0xa, 0xf);
+   if (!smpboot_warm_reset_vector_count++) {
+   CMOS_WRITE(0xa, 0xf);
+   *((volatile unsigned short 
*)phys_to_virt(TRAMPOLINE_PHYS_HIGH)) =
+   start_eip >> 4;
+   *((volatile unsigned short *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) 
=
+   start_eip & 0xf;
+   }
spin_unlock_irqrestore(&rtc_lock, flags);
-   *((volatile unsigned short *)phys_to_virt(TRAMPOLINE_PHYS_HIGH)) =
-   start_eip >> 4;
-   *((volatile unsigned short *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) =
-   start_eip & 0xf;
 }
 
 static inline void smpboot_restore_warm_reset_vector(void)
@@ -147,10 +152,12 @@ static inline void smpboot_restore_warm_reset_vector(void)
 * to default values.
 */
spin_lock_irqsave(&rtc_lock, flags);
-   CMOS_WRITE(0, 0xf);
-   spin_unlock_irqrestore(&rtc_lock, flags);
+   if (!--smpboot_warm_reset_vector_count) {
+   CMOS_WRITE(0, 0xf);
 
-   *((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0;
+   *((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0;
+   }
+   spin_unlock_irqrestore(&rtc_lock, flags);
 }
 
 static void init_freq_invariance(bool secondary, bool cppc_ready);
-- 
2.29.2



[PATCH 1/6] x86/apic/x2apic: Fix parallel handling of cluster_mask

2021-02-01 Thread David Woodhouse
For each CPU being brought up, the alloc_clustermask() function
allocates a new struct cluster_mask just in case it's needed. Then the
target CPU actually runs, and in init_x2apic_ldr() it either uses a
cluster_mask from a previous CPU in the same cluster, or consumes the
"spare" one and sets the global pointer to NULL.

That isn't going to parallelise stunningly well.

Ditch the global variable, let alloc_clustermask() install the struct
*directly* in the per_cpu data for the CPU being brought up. As an
optimisation, actually make it do so for *all* present CPUs in the same
cluster, which means only one iteration over for_each_present_cpu()
instead of doing so repeatedly, once for each CPU.

This was a harmless "bug" while CPU bringup wasn't actually happening in
parallel. It's about to become less harmless...

Fixes: 023a611748fd5 ("x86/apic/x2apic: Simplify cluster management")
Signed-off-by: David Woodhouse 
---
 arch/x86/kernel/apic/x2apic_cluster.c | 82 ---
 1 file changed, 49 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kernel/apic/x2apic_cluster.c 
b/arch/x86/kernel/apic/x2apic_cluster.c
index df6adc5674c9..2afa4609496f 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -18,7 +18,6 @@ struct cluster_mask {
 static DEFINE_PER_CPU(u32, x86_cpu_to_logical_apicid);
 static DEFINE_PER_CPU(cpumask_var_t, ipi_mask);
 static DEFINE_PER_CPU(struct cluster_mask *, cluster_masks);
-static struct cluster_mask *cluster_hotplug_mask;
 
 static int x2apic_acpi_madt_oem_check(char *oem_id, char *oem_table_id)
 {
@@ -98,54 +97,71 @@ static u32 x2apic_calc_apicid(unsigned int cpu)
 static void init_x2apic_ldr(void)
 {
struct cluster_mask *cmsk = this_cpu_read(cluster_masks);
-   u32 cluster, apicid = apic_read(APIC_LDR);
-   unsigned int cpu;
 
-   this_cpu_write(x86_cpu_to_logical_apicid, apicid);
+   BUG_ON(!cmsk);
 
-   if (cmsk)
-   goto update;
-
-   cluster = apicid >> 16;
-   for_each_online_cpu(cpu) {
-   cmsk = per_cpu(cluster_masks, cpu);
-   /* Matching cluster found. Link and update it. */
-   if (cmsk && cmsk->clusterid == cluster)
-   goto update;
-   }
-   cmsk = cluster_hotplug_mask;
-   cmsk->clusterid = cluster;
-   cluster_hotplug_mask = NULL;
-update:
-   this_cpu_write(cluster_masks, cmsk);
cpumask_set_cpu(smp_processor_id(), &cmsk->mask);
 }
 
-static int alloc_clustermask(unsigned int cpu, int node)
+static int alloc_clustermask(unsigned int cpu, u32 cluster, int node)
 {
+   struct cluster_mask *cmsk = NULL;
+   unsigned int cpu_i;
+   u32 apicid;
+
if (per_cpu(cluster_masks, cpu))
return 0;
-   /*
-* If a hotplug spare mask exists, check whether it's on the right
-* node. If not, free it and allocate a new one.
+
+   /* For the hotplug case, don't always allocate a new one */
+   for_each_present_cpu(cpu_i) {
+   apicid = apic->cpu_present_to_apicid(cpu_i);
+   if (apicid != BAD_APICID && apicid >> 4 == cluster) {
+   cmsk = per_cpu(cluster_masks, cpu_i);
+   if (cmsk)
+   break;
+   }
+   }
+   if (!cmsk) {
+   cmsk = kzalloc_node(sizeof(*cmsk), GFP_KERNEL, node);
+   }
+   if (!cmsk)
+   return -ENOMEM;
+
+   cmsk->node = node;
+   cmsk->clusterid = cluster;
+
+   per_cpu(cluster_masks, cpu) = cmsk;
+
+/*
+* As an optimisation during boot, set the cluster_mask for *all*
+* present CPUs at once, to prevent *each* of them having to iterate
+* over the others to find the existing cluster_mask.
 */
-   if (cluster_hotplug_mask) {
-   if (cluster_hotplug_mask->node == node)
-   return 0;
-   kfree(cluster_hotplug_mask);
+   if (system_state < SYSTEM_RUNNING) {
+   for_each_present_cpu(cpu) {
+   u32 apicid = apic->cpu_present_to_apicid(cpu);
+   if (apicid != BAD_APICID && apicid >> 4 == cluster) {
+   struct cluster_mask **cpu_cmsk = 
&per_cpu(cluster_masks, cpu);
+   if (*cpu_cmsk)
+   BUG_ON(*cpu_cmsk != cmsk);
+   else
+   *cpu_cmsk = cmsk;
+   }
+   }
}
 
-   cluster_hotplug_mask = kzalloc_node(sizeof(*cluster_hotplug_mask),
-   GFP_KERNEL, node);
-   if (!cluster_hotplug_mask)
-   return -ENOMEM;
-   cluster_hotplug_mask->node

Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-02-01 Thread David Woodhouse
On Thu, 2021-01-21 at 15:55 +0100, Thomas Gleixner wrote:
> > Here's the hack we're testing with, for reference. It's kind of ugly
> > but you can see where it's going. Note that the CMOS mangling for the
> > warm reset vector is going to need to be lifted out of the per-cpu
> > loop, and done *once* at startup and torn down once in smp_cpus_done.
> > Except that it also needs to be done before/after a hotplug cpu up;
> > we'll have to come back to that but we've just shifted it to
> > native_smp_cpus_done() for testing for now.
> 
> Right. It's at least a start.

Here's what we have now.

I've refcounted the warm reset vector thing which should fix the
hotplug case, although I need to check it gets torn down in the error
cases correctly.

With the X2APIC global variable thing fixed, the new states can be
immediately before CPUHP_BRINGUP_CPU as we originally wanted. I've
fixed up the bringup_nonboot_cpus() loop to bring an appropriate number
of CPUs to those "CPUHP_BP_PARALLEL_DYN" dynamic parallel pre-bringup
states in parallel.

We spent a while vacillating about how to add the new states, because
of the existing special-case hackery in bringup_cpu() for the
CPUHP_BRINGUP_CPU case.

The irq_lock_sparse() and the idle_thread_get() from there might
actually be needed in *earlier* states for platforms which do parallel
bringup so do we add similar wrappers in kernel/cpu.c for *all* of
the pre-bringup states, having hard-coded them? Then let the arch
provide a config symbol for whether it really wants them or not? That
seemed kind of horrid, so I went for the simple option of just letting
the arch register the CPUHP_BP_PARALLEL_DYN states the normal way with
its own functions to be called directly, and the loop in
bringup_nonboot_cpus() can then operate directly on whether they exist
in the state table or not, for which there is precedent already.

That means I needed to export idle_thread_get() for the pre-bringup
state functions to use too. I'll also want to add the irq_lock_sparse()
into my final patch but frankly, that's the least of my worries about
that patch right now.

It's also fairly much a no-brainer to splitting up the x86
native_cpu_up() into the four separate phases that I had got separate
timings for previously. We can do that just as a "cleanup" with no
functional change.

So I'm relatively happy at least that far, as preparatory work...

David Woodhouse (6):
  x86/apic/x2apic: Fix parallel handling of cluster_mask
  cpu/hotplug: Add dynamic states before CPUHP_BRINGUP_CPU for parallel 
bringup
  x86/smpboot: Reference count on smpboot_setup_warm_reset_vector()
  x86/smpboot: Split up native_cpu_up into separate phases
  cpu/hotplug: Move idle_thread_get() to 

 arch/x86/kernel/apic/x2apic_cluster.c |  82 +
 arch/x86/kernel/smpboot.c | 159 
++--
 include/linux/cpuhotplug.h|   2 +
 include/linux/smpboot.h   |   7 +++
 kernel/cpu.c  |  27 +-
 kernel/smpboot.h  |   2 -
 6 files changed, 180 insertions(+), 99 deletions(-)

That's the generic part mostly done, and the fun part is where we turn
back to x86 and actually try to split out those four phases of
native_cpu_up() to happen in parallel.

We store initial_stack and initial_gs for "the" AP that is coming up,
in global variables. It turns out that the APs don't like all sharing
the same stack as they come up in parallel, and weird behaviour ensues.

I think the only thing the AP has that can disambiguate it from other
APs is its CPUID, which it can get in its full 32-bit glory from
CPUID.0BH:EDX (and I think we can say we'll do parallel bringup *only*
of that leaf exists on the boot CPU).

So the trampoline code would need to find the logical CPU# and thus the
idle thread stack and per-cpu data with a lookup based on its APICID.
Perhaps just by trawling the various per-cpu data until it finds one
with the right apicid, much like default_cpu_present_to_apicid() does.

Oh, and ideally it needs to do this without using a real-mode stack,
because they won't like sharing that *either*.

(Actually they don't seem to mind in practice right now because the
only thing they all use it for is a 'call verify_cpu' and they all
place the *same* return address at the same place on the stack, but it
would be horrid to rely on that on *purpose* :)

So we'll continue to work on that in order to enable the parallel
bringup on x86, unless anyone has any cleverer ideas.

After that we'll get to the TSC sync, which is also not working in
parallel.


smime.p7s
Description: S/MIME cryptographic signature


[PATCH] x86/apic/x2apic: Fix parallel handling of cluster_mask

2021-01-21 Thread David Woodhouse
From: David Woodhouse 

For each CPU being brought up, the alloc_clustermask() function
allocates a new struct cluster_mask just in case it's needed. Then the
target CPU actually runs, and in init_x2apic_ldr() it either uses a
cluster_mask from a previous CPU in the same cluster, or consumes the
"spare" one and sets the global pointer to NULL.

That isn't going to parallelise stunningly well.

Ditch the global variable, let alloc_clustermask() install the struct
*directly* in the per_cpu data for the CPU being brought up. As an
optimisation, actually make it do so for *all* present CPUs in the same
cluster, which means only one iteration over for_each_present_cpu()
instead of doing so repeatedly, once for each CPU.

This was a harmless "bug" while CPU bringup wasn't actually happening in
parallel. It's about to become less harmless...

Fixes: 023a611748fd5 ("x86/apic/x2apic: Simplify cluster management")
Signed-off-by: David Woodhouse 
---
 arch/x86/kernel/apic/x2apic_cluster.c | 82 ---
 1 file changed, 49 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kernel/apic/x2apic_cluster.c 
b/arch/x86/kernel/apic/x2apic_cluster.c
index b0889c48a2ac..ee5a9a438780 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -18,7 +18,6 @@ struct cluster_mask {
 static DEFINE_PER_CPU(u32, x86_cpu_to_logical_apicid);
 static DEFINE_PER_CPU(cpumask_var_t, ipi_mask);
 static DEFINE_PER_CPU(struct cluster_mask *, cluster_masks);
-static struct cluster_mask *cluster_hotplug_mask;
 
 static int x2apic_acpi_madt_oem_check(char *oem_id, char *oem_table_id)
 {
@@ -98,54 +97,71 @@ static u32 x2apic_calc_apicid(unsigned int cpu)
 static void init_x2apic_ldr(void)
 {
struct cluster_mask *cmsk = this_cpu_read(cluster_masks);
-   u32 cluster, apicid = apic_read(APIC_LDR);
-   unsigned int cpu;
 
-   this_cpu_write(x86_cpu_to_logical_apicid, apicid);
+   BUG_ON(!cmsk);
 
-   if (cmsk)
-   goto update;
-
-   cluster = apicid >> 16;
-   for_each_online_cpu(cpu) {
-   cmsk = per_cpu(cluster_masks, cpu);
-   /* Matching cluster found. Link and update it. */
-   if (cmsk && cmsk->clusterid == cluster)
-   goto update;
-   }
-   cmsk = cluster_hotplug_mask;
-   cmsk->clusterid = cluster;
-   cluster_hotplug_mask = NULL;
-update:
-   this_cpu_write(cluster_masks, cmsk);
cpumask_set_cpu(smp_processor_id(), &cmsk->mask);
 }
 
-static int alloc_clustermask(unsigned int cpu, int node)
+static int alloc_clustermask(unsigned int cpu, u32 cluster, int node)
 {
+   struct cluster_mask *cmsk = NULL;
+   unsigned int cpu_i;
+   u32 apicid;
+
if (per_cpu(cluster_masks, cpu))
return 0;
-   /*
-* If a hotplug spare mask exists, check whether it's on the right
-* node. If not, free it and allocate a new one.
+
+   /* For the hotplug case, don't always allocate a new one */
+   for_each_present_cpu(cpu_i) {
+   apicid = apic->cpu_present_to_apicid(cpu_i);
+   if (apicid != BAD_APICID && apicid >> 4 == cluster) {
+   cmsk = per_cpu(cluster_masks, cpu_i);
+   if (cmsk)
+   break;
+   }
+   }
+   if (!cmsk) {
+   cmsk = kzalloc_node(sizeof(*cmsk), GFP_KERNEL, node);
+   }
+   if (!cmsk)
+   return -ENOMEM;
+
+   cmsk->node = node;
+   cmsk->clusterid = cluster;
+
+   per_cpu(cluster_masks, cpu) = cmsk;
+
+/*
+* As an optimisation during boot, set the cluster_mask for *all*
+* present CPUs at once, to prevent *each* of them having to iterate
+* over the others to find the existing cluster_mask.
 */
-   if (cluster_hotplug_mask) {
-   if (cluster_hotplug_mask->node == node)
-   return 0;
-   kfree(cluster_hotplug_mask);
+   if (system_state < SYSTEM_RUNNING) {
+   for_each_present_cpu(cpu) {
+   u32 apicid = apic->cpu_present_to_apicid(cpu);
+   if (apicid != BAD_APICID && apicid >> 4 == cluster) {
+   struct cluster_mask **cpu_cmsk = 
&per_cpu(cluster_masks, cpu);
+   if (*cpu_cmsk)
+   BUG_ON(*cpu_cmsk != cmsk);
+   else
+   *cpu_cmsk = cmsk;
+   }
+   }
}
 
-   cluster_hotplug_mask = kzalloc_node(sizeof(*cluster_hotplug_mask),
-   GFP_KERNEL, node);
-   if (!cluster_hotplug_mask)
-   return -ENOMEM;
-   cluste

Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-01-21 Thread David Woodhouse
On Thu, 2021-01-21 at 15:42 +, David Woodhouse wrote:
> [2.289283] BUG: kernel NULL pointer dereference, address: 
> [2.289283] #PF: supervisor write access in kernel mode
> [2.289283] #PF: error_code(0x0002) - not-present page
> [2.289283] PGD 0 P4D 0 
> [2.289283] Oops: 0002 [#1] SMP PTI
> [2.289283] CPU: 32 PID: 0 Comm: swapper/32 Not tainted 5.10.0+ #745
> [2.289283] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.14.0-1.fc33 04/01/2014
> [2.289283] RIP: 0010:init_x2apic_ldr+0xa0/0xb0


OK... in alloc_clustermask() for each CPU we were preallocating a
cluster_mask and storing it in the global cluster_hotplug_mask.

Then later for each CPU we were taking the preallocated cluster_mask
and setting cluster_hotplug_mask to NULL.

That doesn't parallelise well :)

So... ditch the global variable, let alloc_clustermask() install the
appropriate cluster_mask *directly* into the target CPU's per_cpu data
before it's running. And since we have to calculate the logical APIC ID
for the cluster ID, we might as well set x86_cpu_to_logical_apicid at
the same time.

Now all that init_x2apic_ldr() actually *does* on the target CPU is set
that CPU's bit in the pre-existing cluster_mask.

To reduce the number of loops over all (present or online) CPUs, I've
made it set the per_cpu cluster_mask for *all* CPUs in the cluster in
one pass at boot time. I think the case for later hotplug is also sane;
will have to test that.

But it passes that qemu boot test it was failing earlier, at least...
 
diff --git a/arch/x86/kernel/apic/x2apic_cluster.c 
b/arch/x86/kernel/apic/x2apic_cluster.c
index b0889c48a2ac..74bb4cae8b5b 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -18,7 +18,6 @@ struct cluster_mask {
 static DEFINE_PER_CPU(u32, x86_cpu_to_logical_apicid);
 static DEFINE_PER_CPU(cpumask_var_t, ipi_mask);
 static DEFINE_PER_CPU(struct cluster_mask *, cluster_masks);
-static struct cluster_mask *cluster_hotplug_mask;
 
 static int x2apic_acpi_madt_oem_check(char *oem_id, char *oem_table_id)
 {
@@ -98,54 +97,61 @@ static u32 x2apic_calc_apicid(unsigned int cpu)
 static void init_x2apic_ldr(void)
 {
struct cluster_mask *cmsk = this_cpu_read(cluster_masks);
-   u32 cluster, apicid = apic_read(APIC_LDR);
-   unsigned int cpu;
 
-   this_cpu_write(x86_cpu_to_logical_apicid, apicid);
+   BUG_ON(!cmsk);
 
-   if (cmsk)
-   goto update;
-
-   cluster = apicid >> 16;
-   for_each_online_cpu(cpu) {
-   cmsk = per_cpu(cluster_masks, cpu);
-   /* Matching cluster found. Link and update it. */
-   if (cmsk && cmsk->clusterid == cluster)
-   goto update;
-   }
-   cmsk = cluster_hotplug_mask;
-   cmsk->clusterid = cluster;
-   cluster_hotplug_mask = NULL;
-update:
-   this_cpu_write(cluster_masks, cmsk);
cpumask_set_cpu(smp_processor_id(), &cmsk->mask);
 }
 
-static int alloc_clustermask(unsigned int cpu, int node)
+static int alloc_clustermask(unsigned int cpu, u32 cluster, int node)
 {
+   struct cluster_mask *cmsk = NULL;
+   u32 apicid;
+
if (per_cpu(cluster_masks, cpu))
return 0;
-   /*
-* If a hotplug spare mask exists, check whether it's on the right
-* node. If not, free it and allocate a new one.
+
+   /* For the hotplug case, don't always allocate a new one */
+   for_each_online_cpu(cpu) {
+   apicid = apic->cpu_present_to_apicid(cpu);
+   if (apicid != BAD_APICID && apicid >> 4 == cluster) {
+   cmsk = per_cpu(cluster_masks, cpu);
+   if (cmsk)
+   break;
+   }
+   }
+   if (!cmsk)
+   cmsk = kzalloc_node(sizeof(*cmsk), GFP_KERNEL, node);
+   if (!cmsk)
+   return -ENOMEM;
+
+   cmsk->node = node;
+   cmsk->clusterid = cluster;
+
+/*
+* As an optimisation during boot, set the cluster_mask for *all*
+* present CPUs at once, which will include 'cpu'.
 */
-   if (cluster_hotplug_mask) {
-   if (cluster_hotplug_mask->node == node)
-   return 0;
-   kfree(cluster_hotplug_mask);
+   if (system_state < SYSTEM_RUNNING) {
+   for_each_present_cpu(cpu) {
+   u32 apicid = apic->cpu_present_to_apicid(cpu);
+   if (apicid != BAD_APICID && apicid >> 4 == cluster)
+   per_cpu(cluster_masks, cpu) = cmsk;
+   }
}
 
-   cluster_hotplug_mask = kzalloc_node(sizeof(*cluster_hotplug_mask),
-   GFP_KERNEL, node);
-   if (!clust

Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-01-21 Thread David Woodhouse
On Thu, 2021-01-21 at 15:55 +0100, Thomas Gleixner wrote:
> > Testing on real hardware has been more interesting and less useful so
> > far. We started with the CPUHP_BRINGUP_KICK_CPU state being
> > *immediately* before CPUHP_BRINGUP_CPU. On my 28-thread Haswell box,
> > that didn't come up at all even without actually *doing* anything in
> > the pre-bringup phase. Merely bringing all the AP threads up through
> > the various CPUHP_PREPARE_foo stages before actually bringing them
> > online, was enough to break it. I have no serial port on this box so we
> > haven't get worked out why; I've resorted to putting the
> > CPUHP_BRINGUP_KICK_CPU state before CPUHP_WORKQUEUE_PREP instead.
> 
> Hrm.

Aha, I managed to reproduce in qemu. It's CPUHP_X2APIC_PREPARE, which
is only used in x2apic *cluster* mode not physical mode. So I actually
need to give the guest an IOMMU with IRQ remapping before I see it.


$ git diff
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index bc56287a1ed1..f503e66b4718 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -92,6 +92,7 @@ enum cpuhp_state {
CPUHP_MIPS_SOC_PREPARE,
CPUHP_BP_PREPARE_DYN,
CPUHP_BP_PREPARE_DYN_END= CPUHP_BP_PREPARE_DYN + 20,
+   CPUHP_BRINGUP_WAKE_CPU,
CPUHP_BRINGUP_CPU,
CPUHP_AP_IDLE_DEAD,
CPUHP_AP_OFFLINE,
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 2b8d7a5db383..6c6f2986bfdb 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1336,6 +1336,12 @@ void bringup_nonboot_cpus(unsigned int setup_max_cpus)
 {
unsigned int cpu;
 
+   for_each_present_cpu(cpu) {
+   if (num_online_cpus() >= setup_max_cpus)
+   break;
+   if (!cpu_online(cpu))
+   cpu_up(cpu, CPUHP_BRINGUP_WAKE_CPU);
+   }
for_each_present_cpu(cpu) {
if (num_online_cpus() >= setup_max_cpus)
break;
$ qemu-system-x86_64 -kernel arch/x86/boot/bzImage -append "console=ttyS0  
trace_event=cpuhp tp_printk" -display none -serial mon:stdio  -m 2G -M 
q35,accel=kvm,kernel-irqchip=split -device intel-iommu,intremap=on -smp 40
...
[0.349968] smp: Bringing up secondary CPUs ...
[0.350281] cpuhp_enter: cpu: 0001 target:  42 step:   1 
(smpboot_create_threads)
[0.351421] cpuhp_exit:  cpu: 0001  state:   1 step:   1 ret: 0
[0.352074] cpuhp_enter: cpu: 0001 target:  42 step:   2 
(perf_event_init_cpu)
[0.352276] cpuhp_exit:  cpu: 0001  state:   2 step:   2 ret: 0
[0.353273] cpuhp_enter: cpu: 0001 target:  42 step:  37 
(workqueue_prepare_cpu)
[0.354377] cpuhp_exit:  cpu: 0001  state:  37 step:  37 ret: 0
[0.355273] cpuhp_enter: cpu: 0001 target:  42 step:  39 
(hrtimers_prepare_cpu)
[0.356271] cpuhp_exit:  cpu: 0001  state:  39 step:  39 ret: 0
[0.356937] cpuhp_enter: cpu: 0001 target:  42 step:  41 (x2apic_prepare_cpu)
[0.357277] cpuhp_exit:  cpu: 0001  state:  41 step:  41 ret: 0
[0.358278] cpuhp_enter: cpu: 0002 target:  42 step:   1 
(smpboot_create_threads)
...
[0.614278] cpuhp_enter: cpu: 0032 target:  42 step:   1 
(smpboot_create_threads)
[0.615610] cpuhp_exit:  cpu: 0032  state:   1 step:   1 ret: 0
[0.616274] cpuhp_enter: cpu: 0032 target:  42 step:   2 
(perf_event_init_cpu)
[0.617271] cpuhp_exit:  cpu: 0032  state:   2 step:   2 ret: 0
[0.618272] cpuhp_enter: cpu: 0032 target:  42 step:  37 
(workqueue_prepare_cpu)
[0.619388] cpuhp_exit:  cpu: 0032  state:  37 step:  37 ret: 0
[0.620273] cpuhp_enter: cpu: 0032 target:  42 step:  39 
(hrtimers_prepare_cpu)
[0.621270] cpuhp_exit:  cpu: 0032  state:  39 step:  39 ret: 0
[0.622009] cpuhp_enter: cpu: 0032 target:  42 step:  41 (x2apic_prepare_cpu)
[0.622275] cpuhp_exit:  cpu: 0032  state:  41 step:  41 ret: 0
...
[0.684272] cpuhp_enter: cpu: 0039 target:  42 step:  41 (x2apic_prepare_cpu)
[0.685277] cpuhp_exit:  cpu: 0039  state:  41 step:  41 ret: 0
[0.685979] cpuhp_enter: cpu: 0001 target: 217 step:  43 (smpcfd_prepare_cpu)
[0.686283] cpuhp_exit:  cpu: 0001  state:  43 step:  43 ret: 0
[0.687274] cpuhp_enter: cpu: 0001 target: 217 step:  44 (relay_prepare_cpu)
[0.688274] cpuhp_exit:  cpu: 0001  state:  44 step:  44 ret: 0
[0.689274] cpuhp_enter: cpu: 0001 target: 217 step:  47 
(rcutree_prepare_cpu)
[0.690271] cpuhp_exit:  cpu: 0001  state:  47 step:  47 ret: 0
[0.690982] cpuhp_multi_enter: cpu: 0001 target: 217 step:  59 
(trace_rb_cpu_prepare)
[0.691281] cpuhp_exit:  cpu: 0001  state:  59 step:  59 ret: 0
[0.692272] cpuhp_multi_enter: cpu: 0001 target: 217 step:  59 
(trace_rb_cpu_prepare)
[0.694640] cpuhp_exit:  cpu: 0001  state:  59 step:  59 ret: 0
[0.695272] cpuhp_multi_enter: cpu: 0001 target: 217 step:  59 
(trace_rb_cpu_prepare)
[0.696280] cpuhp_exit:  cpu: 0001  state:  59 step:  59 ret: 0
[0.697279] cpuhp_enter: cpu: 0001 target: 217 step:  65 (timers_prepare_cpu)

Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-01-19 Thread David Woodhouse
On Tue, 2020-12-15 at 22:20 +0100, Thomas Gleixner wrote:
> Since the rewrite of the CPU hotplug infrastructure to a state machine
> it's pretty obvious that the bringup of APs can changed from the fully
> serialized:
> 
>  for_each_present_cpu(cpu) {
> if (!cpu_online(cpu))
>cpu_up(cpu, CPUHP_ONLINE);
>  }
> 
> to
> 
>  for_each_present_cpu(cpu) {
> if (!cpu_online(cpu))
>cpu_up(cpu, CPUHP_KICK_CPU);
>  }
> 
>  for_each_present_cpu(cpu) {
> if (!cpu_active(cpu))
>cpu_up(cpu, CPUHP_ONLINE);
>  }
> 
> The CPUHP_KICK_CPU state does not exist today, but it's just the logical
> consequence of the state machine. It's basically splitting __cpu_up()
> into:
> 
> __cpu_kick()
> {
> prepare();
> arch_kick_remote_cpu(); -> Send IPI/NMI, Firmware call .
> }
> 
> __cpu_wait_online()
> {
> wait_until_cpu_online();
> do_further_stuff();
> }
> 
> There is some more to it than just blindly splitting it up at the
> architecture level.

We've been playing with this a little. There's a proof-of-concept hack
below; don't look too hard because it's only really for figuring out
the timing etc.

Basically we ripped out the 'wait' parts of the x86 do_boot_cpu() into
a separate function do_wait_cpu(). There are four phases to the wait.

 • Wait for the AP to turn up in cpu_initialized_mask, set its bit in
   cpu_callout_mask to allow it to run the AP thread.
 • Wait for it to finish init and show up in cpu_callin_mask.
 • check_tsc_sync_source()
 • Wait for cpu_online(cpu)

There's an EARLY_INIT macro which controls whether the early bringup
call actually *does* anything, or whether it's left until bringup_cpu()
as the current code does. It allows a simple comparison of the two.

First we tested under qemu (on a Skylake EC2 c5.metal instance). The
do_boot_cpu() actually sending the IPIs took ~300k cycles itself.
Without EARLY_INIT we see timing for the four wait phases along the
lines of:

[0.285312] CPU#10 up in 192950,952898,  60014786,28 (  
61160662)
[0.288311] CPU#11 up in 181092,962704,  60010432,30 (  
61154258)
[0.291312] CPU#12 up in 386080,970416,  60013956,28 (  
61370480)
[0.294311] CPU#13 up in 372782,964506,  60010564,28 (  
61347880)
[0.297312] CPU#14 up in 389602,976280,  60013046,28 (  
61378956)
[0.300312] CPU#15 up in 213132,968148,  60012138,28 (  
61193446)

If we define EARLY_INIT then that first phase of waiting for the CPU
add itself is fairly much instantaneous, which is precisely what we
were hoping for. We also seem to save about 300k cycles on the AP
bringup too. It's just that it *all* pales into insignificance with
whatever it's doing to synchronise the TSC for 60M cycles.

[0.338829] CPU#10 up in600,689054,  60025522,28 (  
60715204)
[0.341829] CPU#11 up in610,635346,  60019390,28 (  
60655374)
[0.343829] CPU#12 up in632,619352,  60020728,28 (  
60640740)
[0.346829] CPU#13 up in602,514234,  60025402,26 (  
60540264)
[0.348830] CPU#14 up in608,621058,  60025952,26 (  
60647644)
[0.351829] CPU#15 up in600,624690,  60021526,   410 (  
60647226)

Testing on real hardware has been more interesting and less useful so
far. We started with the CPUHP_BRINGUP_KICK_CPU state being
*immediately* before CPUHP_BRINGUP_CPU. On my 28-thread Haswell box,
that didn't come up at all even without actually *doing* anything in
the pre-bringup phase. Merely bringing all the AP threads up through
the various CPUHP_PREPARE_foo stages before actually bringing them
online, was enough to break it. I have no serial port on this box so we
haven't get worked out why; I've resorted to putting the
CPUHP_BRINGUP_KICK_CPU state before CPUHP_WORKQUEUE_PREP instead.

That lets it boot without the EARLY_INIT at least (so it's basically a
no-op), and I get these timings. Looks like there's 3-4M cycles to be
had by the parallel SIPI/INIT, but the *first* thread of each core is
also taking another 8M cycles and it might be worth doing *those* in
parallel too. And Thomas I think that waiting for the AP to bring
itself up is the part you meant was pointlessly differently
reimplemented across architectures? So the way forward there might be
to offer a generic CPUHP state for that, for architectures to plug into
and ditch their own tracking.

[0.311581] CPU#1 up in4057008,   8820492,  1828,   808 (  
12880136)
[0.316802] CPU#2 up in3885080,   8738092,  1792,   904 (  
12625868)
[0.321674] CPU#3 up in3468276,   8244880,  1724,   860 (  
11715740)
[0.326609] CPU#4 up in3565216,   8357876,  1808,   984 (  
11925884)
[0.331565] CPU#5 up in3566916,   8367340,  1836,   708 (  
11936800)
[0.336446] CPU#6

Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

2021-01-07 Thread David Woodhouse
On Wed, 2020-12-16 at 16:31 +0100, Thomas Gleixner wrote:
> But obviously the C-state in which the APs are waiting is not really
> relevant, as you demonstrated that the cost is due to INIT/SIPI even
> with spinwait, which is what I suspected.
> 
> OTOH, the advantage of INIT/SIPI is that the AP comes up in a well known
> state.

And once we parallelise the bringup we basically only incur the latency
of *one* INIT/SIPI instead of multiplying it by the number of CPUs, so
it isn't clear that there's any *disadvantage* to it. It's certainly a
lot simpler.

I think we should definitely start by implementing the parallel bringup
as you described it, and then see if there's still a problem left to be
solved.

We were working on a SIPI-avoiding patch set which is similar to the
above, which Johanna had just about got working the night before this
one was posted. But it looks like we should go back to the drawing
board anyway instead of bothering to compare the details of the two.


smime.p7s
Description: S/MIME cryptographic signature


Re: 5.11-rc1 TTM list corruption

2021-01-06 Thread David Woodhouse
On Tue, 2021-01-05 at 16:40 +0100, Christian König wrote:
> Am 05.01.21 um 13:20 schrieb Huang Rui:
> > On Tue, Jan 05, 2021 at 07:43:51PM +0800, Borislav Petkov wrote:
> > > On Tue, Jan 05, 2021 at 07:08:52PM +0800, Huang Rui wrote:
> > > > Ah, this asic is a bit old and still use radeon driver. So we didn't
> > > > reproduce it on amdgpu driver. I don't have such the old asic in my 
> > > > hand.
> > > > May we know whether this issue can be duplicated after SI which is used
> > > > amdgpu module (not sure whether you have recent APU or GPU)?
> > > 
> > > The latest I have (I think it is the latest) is:
> > > 
> > > [1.826102] [drm] initializing kernel modesetting (RENOIR 
> > > 0x1002:0x1636 0x17AA:0x5099 0xD1).
> > > 
> > > and so far that hasn't triggered it. Which makes sense because that
> > > thing uses amdgpu:
> > > 
> > > [1.810260] [drm] amdgpu kernel modesetting enabled.
> > 
> > Yes! Renoir is late enough for amdgpu kernel module. :-)
> > Please let us know if you still encounter the issue.
> 
> Thanks for the hints guys. You need a rather specific configuration, but 
> I can reproduce this now.
> 
> Let's see what the problem is here.

FWIW I'm seeing it here on my workstation too.

[3.952102] [drm] radeon kernel modesetting enabled.
[3.952885] checking generic (9000 30) vs hw (9000 1000)
[3.952898] fb0: switching to radeondrmfb from EFI VGA
[3.953665] Console: switching to colour dummy device 80x25
[3.953696] radeon :03:00.0: vgaarb: deactivate vga console
[3.953898] [drm] initializing kernel modesetting (CYPRESS 0x1002:0x6898 
0x1462:0x8032 0x00).
[3.953940] resource sanity check: requesting [mem 0x000c-0x000d], 
which spans more than PCI Bus :00 [mem 0x000c4000-0x000cbfff window]
[3.953945] caller pci_map_rom+0x6c/0x1b0 mapping multiple BARs
[3.953972] ATOM BIOS: 113
[3.954028] radeon :03:00.0: VRAM: 1024M 0x - 
0x3FFF (1024M used)
[3.954032] radeon :03:00.0: GTT: 1024M 0x4000 - 
0x7FFF
[3.954037] [drm] Detected VRAM RAM=1024M, BAR=256M
[3.954039] [drm] RAM width 256bits DDR
[3.954087] [TTM] Zone  kernel: Available graphics memory: 16389788 KiB
[3.954090] [TTM] Zone   dma32: Available graphics memory: 2097152 KiB
[3.954105] [drm] radeon: 1024M of VRAM memory ready
[3.954107] [drm] radeon: 1024M of GTT memory ready.
[3.954114] [drm] Loading CYPRESS Microcode
[3.954168] [drm] Internal thermal controller with fan control
[3.954531] usb 3-1.1.1: New USB device found, idVendor=10d5, 
idProduct=1234, bcdDevice= 9.02
[3.954539] usb 3-1.1.1: New USB device strings: Mfr=0, Product=0, 
SerialNumber=0
[3.958098] hub 3-1.1.1:1.0: USB hub found
[3.959704] hub 3-1.1.1:1.0: 4 ports detected
[3.975098] [drm] radeon: dpm initialized
[3.975159] [drm] GART: num cpu pages 262144, num gpu pages 262144
[3.976074] [drm] enabling PCIE gen 2 link speeds, disable with 
radeon.pcie_gen2=0
[3.979669] igb :01:00.0 eno0: renamed from eth0
[3.993789] [drm] PCIE GART of 1024M enabled (table at 0x0014C000).
[3.993912] radeon :03:00.0: WB enabled
[3.993915] radeon :03:00.0: fence driver on ring 0 use gpu addr 
0x4c00
[3.993918] radeon :03:00.0: fence driver on ring 3 use gpu addr 
0x4c0c
[3.994359] radeon :03:00.0: fence driver on ring 5 use gpu addr 
0x0005c418
[3.994531] radeon :03:00.0: radeon: MSI limited to 32-bit
[3.994563] radeon :03:00.0: radeon: using MSI.
[3.994581] [drm] radeon: irq initialized.
[4.011086] [drm] ring test on 0 succeeded in 1 usecs
[4.011094] [drm] ring test on 3 succeeded in 2 usecs
[4.030666] EXT4-fs (md127): mounted filesystem with ordered data mode. 
Opts: (null)
[4.188159] [drm] ring test on 5 succeeded in 1 usecs
[4.188165] [drm] UVD initialized successfully.
[4.188326] [drm] ib test on ring 0 succeeded in 0 usecs
[4.188371] [drm] ib test on ring 3 succeeded in 0 usecs
...
[4.839982] [drm] ib test on ring 5 succeeded
[4.841079] [drm] Radeon Display Connectors
[4.841087] [drm] Connector 0:
[4.841090] [drm]   DP-1
[4.841094] [drm]   HPD4
[4.841097] [drm]   DDC: 0x6430 0x6430 0x6434 0x6434 0x6438 0x6438 0x643c 
0x643c
[4.841104] [drm]   Encoders:
[4.841107] [drm] DFP1: INTERNAL_UNIPHY2
[4.84] [drm] Connector 1:
[4.841114] [drm]   HDMI-A-1
[4.841118] [drm]   HPD5
[4.841120] [drm]   DDC: 0x6460 0x6460 0x6464 0x6464 0x6468 0x6468 0x646c 
0x646c
[4.841127] [drm]   Encoders:
[4.841130] [drm] DFP2: INTERNAL_UNIPHY2
[4.841133] [drm] Connector 2:
[4.841136] [drm]   DVI-I-1
[4.841139] [drm]   HPD1
[4.841142] [drm]   DDC: 0x6450 0x6450 0x6454 0x6454 0x6458 0x6458 0x645c 
0x645c
[4.841149] [drm]   Encoders:
[4.841151] [drm] DFP3: INTERNAL_UNIPHY1
[4.841155] [drm] CRT2: INT

Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

2021-01-05 Thread David Woodhouse
On Tue, 2021-01-05 at 12:11 +, Joao Martins wrote:
> On 1/1/21 2:33 PM, David Woodhouse wrote:
> > What it does actually mean in the short term is that as I update your
> > KVM_IRQ_ROUTING_XEN_EVTCHN support, I probably *won't* bother to add a
> > 'priority' field to struct kvm_irq_routing_xen_evtchn to make it
> > extensible to FIFO event channels. We can always add that later.
> > 
> > Does that seem reasonable?
> > 
> 
> Yes, makes sense IMHO. Guests need to anyway fallback to 2L if the
> evtchnop_init_control hypercall fails, and the way we are handling events,
> doesn't warrant FIFO event channel support as mandatory.

Right.

I think it's perfectly OK to declare that we aren't going to support
FIFO event channel acceleration in Linux/KVM, and anyone who really
wants to support them would have to do it entirely in userspace.

The only reason a VMM would really need to support FIFO event channels
is if we're insane enough to want to do live migration from actual
Xen, of guests which were previously using them.

While I make no comment about my mental state, I can happily observe
that I don't have guests which currently use FIFO support.



smime.p7s
Description: S/MIME cryptographic signature


[PATCH] iommu/amd: Stop irq_remapping_select() matching when remapping is disabled

2021-01-04 Thread David Woodhouse
From: David Woodhouse 

The AMD IOMMU initialisation registers the IRQ remapping domain for
each IOMMU before doing the final sanity check that every I/OAPIC is
covered.

This means that the AMD irq_remapping_select() function gets invoked
even when IRQ remapping has been disabled, eventually leading to a NULL
pointer dereference in alloc_irq_table().

Unfortunately, the IVRS isn't fully parsed early enough that the sanity
check can be done in time to registering the IRQ domain altogether.
Doing that would be nice, but is a larger and more error-prone task. The
simple fix is just for irq_remapping_select() to refuse to report a
match when IRQ remapping has disabled.

Link: https://lore.kernel.org/lkml/ed4be9b4-24ac-7128-c522-7ef359e81...@gmx.at
Fixes: a1a785b57242 ("iommu/amd: Implement select() method on remapping 
irqdomain")
Reported-by: Johnathan Smithinovic 
Signed-off-by: David Woodhouse 
---
 drivers/iommu/amd/iommu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 7e2c445a1fae..f0adbc48fd17 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3854,6 +3854,9 @@ static int irq_remapping_select(struct irq_domain *d, 
struct irq_fwspec *fwspec,
struct amd_iommu *iommu;
int devid = -1;
 
+   if (!amd_iommu_irq_remap)
+   return 0;
+
if (x86_fwspec_is_ioapic(fwspec))
devid = get_ioapic_devid(fwspec->param[0]);
else if (x86_fwspec_is_hpet(fwspec))
-- 
2.29.2



smime.p7s
Description: S/MIME cryptographic signature


[PATCH] iommu/amd: Set iommu->int_enabled consistently when interrupts are set up

2021-01-04 Thread David Woodhouse
From: David Woodhouse 

When I made the INTCAPXT support stop gratuitously pretending to be MSI,
I missed the fact that iommu_setup_msi() also sets the ->int_enabled
flag. I missed this in the iommu_setup_intcapxt() code path, which means
that a resume from suspend will try to allocate the IRQ domains again,
accidentally re-enabling interrupts as it does, resulting in much sadness.

Lift out the bit which sets iommu->int_enabled into the iommu_init_irq()
function which is also where it gets checked.

Link: https://lore.kernel.org/r/20210104132250.ge32...@zn.tnic/
Fixes: d1adcfbb520c ("iommu/amd: Fix IOMMU interrupt generation in X2APIC mode")
Reported-by: Borislav Petkov 
Signed-off-by: David Woodhouse 
---
There's a possibility we also need to ensure that the actual
MMIO_INTCAPXT_xxx_OFFSET registers are restored too. Unless you
actually trigger something to generate faults, you'll never know.
I don't see offhand how that was working in the pretend-to-be-MSI case
either.

 drivers/iommu/amd/init.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index f54cd79b43e4..6a1f7048dacc 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1973,8 +1973,6 @@ static int iommu_setup_msi(struct amd_iommu *iommu)
return r;
}
 
-   iommu->int_enabled = true;
-
return 0;
 }
 
@@ -2169,6 +2167,7 @@ static int iommu_init_irq(struct amd_iommu *iommu)
if (ret)
return ret;
 
+   iommu->int_enabled = true;
 enable_faults:
iommu_feature_enable(iommu, CONTROL_EVT_INT_EN);
 
-- 
2.29.2



smime.p7s
Description: S/MIME cryptographic signature


  1   2   3   4   5   6   7   8   9   10   >