from:"david"

Re: [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code

2024-07-19 Thread David Hildenbrand


On 19.07.24 17:51, Jonathan Cameron wrote:

On Fri, 19 Jul 2024 17:07:35 +0200
David Hildenbrand  wrote:


-* Allocate node data.  Try node-local memory and then any node.
-* Never allocate in DMA zone.
-*/
-   nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
-   if (!nd_pa) {
-   pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
-  nd_size, nid);
-   return;
-   }
-   nd = __va(nd_pa);
-
-   /* report and initialize */
-   printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
-  nd_pa, nd_pa + nd_size - 1);
-   tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
-   if (tnid != nid)
-   printk(KERN_INFO "NODE_DATA(%d) on node %d\n", nid, tnid);
-
-   node_data[nid] = nd;
-   memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
-
-   node_set_online(nid);
-}
-
/**
 * numa_cleanup_meminfo - Cleanup a numa_meminfo
 * @mi: numa_meminfo to clean up
@@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo 
*mi)
continue;
alloc_node_data(nid);
+   node_set_online(nid);
}


I can spot that we only remove a single node_set_online() call from x86.

What about all the other architectures? Will there be any change in behavior
for them? Or do we simply set the nodes online later once more?


On x86 node_set_online() was a part of alloc_node_data() and I moved it
outside so it's called right after alloc_node_data(). On other
architectures the allocation didn't include that call, so there should be
no difference there.


But won't their arch code try setting the nodes online at a later stage?

And I think, some architectures only set nodes online conditionally
(see most other node_set_online() calls).

Sorry if I'm confused here, but with now unconditional node_set_online(), won't
we change the behavior of other architectures?

This is moving x86 code to x86 code, not a generic location
so how would that affect anyone else? Their onlining should be same as
before.


Yes, see my reply to Mike.



The node onlining difference are a pain (I recall that fun from adding
generic initiators) as different ordering on x86 and arm64 at least.


That's part of the reason I was confused, because I remember some nasty 
inconsistency.


--
Cheers,

David / dhildenb

Re: [PATCH] virt: acrn: Remove unusted list 'acrn_irqfd_clients'

2024-07-19 Thread Dr. David Alan Gilbert

* Li, Fei1 (fei1...@intel.com) wrote:
> > -Original Message-
> > From: Dr. David Alan Gilbert 
> > Sent: Friday, July 19, 2024 1:44 AM
> > To: Li, Fei1 
> > Cc: linux-kernel@vger.kernel.org; virtualizat...@lists.linux.dev
> > Subject: Re: [PATCH] virt: acrn: Remove unusted list 'acrn_irqfd_clients'
> > 
> > * Fei Li (fei1...@intel.com) wrote:
> > > On 2024-05-18 at 00:12:46 +, Dr. David Alan Gilbert wrote:
> > > > * li...@treblig.org (li...@treblig.org) wrote:
> > > > > From: "Dr. David Alan Gilbert" 
> > > > >
> > > > > It doesn't look like this was ever used.
> > > > >
> > > > > Build tested only.
> > > > >
> > > > > Signed-off-by: Dr. David Alan Gilbert 
> > > >
> > > > Ping
> > >
> > > Acked-by: Fei Li 
> > 
> > Hi Fei,
> >   Do you know which way this is likely to get picked up?
> > (I don't see it in -next)
> > 
> >  Thanks,
> > 
> > Dave
> > 
> > > Thanks.
> 
> Hi Dave
> 
> For this patch, you could refer to 
> https://lore.kernel.org/all/20210910094708.3430340-1-bige...@linutronix.de/ 
> to cc Thomas Gleixner  to help review.

OK, I've added Thomas to the To: on this mail;
note he's not listed in Maintainers for drivers/virt/acrn

Dave

> Thanks.
> 
> > >
> > > >
> > > > > ---
> > > > >  drivers/virt/acrn/irqfd.c | 2 --
> > > > >  1 file changed, 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/virt/acrn/irqfd.c b/drivers/virt/acrn/irqfd.c
> > > > > index d4ad211dce7a3..346cf0be4aac7 100644
> > > > > --- a/drivers/virt/acrn/irqfd.c
> > > > > +++ b/drivers/virt/acrn/irqfd.c
> > > > > @@ -16,8 +16,6 @@
> > > > >
> > > > >  #include "acrn_drv.h"
> > > > >
> > > > > -static LIST_HEAD(acrn_irqfd_clients);
> > > > > -
> > > > >  /**
> > > > >   * struct hsm_irqfd - Properties of HSM irqfd
> > > > >   * @vm:  Associated VM pointer
> > > > > --
> > > > > 2.45.0
> > > > >
> > > > --
> > > >  -Open up your eyes, open up your mind, open up your code ---
> > > > / Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \
> > > > \dave @ treblig.org |   | In Hex /
> > > >  \ _|_ http://www.treblig.org   |___/
> > > >
> > >
> > --
> >  -Open up your eyes, open up your mind, open up your code ---
> > / Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \
> > \dave @ treblig.org |   | In Hex /
> >  \ _|_ http://www.treblig.org   |___/
> 
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code

2024-07-19 Thread David Hildenbrand


On 19.07.24 17:34, Mike Rapoport wrote:

On Fri, Jul 19, 2024 at 05:07:35PM +0200, David Hildenbrand wrote:

-* Allocate node data.  Try node-local memory and then any node.
-* Never allocate in DMA zone.
-*/
-   nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
-   if (!nd_pa) {
-   pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
-  nd_size, nid);
-   return;
-   }
-   nd = __va(nd_pa);
-
-   /* report and initialize */
-   printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
-  nd_pa, nd_pa + nd_size - 1);
-   tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
-   if (tnid != nid)
-   printk(KERN_INFO "NODE_DATA(%d) on node %d\n", nid, tnid);
-
-   node_data[nid] = nd;
-   memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
-
-   node_set_online(nid);
-}
-
/**
 * numa_cleanup_meminfo - Cleanup a numa_meminfo
 * @mi: numa_meminfo to clean up
@@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo 
*mi)
continue;
alloc_node_data(nid);
+   node_set_online(nid);
}


I can spot that we only remove a single node_set_online() call from x86.

What about all the other architectures? Will there be any change in behavior
for them? Or do we simply set the nodes online later once more?


On x86 node_set_online() was a part of alloc_node_data() and I moved it
outside so it's called right after alloc_node_data(). On other
architectures the allocation didn't include that call, so there should be
no difference there.


But won't their arch code try setting the nodes online at a later stage?

And I think, some architectures only set nodes online conditionally
(see most other node_set_online() calls).

Sorry if I'm confused here, but with now unconditional node_set_online(), won't
we change the behavior of other architectures?


The generic alloc_node_data() does not set the node online:

+/* Allocate NODE_DATA for a node on the local memory */
+void __init alloc_node_data(int nid)
+{
+   const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+   u64 nd_pa;
+   void *nd;
+   int tnid;
+
+   /* Allocate node data.  Try node-local memory and then any node. */
+   nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
+   if (!nd_pa)
+   panic("Cannot allocate %zu bytes for node %d data\n",
+ nd_size, nid);
+   nd = __va(nd_pa);
+
+   /* report and initialize */
+   pr_info("NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
+   nd_pa, nd_pa + nd_size - 1);
+   tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
+   if (tnid != nid)
+   pr_info("NODE_DATA(%d) on node %d\n", nid, tnid);
+
+   node_data[nid] = nd;
+   memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+}

I might have missed some architecture except x86 that calls
node_set_online() in its alloc_node_data(), but the intention was to leave
that call outside the alloc and explicitly add it after the call to
alloc_node_data() if needed like in x86.


I'm stupid, I didn't realize it is still only called from x86 :(

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code

2024-07-19 Thread David Hildenbrand


-* Allocate node data.  Try node-local memory and then any node.
-* Never allocate in DMA zone.
-*/
-   nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
-   if (!nd_pa) {
-   pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
-  nd_size, nid);
-   return;
-   }
-   nd = __va(nd_pa);
-
-   /* report and initialize */
-   printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
-  nd_pa, nd_pa + nd_size - 1);
-   tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
-   if (tnid != nid)
-   printk(KERN_INFO "NODE_DATA(%d) on node %d\n", nid, tnid);
-
-   node_data[nid] = nd;
-   memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
-
-   node_set_online(nid);
-}
-
   /**
* numa_cleanup_meminfo - Cleanup a numa_meminfo
* @mi: numa_meminfo to clean up
@@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo 
*mi)
continue;
alloc_node_data(nid);
+   node_set_online(nid);
}


I can spot that we only remove a single node_set_online() call from x86.

What about all the other architectures? Will there be any change in behavior
for them? Or do we simply set the nodes online later once more?


On x86 node_set_online() was a part of alloc_node_data() and I moved it
outside so it's called right after alloc_node_data(). On other
architectures the allocation didn't include that call, so there should be
no difference there.


But won't their arch code try setting the nodes online at a later stage?

And I think, some architectures only set nodes online conditionally
(see most other node_set_online() calls).

Sorry if I'm confused here, but with now unconditional node_set_online(), won't
we change the behavior of other architectures?

--
Cheers,

David / dhildenb

Re: [PATCH] virt: acrn: Remove unusted list 'acrn_irqfd_clients'

2024-07-18 Thread Dr. David Alan Gilbert

* Fei Li (fei1...@intel.com) wrote:
> On 2024-05-18 at 00:12:46 +, Dr. David Alan Gilbert wrote:
> > * li...@treblig.org (li...@treblig.org) wrote:
> > > From: "Dr. David Alan Gilbert" 
> > > 
> > > It doesn't look like this was ever used.
> > > 
> > > Build tested only.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert 
> > 
> > Ping
> 
> Acked-by: Fei Li 

Hi Fei,
  Do you know which way this is likely to get picked up?
(I don't see it in -next)

 Thanks,

Dave

> Thanks.
> 
> > 
> > > ---
> > >  drivers/virt/acrn/irqfd.c | 2 --
> > >  1 file changed, 2 deletions(-)
> > > 
> > > diff --git a/drivers/virt/acrn/irqfd.c b/drivers/virt/acrn/irqfd.c
> > > index d4ad211dce7a3..346cf0be4aac7 100644
> > > --- a/drivers/virt/acrn/irqfd.c
> > > +++ b/drivers/virt/acrn/irqfd.c
> > > @@ -16,8 +16,6 @@
> > >  
> > >  #include "acrn_drv.h"
> > >  
> > > -static LIST_HEAD(acrn_irqfd_clients);
> > > -
> > >  /**
> > >   * struct hsm_irqfd - Properties of HSM irqfd
> > >   * @vm:  Associated VM pointer
> > > -- 
> > > 2.45.0
> > > 
> > -- 
> >  -Open up your eyes, open up your mind, open up your code ---   
> > / Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
> > \dave @ treblig.org |   | In Hex /
> >  \ _|_ http://www.treblig.org   |___/
> > 
> 
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [PATCH 05/17] arch, mm: pull out allocation of NODE_DATA to generic code

2024-07-17 Thread David Hildenbrand


On 16.07.24 13:13, Mike Rapoport wrote:

From: "Mike Rapoport (Microsoft)" 

Architectures that support NUMA duplicate the code that allocates
NODE_DATA on the node-local memory with slight variations in reporting
of the addresses where the memory was allocated.

Use x86 version as the basis for the generic alloc_node_data() function
and call this function in architecture specific numa initialization.

Signed-off-by: Mike Rapoport (Microsoft) 
---


[...]


diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c
index 9208eaadf690..909f6cec3a26 100644
--- a/arch/mips/loongson64/numa.c
+++ b/arch/mips/loongson64/numa.c
@@ -81,12 +81,8 @@ static void __init init_topology_matrix(void)
  
  static void __init node_mem_init(unsigned int node)

  {
-   struct pglist_data *nd;
unsigned long node_addrspace_offset;
unsigned long start_pfn, end_pfn;
-   unsigned long nd_pa;
-   int tnid;
-   const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);


One interesting change is that we now always round up to full pages on 
architectures where we previously rounded up to SMP_CACHE_BYTES.


I assume we don't really expect a significant growth in memory 
consumption that we care about, especially because most systems with 
many nodes also have  quite some memory around.




-/* Allocate NODE_DATA for a node on the local memory */
-static void __init alloc_node_data(int nid)
-{
-   const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
-   u64 nd_pa;
-   void *nd;
-   int tnid;
-
-   /*
-* Allocate node data.  Try node-local memory and then any node.
-* Never allocate in DMA zone.
-*/
-   nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
-   if (!nd_pa) {
-   pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
-  nd_size, nid);
-   return;
-   }
-   nd = __va(nd_pa);
-
-   /* report and initialize */
-   printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
-  nd_pa, nd_pa + nd_size - 1);
-   tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
-   if (tnid != nid)
-   printk(KERN_INFO "NODE_DATA(%d) on node %d\n", nid, tnid);
-
-   node_data[nid] = nd;
-   memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
-
-   node_set_online(nid);
-}
-
  /**
   * numa_cleanup_meminfo - Cleanup a numa_meminfo
   * @mi: numa_meminfo to clean up
@@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo 
*mi)
continue;
  
  		alloc_node_data(nid);

+   node_set_online(nid);
}


I can spot that we only remove a single node_set_online() call from x86.

What about all the other architectures? Will there be any change in 
behavior for them? Or do we simply set the nodes online later once more?


--
Cheers,

David / dhildenb

Re: [PATCH 01/17] mm: move kernel/numa.c to mm/

2024-07-17 Thread David Hildenbrand


On 16.07.24 13:13, Mike Rapoport wrote:

From: "Mike Rapoport (Microsoft)" 

The stub functions in kernel/numa.c belong to mm/ rather than to kernel/

Signed-off-by: Mike Rapoport (Microsoft) 
---


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH 04/17] arch, mm: move definition of node_data to generic code

2024-07-17 Thread David Hildenbrand


On 16.07.24 13:13, Mike Rapoport wrote:

From: "Mike Rapoport (Microsoft)" 

Every architecture that supports NUMA defines node_data in the same way:

struct pglist_data *node_data[MAX_NUMNODES];

No reason to keep multiple copies of this definition and its forward
declarations, especially when such forward declaration is the only thing
in include/asm/mmzone.h for many architectures.

Add definition and declaration of node_data to generic code and drop
architecture-specific versions.

Signed-off-by: Mike Rapoport (Microsoft) 
---
  arch/arm64/include/asm/Kbuild  |  1 +
  arch/arm64/include/asm/mmzone.h| 13 -
  arch/arm64/include/asm/topology.h  |  1 +
  arch/loongarch/include/asm/Kbuild  |  1 +
  arch/loongarch/include/asm/mmzone.h| 16 
  arch/loongarch/include/asm/topology.h  |  1 +
  arch/loongarch/kernel/numa.c   |  3 ---
  arch/mips/include/asm/mach-ip27/mmzone.h   |  4 
  arch/mips/include/asm/mach-loongson64/mmzone.h |  4 
  arch/mips/loongson64/numa.c|  2 --
  arch/mips/sgi-ip27/ip27-memory.c   |  3 ---
  arch/powerpc/include/asm/mmzone.h  |  6 --
  arch/powerpc/mm/numa.c |  2 --
  arch/riscv/include/asm/Kbuild  |  1 +
  arch/riscv/include/asm/mmzone.h| 13 -
  arch/riscv/include/asm/topology.h  |  4 
  arch/s390/include/asm/Kbuild   |  1 +
  arch/s390/include/asm/mmzone.h | 17 -
  arch/s390/kernel/numa.c|  3 ---
  arch/sh/include/asm/mmzone.h   |  3 ---
  arch/sh/mm/numa.c  |  3 ---
  arch/sparc/include/asm/mmzone.h|  4 
  arch/sparc/mm/init_64.c|  2 --
  arch/x86/include/asm/Kbuild|  1 +
  arch/x86/include/asm/mmzone.h  |  6 --
  arch/x86/include/asm/mmzone_32.h   | 17 -
  arch/x86/include/asm/mmzone_64.h   | 18 --
  arch/x86/mm/numa.c |  3 ---
  drivers/base/arch_numa.c   |  2 --
  include/asm-generic/mmzone.h   |  5 +
  include/linux/numa.h   |  3 +++
  mm/numa.c  |  3 +++
  32 files changed, 22 insertions(+), 144 deletions(-)
  delete mode 100644 arch/arm64/include/asm/mmzone.h
  delete mode 100644 arch/loongarch/include/asm/mmzone.h
  delete mode 100644 arch/riscv/include/asm/mmzone.h
  delete mode 100644 arch/s390/include/asm/mmzone.h
  delete mode 100644 arch/x86/include/asm/mmzone.h
  delete mode 100644 arch/x86/include/asm/mmzone_32.h
  delete mode 100644 arch/x86/include/asm/mmzone_64.h
  create mode 100644 include/asm-generic/mmzone.h


Nice!

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH 03/17] MIPS: loongson64: rename __node_data to node_data

2024-07-17 Thread David Hildenbrand


On 16.07.24 13:13, Mike Rapoport wrote:

From: "Mike Rapoport (Microsoft)" 

Make definition of node_data match other architectures.
This will allow pulling declaration of node_data to the generic mm code in
the following commit.

Signed-off-by: Mike Rapoport (Microsoft) 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH 02/17] MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures

2024-07-17 Thread David Hildenbrand


On 16.07.24 13:13, Mike Rapoport wrote:

From: "Mike Rapoport (Microsoft)" 

sgi-ip27 is the only system that defines NODE_DATA() differently than
the rest of NUMA machines.

Add node_data array of struct pglist pointers that will point to
__node_data[node]->pglist and redefine NODE_DATA() to use node_data
array.

This will allow pulling declaration of node_data to the generic mm code
in the next commit.

Signed-off-by: Mike Rapoport (Microsoft) 
---
  arch/mips/include/asm/mach-ip27/mmzone.h | 5 -
  arch/mips/sgi-ip27/ip27-memory.c | 5 -
  2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/mips/include/asm/mach-ip27/mmzone.h 
b/arch/mips/include/asm/mach-ip27/mmzone.h
index 08c36e50a860..629c3f290203 100644
--- a/arch/mips/include/asm/mach-ip27/mmzone.h
+++ b/arch/mips/include/asm/mach-ip27/mmzone.h
@@ -22,7 +22,10 @@ struct node_data {
  
  extern struct node_data *__node_data[];
  
-#define NODE_DATA(n)		(&__node_data[(n)]->pglist)

  #define hub_data(n)   (&__node_data[(n)]->hub)
  
+extern struct pglist_data *node_data[];

+
+#define NODE_DATA(nid) (node_data[nid])
+
  #endif /* _ASM_MACH_MMZONE_H */
diff --git a/arch/mips/sgi-ip27/ip27-memory.c b/arch/mips/sgi-ip27/ip27-memory.c
index b8ca94cfb4fe..c30ef6958b97 100644
--- a/arch/mips/sgi-ip27/ip27-memory.c
+++ b/arch/mips/sgi-ip27/ip27-memory.c
@@ -34,8 +34,10 @@
  #define SLOT_PFNSHIFT (SLOT_SHIFT - PAGE_SHIFT)
  #define PFN_NASIDSHFT (NASID_SHFT - PAGE_SHIFT)
  
-struct node_data *__node_data[MAX_NUMNODES];

+struct pglist_data *node_data[MAX_NUMNODES];
+EXPORT_SYMBOL(node_data);
  
+struct node_data *__node_data[MAX_NUMNODES];

  EXPORT_SYMBOL(__node_data);
  
  static u64 gen_region_mask(void)

@@ -361,6 +363,7 @@ static void __init node_mem_init(nasid_t node)
 */
__node_data[node] = __va(slot_freepfn << PAGE_SHIFT);
memset(__node_data[node], 0, PAGE_SIZE);
+   node_data[node] = &__node_data[node]->pglist;
  
  	NODE_DATA(node)->node_start_pfn = start_pfn;

NODE_DATA(node)->node_spanned_pages = end_pfn - start_pfn;


I was assuming we could get rid of __node_data->pglist.

But now I am confused where that is actually set.

Anyhow

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [RFC PATCH v4] ptp: Add vDSO-style vmclock support

2024-07-17 Thread David Woodhouse

On Tue, 2024-07-16 at 15:20 +0200, Peter Hilber wrote:
> On 16.07.24 14:32, David Woodhouse wrote:
> > On 16 July 2024 12:54:52 BST, Peter Hilber  
> > wrote:
> > > On 11.07.24 09:50, David Woodhouse wrote:
> > > > On Thu, 2024-07-11 at 09:25 +0200, Peter Hilber wrote:
> > > > > 
> > > > > IMHO this phrasing is better, since it directly refers to the state 
> > > > > of the
> > > > > structure.
> > > > 
> > > > Thanks. I'll update it.
> > > > 
> > > > > AFAIU if there would be abnormal delays in store buffers, causing some
> > > > > driver to still see the old clock for some time, the monotonicity 
> > > > > could be
> > > > > violated:
> > > > > 
> > > > > 1. device writes new, much slower clock to store buffer
> > > > > 2. some time passes
> > > > > 3. driver reads old, much faster clock
> > > > > 4. device writes store buffer to cache
> > > > > 5. driver reads new, much slower clock
> > > > > 
> > > > > But I hope such delays do not occur.
> > > > 
> > > > For the case of the hypervisor←→guest interface this should be handled
> > > > by the use of memory barriers and the seqcount lock.
> > > > 
> > > > The guest driver reads the seqcount, performs a read memory barrier,
> > > > then reads the contents of the structure. Then performs *another* read
> > > > memory barrier, and checks the seqcount hasn't changed:
> > > > https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=drivers/ptp/ptp_vmclock.c;hb=vmclock#l351
> > > > 
> > > > The converse happens with write barriers on the hypervisor side:
> > > > https://git.infradead.org/?p=users/dwmw2/qemu.git;a=blob;f=hw/acpi/vmclock.c;hb=vmclock#l68
> > > 
> > > My point is that, looking at the above steps 1. - 5.:
> > > 
> > > 3. read HW counter, smp_rmb, read seqcount
> > > 4. store seqcount, smp_wmb, stores, smp_wmb, store seqcount become 
> > > effective
> > > 5. read seqcount, smp_rmb, read HW counter
> > > 
> > > AFAIU this would still be a theoretical problem suggesting the use of
> > > stronger barriers.
> > 
> > This seems like a bug on the guest side. The HW counter needs to be read 
> > *within* the (paired, matching) seqcount reads, not before or after.
> > 
> > 
> 
> There would be paired reads:
> 
> 1. device writes new, much slower clock to store buffer
> 2. some time passes
> 3. read seqcount, smp_rmb, ..., read HW counter, smp_rmb, read seqcount
> 4. store seqcount, smp_wmb, stores, smp_wmb, store seqcount all become
>    effective only now
> 5. read seqcount, smp_rmb, read HW counter, ..., smp_rmb, read seqcount
> 
> I just omitted the parts which do not necessarily need to happen close to
> 4. for the monotonicity to be violated. My point is that 1. could become
> visible to other cores long after it happened on the local core (during
> 4.).

Oh, I see. That would be a bug on the device side then. And as you say,
it could be fixed by using the appropriate barriers. Or my alternative
of just documenting "Don't Do That Then".



smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v4] ptp: Add vDSO-style vmclock support

2024-07-17 Thread David Woodhouse

On Tue, 2024-07-16 at 13:54 +0200, Peter Hilber wrote:
> On 08.07.24 11:27, David Woodhouse wrote:
> > +
> > +   /*
> > +    * Time according to time_type field above.
> > +    */
> > +   uint64_t time_sec;  /* Seconds since time_type epoch */
> > +   uint64_t time_frac_sec; /* (seconds >> 64) */
> > +   uint64_t time_esterror_picosec; /* (± picoseconds) */
> > +   uint64_t time_maxerror_picosec; /* (± picoseconds) */
> 
> Is this unsigned or signed?

The field itself is unsigned, as it provides the absolute value of the
error (which can be in either direction). Probably better just to drop
the ± from the comment.

Julien is now back from vacation and I'm expecting to see his opinion
on whether we can change that to nanoseconds for consistency.

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v4] ptp: Add vDSO-style vmclock support

2024-07-16 Thread David Woodhouse

On 16 July 2024 12:54:52 BST, Peter Hilber  wrote:
>On 11.07.24 09:50, David Woodhouse wrote:
>> On Thu, 2024-07-11 at 09:25 +0200, Peter Hilber wrote:
>>>
>>> IMHO this phrasing is better, since it directly refers to the state of the
>>> structure.
>> 
>> Thanks. I'll update it.
>> 
>>> AFAIU if there would be abnormal delays in store buffers, causing some
>>> driver to still see the old clock for some time, the monotonicity could be
>>> violated:
>>>
>>> 1. device writes new, much slower clock to store buffer
>>> 2. some time passes
>>> 3. driver reads old, much faster clock
>>> 4. device writes store buffer to cache
>>> 5. driver reads new, much slower clock
>>>
>>> But I hope such delays do not occur.
>> 
>> For the case of the hypervisor←→guest interface this should be handled
>> by the use of memory barriers and the seqcount lock.
>> 
>> The guest driver reads the seqcount, performs a read memory barrier,
>> then reads the contents of the structure. Then performs *another* read
>> memory barrier, and checks the seqcount hasn't changed:
>> https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=drivers/ptp/ptp_vmclock.c;hb=vmclock#l351
>> 
>> The converse happens with write barriers on the hypervisor side:
>> https://git.infradead.org/?p=users/dwmw2/qemu.git;a=blob;f=hw/acpi/vmclock.c;hb=vmclock#l68
>
>My point is that, looking at the above steps 1. - 5.:
>
>3. read HW counter, smp_rmb, read seqcount
>4. store seqcount, smp_wmb, stores, smp_wmb, store seqcount become effective
>5. read seqcount, smp_rmb, read HW counter
>
>AFAIU this would still be a theoretical problem suggesting the use of
>stronger barriers.

This seems like a bug on the guest side. The HW counter needs to be read 
*within* the (paired, matching) seqcount reads, not before or after.

RE: [RFC PATCH v4 12/17] powerpc64/ftrace: Move ftrace sequence out of line

2024-07-15 Thread David Laight

From: Nicholas Piggin
> Sent: 15 July 2024 09:25
> 
> On Sun Jul 14, 2024 at 6:27 PM AEST, Naveen N Rao wrote:
> > Function profile sequence on powerpc includes two instructions at the
> > beginning of each function:
> > mflrr0
> > bl  ftrace_caller
> >
> > The call to ftrace_caller() gets nop'ed out during kernel boot and is
> > patched in when ftrace is enabled.
> >
> > Given the sequence, we cannot return from ftrace_caller with 'blr' as we
> > need to keep LR and r0 intact. This results in link stack (return
> > address predictor) imbalance when ftrace is enabled. To address that, we
> > would like to use a three instruction sequence:
> > mflrr0
> > bl  ftrace_caller
> > mtlrr0
> >
> > Further more, to support DYNAMIC_FTRACE_WITH_CALL_OPS, we need to
> > reserve two instruction slots before the function. This results in a
> > total of five instruction slots to be reserved for ftrace use on each
> > function that is traced.
> >
> > Move the function profile sequence out-of-line to minimize its impact.
> > To do this, we reserve a single nop at function entry using
> > -fpatchable-function-entry=1 and add a pass on vmlinux.o to determine
> > the total number of functions that can be traced. This is then used to
> > generate a .S file reserving the appropriate amount of space for use as
> > ftrace stubs, which is built and linked into vmlinux.
> 
> These are all going into .tramp.ftrace.text AFAIKS? Should that be
> moved after some of the other text in the linker script then if it
> could get quite large? sched and lock and other things should be
> closer to the rest of text and hot code.

Can't you allocate the space for the 'function profile sequence'
at run-time when (if) ftrace is enabled?
When ftrace gets disabled it is likely possible to save the trampoline
number in the nop - so the same memory can be used next time.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [PATCH] arm64: dts: exynos: exynos7885-jackpotlte: Correct RAM amount to 4GB

2024-07-13 Thread David Virag

On Sat, 2024-07-13 at 20:54 +0200, Krzysztof Kozlowski wrote:
> On 13/07/2024 19:58, David Virag wrote:
> > All known jackpotlte variants have 4GB of RAM, let's use it all.
> > RAM was set to 3GB from a mistake in the vendor provided DTS file.
> 
> Hm, vendor DTS rarely has a mistake of missing 1 GB of RAM, so I
> assume
> there was some reason behind it. Trusted apps? Some shared memory for
> other co-processor?

Honestly I'm not sure, maybe some prototype had 3GB of RAM?
The stock bootloader does update it to 4GB, but the stock bootloader
also doesn't even respect the arm64 boot protocol, and doesn't let us
change the kernel cmdline, so we don't like using it.

> 
> Anyway, if this works 100% for you, then I am fine with it.

Yup, works perfectly!

> 
> It is too late in the cycle for me to pick it up. I will take it
> after
> the merge window.

That's fine with me.

> 
> 
> 
> 
> Best regards,
> Krzysztof
> 

Best regards,
David

[PATCH] arm64: dts: exynos: exynos7885-jackpotlte: Correct RAM amount to 4GB

2024-07-13 Thread David Virag

All known jackpotlte variants have 4GB of RAM, let's use it all.
RAM was set to 3GB from a mistake in the vendor provided DTS file.

Fixes: 06874015327b ("arm64: dts: exynos: Add initial device tree support for 
Exynos7885 SoC")
Signed-off-by: David Virag 
---
 arch/arm64/boot/dts/exynos/exynos7885-jackpotlte.dts | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/boot/dts/exynos/exynos7885-jackpotlte.dts 
b/arch/arm64/boot/dts/exynos/exynos7885-jackpotlte.dts
index ed2925b4715f..0d5c26a197d8 100644
--- a/arch/arm64/boot/dts/exynos/exynos7885-jackpotlte.dts
+++ b/arch/arm64/boot/dts/exynos/exynos7885-jackpotlte.dts
@@ -57,7 +57,7 @@ memory@8000 {
device_type = "memory";
reg = <0x0 0x8000 0x3da0>,
  <0x0 0xc000 0x4000>,
- <0x8 0x8000 0x4000>;
+ <0x8 0x8000 0x8000>;
};
 
gpio-keys {
-- 
2.45.2

Re: [RFC PATCH v4] ptp: Add vDSO-style vmclock support

2024-07-11 Thread David Woodhouse

On Thu, 2024-07-11 at 09:25 +0200, Peter Hilber wrote:
> 
> IMHO this phrasing is better, since it directly refers to the state of the
> structure.

Thanks. I'll update it.

> AFAIU if there would be abnormal delays in store buffers, causing some
> driver to still see the old clock for some time, the monotonicity could be
> violated:
> 
> 1. device writes new, much slower clock to store buffer
> 2. some time passes
> 3. driver reads old, much faster clock
> 4. device writes store buffer to cache
> 5. driver reads new, much slower clock
> 
> But I hope such delays do not occur.

For the case of the hypervisor←→guest interface this should be handled
by the use of memory barriers and the seqcount lock.

The guest driver reads the seqcount, performs a read memory barrier,
then reads the contents of the structure. Then performs *another* read
memory barrier, and checks the seqcount hasn't changed:
https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=drivers/ptp/ptp_vmclock.c;hb=vmclock#l351

The converse happens with write barriers on the hypervisor side:
https://git.infradead.org/?p=users/dwmw2/qemu.git;a=blob;f=hw/acpi/vmclock.c;hb=vmclock#l68

Do we need to think harder about the ordering across a real PCI bus? It
isn't entirely unreasonable for this to be implemented in hardware if
we eventually add a counter_id value for a bus-visible counter like the
Intel Always Running Timer (ART). I'm also OK with saying that device
implementations may only provide the shared memory structure if they
can ensure memory ordering.

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v4] ptp: Add vDSO-style vmclock support

2024-07-10 Thread David Woodhouse

On Wed, 2024-07-10 at 15:07 +0200, Peter Hilber wrote:
> On 08.07.24 11:27, David Woodhouse wrote:
> > From: David Woodhouse 
> > 
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> > 
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> > 
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
> > 
> > Signed-off-by: David Woodhouse 
> 
> [...]
> 
> > +
> > +struct vmclock_abi {
> > +   /* CONSTANT FIELDS */
> > +   uint32_t magic;
> > +#define VMCLOCK_MAGIC  0x4b4c4356 /* "VCLK" */
> > +   uint32_t size;  /* Size of region containing this structure 
> > */
> > +   uint16_t version;   /* 1 */
> > +   uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx except 
> > INVALID */
> > +#define VMCLOCK_COUNTER_ARM_VCNT   0
> > +#define VMCLOCK_COUNTER_X86_TSC1
> > +#define VMCLOCK_COUNTER_INVALID0xff
> > +   uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */
> > +#define VMCLOCK_TIME_UTC   0   /* Since 1970-01-01 
> > 00:00:00z */
> > +#define VMCLOCK_TIME_TAI   1   /* Since 1970-01-01 
> > 00:00:00z */
> > +#define VMCLOCK_TIME_MONOTONIC 2   /* Since undefined 
> > epoch */
> > +#define VMCLOCK_TIME_INVALID_SMEARED   3   /* Not supported */
> > +#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED 4   /* Not supported */
> > +
> > +   /* NON-CONSTANT FIELDS PROTECTED BY SEQCOUNT LOCK */
> > +   uint32_t seq_count; /* Low bit means an update is in progress */
> > +   /*
> > +    * This field changes to another non-repeating value when the CPU
> > +    * counter is disrupted, for example on live migration. This lets
> > +    * the guest know that it should discard any calibration it has
> > +    * performed of the counter against external sources (NTP/PTP/etc.).
> > +    */
> > +   uint64_t disruption_marker;
> > +   uint64_t flags;
> > +   /* Indicates that the tai_offset_sec field is valid */
> > +#define VMCLOCK_FLAG_TAI_OFFSET_VALID  (1 << 0)
> > +   /*
> > +    * Optionally used to notify guests of pending maintenance events.
> > +    * A guest which provides latency-sensitive services may wish to
> > +    * remove itself from service if an event is coming up. Two flags
> > +    * indicate the approximate imminence of the event.
> > +    */
> > +#define VMCLOCK_FLAG_DISRUPTION_SOON   (1 << 1) /* About a day */
> > +#define VMCLOCK_FLAG_DISRUPTION_IMMINENT   (1 << 2) /* About an hour */
> > +#define VMCLOCK_FLAG_PERIOD_ESTERROR_VALID (1 << 3)
> > +#define VMCLOCK_FLAG_PERIOD_MAXERROR_VALID (1 << 4)
> > +#define VMCLOCK_FLAG_TIME_ESTERROR_VALID   (1 << 5)
> > +#define VMCLOCK_FLAG_TIME_MAXERROR_VALID   (1 << 6)
> > +   /*
> > +    * Even regardless of leap seconds, the time presented through this
> > +    * mechanism may not be strictly monotonic. If the counter slows 
> > down
> > +    * and the host adapts to this discovery, the time calculated from
> > +    * the value of the counter immediately after an update to this
> > +    * structure, may appear to be *earlier* than a calculation just
> > +    * before the update (while the counter was believed to be running
> > +    * faster than it now is). A guest operating system will typically
> > +    * *skew* its own system clock back towards the reference clock
> > +    * exposed here, rather than following this clock directly. If,
> > +    * however, this structure is being populated from such a system
> > +    * clock which is already handled in such a fashion and the results
> > +    * *are* guaranteed to be monotonic, such monotonicity can be
> > +    * advertised by setting this bit.
> > +    */
> 
> I wonder if this might be difficult to define in a standard.

I'm sure we could do better than my attempt above, but surely it isn't
*so* hard to define monotonicity?

> Is there a need to define device and driver behavior in more detail? What
> would

Re: [PATCH 2/2] virtio: fix vq # when vq skipped

2024-07-09 Thread David Hildenbrand


On 05.07.24 12:09, Michael S. Tsirkin wrote:

virtio balloon communicates to the core that in some
configurations vq #s are non-contiguous by setting name
pointer to NULL.

Unfortunately, core then turned around and just made them
contiguous again. Result is that driver is out of spec.

Implement what the API was supposed to do
in the 1st place. Compatibility with buggy hypervisors
is handled inside virtio-balloon, which is the only driver
making use of this facility, so far.

Signed-off-by: Michael S. Tsirkin 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH 1/2] virtio_balloon: add work around for out of spec QEMU

2024-07-09 Thread David Hildenbrand


On 05.07.24 12:08, Michael S. Tsirkin wrote:

QEMU implemented the configuration
VIRTIO_BALLOON_F_REPORTING && ! VIRTIO_BALLOON_F_FREE_PAGE_HINT
incorrectly: it then uses vq3 for reporting, spec says it is always 4.

This is masked by a corresponding bug in driver:
add a work around as I'm going to try and fix the driver bug.

Signed-off-by: Michael S. Tsirkin 
---
  drivers/virtio/virtio_balloon.c | 19 +--
  1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 9a61febbd2f7..7dc3fcd56238 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -597,8 +597,23 @@ static int init_vqs(struct virtio_balloon *vb)
  
  	err = virtio_find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX, vqs,

  callbacks, names, NULL);
-   if (err)
-   return err;
+   if (err) {
+   /*
+* Try to work around QEMU bug which since 2020 confused vq 
numbers
+* when VIRTIO_BALLOON_F_REPORTING but not
+* VIRTIO_BALLOON_F_FREE_PAGE_HINT are offered.
+*/
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING) &&
+   !virtio_has_feature(vb->vdev, 
VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+   names[VIRTIO_BALLOON_VQ_FREE_PAGE] = "reporting_vq";
+   callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = balloon_ack;
+   err = virtio_find_vqs(vb->vdev,
+ VIRTIO_BALLOON_VQ_REPORTING, vqs, 
callbacks, names, NULL);
+   }
+
+   if (err)
+   return err;
+   }
  
  	vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];

vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

[RFC PATCH v4] ptp: Add vDSO-style vmclock support

2024-07-08 Thread David Woodhouse

From: David Woodhouse 

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Like the KVM PTP clock, this can convert TSC-based cross timestamps into
KVM clock values. Unlike the KVM PTP clock, it does so only when such is
actually helpful.

The memory region of the device is also exposed to userspace so it can be
read or memory mapped by application which need reliable notification of
clock disruptions.

Signed-off-by: David Woodhouse 
---

QEMU implementation at
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock

Remaining questions/TODO for virtio-rtc adoption:
 • Use of signed integer for tai_offset field
 • Explicit little-endianness
 • Is picoseconds the right unit for absolute error (I was going to make
   this (seconds>>64) but that actually reduces the *range* that can be
   expressed).
 • Are the clock_status values sensible?

v4:
 • Add esterror fields, MONOTONIC flag.
 • Reduce seq_count to 32 bits
 • Expand size to permit 64KiB pages
 • Align with virtio-rtc fields, values and leap handling
 • Drop gettime() method (since we have gettimex())
 • Add leap second smearing hint
 • Use a real _CRS on the ACPI device

v3: (wrong patch sent)

v2:
 • Add gettimex64() support
 • Convert TSC values to KVM clock when appropriate
 • Require int128 support
 • Add counter_period_shift
 • Add timeout when seq_count is invalid
 • Add flags field
 • Better comments in vmclock ABI structure
 • Explicitly forbid smearing (as clock rates would need to change)


 drivers/ptp/Kconfig  |  13 +
 drivers/ptp/Makefile |   1 +
 drivers/ptp/ptp_vmclock.c| 567 +++
 include/uapi/linux/vmclock-abi.h | 187 ++
 4 files changed, 768 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock-abi.h

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
  To compile this driver as a module, choose M here: the module
  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+   tristate "Virtual machine PTP clock"
+   depends on X86_TSC || ARM_ARCH_TIMER
+   depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+   default y
+   help
+ This driver adds support for using a virtual precision clock
+ advertised by the hypervisor. This clock is only useful in virtual
+ machines where such a device is present.
+
+ To compile this driver as a module, choose M here: the module
+ will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
tristate "IDT 82P33xxx PTP clock"
depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE)  += ptp_dte.o
 obj-$(CONFIG_PTP_1588_CLOCK_INES)  += ptp_ines.o
 obj-$(CONFIG_PTP_1588_CLOCK_PCH)   += ptp_pch.o
 obj-$(CONFIG_PTP_1588_CLOCK_KVM)   += ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK)   += ptp_vmclock.o
 obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o
 ptp-qoriq-y+= ptp_qoriq.o
 ptp-qoriq-$(CONFIG_DEBUG_FS)   += ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index ..30f15d7753bb
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,567 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+
+#ifdef CONFIG_X86
+#include 
+#include 
+#endif
+
+#ifdef CONFIG_KVM_GUEST
+#define SUPPORT_KVMCLOCK
+#endif
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+   struct resource res;
+   struct vmclock_abi *clk;
+   struct miscdevice miscdev;
+   struct ptp_clock_info ptp_clock_info;
+   struct ptp_clock *ptp_clock;
+   enum clocksource_ids cs_id, sys_cs_id;
+   int index;
+   char *name;
+};
+
+#define VMCLOCK_MAX_WAIT ms_to_ktime(100)
+
+/*
+ * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
+ * and add the fractional second part of the reference time.
+ *
+ * The result is a 128-bit value, the top 64 bits of which are seconds, and
+ * the low 64 bits are (seconds >> 64).
+ *
+ * If __int128 isn't available, perform the calculation 32 bits at a time to
+ * avoid overflo

Re: [RFC PATCH v3] ptp: Add vDSO-style vmclock support

2024-07-08 Thread David Woodhouse

On Mon, 2024-07-08 at 10:17 +0100, Simon Horman wrote:
> Hi David,
> 
> As per my comment on v2, although it is harmless in this case,
> it would be nicer to use strscpy() here and avoid fortification
> warnings.
> 

Oh, pants! I have fixed that; must have sent the wrong version.
V4 coming up shortly. With git-send-email this time.

Thanks.

smime.p7s
Description: S/MIME cryptographic signature

[RFC PATCH v3] ptp: Add vDSO-style vmclock support

2024-07-06 Thread David Woodhouse

From: David Woodhouse 

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Like the KVM PTP clock, this can convert TSC-based cross timestamps into
KVM clock values. Unlike the KVM PTP clock, it does so only when such is
actually helpful.

The memory region of the device is also exposed to userspace so it can be
read or memory mapped by application which need reliable notification of
clock disruptions.

The shared memory structure is intended to be adopted into the nascent
virtio-rtc specification (since one might consider a virtio-rtc
specification that doesn't fix the live migration problem to be not fit
for purpose). It can also be presented via a simple ACPI device.

Signed-off-by: David Woodhouse 
---
QEMU implementation at 
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock

Remaining questions/TODO for virtio-rtc adoption:
 • Use of signed integer for tai_offset field
 • Explicit little-endianness
 • Is picoseconds the right unit for absolute error (I was going to make
   this (seconds>>64) but that actually reduces the *range* that can be
   expressed).
 • Are the clock_status values sensible?

v3:
 • Add esterror fields
 • Reduce seq_count to 32 bits
 • Expand size to permit 64KiB pages
 • Align with virtio-rtc fields, values and leap handling
 • Drop gettime() method (since we have gettimex())
 • Add leap second smearing hint
 • Use a real _CRS on the ACPI device

v2: 
 • Add gettimex64() support
 • Convert TSC values to KVM clock when appropriate
 • Require int128 support
 • Add counter_period_shift
 • Add timeout when seq_count is invalid
 • Add flags field
 • Better comments in vmclock ABI structure
 • Explicitly forbid smearing (as clock rates would need to change)

 drivers/ptp/Kconfig  |  13 +
 drivers/ptp/Makefile |   1 +
 drivers/ptp/ptp_vmclock.c| 516 +++
 include/uapi/linux/vmclock.h | 138 ++
 4 files changed, 668 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock.h

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
  To compile this driver as a module, choose M here: the module
  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+   tristate "Virtual machine PTP clock"
+   depends on X86_TSC || ARM_ARCH_TIMER
+   depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+   default y
+   help
+ This driver adds support for using a virtual precision clock
+ advertised by the hypervisor. This clock is only useful in virtual
+ machines where such a device is present.
+
+ To compile this driver as a module, choose M here: the module
+ will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
tristate "IDT 82P33xxx PTP clock"
depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE)  += ptp_dte.o
 obj-$(CONFIG_PTP_1588_CLOCK_INES)  += ptp_ines.o
 obj-$(CONFIG_PTP_1588_CLOCK_PCH)   += ptp_pch.o
 obj-$(CONFIG_PTP_1588_CLOCK_KVM)   += ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK)   += ptp_vmclock.o
 obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o
 ptp-qoriq-y+= ptp_qoriq.o
 ptp-qoriq-$(CONFIG_DEBUG_FS)   += ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index ..e19c2eed8009
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,516 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#ifdef CONFIG_X86
+#include 
+#include 
+#endif
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+   phys_addr_t phys_addr;
+   struct vmclock_abi *clk;
+   struct miscdevice miscdev;
+   struct ptp_clock_info ptp_clock_info;
+   struct ptp_clock *ptp_clock;
+   enum clocksource_ids cs_id, sys_cs_id;
+   int index;
+   char *name;
+};
+
+#define VMCLOCK_MAX_WAIT ms_to_ktime(100)
+
+/*
+ * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
+ * and add the fractional second part of the reference time.
+ *
+ * The result is a 128-bit value, the top 64 bits of which are seconds, and
+ * the low 64 bits are (sec

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-07-05 Thread David Woodhouse

On Fri, 2024-07-05 at 10:12 +0200, Peter Hilber wrote:
> On 03.07.24 12:40, David Woodhouse wrote:
> 
> [...]
> 
> > 
> > 
> > This is what I currently have for 'struct vmclock_abi' that I'd like to
> > persuade you to adopt. I need to tweak it some more, for at least the
> > following reasons, as well as any more you can see:
> > 
> >  • size isn't big enough for 64KiB pages
> >  • Should be explicitly little-endian
> >  • Does it need esterror as well as maxerror?
> 
> I have no opinion about this. I can drop esterror if unwanted.

I also don't care. I'm just observing the inconsistency.

> >  • Why is maxerror in picoseconds? It's the only use of that unit

Between us we now have picoseconds, nanoseconds, (seconds >> 64) and
(seconds >> 64+n).

The power-of-two fractions seem to make a lot of sense for the counter
period, because they mean we don't have to perform divisions.

Does it makes sense to harmonise on (seconds >> 64) for all of the
fractional seconds? Again I don't have a strong opinion; I only want us
to have a *reason* for any differences that exist.

> >  • Where do the clock_status values come from? Do they make sense?
> >  • Are signed integers OK? (I think so!).
> 
> Signed integers would need to be introduced to Virtio, which so far only
> uses explicitly unsigned types: u8, le16 etc.

Perhaps. Although it would also be possible (if not ideal) to define
that e.g. the tai_offset field is a 16-bit "unsigned" integer according
to virtio, but to be interpreted as follows:

If the number is <= 32767 then the TAI offset is that value, but if the
number is >= 32768 then the TAI offset is that value minus 65536.

Perhaps not pretty, but there isn't a *fundamental* dependency on
virtio supporting signed integers as a primary type.

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-07-03 Thread David Woodhouse

On Wed, 2024-07-03 at 11:56 +0200, Peter Hilber wrote:
> On 02.07.24 20:40, David Woodhouse wrote:
> > On 2 July 2024 19:12:00 BST, Peter Hilber  
> > wrote:
> > > On 02.07.24 18:39, David Woodhouse wrote:
> > > > To clarify then, the main types are
> > > > 
> > > >  VIRTIO_RTC_CLOCK_UTC == 0
> > > >  VIRTIO_RTC_CLOCK_TAI == 1
> > > >  VIRTIO_RTC_CLOCK_MONOTONIC == 2
> > > >  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
> > > > 
> > > > And the subtypes are *only* for the case of
> > > > VIRTIO_RTC_CLOCK_SMEARED_UTC. They include
> > > > 
> > > >  VIRTIO_RTC_SUBTYPE_STRICT
> > > >  VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */
> > > >  VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 
> > > >  VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */
> > > > 
> > > > Is that what we just agreed on?
> > > > 
> > > > 
> > > 
> > > This is a misunderstanding. My idea was that the main types are
> > > 
> > > >  VIRTIO_RTC_CLOCK_UTC == 0
> > > >  VIRTIO_RTC_CLOCK_TAI == 1
> > > >  VIRTIO_RTC_CLOCK_MONOTONIC == 2
> > > >  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
> > > 
> > > VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4
> > > 
> > > The subtypes would be (1st for clocks other than
> > > VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for
> > > VIRTIO_RTC_CLOCK_SMEARED_UTC):
> > > 
> > > #define VIRTIO_RTC_SUBTYPE_STRICT 0
> > > #define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1
> > > #define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2
> > > 
> > 
> > Thanks. I really do think that from the guest point of view there's
> > really no distinction between "maybe smeared" and "undefined
> > smearing", and have a preference for using the latter form, which
> > is the key difference there?
> > 
> > Again though, not a hill for me to die on.
> 
> I have no issue with staying with "undefined smearing", so would you agree
> to something like
> 
> VIRTIO_RTC_CLOCK_SMEAR_UNDEFINED_UTC == 4
> 
> (or another name if you prefer)?

Well, the point of contention was really whether that was a *type* or a
*subtype*.

Either way, it's a "precision clock" telling its consumer that the
device *itself* doesn't really know what time is being exposed. Which
seems like a bizarre thing to support.

But I think I've constructed an argument which persuades me to your
point of view that *if* we permit it, it should be a primary type...

A clock can *either* be UTC, *or* it can be monotonic. The whole point
of smearing is to produce a monotonic clock, of course.

VIRTIO_RTC_CLOCK_UTC is UTC. It is not monotonic.

VIRTIO_RTC_CLOCK_SMEARED is, presumably, monotonic (and I think we
should explicitly require that to be true in virtio-rtc).

But VIRTIO_RTC_CLOCK_MAYBE_SMEARED is the worst of both worlds. It is
neither known to be correct UTC, *nor* is it known to be monotonic. So
(again, if we permit it at all) I think it probably does make sense for
that to be a primary type.

This is what I currently have for 'struct vmclock_abi' that I'd like to
persuade you to adopt. I need to tweak it some more, for at least the
following reasons, as well as any more you can see:

 • size isn't big enough for 64KiB pages
 • Should be explicitly little-endian
 • Does it need esterror as well as maxerror?
 • Why is maxerror in picoseconds? It's the only use of that unit
 • Where do the clock_status values come from? Do they make sense?
 • Are signed integers OK? (I think so!).

/*
 * This structure provides a vDSO-style clock to VM guests, exposing the
 * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
 * counter, etc.) and real time. It is designed to address the problem of
 * live migration, which other clock enlightenments do not.
 *
 * When a guest is live migrated, this affects the clock in two ways.
 *
 * First, even between identical hosts the actual frequency of the underlying
 * counter will change within the tolerances of its specification (typically
 * ±50PPM, or 4 seconds a day). This frequency also varies over time on the
 * same host, but can be tracked by NTP as it generally varies slowly. With
 * live migration there is a step change in the frequency, with no warning.
 *
 * Second, there may be a step change in the value of the counter itself, as
 * its accuracy is limited by the precision of the NTP synchronization on the
 * source and destination hosts.
 *
 * So any calibration (NTP, PTP, etc.) which the guest has done on the source
 * host before migration is invalid, and needs to be redone on the new host.
 *
 * In its m

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-07-02 Thread David Woodhouse

On 2 July 2024 19:12:00 BST, Peter Hilber  wrote:
>On 02.07.24 18:39, David Woodhouse wrote:
>> To clarify then, the main types are
>> 
>>  VIRTIO_RTC_CLOCK_UTC == 0
>>  VIRTIO_RTC_CLOCK_TAI == 1
>>  VIRTIO_RTC_CLOCK_MONOTONIC == 2
>>  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
>> 
>> And the subtypes are *only* for the case of
>> VIRTIO_RTC_CLOCK_SMEARED_UTC. They include
>> 
>>  VIRTIO_RTC_SUBTYPE_STRICT
>>  VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */
>>  VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 
>>  VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */
>> 
>> Is that what we just agreed on?
>> 
>> 
>
>This is a misunderstanding. My idea was that the main types are
>
>>  VIRTIO_RTC_CLOCK_UTC == 0
>>  VIRTIO_RTC_CLOCK_TAI == 1
>>  VIRTIO_RTC_CLOCK_MONOTONIC == 2
>>  VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
>
>VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4
>
>The subtypes would be (1st for clocks other than
>VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for
>VIRTIO_RTC_CLOCK_SMEARED_UTC):
>
>#define VIRTIO_RTC_SUBTYPE_STRICT 0
>#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1
>#define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2
>

Thanks. I really do think that from the guest point of view there's really no 
distinction between "maybe smeared" and "undefined smearing", and have a 
preference for using the latter form, which is the key difference there?

Again though, not a hill for me to die on.

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-07-02 Thread David Woodhouse

On Tue, 2024-07-02 at 17:03 +0200, Peter Hilber wrote:
> > On 01.07.24 10:57, David Woodhouse wrote:
> > > > If my proposed memory structure is subsumed into the virtio-rtc
> > > > proposal we'd literally use the same names, but for the time being I'll
> > > > update mine to:
> > 
> > Do you intend vmclock and virtio-rtc to be ABI compatible?

I intend you to incorporate a shared memory structure like the vmclock
one into the virtio-rtc standard :)

Because precision clocks in a VM are fundamentally nonsensical without
a way for the guest to know when live migration has screwed them up.

So yes, so facilitate that, I'm trying to align things so that the
numeric values of fields like the time_type and smearing hint are
*precisely* the same as the VIRTIO_RTC_xxx values.

Time pressure *may* mean I have to ship an ACPI-based device with a
preliminary version of the structure before I've finished persuading
you, and before we've completely finalized the structure. I am hoping
to avoid that.

(In fact, my time pressure only applies to a version of the structure
which carries the disruption_marker field; the actual clock calibration
information doesn't have to be present in the interim implementation.)


> >  FYI, I see a
> > potential problem in that Virtio does avoid the use of signed integers so
> > far. I did not check carefully if there might be other problems, yet.

Hm, you use an unsigned integer to convey the tai_offset. I suppose at
+37 and with a plan to stop doing leap seconds in the next decade,
we're unlikely to get back below zero?

The other signed integer I had was for the leap second direction, but I
think I'm happy to drop that and just adopt your uint8_t leap field
with VIRTIO_RTC_LEAP_{PRE_POS,PRE_NEG,etc.}.





> > > > 
> > > > /*
> > > >  * What time is exposed in the time_sec/time_frac_sec fields?
> > > >  */
> > > > uint8_t time_type;
> > > > #define VMCLOCK_TIME_UTC0   /* Since 1970-01-01 
> > > > 00:00:00z */
> > > > #define VMCLOCK_TIME_TAI1   /* Since 1970-01-01 
> > > > 00:00:00z */
> > > > #define VMCLOCK_TIME_MONOTONIC  2   /* Since undefined 
> > > > epoch */
> > > > #define VMCLOCK_TIME_INVALID3   /* virtio-rtc uses this 
> > > > for smeared UTC */
> > > > 
> > > > 
> > > > I can then use your smearing subtype values as the 'hint' field in the
> > > > shared memory structure. You currently have:
> > > > 
> > > > +\begin{lstlisting}
> > > > +#define VIRTIO_RTC_SUBTYPE_STRICT 0
> > > > +#define VIRTIO_RTC_SUBTYPE_SMEAR 1
> > > > +#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2
> > > > +#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3
> > > > +\end{lstlisting}
> > > > 
> > 
> > I agree with the above part of your proposal.
> > 
> > > > I can certainly ensure that 'noon linear' has the same value. I don't
> > > > think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though:
> > > > 
> > > > 
> > > > +\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard by
> > > > +   smearing time in the vicinity of the leap second, in a not
> > > > +   precisely defined manner. This avoids clock steps due to UTC
> > > > +   leap seconds.
> > > > 
> > > > ...
> > > > 
> > > > +\item VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED may deviate from the UTC
> > > > +   standard w.r.t.\ leap second introduction in an unspecified
> > > > way
> > > > +   (leap seconds may, or may not, be smeared).
> > > > 
> > > > To the client, both of those just mean "for a day or so around a leap
> > > > second event, you can't trust this device to know what the time is".
> > > > There isn't any point in separating "does lie to you" from "might lie
> > > > to you", surely? The guest can't do anything useful with that
> > > > distinction. Let's drop SMEAR and keep only LEAP_UNSPECIFIED?
> > 
> > As for VIRTIO_RTC_SUBTYPE_SMEAR, I think this could be dropped indeed
> > (resp., UTC_SLS may be added).
> > 
> > But VIRTIO_RTC_CLOCK_SMEARED_UTC is an assurance that there will be no
> > steps (in particular, steps backwards, which some clients might not like)
> > due to leap seconds, while LEAP_UNSPECIFIED provides no such guarantee.
> > 
> > So I think this might be better handled by adding, alon

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-07-01 Thread David Woodhouse

On Fri, 2024-06-28 at 22:27 +0100, David Woodhouse wrote:
> On 28 June 2024 17:38:15 BST, Peter Hilber  
> wrote:
> > On 28.06.24 14:15, David Woodhouse wrote:
> > > On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
> > > > On 27.06.24 16:52, David Woodhouse wrote:
> > > > > I already added a flags field, so this might look something like:
> > > > > 
> > > > >     /*
> > > > >  * Smearing flags. The UTC clock exposed through this 
> > > > > structure
> > > > >  * is only ever true UTC, but a guest operating system may
> > > > >  * choose to offer a monotonic smeared clock to its users. 
> > > > > This
> > > > >  * merely offers a hint about what kind of smearing to 
> > > > > perform,
> > > > >  * for consistency with systems in the nearby environment.
> > > > >  */
> > > > > #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* 
> > > > > draft-kuhn-leapsecond-00.txt */
> > > > > 
> > > > > (UTC-SLS is probably a bad example but are there formal definitions 
> > > > > for
> > > > > anything else?)
> > > > 
> > > > I think it could also be more generic, like flags for linear smearing,
> > > > cosine smearing(?), and smear_start_sec and smear_end_sec fields 
> > > > (relative
> > > > to the leap second start). That could also represent UTC-SLS, and
> > > > noon-to-noon, and it would be well-defined.
> > > > 
> > > > This should reduce the likelihood that the guest doesn't know the 
> > > > smearing
> > > > variant.
> > > 
> > > I'm wary of making it too generic. That would seem to encourage a
> > > *proliferation* of false "UTC-like" clocks.
> > > 
> > > It's bad enough that we do smearing at all, let alone that we don't
> > > have a single definition of how to do it.
> > > 
> > > I made the smearing hint a full uint8_t instead of using bits in flags,
> > > in the end. That gives us a full 255 ways of lying to users about what
> > > the time is, so we're unlikely to run out. And it's easy enough to add
> > > a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
> > > that get invented.
> > > 
> > > 
> > 
> > My concern is that the registry update may come after a driver has already
> > been implemented, so that it may be hard to ensure that the smearing which
> > has been chosen is actually implemented.
> 
> Well yes, but why in the name of all that is holy would anyone want
> to invent *new* ways to lie to users about the time? If we capture
> the existing ones as we write this, surely it's a good thing that
> there's a barrier to entry for adding more?

Ultimately though, this isn't the hill for me to die on. I'm pushing on
that topic because I want to avoid the proliferation of *ambiguity*. If
we have a precision clock, we should *know* what the time is.

So how about this proposal. I line up the fields in the proposed shared
memory structure to match your virtio-rtc proposal, using 'subtype' as
you proposed. But, instead of the 'subtype' being valid only for
VIRTIO_RTC_CLOCK_UTC, we define a new top-level type for *smeared* UTC.

So, you have:

+\begin{lstlisting}
+#define VIRTIO_RTC_CLOCK_UTC 0
+#define VIRTIO_RTC_CLOCK_TAI 1
+#define VIRTIO_RTC_CLOCK_MONO 2
+\end{lstlisting}

I propose that you add

#define VIRTIO_RTC_CLOCK_SMEARED_UTC 3

If my proposed memory structure is subsumed into the virtio-rtc
proposal we'd literally use the same names, but for the time being I'll
update mine to:

/*
 * What time is exposed in the time_sec/time_frac_sec fields?
 */
uint8_t time_type;
#define VMCLOCK_TIME_UTC0   /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_TAI1   /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_MONOTONIC  2   /* Since undefined epoch */
#define VMCLOCK_TIME_INVALID3   /* virtio-rtc uses this for 
smeared UTC */


I can then use your smearing subtype values as the 'hint' field in the
shared memory structure. You currently have:

+\begin{lstlisting}
+#define VIRTIO_RTC_SUBTYPE_STRICT 0
+#define VIRTIO_RTC_SUBTYPE_SMEAR 1
+#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2
+#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3
+\end{lstlisting}

I can certainly ensure that 'noon linear' has the same value. I don't
think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though:


+\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-28 Thread David Woodhouse

On 28 June 2024 17:38:15 BST, Peter Hilber  wrote:
>On 28.06.24 14:15, David Woodhouse wrote:
>> On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
>>> On 27.06.24 16:52, David Woodhouse wrote:
>>>> I already added a flags field, so this might look something like:
>>>>
>>>>     /*
>>>>  * Smearing flags. The UTC clock exposed through this structure
>>>>  * is only ever true UTC, but a guest operating system may
>>>>  * choose to offer a monotonic smeared clock to its users. This
>>>>  * merely offers a hint about what kind of smearing to perform,
>>>>  * for consistency with systems in the nearby environment.
>>>>  */
>>>> #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt 
>>>> */
>>>>
>>>> (UTC-SLS is probably a bad example but are there formal definitions for
>>>> anything else?)
>>>
>>> I think it could also be more generic, like flags for linear smearing,
>>> cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
>>> to the leap second start). That could also represent UTC-SLS, and
>>> noon-to-noon, and it would be well-defined.
>>>
>>> This should reduce the likelihood that the guest doesn't know the smearing
>>> variant.
>> 
>> I'm wary of making it too generic. That would seem to encourage a
>> *proliferation* of false "UTC-like" clocks.
>> 
>> It's bad enough that we do smearing at all, let alone that we don't
>> have a single definition of how to do it.
>> 
>> I made the smearing hint a full uint8_t instead of using bits in flags,
>> in the end. That gives us a full 255 ways of lying to users about what
>> the time is, so we're unlikely to run out. And it's easy enough to add
>> a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
>> that get invented.
>> 
>> 
>
>My concern is that the registry update may come after a driver has already
>been implemented, so that it may be hard to ensure that the smearing which
>has been chosen is actually implemented.

Well yes, but why in the name of all that is holy would anyone want to invent 
*new* ways to lie to users about the time? If we capture the existing ones as 
we write this, surely it's a good thing that there's a barrier to entry for 
adding more?


>But the error bounds could be large or missing. I am trying to address use
>cases where the host steps or slews the clock as well.

The host is absolutely intended to be skewing the clock to keep it accurate as 
the frequency of the underlying oscillator changes, and the seq_count field 
will change every time the host does so.

Do we need to handle steps differently? Or just let the guest deal with it?

If an NTP server suddenly steps the time it reports, what does the client do?

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-28 Thread David Woodhouse

On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
> On 27.06.24 16:52, David Woodhouse wrote:
> > I already added a flags field, so this might look something like:
> > 
> >     /*
> >  * Smearing flags. The UTC clock exposed through this structure
> >  * is only ever true UTC, but a guest operating system may
> >  * choose to offer a monotonic smeared clock to its users. This
> >  * merely offers a hint about what kind of smearing to perform,
> >  * for consistency with systems in the nearby environment.
> >  */
> > #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt 
> > */
> > 
> > (UTC-SLS is probably a bad example but are there formal definitions for
> > anything else?)
> 
> I think it could also be more generic, like flags for linear smearing,
> cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
> to the leap second start). That could also represent UTC-SLS, and
> noon-to-noon, and it would be well-defined.
> 
> This should reduce the likelihood that the guest doesn't know the smearing
> variant.

I'm wary of making it too generic. That would seem to encourage a
*proliferation* of false "UTC-like" clocks.

It's bad enough that we do smearing at all, let alone that we don't
have a single definition of how to do it.

I made the smearing hint a full uint8_t instead of using bits in flags,
in the end. That gives us a full 255 ways of lying to users about what
the time is, so we're unlikely to run out. And it's easy enough to add
a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
that get invented.


> > > > +   /*
> > > > +    * This field changes to another non-repeating value when the 
> > > > CPU
> > > > +    * counter is disrupted, for example on live migration.
> > > > +    */
> > > > +   uint64_t disruption_marker;
> > > 
> > > The field could also change when the clock is stepped (leap seconds
> > > excepted), or when the clock frequency is slewed.
> > 
> > I'm not sure. The concept of the disruption marker is that it tells the
> > guest to throw away any calibration of the counter that the guest has
> > done for *itself* (with NTP, other PTP devices, etc.).
> > 
> > One mode for this device would be not to populate the clock fields at
> > all, but *only* to signal disruption when it occurs. So the guest can
> > abort transactions until it's resynced its clocks (to avoid incurring
> > fines if breaking databases, etc.).
> > 
> > Exposing the host timekeeping through the structure means that the
> > migrated guest can keep working because it can trust the timekeeping
> > performed by the (new) host and exposed to it.
> > 
> > If the counter is actually varying in frequency over time, and the host
> > is slewing the clock frequency that it reports, that *isn't* a step
> > change and doesn't mean that the guest should throw away any
> > calibration that it's been doing for itself. One hopes that the guest
> > would have detected the *same* frequency change, and be adapting for
> > itself. So I don't think that should indicate a disruption.
> > 
> > I think the same is even true if the clock is stepped by the host. The
> > actual *counter* hasn't changed, so the guest is better off ignoring
> > the vacillating host and continuing to derive its idea of time from the
> > hardware counter itself, as calibrated against some external NTP/PTP
> > sources. Surely we actively *don't* to tell the guest to throw its own
> > calibrations away, in this case?
> 
> In case the guest is also considering other time sources, it might indeed
> not be a good idea to mix host clock changes into the hardware counter
> disruption marker.
> 
> But if the vmclock is the authoritative source of time, it can still be
> helpful to know about such changes, maybe through another marker.

Could that be the existing seq_count field?

Skewing the counter_period_frac_sec as the underlying oscillator speeds
up and slows down is perfectly normal and expected, and we already
expect the seq_count to change when that happens.

Maybe step changes are different, but arguably if the time advertised
by the host steps *outside* the error bounds previously advertised,
that's just broken?

Depending on how the clock information is fed, a change in seq_count
may even result in non-monotonicity. If the underlying oscillator has
sped up and the structure is updated accordingly, the time calculated
the moment *before* that update may appear later than the time
calculated immediately after it.

It's up to the guest operating system to feed that information into its
own timekeeping system and skew towards correctness instead of stepping
the time it reports to its users.



smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-28 Thread David Woodhouse

On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
> 
> > 
> > /*
> >  * What time is exposed in the time_sec/time_frac_sec fields?
> >  */
> > uint8_t time_type;
> > #define VMCLOCK_TIME_UNKNOWN0   /* Invalid / no time 
> > exposed */
> > #define VMCLOCK_TIME_UTC1   /* Since 1970-01-01 
> > 00:00:00z */
> > #define VMCLOCK_TIME_TAI2   /* Since 1970-01-01 
> > 00:00:00z */
> > #define VMCLOCK_TIME_MONOTONIC  3   /* Since undefined epoch */
> > 
> > /* Bit shift for counter_period_frac_sec and its error rate */
> > uint8_t counter_period_shift;
> > 
> > /*
> >  * Unlike in NTP, this can indicate a leap second in the past. This
> >  * is needed to allow guests to derive an imprecise clock with
> >  * smeared leap seconds for themselves, as some modes of smearing
> >  * need the adjustments to continue even after the moment at which
> >  * the leap second should have occurred.
> >  */
> > int8_t leapsecond_direction;
> > uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */
> > 
> > /*
> >  * Paired values of counter and UTC at a given point in time.
> >  */
> > uint64_t counter_value;
> > uint64_t time_sec; /* Since 1970-01-01 00:00:00z */
> 
> Nitpick: The comment is not valid any more for TIME_MONOTONIC.

Ah yes, I "moved" that comment up to the UTC/TAI time_type values, but
neglected to actually delete it from here. Fixed; thanks.


smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-27 Thread David Woodhouse


I've updated the tree at
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
(but not yet the qemu one).

I think I've taken into account all your comments apart from the one
about non-64-bit counters wrapping. I reduced the seq_count to 32 bit
to make room for a 32-bit flags field, added the time type
(UTC/TAI/MONOTONIC) and a smearing hint, with some straw man
definitions for smearing algorithms for which I could actually find
definitions.

The structure now looks like this:


struct vmclock_abi {
uint32_t magic;
#define VMCLOCK_MAGIC   0x4b4c4356 /* "VCLK" */
uint16_t size;  /* Size of page containing this structure */
uint16_t version;   /* 1 */

/* Sequence lock. Low bit means an update is in progress. */
uint32_t seq_count;

uint32_t flags;
/* Indicates that the tai_offset_sec field is valid */
#define VMCLOCK_FLAG_TAI_OFFSET_VALID   (1 << 0)
/*
 * Optionally used to notify guests of pending maintenance events.
 * A guest may wish to remove itself from service if an event is
 * coming up. Two flags indicate the rough imminence of the event.
 */
#define VMCLOCK_FLAG_DISRUPTION_SOON(1 << 1) /* About a day */
#define VMCLOCK_FLAG_DISRUPTION_IMMINENT(1 << 2) /* About an hour */
/* Indicates that the utc_time_maxerror_picosec field is valid */
#define VMCLOCK_FLAG_UTC_MAXERROR_VALID (1 << 3)
/* Indicates counter_period_error_rate_frac_sec is valid */
#define VMCLOCK_FLAG_PERIOD_ERROR_VALID (1 << 4)

/*
 * This field changes to another non-repeating value when the CPU
 * counter is disrupted, for example on live migration. This lets
 * the guest know that it should discard any calibration it has
 * performed of the counter against external sources (NTP/PTP/etc.).
 */
uint64_t disruption_marker;

uint8_t clock_status;
#define VMCLOCK_STATUS_UNKNOWN  0
#define VMCLOCK_STATUS_INITIALIZING 1
#define VMCLOCK_STATUS_SYNCHRONIZED 2
#define VMCLOCK_STATUS_FREERUNNING  3
#define VMCLOCK_STATUS_UNRELIABLE   4

uint8_t counter_id;
#define VMCLOCK_COUNTER_INVALID 0
#define VMCLOCK_COUNTER_X86_TSC 1
#define VMCLOCK_COUNTER_ARM_VCNT2
#define VMCLOCK_COUNTER_X86_ART 3

/*
 * By providing the offset from UTC to TAI, the guest can know both
 * UTC and TAI reliably, whichever is indicated in the time_type
 * field. Valid if VMCLOCK_FLAG_TAI_OFFSET_VALID is set in flags.
 */
int16_t tai_offset_sec;

/*
 * The time exposed through this device is never smeaared; if it
 * claims to be VMCLOCK_TIME_UTC then it MUST be UTC. This field
 * provides a hint to the guest operating system, such that *if*
 * the guest OS wants to provide its users with an alternative
 * clock which does not follow the POSIX CLOCK_REALTIME standard,
 * it may do so in a fashion consistent with the other systems
 * in the nearby environment.
 */
uint8_t leap_second_smearing_hint;
/* Provide true UTC to users, unsmeared. */;
#define VMCLOCK_SMEARING_NONE   0
/*
 * 
https://aws.amazon.com/blogs/aws/look-before-you-leap-the-coming-leap-second-and-aws/
 * From noon on the day before to noon on the day after, smear the
 * clock by a linear 1/86400s per second.
*/
#define VMCLOCK_SMEARING_LINEAR_86400   1
/*
 * draft-kuhn-leapsecond-00
 * For the 1000s leading up to the leap second, smear the clock by
 * clock by a linear 1ms per second.
 */
#define VMCLOCK_SMEARING_UTC_SLS2

/*
 * What time is exposed in the time_sec/time_frac_sec fields?
 */
uint8_t time_type;
#define VMCLOCK_TIME_UNKNOWN0   /* Invalid / no time exposed */
#define VMCLOCK_TIME_UTC1   /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_TAI2   /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_MONOTONIC  3   /* Since undefined epoch */

/* Bit shift for counter_period_frac_sec and its error rate */
uint8_t counter_period_shift;

/*
 * Unlike in NTP, this can indicate a leap second in the past. This
 * is needed to allow guests to derive an imprecise clock with
 * smeared leap seconds for themselves, as some modes of smearing
 * need the adjustments to continue even after the moment at which
 * the leap second should have occurred.
 */
int8_t leapsecond_direction;
uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */

/*
 * Paired values of counter and UTC at a given point in time.
 */
uint64_t

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-27 Thread David Woodhouse

On Thu, 2024-06-27 at 15:50 +0200, Peter Hilber wrote:
> On 25.06.24 21:01, David Woodhouse wrote:
> > From: David Woodhouse 
> > 
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> > 
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> > 
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
> > 
> > Signed-off-by: David Woodhouse 
> > ---
> > 
> > v2: 
> >  • Add gettimex64() support
> >  • Convert TSC values to KVM clock when appropriate
> >  • Require int128 support
> >  • Add counter_period_shift 
> >  • Add timeout when seq_count is invalid
> >  • Add flags field
> >  • Better comments in vmclock ABI structure
> >  • Explicitly forbid smearing (as clock rates would need to change)
> 
> Leap second smearing information could still be conveyed through the
> vmclock_abi. AFAIU, to cover the popular smearing variants, it should be
> enough to indicate whether the driver should apply linear or cosine
> smearing, and the start time and end time.

Yes. The clock information actually conveyed through the {counter,
time, rate} tuple should never be smeared, and should only ever be UTC.

But we could provide a hint to the guest operating system about what
type of smearing to perform, *if* it chooses to offer a clock other
than the standard CLOCK_REALTIME to its users.

I already added a flags field, so this might look something like:

/*
 * Smearing flags. The UTC clock exposed through this structure
 * is only ever true UTC, but a guest operating system may
 * choose to offer a monotonic smeared clock to its users. This
 * merely offers a hint about what kind of smearing to perform,
 * for consistency with systems in the nearby environment.
 */
#define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */


(UTC-SLS is probably a bad example but are there formal definitions for
anything else?)


> > But we 
> >  drivers/ptp/Kconfig  |  13 +
> >  drivers/ptp/Makefile |   1 +
> >  drivers/ptp/ptp_vmclock.c    | 516 +++
> >  include/uapi/linux/vmclock.h | 138 ++
> >  4 files changed, 668 insertions(+)
> >  create mode 100644 drivers/ptp/ptp_vmclock.c
> >  create mode 100644 include/uapi/linux/vmclock.h
> > 
> 
> [...]
> 
> > +
> > +/*
> > + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds 
> > >> 64
> > + * and add the fractional second part of the reference time.
> > + *
> > + * The result is a 128-bit value, the top 64 bits of which are seconds, and
> > + * the low 64 bits are (seconds >> 64).
> > + *
> > + * If __int128 isn't available, perform the calculation 32 bits at a time 
> > to
> > + * avoid overflow.
> > + */
> > +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t 
> > delta,
> > +  uint64_t period, uint8_t 
> > shift,
> > +  uint64_t frac_sec)
> > +{
> > +   unsigned __int128 res = (unsigned __int128)delta * period;
> > +
> > +   res >>= shift;
> > +   res += frac_sec;
> > +   *res_hi = res >> 64;
> > +   return (uint64_t)res;
> > +}
> > +
> > +static int vmclock_get_crosststamp(struct vmclock_state *st,
> > +  struct ptp_system_timestamp *sts,
> > +  struct system_counterval_t 
> > *system_counter,
> > +  struct timespec64 *tspec)
> > +{
> > +   ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
> > +   struct system_time_snapshot systime_snapshot;
> > +   uint64_t cycle, delta, seq, frac_sec;
> > +
> > +#ifdef CONFIG_X86
> > +   /*
> > +    * We'd expect the hypervisor to know this and to report the clock
> > +    * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
> > +    */
> > +   if (check_tsc_unstable())
> > +   return -EINVAL;
> > +#endif
> > +
> > +   while (1) {
> > +   seq = st->clk->seq_count & ~1ULL;
> > +   virt

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-26 Thread David Woodhouse

On Tue, 2024-06-25 at 15:22 -0700, John Stultz wrote:
> On Tue, Jun 25, 2024 at 2:48 PM David Woodhouse  wrote:
> > On Tue, 2024-06-25 at 23:34 +0200, Thomas Gleixner wrote:
> > > On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote:
> > > > From: David Woodhouse 
> > > > 
> > > > The vmclock "device" provides a shared memory region with precision 
> > > > clock
> > > > information. By using shared memory, it is safe across Live Migration.
> > > > 
> > > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > > > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > > > actually helpful.
> > > > 
> > > > The memory region of the device is also exposed to userspace so it can 
> > > > be
> > > > read or memory mapped by application which need reliable notification of
> > > > clock disruptions.
> > > 
> > > There is effort underway to expose PTP clocks to user space via VDSO.
> > 
> > Ooh, interesting. Got a reference to that please?
> > 
> > >  Can we please not expose an ad hoc interface for that?
> > 
> > Absolutely. I'm explicitly trying to intercept the virtio-rtc
> > specification here, to *avoid* having to do anything ad hoc.
> > 
> > Note that this is a "vDSO-style" interface from hypervisor to guest via
> > a shared memory region, not necessarily an actual vDSO.
> > 
> > But yes, it *is* intended to be exposed to userspace, so that userspace
> > can know the *accurate* time without a system call, and know that it
> > hasn't been perturbed by live migration.
> 
> Yea, I was going to raise a concern that just defining an mmaped
> structure means it has to trust the guest logic is as expected. It's
> good that it's versioned! :)

Right. Although it's basically a pvclock, and we've had those for ages.

The main difference here is that we add an indicator that tells the
guest that it's been live migrated, so any additional NTP/PTP
refinement that the *guest* has done of its oscillator, should now be
discarded.

It's also designed to be useful in "disruption-only" mode, where the
pvclock information isn't actually populated, so *all* it does is tell
guests that their clock is now hosed due to live migration.

That part is why it needs to be mappable directly to userspace, so that
userspace can not only get a timestamp but *also* know that it's
actually valid. All without a system call.

The critical use cases are financial systems where they incur massive
fines if they submit mis-timestamped transactions, and distributed
databases which rely on accurate timestamps (and error bounds) for
eventual coherence. Live migration can screw those completely.

I'm open to changing fairly much anything about the proposal as long as
we can address those use cases (which the existing virtio-rtc and other
KVM enlightenments do not).

> I'd fret a bit about exposing this to userland. It feels very similar
> to the old powerpc systemcfg implementation that similarly mapped just
> kernel data out to userland and was difficult to maintain as changes
> were made. Would including a code page like a proper vdso make sense
> to make this more flexible of an UABI to maintain?

I think the structure itself should be stable once we've bikeshedded it
a bit. But there is certainly some potential for vDSO functions which
help us expose it to the user...

This structure exposes a 'disruption count' which is updated every time
the TSC/counter is messed with by live migration. But what is userspace
actually going to *compare* it with?

It basically needs to compare it with the disruption count when the
clock was last synchronized, so maybe the kernel could export *that* to
vDSO too, then expose a simple vDSO function which reports whether the
clock is valid?

The 'invalid' code path could turn into an actual system call which
makes the kernel (check for itself and) call ntp_clear() when the
disruption occurs. Or maybe not just ntp_clear() but actually consume
the pvclock rate information directly and apply the *new* calibration?

That kind of thing would be great, and I've definitely tried to design
the structure so that it *can* be made a first-class citizen within the
kernel's timekeeping code and used like that.

But I was going to start with a more modest proposal that it's "just a
device", and applications which care about reliable time after LM would
have to /dev/vmclock0 and mmap it and check for themselves. (Which
would be assisted by things like the ClockBound library).

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-25 Thread David Woodhouse

On Tue, 2024-06-25 at 23:34 +0200, Thomas Gleixner wrote:
> On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote:
> > From: David Woodhouse 
> > 
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> > 
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> > 
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
> 
> There is effort underway to expose PTP clocks to user space via VDSO.

Ooh, interesting. Got a reference to that please?

>  Can we please not expose an ad hoc interface for that?

Absolutely. I'm explicitly trying to intercept the virtio-rtc
specification here, to *avoid* having to do anything ad hoc.

Note that this is a "vDSO-style" interface from hypervisor to guest via
a shared memory region, not necessarily an actual vDSO.

But yes, it *is* intended to be exposed to userspace, so that userspace
can know the *accurate* time without a system call, and know that it
hasn't been perturbed by live migration.

> As you might have heard the sad news, I'm not feeling up to the task to
> dig deeper into this right now. Give me a couple of days to look at this
> with working brain.

I have not heard any news, although now I'm making inferences.

Wishing you the best!

smime.p7s
Description: S/MIME cryptographic signature

[RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-25 Thread David Woodhouse

From: David Woodhouse 

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Like the KVM PTP clock, this can convert TSC-based cross timestamps into
KVM clock values. Unlike the KVM PTP clock, it does so only when such is
actually helpful.

The memory region of the device is also exposed to userspace so it can be
read or memory mapped by application which need reliable notification of
clock disruptions.

Signed-off-by: David Woodhouse 
---

v2: 
 • Add gettimex64() support
 • Convert TSC values to KVM clock when appropriate
 • Require int128 support
 • Add counter_period_shift 
 • Add timeout when seq_count is invalid
 • Add flags field
 • Better comments in vmclock ABI structure
 • Explicitly forbid smearing (as clock rates would need to change)

 drivers/ptp/Kconfig  |  13 +
 drivers/ptp/Makefile |   1 +
 drivers/ptp/ptp_vmclock.c| 516 +++
 include/uapi/linux/vmclock.h | 138 ++
 4 files changed, 668 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock.h

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
  To compile this driver as a module, choose M here: the module
  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+   tristate "Virtual machine PTP clock"
+   depends on X86_TSC || ARM_ARCH_TIMER
+   depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+   default y
+   help
+ This driver adds support for using a virtual precision clock
+ advertised by the hypervisor. This clock is only useful in virtual
+ machines where such a device is present.
+
+ To compile this driver as a module, choose M here: the module
+ will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
tristate "IDT 82P33xxx PTP clock"
depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE)  += ptp_dte.o
 obj-$(CONFIG_PTP_1588_CLOCK_INES)  += ptp_ines.o
 obj-$(CONFIG_PTP_1588_CLOCK_PCH)   += ptp_pch.o
 obj-$(CONFIG_PTP_1588_CLOCK_KVM)   += ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK)   += ptp_vmclock.o
 obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o
 ptp-qoriq-y+= ptp_qoriq.o
 ptp-qoriq-$(CONFIG_DEBUG_FS)   += ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index ..e19c2eed8009
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,516 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#ifdef CONFIG_X86
+#include 
+#include 
+#endif
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+   phys_addr_t phys_addr;
+   struct vmclock_abi *clk;
+   struct miscdevice miscdev;
+   struct ptp_clock_info ptp_clock_info;
+   struct ptp_clock *ptp_clock;
+   enum clocksource_ids cs_id, sys_cs_id;
+   int index;
+   char *name;
+};
+
+#define VMCLOCK_MAX_WAIT ms_to_ktime(100)
+
+/*
+ * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
+ * and add the fractional second part of the reference time.
+ *
+ * The result is a 128-bit value, the top 64 bits of which are seconds, and
+ * the low 64 bits are (seconds >> 64).
+ *
+ * If __int128 isn't available, perform the calculation 32 bits at a time to
+ * avoid overflow.
+ */
+static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t 
delta,
+  uint64_t period, uint8_t shift,
+  uint64_t frac_sec)
+{
+   unsigned __int128 res = (unsigned __int128)delta * period;
+
+   res >>= shift;
+   res += frac_sec;
+   *res_hi = res >> 64;
+   return (uint64_t)res;
+}
+
+static int vmclock_get_crosststamp(struct vmclock_state *st,
+  struct ptp_system_timestamp *sts,
+  struct system_counterval_t *system_counter,
+  struct timespec64 *tspec)
+{
+   ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
+   struct system_time_snapshot systime_snapshot;
+   uint64_t cycle, delta, seq,

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-06-21 Thread David Woodhouse

On Thu, 2024-06-20 at 14:37 +0200, Peter Hilber wrote:
> Should implement .gettimex64 instead.

Thanks. This look sane?

As noted in the code comment, in the *ideal* case we just build all
three pre/post/device timestamps from the very same counter read. So
sts->pre_ts == sts->post_ts.

In the less ideal case (which will happen on x86 when kvmclock is being
used for the system time), we use the time from ktime_get_snapshot() as
the pre_ts and take a new snapshot immediately after the get_cycles().


diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
index e8c65405a8f3..07a81a94d29a 100644
--- a/drivers/ptp/ptp_vmclock.c
+++ b/drivers/ptp/ptp_vmclock.c
@@ -96,9 +96,11 @@ static inline uint64_t mul_u64_u64_add_u64(uint64_t *res_hi, 
uint64_t delta,
 }
 
 static int vmclock_get_crosststamp(struct vmclock_state *st,
+  struct ptp_system_timestamp *sts,
   struct system_counterval_t *system_counter,
   struct timespec64 *tspec)
 {
+   struct system_time_snapshot systime_snapshot;
uint64_t cycle, delta, seq, frac_sec;
int ret = 0;
 
@@ -119,7 +121,17 @@ static int vmclock_get_crosststamp(struct vmclock_state 
*st,
continue;
}
 
-   cycle = get_cycles();
+   if (sts) {
+   ktime_get_snapshot(_snapshot);
+
+   if (systime_snapshot.cs_id == st->cs_id) {
+   cycle = systime_snapshot.cycles;
+   } else {
+   cycle = get_cycles();
+   ptp_read_system_postts(sts);
+   }
+   } else
+   cycle = get_cycles();
 
delta = cycle - st->clk->counter_value;
 
@@ -139,6 +151,21 @@ static int vmclock_get_crosststamp(struct vmclock_state 
*st,
if (ret)
return ret;
 
+   /*
+* When invoked for gettimex64, fill in the pre/post system times.
+* The ideal case is when system time is based on the the same
+* counter as st->cs_id, in which case all three pre/post/device
+* times are derived from the *same* counter value. If cs_id does
+* not match, then the value from ktime_get_snapshot() is used as
+* pre_ts, and ptp_read_system_postts() was already called above
+* for the post_ts. Those are either side of the get_cycles() call.
+*/
+   if (sts) {
+   sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
+   if (systime_snapshot.cs_id == st->cs_id)
+   sts->post_ts = sts->pre_ts;
+   }
+
if (system_counter) {
system_counter->cycles = cycle;
system_counter->cs_id = st->cs_id;
@@ -155,7 +182,7 @@ static int ptp_vmclock_get_time_fn(ktime_t *device_time,
struct timespec64 tspec;
int ret;
 
-   ret = vmclock_get_crosststamp(st, system_counter, );
+   ret = vmclock_get_crosststamp(st, NULL, system_counter, );
if (!ret)
*device_time = timespec64_to_ktime(tspec);
 
@@ -198,7 +225,16 @@ static int ptp_vmclock_gettime(struct ptp_clock_info *ptp, 
struct timespec64 *ts
struct vmclock_state *st = container_of(ptp, struct vmclock_state,
ptp_clock_info);
 
-   return vmclock_get_crosststamp(st, NULL, ts);
+   return vmclock_get_crosststamp(st, NULL, NULL, ts);
+}
+
+static int ptp_vmclock_gettimex(struct ptp_clock_info *ptp, struct timespec64 
*ts,
+   struct ptp_system_timestamp *sts)
+{
+   struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+   ptp_clock_info);
+
+   return vmclock_get_crosststamp(st, sts, NULL, ts);
 }
 
 static int ptp_vmclock_enable(struct ptp_clock_info *ptp,
@@ -216,6 +252,7 @@ static const struct ptp_clock_info ptp_vmclock_info = {
.adjfine= ptp_vmclock_adjfine,
.adjtime= ptp_vmclock_adjtime,
.gettime64  = ptp_vmclock_gettime,
+   .gettimex64 = ptp_vmclock_gettimex,
.settime64  = ptp_vmclock_settime,
.enable = ptp_vmclock_enable,
.getcrosststamp = ptp_vmclock_getcrosststamp,




smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-06-21 Thread David Woodhouse

On Thu, 2024-06-20 at 17:19 +0100, David Woodhouse wrote:
> 
> > 
> > > +
> > > +   /* Counter frequency, and error margin. Units of (second >> 64) */
> > > +   uint64_t counter_period_frac_sec;
> > 
> > AFAIU this might limit the precision in case of high counter frequencies.
> > Could the unit be aligned to the expected frequency band of counters?
> 
> This field indicates the period of a single tick, in units of 1>>64 of
> a second. That's about 5.4e-20 seconds, or 54 zeptoseconds? 
> 
> Can you walk me through a calculation where you believe that level of
> precision is insufficient?
> 
> I guess the precision matters if the structure isn't updated for a long
> period of time, and the delta between the current counter and the
> snapshot is high? That's a *lot* of 54 zeptosecondses? But you really
> would need a *lot* of them before you care? And if nobody's been
> calibrating your counter for that long, surely you have bigger worries?
> 
> Am I missing something there?

Hm, that was a bit rushed at the end of the day; let's take a better look...

Let's take a hypothetical example of a 100GHz counter. That's two
orders of magnitude more than today's Arm arch counter.

The period of such a counter would be 10 picoseconds. 

(Let's ignore the question of how far light actually travels in that
time and how *realistic* that example is, for the moment.)

It turns out that at that rate, there *are* a lot of 54 zeptosecondses
of precision loss in the day. It could be half a millisecond a day, or
20µs an hour.

That particular example of 10 picoseconds is 184467440.7370955
(seconds>>64) which could be truncated to 184467440 — losing about 4PPB
(a third of a millisecond a day; 14µs an hour).

So yeah, I suppose a 'shift' field could make sense. It's easy enough
to consume on the guest side as it doesn't really perturb the 128-bit
multiplication very much; especially if we don't let it be negative.

And implementations *can* just set it to zero. It hurts nobody.

Or were you thinking of just using a fixed shift like (seconds>>80)
instead?

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-06-20 Thread David Woodhouse

On Thu, 2024-06-20 at 14:37 +0200, Peter Hilber wrote:
> Changing virtio-dev address to the new one. The discussion might also be
> relevant for virtio-comment, but it is discouraged to forward it to both.

I will happily take it to whichever forum you think is most
appropriate. (And you have my permission to direct replies to whichever
you choose.)

> On 15.06.24 10:40, David Woodhouse wrote:
> > As discussed before, I don't think it makes sense to design a new high-
> > precision virtual clock which only gets it right *most* of the time. We
> > absolutely need to address the issue of live migration.
 ...
> > Here's a first attempt at defining such a memory structure. For now
> > I've done it as a "vmclock" ACPI device based loosely on vmgenid, but I
> > think it makes most sense for this to be part of the virtio_rtc spec.
> > Ultimately it doesn't matter *how* the guest finds the memory region.
> 
> This looks sensible to me. I also think it would be possible to adapt this for
> the virtio-rtc spec. The proposal also supports some other use cases which are
> not in the virtio-rtc RFC spec v4 [2], notably vDSO-style clock reading, and
> others such as indication of a past leap second.

Right. The vDSO-style clock reading is key to solving the live
migration problem.

The other key thing this adds is the error bounds, which some
applications care deeply about. I've been working with the team that
owns ClockBound on that part: https://github.com/aws/clock-bound

> Compared to the virtio-rtc RFC spec v4 [2], this proposal does not seem to
> support leap second smearing. 

That's kind of intentional. Leap second smearing should be considered
the *antithesis* of precise time. Anyone who wants a monotonic realtime
clock should be using the POSIX CLOCK_TAI.

Part of my motivation for fixing the LM problem is because some
financial institutions can incur significant penalties for putting
inaccurate timestamps on transactions — even the disruption caused by
live migration is enough to trigger that. So deliberately lying to them
about what the UTC time is, by up to a second in either direction, is
not necessarily in their best interest.

As you noted, this proposal does expose leap seconds in the recent
past, which is all that's needed to allow a guest to generate a smeared
clock *from* the accurate clock that is provided through this
mechanism.

(Knowledge of past leap seconds is needed because in some modes,
smearing adjustments continue for some hours *afte* the leap second
should have occurred. So the NTP style of leap indicator isn't
sufficient).

> Also, it would be helpful to allow indicating
> when some of the fields are not valid (such as leapsecond_counter_value,
> leapsecond_tai_time, tai_offset_sec, utc_time_maxerror_picosec, ...).

Right. For some of those the answer can just be 'zero means invalid',
for the max error, perhaps MAX_UINT64. But we should definitely make
that explicit.

I'm also not entirely sure I understood Julien's insistence that we
include the leapsecond_counter_value as *well* as the
leapsecond_tai_time. It seems to me that the implementation would have
to recalculate that every time the frequency is adjusted. 

For some of those fields, I was expecting a certain amount of
bikeshedding to occur and figured it was better to post an early straw
man and solicit feedback.

> Do you have plans to contribute this to the Virtio specification and Linux
> driver implementation?

Yes, absolutely. For now I've implemented it in the Linux guest¹ and in
QEMU² as an ACPI device modelled on vmgenid, but I'd love *not* to have
to do that, and just to do it based on virtio instead.

¹ https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
² https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock

> > +static const struct ptp_clock_info ptp_vmclock_info = {
> > +   .owner  = THIS_MODULE,
> > +   .max_adj= 0,
> > +   .n_ext_ts   = 0,
> > +   .n_pins = 0,
> > +   .pps= 0,
> > +   .adjfine= ptp_vmclock_adjfine,
> > +   .adjtime= ptp_vmclock_adjtime,
> > +   .gettime64  = ptp_vmclock_gettime,
> 
> Should implement .gettimex64 instead.

Ack, thanks. I'll go play with that.

> 
> > +
> > +   /* Counter frequency, and error margin. Units of (second >> 64) */
> > +   uint64_t counter_period_frac_sec;
> 
> AFAIU this might limit the precision in case of high counter frequencies.
> Could the unit be aligned to the expected frequency band of counters?

This field indicates the period of a single tick, in units of 1>>64 of
a second. That's about 5.4e-20 seconds, or 54 zeptoseconds? 

Can you walk me through a calculation where you believe that level of
precision is insuffi

Re: [RFC PATCH v3 5/7] virtio_rtc: Add PTP clocks

2024-06-20 Thread David Woodhouse

On Thu, 2024-06-20 at 14:01 +0200, Peter Hilber wrote:
> On 15.06.24 10:01, David Woodhouse wrote:
> > On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> > > 
> > > +   ret = viortc_hw_xtstamp_params(_counter, _id);
> > > +   if (ret)
> > > +   return ret;
> > > +
> > > +   ktime_get_snapshot(_begin);
> > > +   if (history_begin.cs_id != cs_id)
> > > +   return -EOPNOTSUPP;
> > 
> > I think you have to call ktime_get_snapshot() anyway to get a snapshot
> > from before your crosststamp? But I still don't much like the fact that
> > you need to use it to work out which cs_id is being used.
> 
> The actual cs_id check is in get_device_system_crosststamp(), where it was
> added recently [1]. So this additional check is just verifying that the
> history_begin is usable.
>
> > Shouldn't get_device_system_crosststamp() pass that to its get_time_fn
> > as a hint?
> 
> This is unneeded in this case, since get_device_system_crosststamp() does
> the check already (but the driver is free to pass it through the
> get_time_fn parameter ctx).

The *check* is a different thing.

As things stand, the device has to *choose* a cs_id to use, and takes a
gamble on that check in get_device_system_crosststamp() throwing the
crosststamp away with -ENODEV because the device picked the wrong
cs_id.

That's why I'm saying it would be nicer if the core code *told* the
device what cs_id to use. Rather than just throwing it away if the
device guesses wrong.

(Yes, it would have to be considered a hint, because it could
theoretically have *changed* by the time the result is obtained, just
as with your code above.)

> > 
> > On x86, you are likely to find that history_begin.cs_id is the KVM
> > clock, so this will return -EOPNOTSUPP and userspace will have to fall
> > back to PTP_SYS_OFFSET. I note the KVM PTP clock actually *converts* a
> > TSC-based crosststamp to kvmclock µs for itself, so that it can report 
> > *cs_id = CSID_X86_KVM_CLK. Not sure how I feel about that though. I'm
> > inclined to suggest that it shouldn't, as anyone who wants accurate
> > timekeeping shouldn't be using the KVM clock anyway.
> > 
> > But we should at least be relatively consistent about it.
> 
> ATM, the driver does indeed not have TSC support (for cross-timestamping)
> enabled at all, so would always use fallback. If *not* using the KVM clock,
> I think TSC can just be enabled by adding architecture-specific code
> similar to virtio_rtc_arm.c.
> 
> I am not familiar with the KVM clock, but maybe it would be sufficient to
> allow CSID_X86_KVM_CLK as well?

Sure, that's what the ptp_kvm clock does. It actually obtains a TSC
reading from the "hardware", and then manually (and unconditionally)
converts that to a kvmclock value so that it can return a clock pairing
based on CSID_X86_KVM_CLK.

Which works until the user configures the clocksource to be the TSC
instead of kvmclock, and then hits that -ENODEV check and has to do the
fallback.

We should just tell the device which cs_id to use.




smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-06-15 Thread David Woodhouse

On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> RFC v3 updates
> --
> 
> This series implements a driver for a virtio-rtc device conforming to spec
> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
> the PTP clock driver already present before.
> 
> This patch series depends on the patch series "treewide: Use clocksource id
> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
> series on top of mainline.
> 
> Overview
> 
> 
> This patch series adds the virtio_rtc module, and related bugfixes. The
> virtio_rtc module implements a driver compatible with the proposed Virtio
> RTC device specification [1]. The Virtio RTC (Real Time Clock) device
> provides information about current time. The device can provide different
> clocks, e.g. for the UTC or TAI time standards, or for physical time
> elapsed since some past epoch. The driver can read the clocks with simple
> or more accurate methods. Optionally, the driver can set an alarm.
> 
> The series first fixes some bugs in the get_device_system_crosststamp()
> interpolation code, which is required for reliable virtio_rtc operation.
> Then, add the virtio_rtc implementation.
> 
> For the Virtio RTC device, there is currently a proprietary implementation,
> which has been used for provisional testing.

As discussed before, I don't think it makes sense to design a new high-
precision virtual clock which only gets it right *most* of the time. We
absolutely need to address the issue of live migration.

When live migration occurs, the guest's time precision suffers in two
ways.

First, even when migrating to a supposedly identical host the precise
rate of the underlying counter (TSC, arch counter, etc.) can change
within the tolerances (e.g. ±50PPM) of the hardware. Unlike the natural
changes that NTP normally manages to track, this is a *step* change,
potentially from -50PPM to +50PPM from one host to the next.

Second, the accuracy of the counter as preserved across migration is
limited by the accuracy of each host's NTP synchronization. So there is
also a step change in the value of the counter itself.

At the moment of migration, the guest's timekeeping should be
considered invalid. Any previous cross-timestamps are meaningless, and
blindly using the previously-known relationship between the counter and
real time can lead to problems such as corruption in distributed
databases, fines for mis-timestamped transactions, and other general
unhappiness.

We obviously can't get a new timestamp from the virtio_rtc device every
time an application wants to know the time reliably. We don't even want
*system* calls for that, which is why we have it in a vDSO.

We can take the same approach to warning the guest about clock
disruption due to live migration. A shared memory region from the
virtual clock device even be mapped all the way to userspace, for those
applications which need precise and *reliable* time to check a
'disruption' marker in it, and do whatever is appropriate (e.g. abort
transactions and wait for time to resync) when it happens.

We can do better than just letting the guest know about disruption, of
course. We can put the actual counter→realtime relationship into the
memory region too. As well as being able to provide a PTP driver with
this, the applications which care about *reliable* timestamps can mmap
the page directly and use it, vDSO-style, to have accurate timestamps
even from the first cycle after migration.

When disruption is signalled, the guest needs to throw away any
*additional* refinement that it's done with NTP/PTP/PPS/etc. and revert
to knowing nothing more than what the hypervisor advertises here.

Here's a first attempt at defining such a memory structure. For now
I've done it as a "vmclock" ACPI device based loosely on vmgenid, but I
think it makes most sense for this to be part of the virtio_rtc spec.
Ultimately it doesn't matter *how* the guest finds the memory region.

Very preliminary QEMU hacks at 
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock
(I still need to do the KVM side helper for actually filling in the
host clock information, converted to the *guest* TSC)

This is the guest side. H aving heckled your assumption that we can use
the virt counter on Arm, I concede I'm doing the same thing for now.
The structure itself is at the end, or see
https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=include/uapi/linux/vmclock.h;hb=vmclock

From 9e1c3b823d497efa4e0acb21b226a72e4d6e8a53 Mon Sep 17 00:00:00 2001
From: David Woodhouse 
Date: Mon, 10 Jun 2024 15:10:11 +0100
Subject: [PATCH] ptp: Add vDSO-style vmclock support

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Signed-off-by: David Woodhouse 
--

Re: [RFC PATCH v3 5/7] virtio_rtc: Add PTP clocks

2024-06-15 Thread David Woodhouse

On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> 
> +   ret = viortc_hw_xtstamp_params(_counter, _id);
> +   if (ret)
> +   return ret;
> +
> +   ktime_get_snapshot(_begin);
> +   if (history_begin.cs_id != cs_id)
> +   return -EOPNOTSUPP;

I think you have to call ktime_get_snapshot() anyway to get a snapshot
from before your crosststamp? But I still don't much like the fact that
you need to use it to work out which cs_id is being used.

Shouldn't get_device_system_crosststamp() pass that to its get_time_fn
as a hint?

On x86, you are likely to find that history_begin.cs_id is the KVM
clock, so this will return -EOPNOTSUPP and userspace will have to fall
back to PTP_SYS_OFFSET. I note the KVM PTP clock actually *converts* a
TSC-based crosststamp to kvmclock µs for itself, so that it can report 
*cs_id = CSID_X86_KVM_CLK. Not sure how I feel about that though. I'm
inclined to suggest that it shouldn't, as anyone who wants accurate
timekeeping shouldn't be using the KVM clock anyway.

But we should at least be relatively consistent about it.

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v3 6/7] virtio_rtc: Add Arm Generic Timer cross-timestamping

2024-06-15 Thread David Woodhouse

On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> 
> +int viortc_hw_xtstamp_params(u16 *hw_counter, enum clocksource_ids *cs_id)
> +{
> +   *hw_counter = VIRTIO_RTC_COUNTER_ARM_VIRT;

Hm, but what if it isn't? I think you need to put this in
drivers/clocksource/arm_arch_timer.c where it can do something like
kvm_arch_ptp_get_crosststamp() does to decide:

if (arch_timer_uses_ppi == ARCH_TIMER_VIRT_PPI)
ptp_counter = KVM_PTP_VIRT_COUNTER;
else
ptp_counter = KVM_PTP_PHYS_COUNTER;


> +   *cs_id = CSID_ARM_ARCH_COUNTER;
> +
> +   return 0;
> +}



smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC v3 net-next 1/7] net: add rx_sk to trace_kfree_skb

2024-06-07 Thread David Ahern

On 6/6/24 9:37 AM, Yan Zhai wrote:
> # cat /sys/kernel/debug/tracing/events/skb/kfree_skb/format
> name: kfree_skb
> ID: 2260
> format:
> field:unsigned short common_type;   offset:0;
> size:2; signed:0;
> field:unsigned char common_flags;   offset:2;
> size:1; signed:0;
> field:unsigned char common_preempt_count;   offset:3;
>  size:1; signed:0;
> field:int common_pid;   offset:4;   size:4; signed:1;
> 
> field:void * skbaddr;   offset:8;   size:8; signed:0;
> field:void * location;  offset:16;  size:8; signed:0;
> field:unsigned short protocol;  offset:24;  size:2; signed:0;
> field:enum skb_drop_reason reason;  offset:28;
> size:4; signed:0;
> field:void * rx_skaddr; offset:32;  size:8; signed:0;
> 
> Do you think we still need to change the order?
> 

yes. Keeping proper order now ensures no holes creep in with later changes.

Re: [PATCH RFC 0/4] static key support for error injection functions

2024-06-02 Thread David Rientjes

On Fri, 31 May 2024, Vlastimil Babka wrote:

> Patches 3 and 4 implement the static keys for the two mm fault injection
> sites in slab and page allocators. For a quick demonstration I've run a
> VM and the simple test from [1] that stresses the slab allocator and got
> this time before the series:
> 
> real0m8.349s
> user0m0.694s
> sys 0m7.648s
> 
> with perf showing
> 
>0.61%  nonexistent  [kernel.kallsyms]  [k] should_failslab.constprop.0
>0.00%  nonexistent  [kernel.kallsyms]  [k] should_fail_alloc_page  
>   
>   
>   ▒
> 
> And after the series
> 
> real0m7.924s
> user0m0.727s
> sys 0m7.191s
> 
> and the functions gone from perf report.
> 

Impressive results that will no doubt be a win for kernels that enable 
these options.

Both CONFIG_FAILSLAB and CONFIG_FAIL_PAGE_ALLOC go out of their way to 
have no overhead, both in performance and kernel text overhead, when the 
.config options are disabled.

Do we have any insight into the distros or users that enable either of 
these options and are expecting optimal performance?  I would have assumed 
that while CONFIG_FAULT_INJECTION may be enabled that any users who would 
care deeply about this would have disabled both of these debug options.

> There might be other such fault injection callsites in hotpaths of other
> subsystems but I didn't search for them at this point.
> 
> [1] https://lore.kernel.org/all/6d5bb852-8703-4abf-a52b-90816bccb...@suse.cz/
> [2] 
> https://lore.kernel.org/all/3j5d3p22ssv7xoaghzraa7crcfih3h2qqjlhmjppbp6f42pg2t@kg7qoicog5ye/
> 
> Signed-off-by: Vlastimil Babka 
> ---
> Vlastimil Babka (4):
>   fault-inject: add support for static keys around fault injection sites
>   error-injection: support static keys around injectable functions
>   mm, slab: add static key for should_failslab()
>   mm, page_alloc: add static key for should_fail_alloc_page()
> 
>  include/asm-generic/error-injection.h | 13 ++-
>  include/asm-generic/vmlinux.lds.h |  2 +-
>  include/linux/error-injection.h   |  9 +---
>  include/linux/fault-inject.h  |  7 +-
>  kernel/fail_function.c| 22 +++---
>  lib/error-inject.c|  6 -
>  lib/fault-inject.c| 43 
> ++-
>  mm/fail_page_alloc.c  |  3 ++-
>  mm/failslab.c |  2 +-
>  mm/internal.h |  2 ++
>  mm/page_alloc.c   | 11 ++---
>  mm/slab.h |  3 +++
>  mm/slub.c | 10 +---
>  13 files changed, 114 insertions(+), 19 deletions(-)
> ---
> base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
> change-id: 20240530-fault-injection-statickeys-66b7222e91b7
> 
> Best regards,
> -- 
> Vlastimil Babka 
> 
>

[PATCH] arm64: dts: qcom: sm8550-samsung-q5q: fix typo

2024-05-31 Thread David Wronek

It looks like "cdsp_mem" was pasted in the license header by accident.
Fix the typo by removing it.

Signed-off-by: David Wronek 
---
 arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts 
b/arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts
index 4654ae1364ba..3d351e90bb39 100644
--- a/arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts
+++ b/arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts
@@ -1,4 +1,4 @@
-// SPDX-License-cdsp_memIdentifier: BSD-3-Clause
+// SPDX-License-Identifier: BSD-3-Clause
 /*
  * Copyright (c) 2024, Alexandru Marc Serdeliuc 
  * Copyright (c) 2024, David Wronek 

---
base-commit: 0e1980c40b6edfa68b6acf926bab22448a6e40c9
change-id: 20240531-fix-typo-q5q-5d34423b7bb4

Best regards,
-- 
David Wronek

Re: [linus:master] [mm] d99e3140a4: BUG:KCSAN:data-race_in_folio_remove_rmap_ptes/print_report

2024-05-28 Thread David Hildenbrand


Am 28.05.24 um 09:11 schrieb kernel test robot:



Hello,

kernel test robot noticed 
"BUG:KCSAN:data-race_in_folio_remove_rmap_ptes/print_report" on:

commit: d99e3140a4d33e26066183ff727d8f02f56bec64 ("mm: turn folio_test_hugetlb into 
a PageType")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

[test failed on linus/master  c760b3725e52403dc1b28644fb09c47a83cacea6]
[test failed on linux-next/master 3689b0ef08b70e4e03b82ebd37730a03a672853a]

in testcase: trinity
version: trinity-i386-abe9de86-1_20230429
with following parameters:

runtime: 300s
group: group-04
nr_groups: 5



compiler: gcc-13
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


we noticed this issue does not always happen. we also noticed there are
different random KCSAN issues for both this commit and its parent. but below
4 only happen on this commit with not small rate and keep clean on parent.



Likely that's just a page_type check racing against concurrent
mapcount changes.

In __folio_rmap_sanity_checks() we check
VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);

To make sure we don't get hugetlb folios in the wrong rmap code path. That
can easily race with concurrent mapcount changes, just like any other
page_type checks that end up in folio_test_type/page_has_type e.g., from
PFN walkers.

Load tearing in these functions shouldn't really result in false positives
(what we care about), but READ_ONCE shouldn't hurt or make a difference.


From b03dc9bf27571442d886d8da624a4e4f737433f2 Mon Sep 17 00:00:00 2001
From: David Hildenbrand 
Date: Tue, 28 May 2024 09:37:20 +0200
Subject: [PATCH] mm: read page_type using READ_ONCE

KCSAN complains about possible data races: while we check for a
page_type -- for example for sanity checks -- we might concurrently
modify the mapcount that overlays page_type.

Let's use READ_ONCE to avoid laod tearing (shouldn't make a difference)
and to make KCSAN happy.

Note: nothing should really be broken besides wrong KCSAN complaints.

Reported-by: kernel test robot 
Closes: https://lore.kernel.org/oe-lkp/202405281431.c46a3be9-...@intel.com
Signed-off-by: David Hildenbrand 
---
 include/linux/page-flags.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 104078afe0b1..e46ccbb9aa58 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -955,9 +955,9 @@ PAGEFLAG_FALSE(HasHWPoisoned, has_hwpoisoned)
 #define PG_slab0x1000
 
 #define PageType(page, flag)		\

-   ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
+   ((READ_ONCE(page->page_type) & (PAGE_TYPE_BASE | flag)) == 
PAGE_TYPE_BASE)
 #define folio_test_type(folio, flag)   \
-   ((folio->page.page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
+   ((READ_ONCE(folio->page.page_type) & (PAGE_TYPE_BASE | flag))  == 
PAGE_TYPE_BASE)
 
 static inline int page_type_has_type(unsigned int page_type)

 {
@@ -966,7 +966,7 @@ static inline int page_type_has_type(unsigned int page_type)
 
 static inline int page_has_type(const struct page *page)

 {
-   return page_type_has_type(page->page_type);
+   return page_type_has_type(READ_ONCE(page->page_type));
 }
 
 #define FOLIO_TYPE_OPS(lname, fname)	\

--
2.45.1


--
Thanks,

David / dhildenb

Re: [PATCH] virt: acrn: Remove unusted list 'acrn_irqfd_clients'

2024-05-17 Thread Dr. David Alan Gilbert

* li...@treblig.org (li...@treblig.org) wrote:
> From: "Dr. David Alan Gilbert" 
> 
> It doesn't look like this was ever used.
> 
> Build tested only.
> 
> Signed-off-by: Dr. David Alan Gilbert 

Ping

> ---
>  drivers/virt/acrn/irqfd.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/virt/acrn/irqfd.c b/drivers/virt/acrn/irqfd.c
> index d4ad211dce7a3..346cf0be4aac7 100644
> --- a/drivers/virt/acrn/irqfd.c
> +++ b/drivers/virt/acrn/irqfd.c
> @@ -16,8 +16,6 @@
>  
>  #include "acrn_drv.h"
>  
> -static LIST_HEAD(acrn_irqfd_clients);
> -
>  /**
>   * struct hsm_irqfd - Properties of HSM irqfd
>   * @vm:  Associated VM pointer
> -- 
> 2.45.0
> 
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-05-16 Thread David Hildenbrand


On 16.05.24 19:44, Guillaume Morin wrote:

On 02 May  5:59, Guillaume Morin wrote:


On 30 Apr 21:25, David Hildenbrand wrote:

I tried to get the hugepd stuff right but this was the first I heard
about it :-) Afaict follow_huge_pmd and friends were already DTRT


I'll have to have a closer look at some details (the hugepd writability
check looks a bit odd), but it's mostly what I would have expected!


Ok in the meantime, here is the uprobe change on your current
uprobes_cow trying to address the comments you made in your previous
message. Some of them were not 100% clear to me, so it's a best effort
patch :-) Again lightly tested


David, have you had a chance to take a look at both patches?


Not in detail, last weeks were busy (currently traveling back home from 
LSF/MM). I'll try to find time within the next two weeks to polish my 
changes and send them out. It would be great if you could send your 
stuff based on top of that then.


(the merge window just opened on Saturday, so we have plenty of time to 
make it to the next one :) )


--
Cheers,

David / dhildenb

Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V

2024-05-14 Thread David Hildenbrand


On 14.05.24 20:17, Björn Töpel wrote:

Alexandre Ghiti  writes:


On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:


From: Björn Töpel 

Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
RISC-V.

Signed-off-by: Björn Töpel 
---
  arch/riscv/Kconfig | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 6bec1bce6586..b9398b64bb69 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -16,6 +16,8 @@ config RISCV
 select ACPI_REDUCED_HARDWARE_ONLY if ACPI
 select ARCH_DMA_DEFAULT_COHERENT
 select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
+   select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU


I think this should be SPARSEMEM_VMEMMAP here.


Hmm, care to elaborate? I thought that was optional.


There was a discussion at LSF/MM today to maybe require 
SPARSEMEM_VMEMMAP for hotplug. Would that work here as well?


--
Cheers,

David / dhildenb

Re: [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump

2024-05-14 Thread David Hildenbrand


On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

During memory hot remove, the ptdump functionality can end up touching
stale data. Avoid any potential crashes (or worse), by holding the
memory hotplug read-lock while traversing the page table.

This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm:
Hold memory hotplug lock while walking for kernel page table dump").

Signed-off-by: Björn Töpel 
---
  arch/riscv/mm/ptdump.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index 1289cc6d3700..9d5f657a251b 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -6,6 +6,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -370,7 +371,9 @@ bool ptdump_check_wx(void)
  
  static int ptdump_show(struct seq_file *m, void *v)

  {
+   get_online_mems();
ptdump_walk(m, m->private);
+   put_online_mems();
  
  	return 0;

  }


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread David Hildenbrand


On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

For an architecture to support memory hotplugging, a couple of
callbacks needs to be implemented:

  arch_add_memory()
   This callback is responsible for adding the physical memory into the
   direct map, and call into the memory hotplugging generic code via
   __add_pages() that adds the corresponding struct page entries, and
   updates the vmemmap mapping.

  arch_remove_memory()
   This is the inverse of the callback above.

  vmemmap_free()
   This function tears down the vmemmap mappings (if
   CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
   backing vmemmap pages. Note that for persistent memory, an
   alternative allocator for the backing pages can be used; The
   vmem_altmap. This means that when the backing pages are cleared,
   extra care is needed so that the correct deallocation method is
   used.

  arch_get_mappable_range()
   This functions returns the PA range that the direct map can map.
   Used by the MHP internals for sanity checks.

The page table unmap/teardown functions are heavily based on code from
the x86 tree. The same remove_pgd_mapping() function is used in both
vmemmap_free() and arch_remove_memory(), but in the latter function
the backing pages are not removed.

Signed-off-by: Björn Töpel 
---
  arch/riscv/mm/init.c | 242 +++
  1 file changed, 242 insertions(+)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 6f72b0b2b854..7f0b921a3d3a 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
}
  }
  #endif
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
+{
+   pte_t *pte;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PTE; i++) {
+   pte = pte_start + i;
+   if (!pte_none(*pte))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
+   pmd_clear(pmd);
+}
+
+static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
+{
+   pmd_t *pmd;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PMD; i++) {
+   pmd = pmd_start + i;
+   if (!pmd_none(*pmd))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(pud_page(*pud)), 0);
+   pud_clear(pud);
+}
+
+static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+   pud_t *pud;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PUD; i++) {
+   pud = pud_start + i;
+   if (!pud_none(*pud))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
+   p4d_clear(p4d);
+}
+
+static void __meminit free_vmemmap_storage(struct page *page, size_t size,
+  struct vmem_altmap *altmap)
+{
+   if (altmap)
+   vmem_altmap_free(altmap, size >> PAGE_SHIFT);
+   else
+   free_pages((unsigned long)page_address(page), get_order(size));


If you unplug a DIMM that was added during boot (can happen on x86-64, 
can it happen on riscv?), free_pages() would not be sufficient. You'd be 
freeing a PG_reserved page that has to be freed differently.


--
Cheers,

David / dhildenb

Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add

2024-05-14 Thread David Hildenbrand


On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

Add a parameter to the direct map setup function, so it can be used in
arch_add_memory() later.

Signed-off-by: Björn Töpel 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 2/8] riscv: mm: Change attribute from init to meminit for page functions

2024-05-14 Thread David Hildenbrand


On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

Prepare for memory hotplugging support by changing from __init to
__meminit for the page table functions that are used by the upcoming
architecture specific callbacks.

Changing the __init attribute to __meminit, avoids that the functions
are removed after init. The __meminit attribute makes sure the
functions are kept in the kernel text post init, but only if memory
hotplugging is enabled for the build.

Also, make sure that the altmap parameter is properly passed on to
vmemmap_populate_hugepages().

Signed-off-by: Björn Töpel 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 7/8] virtio-mem: Enable virtio-mem for RISC-V

2024-05-14 Thread David Hildenbrand


On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

Now that RISC-V has memory hotplugging support, virtio-mem can be used
on the platform.

Signed-off-by: Björn Töpel 
---
  drivers/virtio/Kconfig | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index c17193544268..4e5cebf1b82a 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -122,7 +122,7 @@ config VIRTIO_BALLOON
  
  config VIRTIO_MEM

tristate "Virtio mem driver"
-   depends on X86_64 || ARM64
+   depends on X86_64 || ARM64 || RISCV
depends on VIRTIO
depends on MEMORY_HOTPLUG
depends on MEMORY_HOTREMOVE



Nice!

Acked-by: David Hildenbrand 
--
Cheers,

David / dhildenb

Re: [PATCH v23 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-10 Thread David Hildenbrand

+   cpu_buffer->subbuf_ids = subbuf_ids;
+
+   meta->meta_page_size = PAGE_SIZE;
+   meta->meta_struct_len = sizeof(*meta);
+   meta->nr_subbufs = nr_subbufs;
+   meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+
+   rb_update_meta_page(cpu_buffer);
+}
+
+static struct ring_buffer_per_cpu *
+rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
+{
+   struct ring_buffer_per_cpu *cpu_buffer;
+
+   if (!cpumask_test_cpu(cpu, buffer->cpumask))
+   return ERR_PTR(-EINVAL);
+
+   cpu_buffer = buffer->buffers[cpu];
+
+   mutex_lock(_buffer->mapping_lock);
+
+   if (!cpu_buffer->mapped) {
+   mutex_unlock(_buffer->mapping_lock);
+   return ERR_PTR(-ENODEV);
+   }
+
+   return cpu_buffer;
+}
+
+static void rb_put_mapped_buffer(struct ring_buffer_per_cpu *cpu_buffer)
+{
+   mutex_unlock(_buffer->mapping_lock);
+}
+
+/*
+ * Fast-path for rb_buffer_(un)map(). Called whenever the meta-page doesn't 
need
+ * to be set-up or torn-down.
+ */
+static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu *cpu_buffer,
+  bool inc)
+{
+   unsigned long flags;
+
+   lockdep_assert_held(_buffer->mapping_lock);
+
+   if (inc && cpu_buffer->mapped == UINT_MAX)
+   return -EBUSY;
+
+   if (WARN_ON(!inc && cpu_buffer->mapped == 0))
+   return -EINVAL;
+
+   mutex_lock(_buffer->buffer->mutex);
+   raw_spin_lock_irqsave(_buffer->reader_lock, flags);
+
+   if (inc)
+   cpu_buffer->mapped++;
+   else
+   cpu_buffer->mapped--;
+
+   raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
+   mutex_unlock(_buffer->buffer->mutex);
+
+   return 0;
+}
+
+/*
+ *   +--+  pgoff == 0
+ *   |   meta page  |
+ *   +--+  pgoff == 1
+ *   | subbuffer 0  |
+ *   |  |
+ *   +--+  pgoff == (1 + (1 << subbuf_order))
+ *   | subbuffer 1  |
+ *   |  |
+ * ...
+ */
+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+
+   /* Refuse MP_PRIVATE or writable mappings */
+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+
+   /*
+* Make sure the mapping cannot become writable later. Also tell the VM
+* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND).


Coment a bit outdated.

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-10 Thread David Hildenbrand


On 08.05.24 04:34, Steven Rostedt wrote:

On Tue, 30 Apr 2024 12:13:51 +0100
Vincent Donnefort  wrote:


+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+
+   /* Refuse MP_PRIVATE or writable mappings */
+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+
+   /*
+* Make sure the mapping cannot become writable later. Also tell the VM
+* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally,
+* prevent migration, GUP and dump (VM_IO).
+*/
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE);


Do we really need the VM_IO?

When testing this in gdb, I would get:

(gdb) p tmap->map->subbuf_size
Cannot access memory at address 0x77fc2008

It appears that you can't ptrace IO memory. When I removed that flag,
gdb has no problem reading that memory.

I think we should drop that flag.

Can you send a v23 with that removed, Shuah's update, and also the
change below:


+
+   lockdep_assert_held(_buffer->mapping_lock);
+
+   subbuf_order = cpu_buffer->buffer->subbuf_order;
+   subbuf_pages = 1 << subbuf_order;
+
+   nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */
+   nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */
+
+   vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+   if (!vma_pages || vma_pages > nr_pages)
+   return -EINVAL;
+
+   nr_pages = vma_pages;
+
+   pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
+   if (!pages)
+   return -ENOMEM;
+
+   if (!pgoff) {
+   pages[p++] = virt_to_page(cpu_buffer->meta_page);
+
+   /*
+* TODO: Align sub-buffers on their size, once
+* vm_insert_pages() supports the zero-page.
+*/
+   } else {
+   /* Skip the meta-page */
+   pgoff--;
+
+   if (pgoff % subbuf_pages) {
+   err = -EINVAL;
+   goto out;
+   }
+
+   s += pgoff / subbuf_pages;
+   }
+
+   while (s < nr_subbufs && p < nr_pages) {
+   struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
+   int off = 0;
+
+   for (; off < (1 << (subbuf_order)); off++, page++) {
+   if (p >= nr_pages)
+   break;
+
+   pages[p++] = page;
+   }
+   s++;
+   }


The above can be made to:

while (p < nr_pages) {
struct page *page;
int off = 0;

if (WARN_ON_ONCE(s >= nr_subbufs))
break;


I'm not particularly happy about us calling vm_insert_pages with NULL 
pointers stored in pages.


Should we instead do

if (WARN_ON_ONCE(s >= nr_subbufs)) {
err = -EINVAL;
goto out;
}

?

--
Cheers,

David / dhildenb

Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-10 Thread David Hildenbrand


On 09.05.24 13:05, Vincent Donnefort wrote:

On Tue, May 07, 2024 at 10:34:02PM -0400, Steven Rostedt wrote:

On Tue, 30 Apr 2024 12:13:51 +0100
Vincent Donnefort  wrote:


+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+
+   /* Refuse MP_PRIVATE or writable mappings */
+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+
+   /*
+* Make sure the mapping cannot become writable later. Also tell the VM
+* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally,
+* prevent migration, GUP and dump (VM_IO).
+*/
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE);


Do we really need the VM_IO?

When testing this in gdb, I would get:

(gdb) p tmap->map->subbuf_size
Cannot access memory at address 0x77fc2008

It appears that you can't ptrace IO memory. When I removed that flag,
gdb has no problem reading that memory.


Yeah, VM_IO indeed implies DONTDUMP. VM_IO was part of Linus recommendations.


Yes, the VM should recognize that memory to some degree as being special 
already due to VM_MIXEDMAP and VM_DONTEXPAND.


#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)

So any of these flag achieve that (e.g., mlock_fixup() checks 
VM_SPECIAL). KSM similarly skips VM_DONTEXPAND and VM_MIXEDMAP (likely 
we should be using VM_SPECIAL in vma_ksm_compatible()). Not sure about 
page migration, likely its fine.


Thinking about MADV_DONTNEED, I can spot in 
madvise_dontneed_free_valid_vma() only that we disallow primarily VM_PFNMAP.


... I assume if user space MADV_DONTNEED's some pages we'll simply get a 
page fault later on access that will SIGBUS, handling that gracefully 
(we should double-check!).




But perhaps, VM_DONTEXPAND and MIXEDMAP (implicitely set by vm_insert_pages) are
enough protection?


Do we want to dump these pages? VM_DONTDUMP might be reasonabe then.



I don't see how anything could use GUP there and as David pointed-out on the
previous version, it doesn't event prevent the GUP-fast path.


Yes, GUP-fast would still have worked under some conditions.

--
Cheers,

David / dhildenb

Re: [PATCH] ftrace: Remove unused global 'ftrace_direct_func_count'

2024-05-06 Thread Dr. David Alan Gilbert

* li...@treblig.org (li...@treblig.org) wrote:
> From: "Dr. David Alan Gilbert" 
> 
> Commit 8788ca164eb4b ("ftrace: Remove the legacy _ftrace_direct API")
> stopped setting the 'ftrace_direct_func_count' variable, but left
> it around.  Clean it up.
> 
> Signed-off-by: Dr. David Alan Gilbert 

FYI this is on top of my earlier 'ftrace: Remove unused list 
'ftrace_direct_funcs'

Dave

> ---
>  include/linux/ftrace.h |  2 --
>  kernel/trace/fgraph.c  | 11 ---
>  kernel/trace/ftrace.c  |  1 -
>  3 files changed, 14 deletions(-)
> 
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index b01cca36147ff..e3a83ebd1b333 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -413,7 +413,6 @@ struct ftrace_func_entry {
>  };
>  
>  #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> -extern int ftrace_direct_func_count;
>  unsigned long ftrace_find_rec_direct(unsigned long ip);
>  int register_ftrace_direct(struct ftrace_ops *ops, unsigned long addr);
>  int unregister_ftrace_direct(struct ftrace_ops *ops, unsigned long addr,
> @@ -425,7 +424,6 @@ void ftrace_stub_direct_tramp(void);
>  
>  #else
>  struct ftrace_ops;
> -# define ftrace_direct_func_count 0
>  static inline unsigned long ftrace_find_rec_direct(unsigned long ip)
>  {
>   return 0;
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index c83c005e654e3..a130b2d898f7c 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -125,17 +125,6 @@ int function_graph_enter(unsigned long ret, unsigned 
> long func,
>  {
>   struct ftrace_graph_ent trace;
>  
> -#ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
> - /*
> -  * Skip graph tracing if the return location is served by direct 
> trampoline,
> -  * since call sequence and return addresses are unpredictable anyway.
> -  * Ex: BPF trampoline may call original function and may skip frame
> -  * depending on type of BPF programs attached.
> -  */
> - if (ftrace_direct_func_count &&
> - ftrace_find_rec_direct(ret - MCOUNT_INSN_SIZE))
> - return -EBUSY;
> -#endif
>   trace.func = func;
>   trace.depth = ++current->curr_ret_depth;
>  
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index b18b4ece3d7c9..adf34167c3418 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -2538,7 +2538,6 @@ ftrace_find_unique_ops(struct dyn_ftrace *rec)
>  /* Protected by rcu_tasks for reading, and direct_mutex for writing */
>  static struct ftrace_hash __rcu *direct_functions = EMPTY_HASH;
>  static DEFINE_MUTEX(direct_mutex);
> -int ftrace_direct_func_count;
>  
>  /*
>   * Search the direct_functions hash to see if the given instruction pointer
> -- 
> 2.45.0
> 
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: ftrace_direct_func_count ?

2024-05-06 Thread Dr. David Alan Gilbert

* Steven Rostedt (rost...@goodmis.org) wrote:
> On Sat, 4 May 2024 13:35:26 +
> "Dr. David Alan Gilbert"  wrote:
> 
> > Hi,
> >   I've just posted a patch 'ftrace: Remove unused list 
> > 'ftrace_direct_funcs''
> > that clears out some old code, but while at it I noticed the global
> > 'ftrace_direct_func_count'.
> > 
> > As far as I can tell, it's never assigned (or initialised) but it is tested:
> > 
> > kernel/trace/fgraph.c:
> > #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
> >   /*
> >* Skip graph tracing if the return location is served by direct 
> > trampoline,
> >* since call sequence and return addresses are unpredictable anyway.
> >* Ex: BPF trampoline may call original function and may skip frame
> >* depending on type of BPF programs attached.
> >*/
> >   if (ftrace_direct_func_count &&
> >   ftrace_find_rec_direct(ret - MCOUNT_INSN_SIZE))
> > return -EBUSY;
> > #endif
> > 
> > So I wasn't sure whether it was just safe to nuke that section
> > or whether it really needed fixing?
> 
> Yes, after commit 8788ca164eb4bad ("ftrace: Remove the legacy
> _ftrace_direct API") that variable is no longer used.

OK, thanks, I'll send a follow up patch to my other patch to nuke
this as well.

Dave

> -- Steve
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

2024-05-04 Thread David Laight

From: Waiman Long
> Sent: 03 May 2024 23:14
> 
> 
> On 5/3/24 17:10, David Laight wrote:
> > From: Waiman Long
> >> Sent: 03 May 2024 17:00
> > ...
> >> David,
> >>
> >> Could you respin the series based on the latest upstream code?
> > I've just reapplied the patches to 'master' and they all apply
> > cleanly and diffing the new patches to the old ones gives no differences.
> > So I think they should still apply.
> >
> > Were you seeing a specific problem?
> >
> > I don't remember any suggested changed either.
> > (Apart from a very local variable I used to keep a patch isolated.)
> 
> No, I just want to make sure that your patches will still apply. Anyway,
> it will be easier for the maintainer to merge your remaining patches if
> you can send out a new version even if they are almost the same as the
> old ones.

I don't think any changes are needed.
So the existing versions are fine.
They applied (well my copy of what I think I sent applied) and built.
So there shouldn't be any issues.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

ftrace_direct_func_count ?

2024-05-04 Thread Dr. David Alan Gilbert

Hi,
  I've just posted a patch 'ftrace: Remove unused list 'ftrace_direct_funcs''
that clears out some old code, but while at it I noticed the global
'ftrace_direct_func_count'.

As far as I can tell, it's never assigned (or initialised) but it is tested:

kernel/trace/fgraph.c:
#ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
  /*
   * Skip graph tracing if the return location is served by direct trampoline,
   * since call sequence and return addresses are unpredictable anyway.
   * Ex: BPF trampoline may call original function and may skip frame
   * depending on type of BPF programs attached.
   */
  if (ftrace_direct_func_count &&
  ftrace_find_rec_direct(ret - MCOUNT_INSN_SIZE))
return -EBUSY;
#endif

So I wasn't sure whether it was just safe to nuke that section
or whether it really needed fixing?

Dave

-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

2024-05-03 Thread David Laight

From: Waiman Long
> Sent: 03 May 2024 17:00
...
> David,
> 
> Could you respin the series based on the latest upstream code?

I've just reapplied the patches to 'master' and they all apply
cleanly and diffing the new patches to the old ones gives no differences.
So I think they should still apply.

Were you seeing a specific problem?

I don't remember any suggested changed either.
(Apart from a very local variable I used to keep a patch isolated.)

    David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

2024-05-03 Thread David Laight

From: Waiman Long
> Sent: 03 May 2024 17:00
> To: David Laight ; 'linux-kernel@vger.kernel.org' 
>  ker...@vger.kernel.org>; 'pet...@infradead.org' 
> Cc: 'mi...@redhat.com' ; 'w...@kernel.org' 
> ; 'boqun.f...@gmail.com'
> ; 'Linus Torvalds' ; 
> 'virtualization@lists.linux-
> foundation.org' ; 'Zeng Heng' 
> 
> Subject: Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and 
> per_cpu_ptr().
> 
> 
> On 12/31/23 23:14, Waiman Long wrote:
> >
> > On 12/31/23 16:55, David Laight wrote:
> >> per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
> >> This requires the cpu number be 64bit.
> >> However the value is osq_lock() comes from a 32bit xchg() and there
> >> isn't a way of telling gcc the high bits are zero (they are) so
> >> there will always be an instruction to clear the high bits.
> >>
> >> The cpu number is also offset by one (to make the initialiser 0)
> >> It seems to be impossible to get gcc to convert
> >> __per_cpu_offset[cpu_p1 - 1]
> >> into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the
> >> address).
> >>
> >> Converting the cpu number to 32bit unsigned prior to the decrement means
> >> that gcc knows the decrement has set the high bits to zero and doesn't
> >> add a register-register move (or cltq) to zero/sign extend the value.
> >>
> >> Not massive but saves two instructions.
> >>
> >> Signed-off-by: David Laight 
> >> ---
> >>   kernel/locking/osq_lock.c | 6 ++
> >>   1 file changed, 2 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> >> index 35bb99e96697..37a4fa872989 100644
> >> --- a/kernel/locking/osq_lock.c
> >> +++ b/kernel/locking/osq_lock.c
> >> @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
> >>   return cpu_nr + 1;
> >>   }
> >>   -static inline struct optimistic_spin_node *decode_cpu(int
> >> encoded_cpu_val)
> >> +static inline struct optimistic_spin_node *decode_cpu(unsigned int
> >> encoded_cpu_val)
> >>   {
> >> -    int cpu_nr = encoded_cpu_val - 1;
> >> -
> >> -    return per_cpu_ptr(_node, cpu_nr);
> >> +    return per_cpu_ptr(_node, encoded_cpu_val - 1);
> >>   }
> >>     /*
> >
> > You really like micro-optimization.
> >
> > Anyway,
> >
> > Reviewed-by: Waiman Long 
> >
> David,
> 
> Could you respin the series based on the latest upstream code?

Looks like a wet bank holiday weekend.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-02 Thread David Hildenbrand


On 30.04.24 13:13, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

   ring_buffer_{map,unmap}()

And controls on the ring-buffer:

   ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

   A unique ID for each subbuf of the ring-buffer, currently they are
   only identified through their in-kernel VA.

   A meta-page, where are stored ring-buffer statistics and a
   description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h


[...]


+/*
+ *   +--+  pgoff == 0
+ *   |   meta page  |
+ *   +--+  pgoff == 1
+ *   | subbuffer 0  |
+ *   |  |
+ *   +--+  pgoff == (1 + (1 << subbuf_order))
+ *   | subbuffer 1  |
+ *   |  |
+ * ...
+ */
+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+
+   /* Refuse MP_PRIVATE or writable mappings */
+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+
+   /*
+* Make sure the mapping cannot become writable later. Also tell the VM
+* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally,
+* prevent migration, GUP and dump (VM_IO).
+*/
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE);
+
+   lockdep_assert_held(_buffer->mapping_lock);
+
+   subbuf_order = cpu_buffer->buffer->subbuf_order;
+   subbuf_pages = 1 << subbuf_order;
+
+   nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */
+   nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */
+
+   vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+   if (!vma_pages || vma_pages > nr_pages)
+   return -EINVAL;
+
+   nr_pages = vma_pages;
+
+   pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
+   if (!pages)
+   return -ENOMEM;
+
+   if (!pgoff) {
+   pages[p++] = virt_to_page(cpu_buffer->meta_page);
+
+   /*
+* TODO: Align sub-buffers on their size, once
+* vm_insert_pages() supports the zero-page.
+*/
+   } else {
+   /* Skip the meta-page */
+   pgoff--;
+
+   if (pgoff % subbuf_pages) {
+   err = -EINVAL;
+   goto out;
+   }
+
+   s += pgoff / subbuf_pages;
+   }
+
+   while (s < nr_subbufs && p < nr_pages) {
+   struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
+   int off = 0;
+
+   for (; off < (1 << (subbuf_order)); off++, page++) {
+   if (p >= nr_pages)
+   break;
+
+   pages[p++] = page;
+   }
+   s++;
+   }
+
+   err = vm_insert_pages(vma, vma->vm_start, pages, _pages);


Nit: I did not immediately understand if we could end here with p < 
nr_pages (IOW, pages[] not completely filled).


One source of confusion is the "s < nr_subbufs" check in the while loop: 
why is "p < nr_pages" insufficient?



For the MM bits:

Acked-by: David Hildenbrand 



--
Cheers,

David / dhildenb

Re: [PATCH v22 1/5] ring-buffer: Allocate sub-buffers with __GFP_COMP

2024-05-02 Thread David Hildenbrand


On 30.04.24 13:13, Vincent Donnefort wrote:

In preparation for the ring-buffer memory mapping, allocate compound
pages for the ring-buffer sub-buffers to enable us to map them to
user-space with vm_insert_pages().

Signed-off-by: Vincent Donnefort 


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-30 Thread David Hildenbrand


On 26.04.24 21:55, Guillaume Morin wrote:

On 26 Apr  9:19, David Hildenbrand wrote:

A couple of points:

a) Don't use page_mapcount(). Either folio_mapcount(), but likely you want
to check PageAnonExclusive.

b) If you're not following the can_follow_write_pte/_pmd model, you are
doing something wrong :)

c) The code was heavily changed in mm/mm-unstable. It was merged with t
the common code.

Likely, in mm/mm-unstable, the existing can_follow_write_pte and
can_follow_write_pmd checks will already cover what you want in most cases.

We'd need a can_follow_write_pud() to cover follow_huge_pud() and
(unfortunately) something to handle follow_hugepd() as well similarly.

Copy-pasting what we do in can_follow_write_pte() and adjusting for
different PTE types is the right thing to do. Maybe now it's time to factor
out the common checks into a separate helper.


I tried to get the hugepd stuff right but this was the first I heard
about it :-) Afaict follow_huge_pmd and friends were already DTRT


I'll have to have a closer look at some details (the hugepd writability 
check looks a bit odd), but it's mostly what I would have expected!


--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-30 Thread David Hildenbrand


On 30.04.24 17:22, Guillaume Morin wrote:

On 26 Apr 21:55, Guillaume Morin wrote:


On 26 Apr  9:19, David Hildenbrand wrote:

A couple of points:

a) Don't use page_mapcount(). Either folio_mapcount(), but likely you want
to check PageAnonExclusive.

b) If you're not following the can_follow_write_pte/_pmd model, you are
doing something wrong :)

c) The code was heavily changed in mm/mm-unstable. It was merged with t
the common code.

Likely, in mm/mm-unstable, the existing can_follow_write_pte and
can_follow_write_pmd checks will already cover what you want in most cases.

We'd need a can_follow_write_pud() to cover follow_huge_pud() and
(unfortunately) something to handle follow_hugepd() as well similarly.

Copy-pasting what we do in can_follow_write_pte() and adjusting for
different PTE types is the right thing to do. Maybe now it's time to factor
out the common checks into a separate helper.


I tried to get the hugepd stuff right but this was the first I heard
about it :-) Afaict follow_huge_pmd and friends were already DTRT


I got it working on top of your uprobes-cow branch with the foll force
patch sent friday. Still pretty lightly tested


Sorry for not replying earlier, was busy with other stuff. I'll try 
getiing that stuff into shape and send it out soonish.




I went with using one write uprobe function with some additional
branches. I went back and forth between that and making them 2 different
functions.


All the folio_test_hugetlb() special casing is a bit suboptimal. Likely 
we want a separate variant, because we should be sing hugetlb PTE 
functions consistently (e.g., huge_pte_uffd_wp() vs pte_uffd_wp(), 
softdirty does not exist etc.)




diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 2f4e88552d3f..8a33e380f7ea 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -83,6 +83,10 @@ static const struct fs_parameter_spec 
hugetlb_fs_parameters[] = {
{}
  };
  
+bool hugetlbfs_mapping(struct address_space *mapping) {

+   return mapping->a_ops == _aops;


is_vm_hugetlb_page() might be what you are looking for.

[...]


  }
  
-static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)

+static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, 
int len, unsigned long page_mask)
  {
void *kaddr = kmap_atomic(page);
-   memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
+   memcpy(dst, kaddr + (vaddr & ~page_mask), len);
kunmap_atomic(kaddr);
  }


  
-static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len)

+static void copy_to_page(struct page *page, unsigned long vaddr, const void 
*src, int len, unsigned long page_mask)
  {
void *kaddr = kmap_atomic(page);
-   memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len);
+   memcpy(kaddr + (vaddr & ~page_mask), src, len);
kunmap_atomic(kaddr);
  }


These two changes really are rather ugly ...

An why are they even required? We get a PAGE_SIZED-based subpage of a 
hugetlb page. We only kmap that one and copy within that one.


In other words, I don't think the copy_from_page() and copy_to_page() 
changes are even required when we consistently work on subpages and not 
suddenly on head pages.


  
-static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)

+static int verify_opcode(struct page *page, unsigned long vaddr, 
uprobe_opcode_t *new_opcode, unsigned long page_mask)
  {
uprobe_opcode_t old_opcode;
bool is_swbp;
@@ -191,7 +192,8 @@ static int verify_opcode(struct page *page, unsigned long 
vaddr, uprobe_opcode_t
 * is a trap variant; uprobes always wins over any other (gdb)
 * breakpoint.
 */
-   copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE);
+   copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE,
+  page_mask);
is_swbp = is_swbp_insn(_opcode);
  
  	if (is_swbp_insn(new_opcode)) {

@@ -376,8 +378,8 @@ struct uwo_data {
uprobe_opcode_t opcode;
  };
  
-static int __write_opcode_pte(pte_t *ptep, unsigned long vaddr,

-   unsigned long next, struct mm_walk *walk)
+static int __write_opcode(pte_t *ptep, unsigned long vaddr,
+ unsigned long page_mask, struct mm_walk *walk)



Unrelated alignment change.


  {
struct uwo_data *data = walk->private;;
const bool is_register = !!is_swbp_insn(>opcode);
@@ -415,9 +417,12 @@ static int __write_opcode_pte(pte_t *ptep, unsigned long 
vaddr,
  
  	/* Unmap + flush the TLB, such that we can write atomically .*/

flush_cache_page(vma, vaddr, pte_pfn(pte));
-   pte = ptep_clear_flush(vma, vaddr, ptep);
+   if (folio_test_hugetlb(folio))
+   pte = huge_ptep_clear_flush(vma, vaddr, ptep);
+   else
+   pte = ptep_clear_flush(vma, vaddr, ptep);
copy_to_page(page, da

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-26 Thread David Hildenbrand


On 26.04.24 02:09, Guillaume Morin wrote:

On 25 Apr 21:56, David Hildenbrand wrote:


On 25.04.24 17:19, Guillaume Morin wrote:

On 24 Apr 23:00, David Hildenbrand wrote:

One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for
hugetlb mappings. However this was also on my TODO and I have a draft
patch that implements it.


Yes, I documented it back then and added sanity checks in GUP code to fence
it off. Shouldn't be too hard to implement (famous last words) and would be
the cleaner thing to use here once I manage to switch over to
FOLL_WRITE|FOLL_FORCE to break COW.


Yes, my patch seems to be working. The hugetlb code is pretty simple.
And it allows ptrace and the proc pid mem file to work on the executable
private hugetlb mappings.

There is one thing I am unclear about though. hugetlb enforces that
huge_pte_write() is true on FOLL_WRITE in both the fault and
follow_page_mask paths. I am not sure if we can simply assume in the
hugetlb code that if the pte is not writable and this is a write fault
then we're in the FOLL_FORCE|FOLL_WRITE case.  Or do we want to keep the
checks simply not enforce it for FOLL_FORCE|FOLL_WRITE?

The latter is more complicated in the fault path because there is no
FAULT_FLAG_FORCE flag.



I just pushed something to
https://github.com/davidhildenbrand/linux/tree/uprobes_cow

Only very lightly tested so far. Expect the worst :)



I'll try it out and send you the hugetlb bits



I still detest having the zapping logic there, but to get it all right I
don't see a clean way around that.


For hugetlb, we'd primarily have to implement the
mm_walk_ops->hugetlb_entry() callback (well, and FOLL_FORCE).


For FOLL_FORCE, heer is my draft. Let me know if this is what you had in
mind.


diff --git a/mm/gup.c b/mm/gup.c
index 1611e73b1121..ac60e0ae64e8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1056,9 +1056,6 @@ static int check_vma_flags(struct vm_area_struct *vma, 
unsigned long gup_flags)
if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
if (!(gup_flags & FOLL_FORCE))
return -EFAULT;
-   /* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */
-   if (is_vm_hugetlb_page(vma))
-   return -EFAULT;
/*
 * We used to let the write,force case do COW in a
 * VM_MAYWRITE VM_SHARED !VM_WRITE vma, so ptrace could
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3548eae42cf9..73f86eddf888 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5941,7 +5941,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct 
vm_area_struct *vma,
   struct folio *pagecache_folio, spinlock_t *ptl,
   struct vm_fault *vmf)
  {
-   const bool unshare = flags & FAULT_FLAG_UNSHARE;
+   const bool make_writable = !(flags & FAULT_FLAG_UNSHARE) &&
+   (vma->vm_flags & VM_WRITE);
pte_t pte = huge_ptep_get(ptep);
struct hstate *h = hstate_vma(vma);
struct folio *old_folio;
@@ -5959,16 +5960,9 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, 
struct vm_area_struct *vma,
 * can trigger this, because hugetlb_fault() will always resolve
 * uffd-wp bit first.
 */
-   if (!unshare && huge_pte_uffd_wp(pte))
+   if (make_writable && huge_pte_uffd_wp(pte))
return 0;
  
-	/*

-* hugetlb does not support FOLL_FORCE-style write faults that keep the
-* PTE mapped R/O such as maybe_mkwrite() would do.
-*/
-   if (WARN_ON_ONCE(!unshare && !(vma->vm_flags & VM_WRITE)))
-   return VM_FAULT_SIGSEGV;
-
/* Let's take out MAP_SHARED mappings first. */
if (vma->vm_flags & VM_MAYSHARE) {
set_huge_ptep_writable(vma, haddr, ptep);
@@ -5989,7 +5983,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct 
vm_area_struct *vma,
folio_move_anon_rmap(old_folio, vma);
SetPageAnonExclusive(_folio->page);
}
-   if (likely(!unshare))
+   if (likely(make_writable))
set_huge_ptep_writable(vma, haddr, ptep);


Maybe we want to refactor that similarly into a 
set_huge_ptep_maybe_writable, and handle the VM_WRITE check internally.


Then, here you'd do

if (unshare)
set_huge_ptep(vma, haddr, ptep);
else
set_huge_ptep_maybe_writable(vma, haddr, ptep);

Something like that.




/* Break COW or unshare */
huge_ptep_clear_flush(vma, haddr, ptep);
@@ -6883,6 +6878,17 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
  }
  #endif /* CONFIG_USERFAULTFD */
  
+static bool is_force_follow(struct vm_area_struct* vma, unsigned int flags,

+struct p

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-25 Thread David Hildenbrand


On 25.04.24 17:19, Guillaume Morin wrote:

On 24 Apr 23:00, David Hildenbrand wrote:

One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for
hugetlb mappings. However this was also on my TODO and I have a draft
patch that implements it.


Yes, I documented it back then and added sanity checks in GUP code to fence
it off. Shouldn't be too hard to implement (famous last words) and would be
the cleaner thing to use here once I manage to switch over to
FOLL_WRITE|FOLL_FORCE to break COW.


Yes, my patch seems to be working. The hugetlb code is pretty simple.
And it allows ptrace and the proc pid mem file to work on the executable
private hugetlb mappings.

There is one thing I am unclear about though. hugetlb enforces that
huge_pte_write() is true on FOLL_WRITE in both the fault and
follow_page_mask paths. I am not sure if we can simply assume in the
hugetlb code that if the pte is not writable and this is a write fault
then we're in the FOLL_FORCE|FOLL_WRITE case.  Or do we want to keep the
checks simply not enforce it for FOLL_FORCE|FOLL_WRITE?

The latter is more complicated in the fault path because there is no
FAULT_FLAG_FORCE flag.



I just pushed something to
https://github.com/davidhildenbrand/linux/tree/uprobes_cow

Only very lightly tested so far. Expect the worst :)

I still detest having the zapping logic there, but to get it all right I 
don't see a clean way around that.



For hugetlb, we'd primarily have to implement the 
mm_walk_ops->hugetlb_entry() callback (well, and FOLL_FORCE).


Likely vaddr and PAGE_SIZE in uprobe_write_opcode() would have to be 
expanded to cover the full hugetlb page.


--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-25 Thread David Hildenbrand


On 25.04.24 17:19, Guillaume Morin wrote:

On 24 Apr 23:00, David Hildenbrand wrote:

One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for
hugetlb mappings. However this was also on my TODO and I have a draft
patch that implements it.


Yes, I documented it back then and added sanity checks in GUP code to fence
it off. Shouldn't be too hard to implement (famous last words) and would be
the cleaner thing to use here once I manage to switch over to
FOLL_WRITE|FOLL_FORCE to break COW.


Yes, my patch seems to be working. The hugetlb code is pretty simple.
And it allows ptrace and the proc pid mem file to work on the executable
private hugetlb mappings.

There is one thing I am unclear about though. hugetlb enforces that
huge_pte_write() is true on FOLL_WRITE in both the fault and
follow_page_mask paths. I am not sure if we can simply assume in the
hugetlb code that if the pte is not writable and this is a write fault
then we're in the FOLL_FORCE|FOLL_WRITE case.  Or do we want to keep the
checks simply not enforce it for FOLL_FORCE|FOLL_WRITE?

The latter is more complicated in the fault path because there is no
FAULT_FLAG_FORCE flag.



handle_mm_fault()->sanitize_fault_flags() makes sure that we'll only 
proceed with a fault either if

* we have VM_WRITE set
* we are in a COW mapping (MAP_PRIVATE with at least VM_MAYWRITE)

Once you see FAULT_FLAG_WRITE and you do have VM_WRITE, you don't care 
about FOLL_FORCE, it's simply a write fault.


Once you see FAULT_FLAG_WRITE and you *don't* have VM_WRITE, you must 
have VM_MAYWRITE and are essentially in FOLL_FORCE.


In a VMA without VM_WRITE, you must never map a PTE writable. In 
ordinary COW code, that's done in wp_page_copy(), where we *always* use 
maybe_mkwrite(), to do exactly what a write fault would do, but without 
mapping the PTE writable.


That's what the whole can_follow_write_pmd()/can_follow_write_pte() is 
about: writing to PTEs that are not writable.


You'll have to follow the exact same model in hugetlb 
(can_follow_write_pmd(), hugetlb_maybe_mkwrite(), ...).


--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-24 Thread David Hildenbrand


On 24.04.24 22:44, Guillaume Morin wrote:

On 24 Apr 22:09, David Hildenbrand wrote:

Let me try to see if we can get this done cleaner.

One ugly part (in general here) is the custom page replacement in the
registration part.

We are guaranteed to have a MAP_PRIVATE mapping. Instead of replacing pages
ourselves (which we likely shouldn't do ...) ... maybe we could use
FAULT_FLAG_UNSHARE faults such that we will get an anonymous folio
populated. (like KSM does nowadays)

Punching FOLL_PIN|FOLL_LONGTERM into GUP would achieve the same thing, but
using FOLL_WRITE would not work on many file systems. So maybe we have to
trigger an unsharing fault ourselves.


^ realizing that we already use FOLL_FORCE, so we can just use FOLL_WRITE to
break COW.


It was never clear to me why uprobes was not doing FOLL_WRITE in the
first place, I must say.


It's quite dated code ...

The use of FOLL_FORCE really is ugly here. When registering, we require 
VM_WRITE but ... when unregistering, we don't ...




One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for
hugetlb mappings. However this was also on my TODO and I have a draft
patch that implements it.


Yes, I documented it back then and added sanity checks in GUP code to 
fence it off. Shouldn't be too hard to implement (famous last words) and 
would be the cleaner thing to use here once I manage to switch over to 
FOLL_WRITE|FOLL_FORCE to break COW.


--
Cheers,

David / dhildenb

Re: [PATCH v21 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-24 Thread David Hildenbrand


On 24.04.24 22:31, Vincent Donnefort wrote:

Hi David,

Thanks for your quick response.

On Wed, Apr 24, 2024 at 05:26:39PM +0200, David Hildenbrand wrote:


I gave it some more thought, and I think we are still missing something (I
wish PFNMAP/MIXEDMAP wouldn't be that hard).


+
+/*
+ *   +--+  pgoff == 0
+ *   |   meta page  |
+ *   +--+  pgoff == 1
+ *   | subbuffer 0  |
+ *   |  |
+ *   +--+  pgoff == (1 + (1 << subbuf_order))
+ *   | subbuffer 1  |
+ *   |  |
+ * ...
+ */
+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+


I'd add some comments here like

/* Refuse any MAP_PRIVATE or writable mappings. */

+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+


/*
  * Make sure the mapping cannot become writable later. Also, tell the VM
  * to not touch these pages pages (VM_DONTCOPY | VM_DONTDUMP) and tell
  * GUP to leave them alone as well (VM_IO).
  */

+   vm_flags_mod(vma,
+VM_MIXEDMAP | VM_PFNMAP |
+VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND | VM_IO,
+VM_MAYWRITE);


I am still really unsure about VM_PFNMAP ... it's not a PFNMAP at all and,
as stated, vm_insert_pages() even complains quite a lot when it would have
to set VM_MIXEDMAP and VM_PFNMAP is already set, likely for a very good
reason.

Can't we limit ourselves to VM_IO?

But then, I wonder if it really helps much regarding GUP: yes, it blocks
ordinary GUP (see check_vma_flags()) but as insert_page_into_pte_locked()
does *not* set pte_special(), GUP-fast (gup_fast_pte_range()) will not
reject it.

Really, if you want GUP-fast to reject it, remap_pfn_range() and friends are
the way to go, that will set pte_special() such that also GUP-fast will
leave it alone, just like vm_normal_page() would.

So ... I know Linus recommended VM_PFNMAP/VM_IO to stop GUP, but it alone
won't stop all of GUP. We really have to mark the PTE as special, which
vm_insert_page() must not do (because it is refcounted!).


Hum, apologies, I am not sure to follow the connection here. Why do you think
the recommendation was to prevent GUP?


Ah, I'm hallucinating! :) "not let people play games with the mapping" to me
implied "make sure nobody touches it". If GUP is acceptable that makes stuff
a lot easier. VM_IO will block some GUP, but not all of it.





Which means: do we really have to stop GUP from grabbing that page?

Using vm_insert_page() only with VM_MIXEDMAP (and without VM_PFNMAP|VM_IO)
would be better.


Under the assumption we do not want to stop all GUP, why not using VM_IO over
VM_MIXEDMAP which is I believe more restrictive?


VM_MIXEDMAP will be implicitly set by vm_insert_page(). There is a lengthy 
comment
for vm_normal_page() that explains all this madness. VM_MIXEDMAP is primarily
relevant for COW mappings, which you just forbid completely.

remap_pfn_range_notrack() documents the semantics of some of the other flags:

 *   VM_IO tells people not to look at these pages
 *  (accesses can have side effects).
 *   VM_PFNMAP tells the core MM that the base pages are just
 *  raw PFN mappings, and do not have a "struct page" associated
 *  with them.
 *   VM_DONTEXPAND
 *  Disable vma merging and expanding with mremap().
 *   VM_DONTDUMP
 *  Omit vma from core dump, even when VM_IO turned off.

VM_PFNMAP is very likely really not what we want, unless we really perform raw
PFN mappings ... VM_IO we can set without doing much harm.

So I would suggest dropping VM_PFNMAP when using vm_insert_pages(), using only 
VM_IO
and likely just letting vm_insert_pages() set VM_MIXEDMAP for you.

[...]



vm_insert_pages() documents: "In case of error, we may have mapped a subset
of the provided pages. It is the caller's responsibility to account for this
case."

Which could for example happen, when allocating a page table fails.

Would we able to deal with that here?


As we are in the mmap path, on an error, I would expect the vma to be destroyed
and those pages whom insertion succeeded to be unmapped?



Ah, we simply fail ->mmap().

In mmap_region(), if call_mmap() failed, we "goto unmap_and_free_vma" where we 
have

/* Undo any partial mapping done by a device driver. */
unmap_region(mm, , vma, prev, next, vma->vm_start, vma->vm_end, 
vma->vm_end, true);



But perhaps shall we proactively zap_page_range_single()?


No mmap_region() should indeed be handling it correctly already!

--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-24 Thread David Hildenbrand


On 22.04.24 22:53, Guillaume Morin wrote:

On 22 Apr 20:59, David Hildenbrand wrote:

The benefit - to me - is very clear. People do use hugetlb mappings to
run code in production environments. The perf benefits are there for some
workloads. Intel has published a whitepaper about it etc.
Uprobes are a very good tool to do live tracing. If you can restart the
process and reproduce, you should be able to disable hugetlb remapping
but if you need to look at a live process, there are not many options.
Not being able to use uprobes is crippling.


Please add all that as motivation to the patch description or cover letter.


Yes, libhugetlbfs exists. But why do we have to support uprobes with it?
Nobody cared until now, why care now?


I think you could ask the same question for every new feature patch :)


I have to, because it usually indicates a lack of motivation in the
cover-letter/patch description :P


My cover letter was indeed lacking. I will make sure to add this kind of
details next time.
  

Since the removal a few releases ago of the __morecore() hook in glibc,
the main feature of libhugetlbfs is ELF segments remapping. I think
there are definitely a lot of users that simply deal with this
unnecessary limitation.

I am certainly not shoving this patch through anyone's throat if there
is no interest. But we definitely find it a very useful feature ...


Let me try to see if we can get this done cleaner.

One ugly part (in general here) is the custom page replacement in the
registration part.

We are guaranteed to have a MAP_PRIVATE mapping. Instead of replacing pages
ourselves (which we likely shouldn't do ...) ... maybe we could use
FAULT_FLAG_UNSHARE faults such that we will get an anonymous folio
populated. (like KSM does nowadays)

Punching FOLL_PIN|FOLL_LONGTERM into GUP would achieve the same thing, but
using FOLL_WRITE would not work on many file systems. So maybe we have to
trigger an unsharing fault ourselves.


^ realizing that we already use FOLL_FORCE, so we can just use 
FOLL_WRITE to break COW.




That would do the page replacement for us and we "should" be able to lookup
an anonymous folio that we can then just modify, like ptrace would.

But then, there is also unregistration part, with weird conditional page
replacement. Zapping the anon page if the content matches the content of the
original page is one thing. But why are we placing an existing anonymous
page by a new anonymous page when the content from the original page differs
(but matches the one from the just copied page?)?

I'll have to further think about that one. It's all a bit nasty.


Sounds good to me. I am willing to help with the code when you have a
plan or testing as you see fit. Let me know.


I'm hacking on a redesign that removes the manual COW breaking logic and 
*might* make it easier to integrate hugetlb. (very likely, but until I 
have the redesign running I cannot promise anything :) )


I'll let you know once I have something ready so you could integrate the 
hugetlb portion.


--
Cheers,

David / dhildenb

Re: [PATCH v21 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-24 Thread David Hildenbrand




I gave it some more thought, and I think we are still missing something 
(I wish PFNMAP/MIXEDMAP wouldn't be that hard).



+
+/*
+ *   +--+  pgoff == 0
+ *   |   meta page  |
+ *   +--+  pgoff == 1
+ *   | subbuffer 0  |
+ *   |  |
+ *   +--+  pgoff == (1 + (1 << subbuf_order))
+ *   | subbuffer 1  |
+ *   |  |
+ * ...
+ */
+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+


I'd add some comments here like

/* Refuse any MAP_PRIVATE or writable mappings. */

+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+


/*
 * Make sure the mapping cannot become writable later. Also, tell the VM
 * to not touch these pages pages (VM_DONTCOPY | VM_DONTDUMP) and tell
 * GUP to leave them alone as well (VM_IO).
 */

+   vm_flags_mod(vma,
+VM_MIXEDMAP | VM_PFNMAP |
+VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND | VM_IO,
+VM_MAYWRITE);


I am still really unsure about VM_PFNMAP ... it's not a PFNMAP at all 
and, as stated, vm_insert_pages() even complains quite a lot when it 
would have to set VM_MIXEDMAP and VM_PFNMAP is already set, likely for a 
very good reason.


Can't we limit ourselves to VM_IO?

But then, I wonder if it really helps much regarding GUP: yes, it blocks 
ordinary GUP (see check_vma_flags()) but as 
insert_page_into_pte_locked() does *not* set pte_special(), GUP-fast 
(gup_fast_pte_range()) will not reject it.


Really, if you want GUP-fast to reject it, remap_pfn_range() and friends 
are the way to go, that will set pte_special() such that also GUP-fast 
will leave it alone, just like vm_normal_page() would.


So ... I know Linus recommended VM_PFNMAP/VM_IO to stop GUP, but it 
alone won't stop all of GUP. We really have to mark the PTE as special, 
which vm_insert_page() must not do (because it is refcounted!).


Which means: do we really have to stop GUP from grabbing that page?

Using vm_insert_page() only with VM_MIXEDMAP (and without 
VM_PFNMAP|VM_IO) would be better.


If we want to stop all of GUP, remap_pfn_range() currently seems 
unavoidable :(




+
+   lockdep_assert_held(_buffer->mapping_lock);
+
+   subbuf_order = cpu_buffer->buffer->subbuf_order;
+   subbuf_pages = 1 << subbuf_order;
+
+   nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */
+   nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */
+
+   vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+   if (!vma_pages || vma_pages > nr_pages)
+   return -EINVAL;
+
+   nr_pages = vma_pages;
+
+   pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
+   if (!pages)
+   return -ENOMEM;
+
+   if (!pgoff) {
+   pages[p++] = virt_to_page(cpu_buffer->meta_page);
+
+   /*
+* TODO: Align sub-buffers on their size, once
+* vm_insert_pages() supports the zero-page.
+*/
+   } else {
+   /* Skip the meta-page */
+   pgoff--;
+
+   if (pgoff % subbuf_pages) {
+   err = -EINVAL;
+   goto out;
+   }
+
+   s += pgoff / subbuf_pages;
+   }
+
+   while (s < nr_subbufs && p < nr_pages) {
+   struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
+   int off = 0;
+
+   for (; off < (1 << (subbuf_order)); off++, page++) {
+   if (p >= nr_pages)
+   break;
+
+   pages[p++] = page;
+   }
+   s++;
+   }
+
+   err = vm_insert_pages(vma, vma->vm_start, pages, _pages);


vm_insert_pages() documents: "In case of error, we may have mapped a 
subset of the provided pages. It is the caller's responsibility to 
account for this case."


Which could for example happen, when allocating a page table fails.

Would we able to deal with that here?


Again, I wish it would all be easier ...

--
Cheers,

David / dhildenb

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-23 Thread David Hildenbrand


On 22.04.24 22:31, Vincent Donnefort wrote:

On Mon, Apr 22, 2024 at 08:27:17PM +0200, David Hildenbrand wrote:

On 22.04.24 20:20, Vincent Donnefort wrote:

Hi David,

Thanks for having a look, very much appreciated!

On Mon, Apr 22, 2024 at 11:27:11AM +0200, David Hildenbrand wrote:

On 19.04.24 20:25, David Hildenbrand wrote:

On 06.04.24 19:36, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,
+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..793ecc454039 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -26,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
@@ -338,6 +340,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
@@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages

Re: [PATCH v3 1/4] virtio_balloon: separate vm events into a function

2024-04-23 Thread David Hildenbrand


On 23.04.24 05:41, zhenwei pi wrote:

All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', separate these events into a function to
make code clean. Then we can remove 'CONFIG_VM_EVENT_COUNTERS' from
'update_balloon_stats'.

Signed-off-by: zhenwei pi 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-22 Thread David Hildenbrand


On 22.04.24 20:11, Guillaume Morin wrote:

(Dropping Mike Kravetz as CC since he has retired and his email is no
longer valid, adding Muchun since he's the current hugetlb maintainer,
as well as linux-trace-kernel)

On 22 Apr 11:39, David Hildenbrand wrote:


On 19.04.24 20:37, Guillaume Morin wrote:

libhugetlbfs, the Intel iodlr code both allow to remap .text onto a
hugetlb private mapping. It's also pretty easy to do it manually.
One drawback of using this functionality is the lack of support for
uprobes (NOTE uprobe ignores shareable vmas)

This patch adds support for private hugetlb mappings.  It does require exposing
some hugetlbfs innards and relies on copy_user_large_folio which is only
available when CONFIG_HUGETLBFS is used so I had to use an ugly #ifdef

If there is some interest in applying this patch in some form or
another, I am open to any refactoring suggestions (esp getting rid the
#ifdef in uprobes.c) . I tried to limit the
amount of branching.


All that hugetlb special casing  oh my. What's the benefit why we should
be interested in making that code less clean -- to phrase it in a nice way
;) ?


I do appreciate the nice phrasing. Believe me, I did try to limit the
special casing to a minimum :-).

Outside of __replace_page, I added only 3-ish branches so I do not think
it's *too* bad. The uprobe code is using PAGE_{SHIFT,MASK} quite liberally so I
had to add calls to retrieve these for the hugetlb vmas.

__replace_page has a lot of special casing. I certainly agree (and
unfortunately for me it's at the beginning of the patch :)).  It's doing
something pretty uncommon outside of the mm code so it has to make a
bunch of specific hugetlb calls. I am not quite sure how to improve it
but if you have suggestions, I'd be happy to refactor.


See below.



The benefit - to me - is very clear. People do use hugetlb mappings to
run code in production environments. The perf benefits are there for some
workloads. Intel has published a whitepaper about it etc.
Uprobes are a very good tool to do live tracing. If you can restart the
process and reproduce, you should be able to disable hugetlb remapping
but if you need to look at a live process, there are not many options.
Not being able to use uprobes is crippling.


Please add all that as motivation to the patch description or cover letter.




Yes, libhugetlbfs exists. But why do we have to support uprobes with it?
Nobody cared until now, why care now?


I think you could ask the same question for every new feature patch :)


I have to, because it usually indicates a lack of motivation in the 
cover-letter/patch description :P


People will have to maintain that code, and maintaining hugetlb code in 
odd places is no fun ...




Since the removal a few releases ago of the __morecore() hook in glibc,
the main feature of libhugetlbfs is ELF segments remapping. I think
there are definitely a lot of users that simply deal with this
unnecessary limitation.

I am certainly not shoving this patch through anyone's throat if there
is no interest. But we definitely find it a very useful feature ...


Let me try to see if we can get this done cleaner.

One ugly part (in general here) is the custom page replacement in the 
registration part.


We are guaranteed to have a MAP_PRIVATE mapping. Instead of replacing 
pages ourselves (which we likely shouldn't do ...) ... maybe we could 
use FAULT_FLAG_UNSHARE faults such that we will get an anonymous folio 
populated. (like KSM does nowadays)


Punching FOLL_PIN|FOLL_LONGTERM into GUP would achieve the same thing, 
but using FOLL_WRITE would not work on many file systems. So maybe we 
have to trigger an unsharing fault ourselves.


That would do the page replacement for us and we "should" be able to 
lookup an anonymous folio that we can then just modify, like ptrace would.


But then, there is also unregistration part, with weird conditional page 
replacement. Zapping the anon page if the content matches the content of 
the original page is one thing. But why are we placing an existing 
anonymous page by a new anonymous page when the content from the 
original page differs (but matches the one from the just copied page?)?


I'll have to further think about that one. It's all a bit nasty.


One thing to note is that hugetlb folios don't grow on trees. Likely, 
Many setups *don't* reserve extra hugetlb folios and you might just 
easily be running out of free hugetlb folios that you can use to break 
COW here (replace a file hugetlb by a fresh anon hugetlb page). Likely 
it's easy to make register or unregister fail.


--
Cheers,

David / dhildenb

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-22 Thread David Hildenbrand


On 22.04.24 20:20, Vincent Donnefort wrote:

Hi David,

Thanks for having a look, very much appreciated!

On Mon, Apr 22, 2024 at 11:27:11AM +0200, David Hildenbrand wrote:

On 19.04.24 20:25, David Hildenbrand wrote:

On 06.04.24 19:36, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

 ring_buffer_{map,unmap}()

And controls on the ring-buffer:

 ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

 A unique ID for each subbuf of the ring-buffer, currently they are
 only identified through their in-kernel VA.

 A meta-page, where are stored ring-buffer statistics and a
 description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
#include 
#include 
+#include 
+
struct trace_buffer;
struct ring_buffer_iter;
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
#define trace_rb_cpu_prepareNULL
#endif
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,
+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
#endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..793ecc454039 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
#include 
#include 
#include 
+#include 
#include 
#include 
#include 
@@ -26,6 +27,7 @@
#include 
#include 
#include 
+#include 
#include 
#include 
@@ -338,6 +340,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
};
@@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct lis

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-22 Thread David Hildenbrand


On 19.04.24 20:25, David Hildenbrand wrote:

On 06.04.24 19:36, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

ring_buffer_{map,unmap}()

And controls on the ring-buffer:

ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

A unique ID for each subbuf of the ring-buffer, currently they are
only identified through their in-kernel VA.

A meta-page, where are stored ring-buffer statistics and a
description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
   #include 
   #include 
   
+#include 

+
   struct trace_buffer;
   struct ring_buffer_iter;
   
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);

   #define trace_rb_cpu_prepare NULL
   #endif
   
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,

+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
   #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..793ecc454039 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 
   #include 
@@ -26,6 +27,7 @@
   #include 
   #include 
   #include 
+#include 
   
   #include 

   #include 
@@ -338,6 +340,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
   };
   
@@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {

u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_irq_work(_buffer->irq_

Re: [PATCH v2 1/4] virtio_balloon: separate vm events into a function

2024-04-22 Thread David Hildenbrand


On 22.04.24 09:42, zhenwei pi wrote:

All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', once any stack variable is required by any
VM events in future, we would have codes like:
  #ifdef CONFIG_VM_EVENT_COUNTERS
   unsigned long foo;
  #endif
   ...
  #ifdef CONFIG_VM_EVENT_COUNTERS
   foo = events[XXX] + events[YYY];
   update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
  #endif

Separate vm events into a single function, also remove
'CONFIG_VM_EVENT_COUNTERS' from 'update_balloon_stats'.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 44 ++---
  1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1f5b3dd31fcf..59fe157e5722 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -316,34 +316,48 @@ static inline void update_stat(struct virtio_balloon *vb, 
int idx,
  
  #define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
  
-static unsigned int update_balloon_stats(struct virtio_balloon *vb)

+/* Return the number of entries filled by vm events */
+static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb,
+  unsigned int start)
  {
+#ifdef CONFIG_VM_EVENT_COUNTERS
unsigned long events[NR_VM_EVENT_ITEMS];
-   struct sysinfo i;
-   unsigned int idx = 0;
-   long available;
-   unsigned long caches;
+   unsigned int idx = start;
  
  	all_vm_events(events);

-   si_meminfo();
-
-   available = si_mem_available();
-   caches = global_node_page_state(NR_FILE_PAGES);
-
-#ifdef CONFIG_VM_EVENT_COUNTERS
update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN,
-   pages_to_bytes(events[PSWPIN]));
+   pages_to_bytes(events[PSWPIN]));
update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
-   pages_to_bytes(events[PSWPOUT]));
+   pages_to_bytes(events[PSWPOUT]));
update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
+
  #ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
events[HTLB_BUDDY_PGALLOC]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL,
events[HTLB_BUDDY_PGALLOC_FAIL]);
-#endif
-#endif
+#endif /* CONFIG_HUGETLB_PAGE */
+
+   return idx - start;
+#else /* CONFIG_VM_EVENT_COUNTERS */
+
+   return 0;
+#endif /* CONFIG_VM_EVENT_COUNTERS */
+}
+
+static unsigned int update_balloon_stats(struct virtio_balloon *vb)
+{
+   struct sysinfo i;
+   unsigned int idx = 0;
+   long available;
+   unsigned long caches;
+
+   idx += update_balloon_vm_stats(vb, idx);


No need to handle idx that complicated now. Just do

unsigned int idx;

idx = update_balloon_vm_stats(vb);

We can go down that path if we ever want to rearrange the code and not 
have the vm_stats first.



+
+   si_meminfo();
+   available = si_mem_available();
+   caches = global_node_page_state(NR_FILE_PAGES);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
pages_to_bytes(i.freeram));
update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMTOT,


--
Cheers,

David / dhildenb

Re: [PATCH v2 1/4] virtio_balloon: separate vm events into a function

2024-04-22 Thread David Hildenbrand


On 22.04.24 10:04, zhenwei pi wrote:



On 4/22/24 15:47, David Hildenbrand wrote:

On 22.04.24 09:42, zhenwei pi wrote:

All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', once any stack variable is required by any
VM events in future, we would have codes like:
   #ifdef CONFIG_VM_EVENT_COUNTERS
    unsigned long foo;
   #endif
    ...
   #ifdef CONFIG_VM_EVENT_COUNTERS
    foo = events[XXX] + events[YYY];
    update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
   #endif

Separate vm events into a single function, also remove


Why not simply use __maybe_unused for that variable?



1>
static unsigned int update_balloon_stats()
{
  unsigned __maybe_unused long foo;

  ...
#ifdef CONFIG_VM_EVENT_COUNTERS
  foo = events[XXX] + events[YYY];
  update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
#endif
}

2>
static inline unsigned int update_balloon_vm_stats()
{
#ifdef CONFIG_VM_EVENT_COUNTERS
  unsigned long foo;

  foo = events[XXX] + events[YYY];
  update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
#endif
}

  From the point of my view, I don't need to compile code in my brain
when reading codes for case 2. :)


But for #1? :)

I mean, you didn't compile the code in your brain when you sent out v1 :P

But I agree that moving that to a separate function ins cleaner, staring 
at resulting update_balloon_stats().


Let me comment on some nits as a fresh reply.

--
Cheers,

David / dhildenb

Re: [PATCH v2 3/4] virtio_balloon: introduce memory allocation stall counter

2024-04-22 Thread David Hildenbrand


On 22.04.24 09:42, zhenwei pi wrote:

Memory allocation stall counter represents the performance/latency of
memory allocation, expose this counter to the host side by virtio
balloon device via out-of-bound way.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 8 
  include/uapi/linux/virtio_balloon.h | 6 --
  2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 87a1d6fa77fb..ab039e83bc6f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -323,6 +323,8 @@ static inline unsigned int update_balloon_vm_stats(struct 
virtio_balloon *vb,
  #ifdef CONFIG_VM_EVENT_COUNTERS
unsigned long events[NR_VM_EVENT_ITEMS];
unsigned int idx = start;
+   unsigned int zid;
+   unsigned long stall = 0;
  
  	all_vm_events(events);

update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN,
@@ -333,6 +335,12 @@ static inline unsigned int update_balloon_vm_stats(struct 
virtio_balloon *vb,
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]);
  
+	/* sum all the stall events */

+   for (zid = 0; zid < MAX_NR_ZONES; zid++)
+   stall += events[ALLOCSTALL_NORMAL - ZONE_NORMAL + zid];
+
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
+
  #ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
events[HTLB_BUDDY_PGALLOC]);
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index b17bbe033697..487b893a160e 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -72,7 +72,8 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations */
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
  #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
-#define VIRTIO_BALLOON_S_NR   11
+#define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
+#define VIRTIO_BALLOON_S_NR   12
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -85,7 +86,8 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
-   VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \
+   VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \
  }
  
  #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 1/4] virtio_balloon: separate vm events into a function

2024-04-22 Thread David Hildenbrand


On 22.04.24 09:42, zhenwei pi wrote:

All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', once any stack variable is required by any
VM events in future, we would have codes like:
  #ifdef CONFIG_VM_EVENT_COUNTERS
   unsigned long foo;
  #endif
   ...
  #ifdef CONFIG_VM_EVENT_COUNTERS
   foo = events[XXX] + events[YYY];
   update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
  #endif

Separate vm events into a single function, also remove


Why not simply use __maybe_unused for that variable?

--
Cheers,

David / dhildenb

[PATCH v2 1/1] virtio: Add support for the virtio suspend feature

2024-04-22 Thread David Stevens

Add support for the VIRTIO_F_SUSPEND feature. When this feature is
negotiated, power management can use it to suspend virtio devices
instead of resorting to resetting the devices entirely.

Signed-off-by: David Stevens 
---
 drivers/virtio/virtio.c| 60 ++
 drivers/virtio/virtio_pci_common.c | 34 -
 drivers/virtio/virtio_pci_modern.c | 19 ++
 include/linux/virtio.h |  8 
 include/uapi/linux/virtio_config.h | 10 -
 5 files changed, 112 insertions(+), 19 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f4080692b351..c7685a0dc995 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -498,6 +499,13 @@ void unregister_virtio_device(struct virtio_device *dev)
 }
 EXPORT_SYMBOL_GPL(unregister_virtio_device);
 
+void virtio_device_mark_removed(struct virtio_device *dev)
+{
+   /* Pairs with READ_ONCE() in virtio_device_set_suspend_bit(). */
+   WRITE_ONCE(dev->removed, true);
+}
+EXPORT_SYMBOL_GPL(virtio_device_mark_removed);
+
 #ifdef CONFIG_PM_SLEEP
 int virtio_device_freeze(struct virtio_device *dev)
 {
@@ -580,6 +588,58 @@ int virtio_device_restore(struct virtio_device *dev)
return ret;
 }
 EXPORT_SYMBOL_GPL(virtio_device_restore);
+
+static int virtio_device_set_suspend_bit(struct virtio_device *dev, bool 
enabled)
+{
+   u8 status, target;
+
+   status = dev->config->get_status(dev);
+   if (enabled)
+   target = status | VIRTIO_CONFIG_S_SUSPEND;
+   else
+   target = status & ~VIRTIO_CONFIG_S_SUSPEND;
+
+   if (target == status)
+   return 0;
+
+   dev->config->set_status(dev, target);
+
+   while ((status = dev->config->get_status(dev)) != target) {
+   if (status & VIRTIO_CONFIG_S_NEEDS_RESET)
+   return -EIO;
+   /* Pairs with WRITE_ONCE() in virtio_device_mark_removed(). */
+   if (READ_ONCE(dev->removed))
+   return -EIO;
+   msleep(10);
+   }
+   return 0;
+}
+
+bool virtio_device_can_suspend(struct virtio_device *dev)
+{
+   return virtio_has_feature(dev, VIRTIO_F_SUSPEND) &&
+  (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FEATURES_OK);
+}
+EXPORT_SYMBOL_GPL(virtio_device_can_suspend);
+
+int virtio_device_suspend(struct virtio_device *dev)
+{
+   return virtio_device_set_suspend_bit(dev, true);
+}
+EXPORT_SYMBOL_GPL(virtio_device_suspend);
+
+bool virtio_device_can_resume(struct virtio_device *dev)
+{
+   return virtio_has_feature(dev, VIRTIO_F_SUSPEND) &&
+  (dev->config->get_status(dev) & VIRTIO_CONFIG_S_SUSPEND);
+}
+EXPORT_SYMBOL_GPL(virtio_device_can_resume);
+
+int virtio_device_resume(struct virtio_device *dev)
+{
+   return virtio_device_set_suspend_bit(dev, false);
+}
+EXPORT_SYMBOL_GPL(virtio_device_resume);
 #endif
 
 static int virtio_init(void)
diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index b655fccaf773..a6ca7718b5dc 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -495,31 +495,26 @@ static int virtio_pci_restore(struct device *dev)
return virtio_device_restore(_dev->vdev);
 }
 
-static bool vp_supports_pm_no_reset(struct device *dev)
+static int virtio_pci_suspend(struct device *dev)
 {
struct pci_dev *pci_dev = to_pci_dev(dev);
-   u16 pmcsr;
-
-   if (!pci_dev->pm_cap)
-   return false;
-
-   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
-   if (PCI_POSSIBLE_ERROR(pmcsr)) {
-   dev_err(dev, "Unable to query pmcsr");
-   return false;
-   }
+   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
 
-   return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
-}
+   if (virtio_device_can_suspend(_dev->vdev))
+   return virtio_device_suspend(_dev->vdev);
 
-static int virtio_pci_suspend(struct device *dev)
-{
-   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev);
+   return virtio_pci_freeze(dev);
 }
 
 static int virtio_pci_resume(struct device *dev)
 {
-   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev);
+   struct pci_dev *pci_dev = to_pci_dev(dev);
+   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
+
+   if (virtio_device_can_resume(_dev->vdev))
+   return virtio_device_resume(_dev->vdev);
+
+   return virtio_pci_restore(dev);
 }
 
 static const struct dev_pm_ops virtio_pci_pm_ops = {
@@ -623,9 +618,12 @@ static void virtio_pci_remove(struct pci_dev *pci_dev)
 * Device is marked broken on surprise removal so that virtio upper

[PATCH v2 0/1] virtio: Add suspend support

2024-04-22 Thread David Stevens

This series implements support for the virtio device suspend feature
that is under discussion. Unfortunately, the virtio mailing list is
currently being migrated, so recent discussion of the proposal is not
archived anywhere. There current version of the proposal is a
combination of [1] and [2].

[1] https://lore.kernel.org/all/20230906081637.32185-3-lingshan@intel.com/
[2] https://lists.oasis-open.org/archives/virtio-comment/202402/msg00088.html

v1 -> v2:
 - Check for device removal while waiting for suspend bit.
 - Don't try to suspend uninitialized deivces.
 - Use msleep instead of mdelay.

David Stevens (1):
  virtio: Add support for the virtio suspend feature

 drivers/virtio/virtio.c| 60 ++
 drivers/virtio/virtio_pci_common.c | 34 -
 drivers/virtio/virtio_pci_modern.c | 19 ++
 include/linux/virtio.h |  8 
 include/uapi/linux/virtio_config.h | 10 -
 5 files changed, 112 insertions(+), 19 deletions(-)


base-commit: e8f897f4afef0031fe618a8e94127a0934896aba
-- 
2.44.0.769.g3c40516874-goog

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-19 Thread David Matlack

On 2024-04-19 01:47 PM, James Houghton wrote:
> On Thu, Apr 11, 2024 at 10:28 AM David Matlack  wrote:
> > On 2024-04-11 10:08 AM, David Matlack wrote:
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > bool young = false;
> >
> > if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> >
> > if (tdp_mmu_enabled)
> > young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> >
> > return young;
> > }
> >
> > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > bool young = false;
> >
> > if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
> >
> > if (tdp_mmu_enabled)
> > young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
> >
> > return young;
> 
> 
> Yeah I think this is the right thing to do. Given your other
> suggestions (on patch 3), I think this will look something like this
> -- let me know if I've misunderstood something:
> 
> bool check_rmap = !bitmap && kvm_memslot_have_rmaps(kvm);
> 
> if (check_rmap)
>   KVM_MMU_LOCK(kvm);
> 
> rcu_read_lock(); // perhaps only do this when we don't take the MMU lock?
> 
> if (check_rmap)
>   kvm_handle_gfn_range(/* ... */ kvm_test_age_rmap)
> 
> if (tdp_mmu_enabled)
>   kvm_tdp_mmu_test_age_gfn() // modified to be RCU-safe
> 
> rcu_read_unlock();
> if (check_rmap)
>   KVM_MMU_UNLOCK(kvm);

I was thinking a little different. If you follow my suggestion to first
make the TDP MMU aging lockless, you'll end up with something like this
prior to adding bitmap support (note: the comments are just for
demonstrative purposes):

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

/* Shadow MMU aging holds write-lock. */
if (kvm_memslots_have_rmaps(kvm)) {
write_lock(>mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(>mmu_lock);
}

/* TDM MMU aging is lockless. */
if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

return young;
}

Then when you add bitmap support it would look something like this:

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
unsigned long *bitmap = range->arg.metadata->bitmap;
bool young = false;

/* SHadow MMU aging holds write-lock and does not support bitmap. */
if (kvm_memslots_have_rmaps(kvm) && !bitmap) {
write_lock(>mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(>mmu_lock);
}

/* TDM MMU aging is lockless and supports bitmap. */
if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

return young;
}

rcu_read_lock/unlock() would be called in kvm_tdp_mmu_age_gfn_range().

That brings up a question I've been wondering about. If KVM only
advertises support for the bitmap lookaround when shadow roots are not
allocated, does that mean MGLRU will be blind to accesses made by L2
when nested virtualization is enabled? And does that mean the Linux MM
will think all L2 memory is cold (i.e. good candidate for swapping)
because it isn't seeing accesses made by L2?

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-19 Thread David Hildenbrand

  s++;
+   }
+
+   err = vm_insert_pages(vma, vma->vm_start, pages, _pages);


I think Linus suggested it ("avoid all the sub-page ref-counts entirely 
by using VM_PFNMAP, and use vm_insert_pages()"), but ... 
vm_insert_pages() will:

* Mess with mapcounts
* Mess with refcounts

See 
insert_pages()->insert_page_in_batch_locked()->insert_page_into_pte_locked().


So we'll mess with the mapcount and refcount of the shared zeropage ... 
h


If I am not wrong, vm_normal_page() would not return the shared zeropage 
in case we don't have CONFIG_ARCH_HAS_PTE_SPECIAL ... so 
unmap()->...->zap_present_ptes() would not decrement the refcount and we 
could overflow it. ... we also shouldn't ever mess with the mapcount of 
the shared zeropage in the first place.


vm_insert_page() is clearer on that: "This allows drivers to insert 
individual pages they've allocated into a user vma". You didn't allocate 
the shared zeropage.


I'm wondering if we even expect VM_MIXEDMAP and VM_PFNMAP to be set at 
the same time? vm_insert_pages() would BUG_ON in case VM_PFNMAP is 
already set and it would set VM_MIXEDMAP ... similarly 
vmf_insert_pfn_prot() would not be happy about that at all ...


BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == 
(VM_PFNMAP|VM_MIXEDMAP));


... remap_pfn_range() is used by io_uring_mmap for a similar purpose. 
But it only supports a single PFN range at a time and requires the 
caller to handle refcounting of pages.


It's getting late in Germany, so I might be missing something; but using 
the shared zeropage here might be a problem.


--
Cheers,

David / dhildenb

Re: [PATCH 3/3] virtio_balloon: introduce memory scan/reclaim info

2024-04-18 Thread David Hildenbrand


On 18.04.24 08:26, zhenwei pi wrote:

Expose memory scan/reclaim information to the host side via virtio
balloon device.

Now we have a metric to analyze the memory performance:

y: counter increases
n: counter does not changes
h: the rate of counter change is high
l: the rate of counter change is low

OOM: VIRTIO_BALLOON_S_OOM_KILL
STALL: VIRTIO_BALLOON_S_ALLOC_STALL
ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC
DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT
ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC
DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT

- OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]:
   the guest runs under really critial memory pressure

- OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to cgroup, not the global memory
   pressure.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]:
   the memory allocation stalls due to global memory pressure. The
   performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows
   quite effective memory reclaiming.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to global memory pressure.
   the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing
   heavily, the serious case leads poor performance and difficult
   trouble shooting. Ex, sshd may block on memory allocation when
   accepting new connections, a user can't login a VM by ssh command.

- OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]:
   the low ratio between ARCLM/ASCAN shows that the guest tries to
   reclaim more memory, but it can't. Once more memory is required in
   future, it will struggle to reclaim memory.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c |  9 +
  include/uapi/linux/virtio_balloon.h | 12 ++--
  2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index e88e6573afa5..bc9332c1ae85 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -356,6 +356,15 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
stall += events[ALLOCSTALL_MOVABLE];
update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
  
+	update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_SCAN,

+   pages_to_bytes(events[PGSCAN_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_SCAN,
+   pages_to_bytes(events[PGSCAN_DIRECT]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_RECLAIM,
+   pages_to_bytes(events[PGSTEAL_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_RECLAIM,
+   pages_to_bytes(events[PGSTEAL_DIRECT]));
+
  #ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
events[HTLB_BUDDY_PGALLOC]);
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 487b893a160e..ee35a372805d 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -73,7 +73,11 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
  #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
  #define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
-#define VIRTIO_BALLOON_S_NR   12
+#define VIRTIO_BALLOON_S_ASYNC_SCAN12 /* Amount of memory scanned 
asynchronously */
+#define VIRTIO_BALLOON_S_DIRECT_SCAN   13 /* Amount of memory scanned directly 
*/
+#define VIRTIO_BALLOON_S_ASYNC_RECLAIM 14 /* Amount of memory reclaimed 
asynchronously */
+#define VIRTIO_BALLOON_S_DIRECT_RECLAIM 15 /* Amount of memory reclaimed 
directly */
+#define VIRTIO_BALLOON_S_NR   16
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -87,7 +91,11 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \
-   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls", \
+   VIRTIO_BALLOON_S_NAMES_prefix "async-scans", \
+   VIRTIO_BALLOON_S_NAMES_prefix "direct-scans", \
+   VIRTIO_BALLOON_S_NAMES_prefix "async-reclaims", \
+   VIRTIO_BALLOON_S_NAMES_prefix "direct-reclaims" \
  }
  
  #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")


Not an expert on these counters/events, but LGTM

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH 2/3] virtio_balloon: introduce memory allocation stall counter

2024-04-18 Thread David Hildenbrand


On 18.04.24 08:26, zhenwei pi wrote:

Memory allocation stall counter represents the performance/latency of
memory allocation, expose this counter to the host side by virtio
balloon device via out-of-bound way.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 20 +++-
  include/uapi/linux/virtio_balloon.h |  6 --
  2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index fd19934a847f..e88e6573afa5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -321,7 +321,7 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
unsigned long events[NR_VM_EVENT_ITEMS];
struct sysinfo i;
unsigned int idx = 0;
-   long available;
+   long available, stall = 0;
unsigned long caches;
  
  	all_vm_events(events);

@@ -338,6 +338,24 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]);
+
+   /* sum all the stall events */
+#ifdef CONFIG_ZONE_DMA
+   stall += events[ALLOCSTALL_DMA];
+#endif
+#ifdef CONFIG_ZONE_DMA32
+   stall += events[ALLOCSTALL_DMA32];
+#endif
+#ifdef CONFIG_HIGHMEM
+   stall += events[ALLOCSTALL_HIGH];
+#endif
+#ifdef CONFIG_ZONE_DEVICE
+   stall += events[ALLOCSTALL_DEVICE];
+#endif


Naive me would think that ALLOCSTALL_DEVICE is always 0. :)

Likely we should just do:

for (zid = 0; zid < MAX_NR_ZONES; zid++)
stall += events[ALLOCSTALL_NORMAL - ZONE_NORMAL + zid];

(see isolate_lru_folios() -> __count_zid_vm_events(), where we realy on 
the same ordering)


Apart form that, LGTM.

--
Cheers,

David / dhildenb

Re: [PATCH 1/3] virtio_balloon: introduce oom-kill invocations

2024-04-18 Thread David Hildenbrand


On 18.04.24 08:26, zhenwei pi wrote:

When the guest OS runs under critical memory pressure, the guest
starts to kill processes. A guest monitor agent may scan 'oom_kill'
from /proc/vmstat, and reports the OOM KILL event. However, the agent
may be killed and we will loss this critical event(and the later
events).

For now we can also grep for magic words in guest kernel log from host
side. Rather than this unstable way, virtio balloon reports OOM-KILL
invocations instead.

Signed-off-by: zhenwei pi 


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

[PATCH 1/1] virtio: Add support for the virtio suspend feature

2024-04-17 Thread David Stevens

Add support for the VIRTIO_F_SUSPEND feature. When this feature is
negotiated, power management can use it to suspend virtio devices
instead of resorting to resetting the devices entirely.

Signed-off-by: David Stevens 
---
 drivers/virtio/virtio.c| 32 ++
 drivers/virtio/virtio_pci_common.c | 29 +++
 drivers/virtio/virtio_pci_modern.c | 19 ++
 include/linux/virtio.h |  2 ++
 include/uapi/linux/virtio_config.h | 10 +-
 5 files changed, 74 insertions(+), 18 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f4080692b351..cd11495a5098 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -580,6 +581,37 @@ int virtio_device_restore(struct virtio_device *dev)
return ret;
 }
 EXPORT_SYMBOL_GPL(virtio_device_restore);
+
+static int virtio_device_set_suspend_bit(struct virtio_device *dev, bool 
enabled)
+{
+   u8 status, target;
+
+   status = dev->config->get_status(dev);
+   if (enabled)
+   target = status | VIRTIO_CONFIG_S_SUSPEND;
+   else
+   target = status & ~VIRTIO_CONFIG_S_SUSPEND;
+   dev->config->set_status(dev, target);
+
+   while ((status = dev->config->get_status(dev)) != target) {
+   if (status & VIRTIO_CONFIG_S_NEEDS_RESET)
+   return -EIO;
+   mdelay(10);
+   }
+   return 0;
+}
+
+int virtio_device_suspend(struct virtio_device *dev)
+{
+   return virtio_device_set_suspend_bit(dev, true);
+}
+EXPORT_SYMBOL_GPL(virtio_device_suspend);
+
+int virtio_device_resume(struct virtio_device *dev)
+{
+   return virtio_device_set_suspend_bit(dev, false);
+}
+EXPORT_SYMBOL_GPL(virtio_device_resume);
 #endif
 
 static int virtio_init(void)
diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index b655fccaf773..4d542de05970 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -495,31 +495,26 @@ static int virtio_pci_restore(struct device *dev)
return virtio_device_restore(_dev->vdev);
 }
 
-static bool vp_supports_pm_no_reset(struct device *dev)
+static int virtio_pci_suspend(struct device *dev)
 {
struct pci_dev *pci_dev = to_pci_dev(dev);
-   u16 pmcsr;
-
-   if (!pci_dev->pm_cap)
-   return false;
-
-   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
-   if (PCI_POSSIBLE_ERROR(pmcsr)) {
-   dev_err(dev, "Unable to query pmcsr");
-   return false;
-   }
+   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
 
-   return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
-}
+   if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND))
+   return virtio_device_suspend(_dev->vdev);
 
-static int virtio_pci_suspend(struct device *dev)
-{
-   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev);
+   return virtio_pci_freeze(dev);
 }
 
 static int virtio_pci_resume(struct device *dev)
 {
-   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev);
+   struct pci_dev *pci_dev = to_pci_dev(dev);
+   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
+
+   if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND))
+   return virtio_device_resume(_dev->vdev);
+
+   return virtio_pci_restore(dev);
 }
 
 static const struct dev_pm_ops virtio_pci_pm_ops = {
diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index f62b530aa3b5..ac8734526b8d 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -209,6 +209,22 @@ static void vp_modern_avq_deactivate(struct virtio_device 
*vdev)
__virtqueue_break(admin_vq->info.vq);
 }
 
+static bool vp_supports_pm_no_reset(struct pci_dev *pci_dev)
+{
+   u16 pmcsr;
+
+   if (!pci_dev->pm_cap)
+   return false;
+
+   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
+   if (PCI_POSSIBLE_ERROR(pmcsr)) {
+   dev_err(_dev->dev, "Unable to query pmcsr");
+   return false;
+   }
+
+   return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
+}
+
 static void vp_transport_features(struct virtio_device *vdev, u64 features)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
@@ -223,6 +239,9 @@ static void vp_transport_features(struct virtio_device 
*vdev, u64 features)
 
if (features & BIT_ULL(VIRTIO_F_ADMIN_VQ))
__virtio_set_bit(vdev, VIRTIO_F_ADMIN_VQ);
+
+   if (features & BIT_ULL(VIRTIO_F_SUSPEND) && 
vp_supports_pm_no_reset(pci_dev))
+   __virtio_set_bit(vdev, VIRTIO_F_SUSPEND);
 }
 
 static i

[PATCH 0/1] virtio: Add suspend support

2024-04-17 Thread David Stevens

This series implements support for the virtio device suspend feature
that is under discussion. Unfortunately, the virtio mailing list is
currently being migrated, so recent discussion of the proposal is not
archived anywhere. There current version of the proposal is a
combination of [1] and [2].

[1] https://lore.kernel.org/all/20230906081637.32185-3-lingshan@intel.com/
[2] https://lists.oasis-open.org/archives/virtio-comment/202402/msg00088.html

David Stevens (1):
  virtio: Add support for the virtio suspend feature

 drivers/virtio/virtio.c| 32 ++
 drivers/virtio/virtio_pci_common.c | 29 +++
 drivers/virtio/virtio_pci_modern.c | 19 ++
 include/linux/virtio.h |  2 ++
 include/uapi/linux/virtio_config.h | 10 +-
 5 files changed, 74 insertions(+), 18 deletions(-)


base-commit: e8f897f4afef0031fe618a8e94127a0934896aba
-- 
2.44.0.683.g7961c838ac-goog

Re: [PATCH] module: ban '.', '..' as module names, ban '/' in module names

2024-04-15 Thread Dr. David Alan Gilbert

* Alexey Dobriyan (adobri...@gmail.com) wrote:
> On Sun, Apr 14, 2024 at 01:58:55PM -0700, Luis Chamberlain wrote:
> > On Sun, Apr 14, 2024 at 10:05:05PM +0300, Alexey Dobriyan wrote:
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -3616,4 +3616,12 @@ extern int vfs_fadvise(struct file *file, loff_t 
> > > offset, loff_t len,
> > >  extern int generic_fadvise(struct file *file, loff_t offset, loff_t len,
> > >  int advice);
> > >  
> > > +/*
> > > + * Use this if data from userspace end up as directory/filename on
> > > + * some virtual filesystem.
> > > + */
> > > +static inline bool string_is_vfs_ready(const char *s)
> > > +{
> > > + return strcmp(s, ".") != 0 && strcmp(s, "..") != 0 && !strchr(s, '/');
> > > +}
> > >  #endif /* _LINUX_FS_H */
> > > --- a/kernel/module/main.c
> > > +++ b/kernel/module/main.c
> > > @@ -2893,6 +2893,11 @@ static int load_module(struct load_info *info, 
> > > const char __user *uargs,
> > >  
> > >   audit_log_kern_module(mod->name);
> > >  
> > > + if (!string_is_vfs_ready(mod->name)) {
> > > + err = -EINVAL;
> > > + goto free_module;
> > > + }
> > > +
> > 
> > Sensible change however to put string_is_vfs_ready() in include/linux/fs.h 
> > is a stretch if there really are no other users.
> 
> This is forward thinking patch :-)
> 
> Other subsystems may create files/directories in proc/sysfs, and should
> check for bad names as well:
> 
>   /proc/2821/net/dev_snmp6/eth0
> 
> This looks exactly like something coming from userspace and making it
> into /proc, so the filter function doesn't belong to kernel/module/internal.h

You mean like:

[24180.292204] tuxthe: renamed from tuxthe
root@dalek:/home/dg# ls /sys/class/net/
enp5s0  lo  tuxthe  tuxthe  tuxthe  virbr0  virbr1

?

Dave
> 
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [RFC 3/3] virtio_balloon: introduce memory scan/reclaim info

2024-04-15 Thread David Hildenbrand


On 15.04.24 10:41, zhenwei pi wrote:

Expose memory scan/reclaim information to the host side via virtio
balloon device.

Now we have a metric to analyze the memory performance:

y: counter increases
n: counter does not changes
h: the rate of counter change is high
l: the rate of counter change is low

OOM: VIRTIO_BALLOON_S_OOM_KILL
STALL: VIRTIO_BALLOON_S_ALLOC_STALL
ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC
DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT
ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC
DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT

- OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]:
   the guest runs under really critial memory pressure

- OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to cgroup, not the global memory
   pressure.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]:
   the memory allocation stalls due to global memory pressure. The
   performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows
   quite effective memory reclaiming.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to global memory pressure.
   the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing
   heavily, the serious case leads poor performance and difficult
   trouble shooting. Ex, sshd may block on memory allocation when
   accepting new connections, a user can't login a VM by ssh command.

- OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]:
   the low ratio between ARCLM/ASCAN shows that the guest tries to
   reclaim more memory, but it can't. Once more memory is required in
   future, it will struggle to reclaim memory.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c |  9 +
  include/uapi/linux/virtio_balloon.h | 12 ++--
  2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4b9c9569f6e5..7b86514e99d4 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -372,6 +372,15 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
stall += events[ALLOCSTALL_MOVABLE];
update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
  
+	update_stat(vb, idx++, VIRTIO_BALLOON_S_SCAN_ASYNC,

+   pages_to_bytes(events[PGSCAN_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_SCAN_DIRECT,
+   pages_to_bytes(events[PGSCAN_DIRECT]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_RECLAIM_ASYNC,
+   pages_to_bytes(events[PGSTEAL_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_RECLAIM_DIRECT,
+   pages_to_bytes(events[PGSTEAL_DIRECT]));
+
return idx;
  }
  
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h

index 13d0c32ba27c..0875a9cccb01 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -73,7 +73,11 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
  #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
  #define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
-#define VIRTIO_BALLOON_S_NR   12
+#define VIRTIO_BALLOON_S_SCAN_ASYNC12 /* Amount of memory scanned 
asynchronously */
+#define VIRTIO_BALLOON_S_SCAN_DIRECT   13 /* Amount of memory scanned directly 
*/
+#define VIRTIO_BALLOON_S_RECLAIM_ASYNC 14 /* Amount of memory reclaimed 
asynchronously */
+#define VIRTIO_BALLOON_S_RECLAIM_DIRECT 15 /* Amount of memory reclaimed 
directly */


Should these be the other way around:

ASYN_SCAN
...
ASYNC_RECLAIM

so we can get ...


+#define VIRTIO_BALLOON_S_NR   16
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -87,7 +91,11 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
VIRTIO_BALLOON_S_NAMES_prefix "oom-kill", \
-   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stall" \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stall", \
+   VIRTIO_BALLOON_S_NAMES_prefix "scan-async", \
+   VIRTIO_BALLOON_S_NAMES_prefix "scan-direct", \
+   VIRTIO_BALLOON_S_NAMES_prefix "reclaim-async", \
+   VIRTIO_BALLOON_S_NAMES_prefix "reclaim-direct" \


...

"async-scans", "async-reclaims" ...

--
Cheers,

David / dhildenb

Re: [RFC 2/3] virtio_balloon: introduce memory allocation stall counter

2024-04-15 Thread David Hildenbrand


On 15.04.24 10:41, zhenwei pi wrote:

Memory allocation stall counter represents the performance/latency of
memory allocation, expose this counter to the host side by virtio
balloon device via out-of-bound way.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 19 ++-
  include/uapi/linux/virtio_balloon.h |  6 --
  2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index fd8daa742734..4b9c9569f6e5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -321,7 +321,7 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
unsigned long events[NR_VM_EVENT_ITEMS];
struct sysinfo i;
unsigned int idx = 0;
-   long available;
+   long available, stall = 0;
unsigned long caches;
  
  	all_vm_events(events);

@@ -355,6 +355,23 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL,
events[OOM_KILL]);
  
+	/* sum all the stall event */

+#ifdef CONFIG_ZONE_DMA
+   stall += events[ALLOCSTALL_DMA];
+#endif
+#ifdef CONFIG_ZONE_DMA32
+   stall += events[ALLOCSTALL_DMA32];
+#endif
+#ifdef CONFIG_HIGHMEM
+   stall += events[ALLOCSTALL_HIGH];
+#endif
+#ifdef CONFIG_ZONE_DEVICE
+   stall += events[ALLOCSTALL_DEVICE];
+#endif
+   stall += events[ALLOCSTALL_NORMAL];
+   stall += events[ALLOCSTALL_MOVABLE];
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
+
return idx;
  }
  
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h

index cde5547e64a7..13d0c32ba27c 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -72,7 +72,8 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations */
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
  #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
-#define VIRTIO_BALLOON_S_NR   11
+#define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
+#define VIRTIO_BALLOON_S_NR   12
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -85,7 +86,8 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
-   VIRTIO_BALLOON_S_NAMES_prefix "oom-kill" \
+   VIRTIO_BALLOON_S_NAMES_prefix "oom-kill", \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stall" \


"alloc-stalls"

--
Cheers,

David / dhildenb

Re: [RFC 1/3] virtio_balloon: introduce oom-kill invocations

2024-04-15 Thread David Hildenbrand


On 15.04.24 10:41, zhenwei pi wrote:

When the guest OS runs under critical memory pressure, the guest
starts to kill processes. A guest monitor agent may scan 'oom_kill'
from /proc/vmstat, and reports the OOM KILL event. However, the agent
may be killed and we will loss this critical event(and the later
events).

For now we can also grep for magic words in guest kernel log from host
side. Rather than this unstable way, virtio balloon reports OOM-KILL
invocations instead.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 2 ++
  include/uapi/linux/virtio_balloon.h | 6 --
  2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1f5b3dd31fcf..fd8daa742734 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -352,6 +352,8 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
pages_to_bytes(available));
update_stat(vb, idx++, VIRTIO_BALLOON_S_CACHES,
pages_to_bytes(caches));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL,
+   events[OOM_KILL]);
  
  	return idx;

  }
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index ddaa45e723c4..cde5547e64a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -71,7 +71,8 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_CACHES   7   /* Disk caches */
  #define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations */
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
-#define VIRTIO_BALLOON_S_NR   10
+#define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
+#define VIRTIO_BALLOON_S_NR   11
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -83,7 +84,8 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "available-memory", \
VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
-   VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures" \
+   VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
+   VIRTIO_BALLOON_S_NAMES_prefix "oom-kill" \


"oom-kills"

--
Cheers,

David / dhildenb

Re: [RFC 0/3] Improve memory statistics for virtio balloon

2024-04-15 Thread David Hildenbrand


On 15.04.24 10:41, zhenwei pi wrote:

Hi,

When the guest runs under critial memory pressure, the guest becomss
too slow, even sshd turns D state(uninterruptible) on memory
allocation. We can't login this VM to do any work on trouble shooting.

Guest kernel log via virtual TTY(on host side) only provides a few
necessary log after OOM. More detail memory statistics are required,
then we can know explicit memory events and estimate the pressure.

I'm going to introduce several VM counters for virtio balloon:
- oom-kill
- alloc-stall
- scan-async
- scan-direct
- reclaim-async
- reclaim-direct


IIUC, we're only exposing events that are already getting provided via 
all_vm_events(), correct?


In that case, I don't really see a major issue. Some considerations:

(1) These new events are fairly Linux specific.

PSWPIN and friends are fairly generic, but HGTLB is also already fairly 
Linux specific already. OOM-kills don't really exist on Windows, for 
example. We'll have to be careful of properly describing what the 
semantics are.


(2) How should we handle if Linux ever stops supporting a certain event 
(e.g., major reclaim rework). I assume, simply return nothing like we 
currently would for VIRTIO_BALLOON_S_HTLB_PGALLOC without 
CONFIG_HUGETLB_PAGE.


--
Cheers,

David / dhildenb

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-12 Thread David Matlack

On 2024-04-01 11:29 PM, James Houghton wrote:
> Only handle the TDP MMU case for now. In other cases, if a bitmap was
> not provided, fallback to the slowpath that takes mmu_lock, or, if a
> bitmap was provided, inform the caller that the bitmap is unreliable.

I think this patch will trigger a lockdep assert in

  kvm_tdp_mmu_age_gfn_range
kvm_tdp_mmu_handle_gfn
  for_each_tdp_mmu_root
__for_each_tdp_mmu_root
  kvm_lockdep_assert_mmu_lock_held

... because it walks tdp_mmu_roots without holding mmu_lock.

Yu's patch[1] added a lockless walk to the TDP MMU. We'd need something
similar here and also update the comment above tdp_mmu_roots describing
how tdp_mmu_roots can be read locklessly.

[1] https://lore.kernel.org/kvmarm/zitx64bbx5vdj...@google.com/

Re: [PATCH v3 3/7] KVM: Add basic bitmap support into kvm_mmu_notifier_test/clear_young

2024-04-12 Thread David Matlack

On 2024-04-01 11:29 PM, James Houghton wrote:
> Add kvm_arch_prepare_bitmap_age() for architectures to indiciate that
> they support bitmap-based aging in kvm_mmu_notifier_test_clear_young()
> and that they do not need KVM to grab the MMU lock for writing. This
> function allows architectures to do other locking or other preparatory
> work that it needs.

There's a lot going on here. I know it's extra work but I think the
series would be easier to understand and simplify if you introduced the
KVM support for lockless test/clear_young() first, and then introduce
support for the bitmap-based look-around.

Specifically:

 1. Make all test/clear_young() notifiers lockless. i.e. Move the
mmu_lock into the architecture-specific code (kvm_age_gfn() and
kvm_test_age_gfn()).

 2. Convert KVM/x86's kvm_{test,}_age_gfn() to be lockless for the TDP
MMU.

 4. Convert KVM/arm64's kvm_{test,}_age_gfn() to hold the mmu_lock in
read-mode.

 5. Add bitmap-based look-around support to KVM/x86 and KVM/arm64
(probably 2-3 patches).

> 
> If an architecture does not implement kvm_arch_prepare_bitmap_age() or
> is unable to do bitmap-based aging at runtime (and marks the bitmap as
> unreliable):
>  1. If a bitmap was provided, we inform the caller that the bitmap is
> unreliable (MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE).
>  2. If a bitmap was not provided, fall back to the old logic.
> 
> Also add logic for architectures to easily use the provided bitmap if
> they are able. The expectation is that the architecture's implementation
> of kvm_gfn_test_age() will use kvm_gfn_record_young(), and
> kvm_gfn_age() will use kvm_gfn_should_age().
> 
> Suggested-by: Yu Zhao 
> Signed-off-by: James Houghton 
> ---
>  include/linux/kvm_host.h | 60 ++
>  virt/kvm/kvm_main.c  | 92 +---
>  2 files changed, 127 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1800d03a06a9..5862fd7b5f9b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1992,6 +1992,26 @@ extern const struct _kvm_stats_desc 
> kvm_vm_stats_desc[];
>  extern const struct kvm_stats_header kvm_vcpu_stats_header;
>  extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
>  
> +/*
> + * Architectures that support using bitmaps for kvm_age_gfn() and
> + * kvm_test_age_gfn should return true for kvm_arch_prepare_bitmap_age()
> + * and do any work they need to prepare. The subsequent walk will not
> + * automatically grab the KVM MMU lock, so some architectures may opt
> + * to grab it.
> + *
> + * If true is returned, a subsequent call to kvm_arch_finish_bitmap_age() is
> + * guaranteed.
> + */
> +#ifndef kvm_arch_prepare_bitmap_age
> +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)

I find the name of these architecture callbacks misleading/confusing.
The lockless path is used even when a bitmap is not provided. i.e.
bitmap can be NULL in between kvm_arch_prepare/finish_bitmap_age().

> +{
> + return false;
> +}
> +#endif
> +#ifndef kvm_arch_finish_bitmap_age
> +static inline void kvm_arch_finish_bitmap_age(struct mmu_notifier *mn) {}
> +#endif

kvm_arch_finish_bitmap_age() seems unnecessary. I think the KVM/arm64
code could acquire/release the mmu_lock in read-mode in
kvm_test_age_gfn() and kvm_age_gfn() right?

> +
>  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  {
> @@ -2076,9 +2096,16 @@ static inline bool 
> mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
>   return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
>  }
>  
> +struct test_clear_young_metadata {
> + unsigned long *bitmap;
> + unsigned long bitmap_offset_end;

bitmap_offset_end is unused.

> + unsigned long end;
> + bool unreliable;
> +};
>  union kvm_mmu_notifier_arg {
>   pte_t pte;
>   unsigned long attributes;
> + struct test_clear_young_metadata *metadata;

nit: Maybe s/metadata/test_clear_young/ ?

>  };
>  
>  struct kvm_gfn_range {
> @@ -2087,11 +2114,44 @@ struct kvm_gfn_range {
>   gfn_t end;
>   union kvm_mmu_notifier_arg arg;
>   bool may_block;
> + bool lockless;

Please document this as it's somewhat subtle. A reader might think this
implies the entire operation runs without taking the mmu_lock.

>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> +
> +static inline void kvm_age_set_unreliable(struct kvm_gfn_range *range)
> +{
> + struct test_clear_young_metadata *args = range->arg.metadata;
> +
> + args->unreliable = true;
> +}
> +static inline unsigned long kvm_young_bitmap_offset(struct kvm_gfn_range 
> *range,
> +

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 93598 matches

Mail list logo