[ovmf test] 172926: regressions - FAIL

2022-09-01 Thread osstest service owner
flight 172926 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/172926/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-i386-libvirt6 libvirt-buildfail REGR. vs. 172136
 build-amd64-libvirt   6 libvirt-buildfail REGR. vs. 172136

version targeted for testing:
 ovmf ec87181192f013f4f7ff916b2a39ff2c87b079f3
baseline version:
 ovmf 444260d45ec2a84e8f8c192b3539a3cd5591d009

Last test of basis   172136  2022-08-04 06:43:42 Z   28 days
Failing since172151  2022-08-05 02:40:28 Z   28 days  223 attempts
Testing same since   172926  2022-09-02 02:30:44 Z0 days1 attempts


People who touched revisions under test:
  Abdul Lateef Attar 
  Abner Chang 
  Ard Biesheuvel 
  Bob Feng 
  Chasel Chiu 
  Chen, Xiao X 
  Czajkowski, Maciej 
  Dimitrije Pavlov 
  Dun Tan 
  Edward Pickup 
  Foster Nong 
  Gregx Yeh 
  Guo Dong 
  Igor Kulchytskyy 
  James Lu 
  Jeff Brasen 
  Jiaxin Wu 
  Jose Marinho 
  KasimX Liu 
  Kavya 
  Konstantin Aladyshev 
  Kun Qin 
  Liming Gao 
  Liu, Zhiguang 
  Maciej Czajkowski 
  Michael D Kinney 
  Michael Kubacki 
  Pierre Gondois 
  Ray Ni 
  Rebecca Cran 
  Rebecca Cran 
  Sainadh Nagolu 
  Sami Mujawar 
  Shengfengx Xue 
  Wu, Jiaxin 
  Xiao X Chen 
  Yuanhao Xie 
  Zhiguang Liu 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  fail
 build-i386-libvirt   fail
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Not pushing.

(No revision log; it would be 1457 lines long.)



Re: [RFC PATCH 16/30] mm: enable slab allocation tagging for kmalloc and friends

2022-09-01 Thread Roman Gushchin
On Tue, Aug 30, 2022 at 02:49:05PM -0700, Suren Baghdasaryan wrote:
> Redefine kmalloc, krealloc, kzalloc, kcalloc, etc. to record allocations
> and deallocations done by these functions.

One particular case when this functionality might be very useful:
in the past we've seen examples (at Fb) where it was hard to understand
the difference between slab memory sizes of two different kernel versions
due to slab caches merging. Once a slab cache is merged with another large
cache, this data is pretty much lost. So I definetely see value in stats which
are independent from kmem caches.

The performance overhead is a concern here, so more data would be useful.

Thanks!



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Roman Gushchin
On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote:
> On Wed, Aug 31, 2022 at 12:02 PM Kent Overstreet
>  wrote:
> >
> > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> > > On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > > > Whatever asking for an explanation as to why equivalent functionality
> > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> > >
> > > Fully agreed and this is especially true for a change this size
> > > 77 files changed, 3406 insertions(+), 703 deletions(-)
> >
> > In the case of memory allocation accounting, you flat cannot do this with 
> > ftrace
> > - you could maybe do a janky version that isn't fully accurate, much slower,
> > more complicated for the developer to understand and debug and more 
> > complicated
> > for the end user.
> >
> > But please, I invite anyone who's actually been doing this with ftrace to
> > demonstrate otherwise.
> >
> > Ftrace just isn't the right tool for the job here - we're talking about 
> > adding
> > per callsite accounting to some of the fastest fast paths in the kernel.
> >
> > And the size of the changes for memory allocation accounting are much more
> > reasonable:
> >  33 files changed, 623 insertions(+), 99 deletions(-)
> >
> > The code tagging library should exist anyways, it's been open coded half a 
> > dozen
> > times in the kernel already.
> >
> > And once we've got that, the time stats code is _also_ far simpler than 
> > doing it
> > with ftrace would be. If anyone here has successfully debugged latency 
> > issues
> > with ftrace, I'd really like to hear it. Again, for debugging latency 
> > issues you
> > want something that can always be on, and that's not cheap with ftrace - and
> > never mind the hassle of correlating start and end wait trace events, 
> > builting
> > up histograms, etc. - that's all handled here.
> >
> > Cheap, simple, easy to use. What more could you want?
> >
> 
> This is very interesting work! Do you have any data about the overhead
> this introduces, especially in a production environment? I am
> especially interested in memory allocations tracking and detecting
> leaks.

+1

I think the question whether it indeed can be always turned on in the production
or not is the main one. If not, the advantage over ftrace/bpf/... is not that
obvious. Otherwise it will be indeed a VERY useful thing.

Also, there is a lot of interesting stuff within this patchset, which
might be useful elsewhere. So thanks to Kent and Suren for this work!

Thanks!



Re: [RFC PATCH 14/30] mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache objects

2022-09-01 Thread Roman Gushchin
On Tue, Aug 30, 2022 at 02:49:03PM -0700, Suren Baghdasaryan wrote:
> Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext
> objects. Also prevent slabobj_ext allocations for kmem_cache objects.
> 
> Signed-off-by: Suren Baghdasaryan 

Patches 12-14 look good to me.
It's probably to early to ack anything, but otherwise I'd ack them.

Thanks!



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Roman Gushchin
On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote:
> On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote:
> > I'd suggest to run something like iperf on a fast hardware. And maybe some
> > io_uring stuff too. These are two places which were historically most 
> > sensitive
> > to the (kernel) memory accounting speed.
> 
> I'm getting wildly inconsistent results with iperf.
> 
> io_uring-echo-server and rust_echo_bench gets me:
> Benchmarking: 127.0.0.1:12345
> 50 clients, running 512 bytes, 60 sec.
> 
> Without alloc tagging:120547 request/sec
> With: 116748 request/sec
> 
> https://github.com/frevib/io_uring-echo-server
> https://github.com/haraldh/rust_echo_bench
> 
> How's that look to you? Close enough? :)

Yes, this looks good (a bit too good).

I'm not that familiar with io_uring, Jens and Pavel should have a better idea
what and how to run (I know they've workarounded the kernel memory accounting
because of the performance in the past, this is why I suspect it might be an
issue here as well).

This is a recent optimization on the networking side:
https://lore.kernel.org/linux-mm/20220825000506.239406-1-shake...@google.com/

Maybe you can try to repeat this experiment.

Thanks!



Re: [RFC PATCH 11/30] mm: introduce slabobj_ext to support slab object extensions

2022-09-01 Thread Roman Gushchin
On Tue, Aug 30, 2022 at 02:49:00PM -0700, Suren Baghdasaryan wrote:
> Currently slab pages can store only vectors of obj_cgroup pointers in
> page->memcg_data. Introduce slabobj_ext structure to allow more data
> to be stored for each slab object. Wraps obj_cgroup into slabobj_ext
> to support current functionality while allowing to extend slabobj_ext
> in the future.
> 
> Note: ideally the config dependency should be turned the other way around:
> MEMCG should depend on SLAB_OBJ_EXT and {page|slab|folio}.memcg_data would
> be renamed to something like {page|slab|folio}.objext_data. However doing
> this in RFC would introduce considerable churn unrelated to the overall
> idea, so avoiding this until v1.

Hi Suren!

I'd say CONFIG_MEMCG_KMEM and CONFIG_YOUR_NEW_STUFF should both depend on
SLAB_OBJ_EXT.
CONFIG_MEMCG_KMEM depend on CONFIG_MEMCG anyway.

> 
> Signed-off-by: Suren Baghdasaryan 
> ---
>  include/linux/memcontrol.h |  18 --
>  init/Kconfig   |   5 ++
>  mm/kfence/core.c   |   2 +-
>  mm/memcontrol.c|  60 ++-
>  mm/page_owner.c|   2 +-
>  mm/slab.h  | 119 +
>  6 files changed, 131 insertions(+), 75 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6257867fbf95..315399f77173 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -227,6 +227,14 @@ struct obj_cgroup {
>   };
>  };
>  
> +/*
> + * Extended information for slab objects stored as an array in 
> page->memcg_data
> + * if MEMCG_DATA_OBJEXTS is set.
> + */
> +struct slabobj_ext {
> + struct obj_cgroup *objcg;
> +} __aligned(8);

Why do we need this aligment requirement?

> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -363,7 +371,7 @@ extern struct mem_cgroup *root_mem_cgroup;
>  
>  enum page_memcg_data_flags {
>   /* page->memcg_data is a pointer to an objcgs vector */
> - MEMCG_DATA_OBJCGS = (1UL << 0),
> + MEMCG_DATA_OBJEXTS = (1UL << 0),
>   /* page has been accounted as a non-slab kernel page */
>   MEMCG_DATA_KMEM = (1UL << 1),
>   /* the next bit after the last actual flag */
> @@ -401,7 +409,7 @@ static inline struct mem_cgroup *__folio_memcg(struct 
> folio *folio)
>   unsigned long memcg_data = folio->memcg_data;
>  
>   VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
> - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
> + VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
>   VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);
>  
>   return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> @@ -422,7 +430,7 @@ static inline struct obj_cgroup *__folio_objcg(struct 
> folio *folio)
>   unsigned long memcg_data = folio->memcg_data;
>  
>   VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
> - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
> + VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
>   VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);
>  
>   return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> @@ -517,7 +525,7 @@ static inline struct mem_cgroup *page_memcg_check(struct 
> page *page)
>*/
>   unsigned long memcg_data = READ_ONCE(page->memcg_data);
>  
> - if (memcg_data & MEMCG_DATA_OBJCGS)
> + if (memcg_data & MEMCG_DATA_OBJEXTS)
>   return NULL;
>  
>   if (memcg_data & MEMCG_DATA_KMEM) {
> @@ -556,7 +564,7 @@ static inline struct mem_cgroup 
> *get_mem_cgroup_from_objcg(struct obj_cgroup *ob
>  static inline bool folio_memcg_kmem(struct folio *folio)
>  {
>   VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page);
> - VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJCGS, folio);
> + VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJEXTS, folio);
>   return folio->memcg_data & MEMCG_DATA_KMEM;
>  }
>  
> diff --git a/init/Kconfig b/init/Kconfig
> index 532362fcfe31..82396d7a2717 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -958,6 +958,10 @@ config MEMCG
>   help
> Provides control over the memory footprint of tasks in a cgroup.
>  
> +config SLAB_OBJ_EXT
> + bool
> + depends on MEMCG
> +
>  config MEMCG_SWAP
>   bool
>   depends on MEMCG && SWAP
> @@ -966,6 +970,7 @@ config MEMCG_SWAP
>  config MEMCG_KMEM
>   bool
>   depends on MEMCG && !SLOB
> + select SLAB_OBJ_EXT
>   default y
>  
>  config BLK_CGROUP
> diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> index c252081b11df..c0958e4a32e2 100644
> --- a/mm/kfence/core.c
> +++ b/mm/kfence/core.c
> @@ -569,7 +569,7 @@ static unsigned long kfence_init_pool(void)
>   __folio_set_slab(slab_folio(slab));
>  #ifdef CONFIG_MEMCG
>   slab->memcg_data = (unsigned long)&kfence_metadata[i / 2 - 
> 1].objcg 

Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Roman Gushchin
On Thu, Sep 01, 2022 at 06:37:20PM -0400, Kent Overstreet wrote:
> On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote:
> > On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote:
> > > This is very interesting work! Do you have any data about the overhead
> > > this introduces, especially in a production environment? I am
> > > especially interested in memory allocations tracking and detecting
> > > leaks.
> > 
> > +1
> > 
> > I think the question whether it indeed can be always turned on in the 
> > production
> > or not is the main one. If not, the advantage over ftrace/bpf/... is not 
> > that
> > obvious. Otherwise it will be indeed a VERY useful thing.
> 
> Low enough overhead to run in production was my primary design goal.
> 
> Stats are kept in a struct that's defined at the callsite. So this adds _no_
> pointer chasing to the allocation path, unless we've switch to percpu counters
> at that callsite (see the lazy percpu counters patch), where we need to deref
> one percpu pointer to save an atomic.
> 
> Then we need to stash a pointer to the alloc_tag, so that kfree() can find it.
> For slab allocations this uses the same storage area as memcg, so for
> allocations that are using that we won't be touching any additional 
> cachelines.
> (I wanted the pointer to the alloc_tag to be stored inline with the 
> allocation,
> but that would've caused alignment difficulties).
> 
> Then there's a pointer deref introduced to the kfree() path, to get back to 
> the
> original alloc_tag and subtract the allocation from that callsite. That one
> won't be free, and with percpu counters we've got another dependent load too -
> hmm, it might be worth benchmarking with just atomics, skipping the percpu
> counters.
> 
> So the overhead won't be zero, I expect it'll show up in some synthetic
> benchmarks, but yes I do definitely expect this to be worth enabling in
> production in many scenarios.

I'm somewhat sceptical, but I usually am. And in this case I'll be really happy
to be wrong.

On a bright side, maybe most of the overhead will come from few allocations,
so an option to explicitly exclude them will do the trick.

I'd suggest to run something like iperf on a fast hardware. And maybe some
io_uring stuff too. These are two places which were historically most sensitive
to the (kernel) memory accounting speed.

Thanks!



[linux-linus test] 172922: regressions - FAIL

2022-09-01 Thread osstest service owner
flight 172922 linux-linus real [real]
http://logs.test-lab.xenproject.org/osstest/logs/172922/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-amd64-libvirt   6 libvirt-buildfail REGR. vs. 172133
 build-arm64-libvirt   6 libvirt-buildfail REGR. vs. 172133
 build-i386-libvirt6 libvirt-buildfail REGR. vs. 172133
 build-armhf-libvirt   6 libvirt-buildfail REGR. vs. 172133
 build-arm64-pvops 6 kernel-build fail REGR. vs. 172133

Regressions which are regarded as allowable (not blocking):
 test-amd64-amd64-xl-rtds 20 guest-localmigrate/x10   fail REGR. vs. 172133
 test-armhf-armhf-xl-rtds18 guest-start/debian.repeat fail REGR. vs. 172133

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-qcow2  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-raw  1 build-check(1)   blocked  n/a
 test-arm64-arm64-examine  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-qcow2  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-amd64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-credit1   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-credit2   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-seattle   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-thunderx  1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-vhd   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-xsm   1 build-check(1)   blocked  n/a
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 172133
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 172133
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 172133
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 172133
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 172133
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass

version targeted for testing:
 linux42e66b1cc3a070671001f8a1e933a80818a192bf
baseline version:
 linuxb44f2fd87919b5ae6e1756d4c7ba2cbba22238e1

Last test of basis   172133  2022-08-04 05:14:48 Z   28 days
Failing since172152  2022-08-05 04:01:26 Z   28 days   64 attempts
Testing same since   172922  2022-09-01 22:42:52 Z0 days1 attempts


1629 people touched revisions under test,
not listing them all

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass
 build-i386   pass

[PATCH v4 6/6] xen: introduce a Kconfig option to configure NUMA nodes number

2022-09-01 Thread Wei Chen
Currently the maximum number of NUMA nodes is a hardcoded value.
This provides little flexibility unless changing the code.

Introduce a new Kconfig option to change the maximum number of
NUMA nodes conveniently. Also considering that not all
architectures support NUMA, this Kconfig option is only visible
on NUMA enabled architectures. Architectures not supporting NUMA
still use 1 for MAX_NUMNODES.

As NODES_SHIFT is currently unused, we're taking this
opportunity to remove it.

Signed-off-by: Wei Chen 
Acked-by: Jan Beulich 
---
v3 -> v4:
1. Update the commit log to follow Jan's suggestion.
2. Add Ack-by.
v2 -> v3:
1. Fix indent.
2. Use 2-64 for node range.
v1 -> v2:
1. Add NODES_SHIFT remove message in commit log
2. Change NR_NUMA_NODES upper bound from 4095 to 255.
---
 xen/arch/Kconfig| 11 +++
 xen/arch/x86/include/asm/numa.h |  2 --
 xen/include/xen/numa.h  | 11 ++-
 3 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/xen/arch/Kconfig b/xen/arch/Kconfig
index f16eb0df43..7028f7b74f 100644
--- a/xen/arch/Kconfig
+++ b/xen/arch/Kconfig
@@ -17,3 +17,14 @@ config NR_CPUS
  For CPU cores which support Simultaneous Multi-Threading or similar
  technologies, this the number of logical threads which Xen will
  support.
+
+config NR_NUMA_NODES
+   int "Maximum number of NUMA nodes supported"
+   range 2 64
+   default "64"
+   depends on NUMA
+   help
+ Controls the build-time size of various arrays and bitmaps
+ associated with multiple-nodes management. It is the upper bound of
+ the number of NUMA nodes that the scheduler, memory allocation and
+ other NUMA-aware components can handle.
diff --git a/xen/arch/x86/include/asm/numa.h b/xen/arch/x86/include/asm/numa.h
index 2ca3475271..7866afa408 100644
--- a/xen/arch/x86/include/asm/numa.h
+++ b/xen/arch/x86/include/asm/numa.h
@@ -3,8 +3,6 @@
 
 #include 
 
-#define NODES_SHIFT 6
-
 typedef u8 nodeid_t;
 
 extern int srat_rev;
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index e593115ba2..e16a7c3764 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -3,14 +3,15 @@
 
 #include 
 
-#ifndef NODES_SHIFT
-#define NODES_SHIFT 0
-#endif
-
 #define NUMA_NO_NODE 0xFF
 #define NUMA_NO_DISTANCE 0xFF
 
-#define MAX_NUMNODES(1 << NODES_SHIFT)
+#ifdef CONFIG_NR_NUMA_NODES
+#define MAX_NUMNODES CONFIG_NR_NUMA_NODES
+#else
+#define MAX_NUMNODES 1
+#endif
+
 #define NR_NODE_MEMBLKS (MAX_NUMNODES * 2)
 
 #define vcpu_to_node(v) (cpu_to_node((v)->processor))
-- 
2.25.1




[PATCH v4 0/6] Device tree based NUMA support for Arm - Part#2

2022-09-01 Thread Wei Chen
(The Arm device tree based NUMA support patch set contains 35
patches. In order to make stuff easier for reviewers, I split
them into 3 parts:
1. Preparation. I have re-sorted the patch series. And moved
   independent patches to the head of the series - merged in [1]
2. Move generically usable code from x86 to common - this series.
3. Add new code to support Arm.

This series only contains the second part patches. As the whole NUMA
series has been reviewed for 1 round in [2], so this series would
be v3)

Xen memory allocation and scheduler modules are NUMA aware.
But actually, on x86 has implemented the architecture APIs
to support NUMA. Arm was providing a set of fake architecture
APIs to make it compatible with NUMA awared memory allocation
and scheduler.

Arm system was working well as a single node NUMA system with
these fake APIs, because we didn't have multiple nodes NUMA
system on Arm. But in recent years, more and more Arm devices
support multiple nodes NUMA system.

So now we have a new problem. When Xen is running on these Arm
devices, Xen still treat them as single node SMP systems. The
NUMA affinity capability of Xen memory allocation and scheduler
becomes meaningless. Because they rely on input data that does
not reflect real NUMA layout.

Xen still think the access time for all of the memory is the
same for all CPUs. However, Xen may allocate memory to a VM
from different NUMA nodes with different access speeds. This
difference can be amplified in workloads inside VM, causing
performance instability and timeouts.

So in this patch series, we implement a set of NUMA API to use
device tree to describe the NUMA layout. We reuse most of the
code of x86 NUMA to create and maintain the mapping between
memory and CPU, create the matrix between any two NUMA nodes.
Except ACPI and some x86 specified code, we have moved other
code to common. In next stage, when we implement ACPI based
NUMA for Arm64, we may move the ACPI NUMA code to common too,
but in current stage, we keep it as x86 only.

This patch serires has been tested and booted well on one
Arm64 NUMA machine and one HPE x86 NUMA machine.

[1] https://lists.xenproject.org/archives/html/xen-devel/2022-06/msg00499.html
[2] https://lists.xenproject.org/archives/html/xen-devel/2021-09/msg01903.html

---
v3 -> v4:
 1. Add init_as_disable as arch_numa_disabled parameter in the patche
where use it.
 2. Drop unnecessary "else" from arch_numa_setup, and fix its
   indentation.
 3. Restore compute_hash_shift's return value to int.
 4. Remove unnecessary parentheses for macros.
 5. Use unsigned int for proper variables.
 6. Fix some code-style.
 7. Move arch_get_ram_range function comment to header file.
 8. Use bool for found, and add a new "err" for the return
value of arch_get_ram_range.
 9. Use -ENODATA instead of -EINVAL for non-RAM type ranges.
10. Use bool as return value for functions that only return
0/1 or 0/-EINVAL.
11. Move mem_hotplug to a proper place in mm.h
12. Remove useless "size" in numa_scan_nodes.
13. Add CONFIG_HAS_NUMA_NODE_FWID to gate print the mapping
between node id and architectural node id (fw node id).

v2 -> v3:
 1. Drop enumeration of numa status.
 2. Use helpers to get/update acpi_numa.
 3. Insert spaces among parameters of strncmp in numa_setup.
 4. Drop helpers to access mem_hotplug. Export mem_hotplug for all arch.
 5. Remove acpi.h from common/numa.c.
 6. Rename acpi_scan_nodes to numa_scan_nodes.
 7. Replace u8 by uint8_t for memnodemap.
 8. Use unsigned int for memnode_shift and adjust related functions
(compute_hash_shift, populate_memnodemap) to use correct types for
return values or parameters.
 9. Use nodeid_t for nodeid and node numbers.
10. Use __read_mostly and __ro_after_init for appropriate variables.
11. Adjust the __read_mostly and __initdata location for some variables.
12. Convert from plain int to unsigned for cpuid and other proper 
13. Remove unnecessary change items in history.
14. Rename arch_get_memory_map to arch_get_ram_range.
15. Use -ENOENT instead of -ENODEV to indicate end of memory map.
16. Add description to code comment that arch_get_ram_range returns
RAM range in [start, end) format.
17. Rename bad_srat to numa_fw_bad.
18. Rename node_to_pxm to numa_node_to_arch_nid.
19. Merge patch#7 and #8 into patch#6.
20. Move NR_NODE_MEMBLKS from x86/acpi.h to common/numa.h
22. Use 2-64 for node range.

v1 -> v2:
 1. Refine the commit messages of several patches.
 2. Merge v1 patch#9,10 into one patch. Introduce the new functions
in the same patch that this patch will be used first time.
 3. Fold if ( end > mem_hotplug ) to mem_hotplug_update_boundary,
in this case, we can drop mem_hotplug_boundary.
 4. Remove fw_numa, use enumeration to replace numa_off and acpi_numa.
 5. Correct return value of srat_disabled.
 6. Introduce numa_enabled_with_firmware.
 7. Refine the justification of using !node_data[nid].node_spanned_pages.
 8. Use ASSERT to replace VIRTUAL_BUG_O

[PATCH v4 5/6] xen/x86: move NUMA scan nodes codes from x86 to common

2022-09-01 Thread Wei Chen
x86 has implemented a set of codes to scan NUMA nodes. These
codes will parse NUMA memory and processor information from
ACPI SRAT table. But except some ACPI specific codes, most
of the scan codes like memory blocks validation, node memory
range updates and some sanity check can be reused by other
NUMA implementation.

So in this patch, we move some variables and related functions
for NUMA memory and processor to common as library. At the
same time, numa_set_processor_nodes_parsed has been introduced
for ACPI specific code to update processor parsing results.
With this helper, we can reuse most of NUMA memory affinity init
code from ACPI. As bad_srat and node_to_pxm functions have been
used in common code to do architectural fallback and node to
architectural node info translation. But it doesn't make sense
to reuse the functions names in common code, we have rename them
to neutral names as well.

PXM is an ACPI specific item, we can't use it in common code
directly. As an alternative, we extend the parameters of
numa_update_node_memblks. The caller can pass the PXM as print
messages' prefix or as architectural node id. And we introduced
a CONFIG_HAS_NUMA_NODE_FWID to prevent print the mapping between
node id and architectural node id for those architectures do not
have architectural node id. In this case, we do not need to retain
a lot of per-arch code but still can print architectural log
messages for different NUMA implementations.

mem_hotplug also has been accessing by common code, except x86,
other architectures like Arm will also want to implement memory
hotplug in future. We export mem_hotplug to common will not bring
any harm for Arm and we also can reduce some per-arch helpers to
access mem_hotplug.

As asm/acpi.h has been removed from common/numa.c, we have to
move NR_NODE_MEMBLKS from asm/acpi.h to xen/numa.h in this patch
as well.

Signed-off-by: Wei Chen 
---
v3 -> v4:
1. Use bool as return value for functions that only return
   0/1 or 0/-EINVAL.
2. Move mem_hotplug to a proper place in mm.h
3. Remove useless "size" in numa_scan_nodes.
4. Use unsigned int or const for proper variables.
5. Fix code-style.
6. Add init_as_disable as arch_numa_disabled parameter.
7. Add CONFIG_HAS_NUMA_NODE_FWID to gate print the mapping
   between node id and architectural node id (fw node id).
v2 -> v3:
1. Add __ro_after_init to proper variables.
2. Rename bad_srat to numa_fw_bad.
3. Rename node_to_pxm to numa_node_to_arch_nid.
4. Merge patch#7 and #8 into this patch.
5. Correct int to unsigned int in proper places.
6. Move NR_NODE_MEMBLKS from x86/acpi.h to common/numa.h
7. Drop helpers to access mem_hotplug, we export mem_hotplug
   from x86/mm.c to common/page_alloc.c
v1 -> v2:
1. Add code comment for numa_update_node_memblks to explain:
   Assumes all memory regions belonging to a single node
   are in one chunk. Holes between them will be included
   in the node.
2. Merge this single patch instead of serval patches to move
   x86 SRAT code to common.
3. Export node_to_pxm to keep pxm information in NUMA scan
   nodes error messages.
4. Change the code style to target file's Xen code-style.
5. Adjust some __init and __initdata for some functions and
   variables.
6. Merge two patches into this patch:
   1. replace CONFIG_ACPI_NUMA by CONFIG_NUMA.
   2. replace "SRAT" texts.
7. Turn numa_scan_nodes to static.
---
 xen/arch/x86/include/asm/acpi.h |   1 -
 xen/arch/x86/include/asm/mm.h   |   2 -
 xen/arch/x86/include/asm/numa.h |   3 +-
 xen/arch/x86/mm.c   |   2 -
 xen/arch/x86/numa.c |   7 +-
 xen/arch/x86/srat.c | 311 +++
 xen/common/numa.c   | 321 +++-
 xen/common/page_alloc.c |   2 +
 xen/drivers/acpi/Kconfig|   1 +
 xen/include/xen/mm.h|   2 +
 xen/include/xen/numa.h  |  12 +-
 11 files changed, 363 insertions(+), 301 deletions(-)

diff --git a/xen/arch/x86/include/asm/acpi.h b/xen/arch/x86/include/asm/acpi.h
index 5c2dd5da2d..c453450a74 100644
--- a/xen/arch/x86/include/asm/acpi.h
+++ b/xen/arch/x86/include/asm/acpi.h
@@ -102,7 +102,6 @@ extern unsigned long acpi_wakeup_address;
 #define ARCH_HAS_POWER_INIT1
 
 extern s8 acpi_numa;
-#define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
 
 extern struct acpi_sleep_info acpi_sinfo;
 #define acpi_video_flags bootsym(video_flags)
diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index 0fc826de46..95ff71a83a 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -474,8 +474,6 @@ static inline int get_page_and_type(struct page_info *page,
 ASSERT(((_p)->count_info & PGC_count_mask) != 0);  \
 ASSERT(page_get_owner(_p) == (_d))
 
-extern paddr_t mem_hotplug;
-
 /**
  * With shadow pagetables, the different kinds of address start
  * to get get confusing.
diff --git a/xen/arch/x86/include/

[PATCH v4 4/6] xen/x86: use arch_get_ram_range to get information from E820 map

2022-09-01 Thread Wei Chen
The sanity check of nodes_cover_memory is also a requirement of
other architectures that support NUMA. But now, the code of
nodes_cover_memory is tied to the x86 E820. In this case, we
introduce arch_get_ram_range to decouple architecture specific
memory map from this function. This means, other architectures
like Arm can also use it to check its node and memory coverage
from bootmem info.

Depends arch_get_ram_range, we make nodes_cover_memory become
architecture independent. We also use neutral words to replace
SRAT and E820 in the print message of this function. This will
to make the massage seems more common.

As arch_get_ram_range use unsigned int for index, we also adjust
the index in nodes_cover_memory from int to unsigned int.

Signed-off-by: Wei Chen 
---
v3 -> v4:
1. Move function comment to header file.
2. Use bool for found, and add a new "err" for the return
   value of arch_get_ram_range.
3. Use -ENODATA instead of -EINVAL for non-RAM type ranges.
v2 -> v3:
1. Rename arch_get_memory_map to arch_get_ram_range.
2. Use -ENOENT instead of -ENODEV to indicate end of memory map.
3. Add description to code comment that arch_get_ram_range returns
   RAM range in [start, end) format.
v1 -> v2:
1. Use arch_get_memory_map to replace arch_get_memory_bank_range
   and arch_get_memory_bank_number.
2. Remove the !start || !end check, because caller guarantee
   these two pointers will not be NULL.
---
 xen/arch/x86/numa.c| 15 +++
 xen/arch/x86/srat.c| 30 ++
 xen/include/xen/numa.h | 13 +
 3 files changed, 46 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 90b2a22591..fa8caaa084 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifndef Dprintk
 #define Dprintk(x...)
@@ -93,3 +94,17 @@ unsigned int __init arch_get_dma_bitsize(void)
  flsl(node_start_pfn(node) + node_spanned_pages(node) / 4 - 1)
  + PAGE_SHIFT, 32);
 }
+
+int __init arch_get_ram_range(unsigned int idx, paddr_t *start, paddr_t *end)
+{
+if ( idx >= e820.nr_map )
+return -ENOENT;
+
+if ( e820.map[idx].type != E820_RAM )
+return -ENODATA;
+
+*start = e820.map[idx].addr;
+*end = *start + e820.map[idx].size;
+
+return 0;
+}
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 4c7f0c547e..bd9694db24 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -428,37 +428,43 @@ acpi_numa_memory_affinity_init(const struct 
acpi_srat_mem_affinity *ma)
Make sure the PXMs cover all memory. */
 static int __init nodes_cover_memory(void)
 {
-   int i;
+   unsigned int i;
 
-   for (i = 0; i < e820.nr_map; i++) {
-   int j, found;
+   for (i = 0; ; i++) {
+   int err;
+   unsigned int j;
+   bool found;
paddr_t start, end;
 
-   if (e820.map[i].type != E820_RAM) {
-   continue;
-   }
+   /* Try to loop memory map from index 0 to end to get RAM 
ranges. */
+   err = arch_get_ram_range(i, &start, &end);
 
-   start = e820.map[i].addr;
-   end = e820.map[i].addr + e820.map[i].size;
+   /* Reach the end of arch's memory map */
+   if (err == -ENOENT)
+   break;
+
+   /* Index relate entry is not RAM, skip it. */
+   if (err)
+   continue;
 
do {
-   found = 0;
+   found = false;
for_each_node_mask(j, memory_nodes_parsed)
if (start < nodes[j].end
&& end > nodes[j].start) {
if (start >= nodes[j].start) {
start = nodes[j].end;
-   found = 1;
+   found = true;
}
if (end <= nodes[j].end) {
end = nodes[j].start;
-   found = 1;
+   found = true;
}
}
} while (found && start < end);
 
if (start < end) {
-   printk(KERN_ERR "SRAT: No PXM for e820 range: "
+   printk(KERN_ERR "NUMA: No NODE for RAM range: "
"[%"PRIpaddr", %"PRIpaddr"]\n", start, end - 1);
return 0;
}
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index af0abfc8cf..38be7db960 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@

[PATCH v4 2/6] xen/x86: move generically usable NUMA code from x86 to common

2022-09-01 Thread Wei Chen
There are some codes in x86/numa.c can be shared by common
architectures to implememnt NUMA support. Just like some
variables and functions to check and store NUMA memory map.
And some variables and functions to do NUMA initialization.

In this patch, we move them to common/numa.c and xen/numa.h
and use the CONFIG_NUMA to gate them for non-NUMA supported
architectures. As the target header file is Xen-style, so
we trim some spaces and replace tabs for the codes that has
been moved to xen/numa.h at the same time.

As acpi_scan_nodes has been used in a common function, it
doesn't make sense to use acpi_xxx in common code, so we
rename it to numa_scan_nodes in this patch too. After that
if we still use CONFIG_ACPI_NUMA in to gate numa_scan_nodes
in numa_initmem_init, that doesn't make sense. As CONFIG_NUMA
will be selected by CONFIG_ACPI_NUMA for x86. So, we replace
CONFIG_ACPI_NUMA by CONFIG_NUMA to gate numa_scan_nodes.

As arch_numa_disabled has been implememnted for ACPI NUMA,
we can rename srat_disabled to numa_disabled and move it
to common code as well.

Signed-off-by: Wei Chen 
---
v3 -> v4:
 1. Restore compute_hash_shift's return value to int.
 2. Remove unnecessary parentheses for macros.
 3. Use unsigned int for proper variables.
 4. Fix some code-style.
v2 -> v3:
 1. Remove acpi.h from common/numa.c.
 2. Rename acpi_scan_nodes to numa_scan_nodes.
 3. Replace u8 by uint8_t for memnodemap.
 4. Use unsigned int for memnode_shift and adjust related functions
(compute_hash_shift, populate_memnodemap) to use correct types for
return values or parameters.
 5. Use nodeid_t for nodeid and node numbers.
 6. Use __read_mostly and __ro_after_init for appropriate variables.
 7. Adjust the __read_mostly and __initdata location for some variables.
 8. convert from plain int to unsigned for cpuid and other proper variables.
 9. Use __attribute_pure__ instead of __attribute__((pure)).
10. Replace CONFIG_ACPI_NUMA by CONFIG_NUMA in numa_initmem_init.
11. Add const for some functions' parameters.
12. Move srat_disabled to common code with new name numa_disabled.
13. Fix some spaces code-style for numa_emulation.
14. Change from int to unsigned int for numa_fake.
v1 -> v2:
1. New patch in v2.
---
 xen/arch/x86/include/asm/acpi.h  |   1 -
 xen/arch/x86/include/asm/numa.h  |  57 +---
 xen/arch/x86/include/asm/setup.h |   1 -
 xen/arch/x86/numa.c  | 430 +-
 xen/arch/x86/smpboot.c   |   2 +-
 xen/arch/x86/srat.c  |   8 +-
 xen/common/Makefile  |   1 +
 xen/common/numa.c| 442 +++
 xen/include/xen/numa.h   |  67 +
 9 files changed, 517 insertions(+), 492 deletions(-)
 create mode 100644 xen/common/numa.c

diff --git a/xen/arch/x86/include/asm/acpi.h b/xen/arch/x86/include/asm/acpi.h
index 9a9cc4c240..5c2dd5da2d 100644
--- a/xen/arch/x86/include/asm/acpi.h
+++ b/xen/arch/x86/include/asm/acpi.h
@@ -102,7 +102,6 @@ extern unsigned long acpi_wakeup_address;
 #define ARCH_HAS_POWER_INIT1
 
 extern s8 acpi_numa;
-extern int acpi_scan_nodes(u64 start, u64 end);
 #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
 
 extern struct acpi_sleep_info acpi_sinfo;
diff --git a/xen/arch/x86/include/asm/numa.h b/xen/arch/x86/include/asm/numa.h
index 237f2c6dbf..6c87942d43 100644
--- a/xen/arch/x86/include/asm/numa.h
+++ b/xen/arch/x86/include/asm/numa.h
@@ -9,72 +9,17 @@ typedef u8 nodeid_t;
 
 extern int srat_rev;
 
-extern nodeid_t  cpu_to_node[NR_CPUS];
-extern cpumask_t node_to_cpumask[];
-
-#define cpu_to_node(cpu)   (cpu_to_node[cpu])
-#define parent_node(node)  (node)
-#define node_to_first_cpu(node)  (__ffs(node_to_cpumask[node]))
-#define node_to_cpumask(node)(node_to_cpumask[node])
-
-struct node { 
-   paddr_t start, end;
-};
-
-extern int compute_hash_shift(struct node *nodes, int numnodes,
- nodeid_t *nodeids);
 extern nodeid_t pxm_to_node(unsigned int pxm);
 
 #define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
-#define VIRTUAL_BUG_ON(x) 
 
-extern void numa_add_cpu(int cpu);
-extern void numa_init_array(void);
-extern bool numa_off;
-
-extern int arch_numa_setup(const char *opt);
-extern bool arch_numa_disabled(void);
-extern bool srat_disabled(void);
-extern void numa_set_node(int cpu, nodeid_t node);
+extern bool numa_disabled(void);
 extern nodeid_t setup_node(unsigned int pxm);
 extern void srat_detect_node(int cpu);
 
-extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end);
 extern nodeid_t apicid_to_node[];
 extern void init_cpu_to_node(void);
 
-static inline void clear_node_cpumask(int cpu)
-{
-   cpumask_clear_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
-}
-
-/* Simple perfect hash to map pdx to node numbers */
-extern int memnode_shift; 
-extern unsigned long memnodemapsize;
-extern u8 *memnodemap;
-
-struct node_data {
-unsigned long node_start_pfn;
-unsigned long node_spanned_pages;
-}

[PATCH v4 3/6] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid

2022-09-01 Thread Wei Chen
VIRTUAL_BUG_ON is an empty macro used in phys_to_nid. This
results in two lines of error-checking code in phys_to_nid
that is not actually working and causing two compilation
errors:
1. error: "MAX_NUMNODES" undeclared (first use in this function).
   This is because in the common header file, "MAX_NUMNODES" is
   defined after the common header file includes the ARCH header
   file, where phys_to_nid has attempted to use "MAX_NUMNODES".
   This error was resolved after we moved the phys_to_nid from
   x86 ARCH header file to common header file.
2. error: wrong type argument to unary exclamation mark.
   This is because, the error-checking code contains !node_data[nid].
   But node_data is a data structure variable, it's not a pointer.

So, in this patch, we use ASSERT instead of VIRTUAL_BUG_ON to
enable the two lines of error-checking code. And fix the left
compilation errors by replacing !node_data[nid] to
!node_data[nid].node_spanned_pages. Although NUMA allows one node
can only have CPUs but without any memory. And node with 0 bytes
of memory might have an entry in memnodemap[] theoretically. But
that doesn't mean phys_to_nid can find any valid address from a
node with 0 bytes memory.

Signed-off-by: Wei Chen 
Tested-by: Jiamei Xie 
Acked-by: Jan Beulich 
---
v3 -> v4:
no change.
v2 -> v3:
1. Remove unnecessary change items in history.
2. Add Acked-by.
v1 -> v2:
1. Use ASSERT to replace VIRTUAL_BUG_ON in phys_to_nid.
2. Adjust the conditional express for ASSERT.
3. Refine the justification of using !node_data[nid].node_spanned_pages.
---
 xen/include/xen/numa.h | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index baef4cd553..af0abfc8cf 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -36,8 +36,6 @@ struct node {
 extern int compute_hash_shift(const struct node *nodes,
   nodeid_t numnodes, nodeid_t *nodeids);
 
-#define VIRTUAL_BUG_ON(x)
-
 extern bool numa_off;
 
 extern void numa_add_cpu(unsigned int cpu);
@@ -70,9 +68,9 @@ extern struct node_data node_data[];
 static inline nodeid_t __attribute_pure__ phys_to_nid(paddr_t addr)
 {
 nodeid_t nid;
-VIRTUAL_BUG_ON((paddr_to_pdx(addr) >> memnode_shift) >= memnodemapsize);
+ASSERT((paddr_to_pdx(addr) >> memnode_shift) < memnodemapsize);
 nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift];
-VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid]);
+ASSERT(nid < MAX_NUMNODES && node_data[nid].node_spanned_pages);
 return nid;
 }
 
-- 
2.25.1




[PATCH v4 1/6] xen/x86: Provide helpers for common code to access acpi_numa

2022-09-01 Thread Wei Chen
acpi_numa is a specific NUMA switch for ACPI NUMA implementation.
Other NUMA implementation may not need this switch. But this switch is
not only used by ACPI code, it is also used directly in some general
NUMA logic code. So far this hasn't caused any problem because Xen only
has x86 implementing ACPI NUMA, but now Arm is implementing device tree
based NUMA. Accesssing acpi_numa directly in some functions will be a
block of reusing NUMA common code. It is also difficult for us to replace
it with a new generic switch, because it is hard to prove that the new
switch states can guarantee the original code will work correctly.

So in this patch, we provide two helpers for common code to update and
get states of acpi_numa. And other new NUMA implementations just need
to provide the same helpers for common code. In this case, the generic
NUMA logic code can be reused by all NUMA implementations.

Signed-off-by: Wei Chen 
---
v3 -> v4:
1. Drop parameter from arch_numa_disabled, the parameter will be
   introduced in later patch where use it.
2. Drop unnecessary "else" from arch_numa_setup, and fix its
   indentation.
v2 -> v3:
1. Drop enumeration of numa status.
2. Use helpers to get/update acpi_numa.
3. Insert spaces among parameters of strncmp in numa_setup.
v1 -> v2:
1. Remove fw_numa.
2. Use enumeration to replace numa_off and acpi_numa.
3. Correct return value of srat_disabled.
4. Introduce numa_enabled_with_firmware.
---
 xen/arch/x86/include/asm/numa.h |  5 +++--
 xen/arch/x86/numa.c | 38 ++---
 2 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/xen/arch/x86/include/asm/numa.h b/xen/arch/x86/include/asm/numa.h
index c32ccffde3..237f2c6dbf 100644
--- a/xen/arch/x86/include/asm/numa.h
+++ b/xen/arch/x86/include/asm/numa.h
@@ -32,8 +32,9 @@ extern void numa_add_cpu(int cpu);
 extern void numa_init_array(void);
 extern bool numa_off;
 
-
-extern int srat_disabled(void);
+extern int arch_numa_setup(const char *opt);
+extern bool arch_numa_disabled(void);
+extern bool srat_disabled(void);
 extern void numa_set_node(int cpu, nodeid_t node);
 extern nodeid_t setup_node(unsigned int pxm);
 extern void srat_detect_node(int cpu);
diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 627ae8aa95..a5bc7a78c9 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -50,9 +50,28 @@ nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
 bool numa_off;
 s8 acpi_numa = 0;
 
-int srat_disabled(void)
+int __init arch_numa_setup(const char *opt)
 {
-return numa_off || acpi_numa < 0;
+#ifdef CONFIG_ACPI_NUMA
+if ( !strncmp(opt, "noacpi", 6) )
+{
+numa_off = false;
+acpi_numa = -1;
+return 0;
+}
+#endif
+
+return -EINVAL;
+}
+
+bool arch_numa_disabled(void)
+{
+return acpi_numa < 0;
+}
+
+bool srat_disabled(void)
+{
+return numa_off || arch_numa_disabled();
 }
 
 /*
@@ -291,28 +310,21 @@ void numa_set_node(int cpu, nodeid_t node)
 /* [numa=off] */
 static int __init cf_check numa_setup(const char *opt)
 {
-if ( !strncmp(opt,"off",3) )
+if ( !strncmp(opt, "off", 3) )
 numa_off = true;
-else if ( !strncmp(opt,"on",2) )
+else if ( !strncmp(opt, "on", 2) )
 numa_off = false;
 #ifdef CONFIG_NUMA_EMU
 else if ( !strncmp(opt, "fake=", 5) )
 {
 numa_off = false;
-numa_fake = simple_strtoul(opt+5,NULL,0);
+numa_fake = simple_strtoul(opt + 5, NULL, 0);
 if ( numa_fake >= MAX_NUMNODES )
 numa_fake = MAX_NUMNODES;
 }
-#endif
-#ifdef CONFIG_ACPI_NUMA
-else if ( !strncmp(opt,"noacpi",6) )
-{
-numa_off = false;
-acpi_numa = -1;
-}
 #endif
 else
-return -EINVAL;
+return arch_numa_setup(opt);
 
 return 0;
 } 
-- 
2.25.1




RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator

2022-09-01 Thread Henry Wang
Hi Julien,

> -Original Message-
> From: Julien Grall 
> >> This code is now becoming quite confusing to understanding. This loop is
> >> meant to map the xenheap. If I follow your documentation, it would
> mean
> >> that only the reserved region should be mapped.
> >
> > Yes I think this is the same question that I raised in the scissors line of 
> > the
> > commit message of this patch.
> 
> Sorry I didn't notice the comment after the scissors line. This is the
> same question :)
> 
> > What I intend to do is still mapping the whole
> > RAM because of the xenheap_* variables that you mentioned in...
> >
> >>
> >> More confusingly, xenheap_* variables will cover the full RAM.
> >
> > ...here. But only adding the reserved region to the boot allocator so the
> > reserved region can become the heap later on. I am wondering if we
> > have a more clear way to do that, any suggestions?
> 
> I think your code is correct. It only needs some renaming of the
> existing variable (maybe to directmap_*?) to make clear the area is used
> to access the RAM easily.

Thanks for the clarification. I checked the code and found the xenheap_*
variables are:
xenheap_virt_start, xenheap_virt_end, xenheap_mfn_start,
xenheap_mfn_end, xenheap_base_pdx.

For clarification, do we need to change all of them to directmap_*?

A pure renaming should be easy (and I guess also safe), but maybe I am
overthinking because arm32 also uses xenheap_virt_end, xenheap_mfn_start
and xenheap_mfn_end. These variables refer to the real xenheap, I am not
sure renaming these would reduce the readability for arm32.

Kind regards,
Henry

> 
> Cheers,
> 
> --
> Julien Grall


RE: [PATCH v6 1/9] xen/arm: introduce static shared memory

2022-09-01 Thread Penny Zheng
Hi Julien

> -Original Message-
> From: Julien Grall 
> Sent: Thursday, September 1, 2022 11:40 PM
> To: Penny Zheng ; xen-devel@lists.xenproject.org
> Cc: Stefano Stabellini ; Bertrand Marquis
> ; Volodymyr Babchuk
> 
> Subject: Re: [PATCH v6 1/9] xen/arm: introduce static shared memory
> 
> Hi Penny,
> 
> On 29/08/2022 07:57, Penny Zheng wrote:
> >> -Original Message-
> >> From: Julien Grall 
> >> Sent: Friday, August 26, 2022 9:17 PM
> >> To: Penny Zheng ; xen-
> de...@lists.xenproject.org
> >> Cc: Stefano Stabellini ; Bertrand Marquis
> >> ; Volodymyr Babchuk
> >> 
> >> Subject: Re: [PATCH v6 1/9] xen/arm: introduce static shared memory
> >>
> >> Hi Penny,
> >>
> >
> > Hi Julien
> >
> >> On 21/07/2022 14:21, Penny Zheng wrote:
> >>> From: Penny Zheng 
> >>>
> >>> This patch series introduces a new feature: setting up static shared
> >>> memory on a dom0less system, through device tree configuration.
> >>>
> >>> This commit parses shared memory node at boot-time, and reserve it
> >>> in bootinfo.reserved_mem to avoid other use.
> >>>
> >>> This commits proposes a new Kconfig CONFIG_STATIC_SHM to wrap
> >>> static-shm-related codes, and this option depends on static memory(
> >>> CONFIG_STATIC_MEMORY). That's because that later we want to reuse a
> >>> few helpers, guarded with CONFIG_STATIC_MEMORY, like
> >>> acquire_staticmem_pages, etc, on static shared memory.
> >>>
> >>> Signed-off-by: Penny Zheng 
> >>> ---
> >>> v6 change:
> >>> - when host physical address is ommited, output the error message
> >>> since xen doesn't support it at the moment
> >>> - add the following check: 1) The shm ID matches and the region
> >>> exactly match
> >>> 2) The shm ID doesn't match and the region doesn't overlap
> >>> - change it to "unsigned int" to be aligned with nr_banks
> >>> - check the len of the property to confirm is it big enough to
> >>> contain "paddr", "size", and "gaddr"
> >>> - shm_id defined before nr_shm_domain, so we could re-use the
> >>> existing hole and avoid increasing the size of the structure.
> >>> - change "nr_shm_domain" to "nr_shm_borrowers", to not increment if
> >>> the role is owner in parsing code
> >>> - make "xen,shm_id" property as arbitrary string, with a strict
> >>> limit on the number of characters, MAX_SHM_ID_LENGTH
> >>> ---
> >>> v5 change:
> >>> - no change
> >>> ---
> >>> v4 change:
> >>> - nit fix on doc
> >>> ---
> >>> v3 change:
> >>> - make nr_shm_domain unsigned int
> >>> ---
> >>> v2 change:
> >>> - document refinement
> >>> - remove bitmap and use the iteration to check
> >>> - add a new field nr_shm_domain to keep the number of shared domain
> >>> ---
> >>>docs/misc/arm/device-tree/booting.txt | 124 
> >>>xen/arch/arm/Kconfig  |   6 +
> >>>xen/arch/arm/bootfdt.c| 157 ++
> >>>xen/arch/arm/include/asm/setup.h  |   9 ++
> >>>4 files changed, 296 insertions(+)
> >>>
> >>> diff --git a/docs/misc/arm/device-tree/booting.txt
> >>> b/docs/misc/arm/device-tree/booting.txt
> >>> index 98253414b8..8013fb98fe 100644
> >>> --- a/docs/misc/arm/device-tree/booting.txt
> >>> +++ b/docs/misc/arm/device-tree/booting.txt
> >>> @@ -378,3 +378,127 @@ device-tree:
> >>>
> >>>This will reserve a 512MB region starting at the host physical address
> >>>0x3000 to be exclusively used by DomU1.
> >>> +
> >>> +Static Shared Memory
> >>> +
> >>> +
> >>> +The static shared memory device tree nodes allow users to
> >>> +statically set up shared memory on dom0less system, enabling
> >>> +domains to do shm-based communication.
> >>> +
> >>> +- compatible
> >>> +
> >>> +"xen,domain-shared-memory-v1"
> >>> +
> >>> +- xen,shm-id
> >>> +
> >>> +An arbitrary string that represents the unique identifier of the 
> >>> shared
> >>> +memory region, with a strict limit on the number of
> >>> + characters(\0
> >> included),
> >>> +`MAX_SHM_ID_LENGTH(16)`. e.g. "xen,shm-id = "my-shared-mem-
> 1"".
> >>> +
> >>> +- xen,shared-mem
> >>> +
> >>> +An array takes a physical address, which is the base address of the
> >>> +shared memory region in host physical address space, a size,
> >>> + and a
> >> guest
> >>> +physical address, as the target address of the mapping.
> >>> +e.g. xen,shared-mem = < [host physical address] [size] [guest
> >>> + address] >
> >>
> >> Your implementation below is checking for overlap and also have some
> >> restriction. Can they be documented in the binding?
> >>
> >>> +
> >>> +The number of cells for the host address (and size) is the same as 
> >>> the
> >>> +guest pseudo-physical address and they are inherited from the
> >>> + parent
> >> node.
> >>
> >> In v5, we discussed to have the host address optional. However, the
> >> binding has not been updated to reflect that. Note that I am not
> >> asking to implement, but instead request that the binding can be used for
> such setup.
> >>
> >
> > How

RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator

2022-09-01 Thread Wei Chen


> -Original Message-
> From: Xen-devel  On Behalf Of Wei
> Chen
> Sent: 2022年9月2日 11:03
> To: Stefano Stabellini ; Julien Grall
> 
> Cc: Henry Wang ; xen-devel@lists.xenproject.org;
> Bertrand Marquis ; Volodymyr Babchuk
> 
> Subject: RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and
> heap allocator
> 
> Hi Julien and Stefano,
> 
> > -Original Message-
> > From: Stefano Stabellini 
> > Sent: 2022年9月2日 9:51
> > To: Julien Grall 
> > Cc: Stefano Stabellini ; Henry Wang
> > ; xen-devel@lists.xenproject.org; Bertrand Marquis
> > ; Wei Chen ; Volodymyr
> Babchuk
> > 
> > Subject: Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and
> > heap allocator
> >
> > On Thu, 1 Sep 2022, Julien Grall wrote:
> > > Hi Stefano,
> >
> 
> > In any case, I think we can postpone to after the release.

Maybe we can add some notes to say that this feature is still
experimental in EFI + DTS boot?

Cheers,
Wei Chen





RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator

2022-09-01 Thread Wei Chen
Hi Julien and Stefano,

> -Original Message-
> From: Stefano Stabellini 
> Sent: 2022年9月2日 9:51
> To: Julien Grall 
> Cc: Stefano Stabellini ; Henry Wang
> ; xen-devel@lists.xenproject.org; Bertrand Marquis
> ; Wei Chen ; Volodymyr Babchuk
> 
> Subject: Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and
> heap allocator
> 
> On Thu, 1 Sep 2022, Julien Grall wrote:
> > Hi Stefano,
> >
> > On 01/09/2022 18:08, Stefano Stabellini wrote:
> > > > > Also, what happen with UEFI? Is it easy to guarantee the region
> will not
> > > > > be used?
> > > >
> > > > For now I think it is not easy to guarantee that, do you have some
> ideas
> > > > in mind? I think I can follow this in above follow-up series to
> improve
> > > > things.
> > >
> > > For clarity, are we worried that the region is used by the bootloader
> > > for other things? For instance U-Boot or Tianocore placing some
> > > firmware tables inside the range specified for xenheap?
> >
> > Yes. I think it would be difficult for an admin to figure out which
> regions
> > are used. Although they are likely (?) going to be static for a given
> > UEFI/U-boot build.
> >
> > My major concern is such bug can be very difficult to root cause because
> we
> > have no safety in Xen. The most likely symptom would be random
> corruption.
> 
> Thanks for the clarification. Yeah, I think we'll have to do some
> "creative thinking" to figure out a solution to this issue. I wonder if
> U-boot or Tianocore have some sort of API (or build-time data) to know
> the unavailable ranges.
> 

When Xen is booted through EFI, all the memory statically defined in the
Device tree has a certain probability of conflicting with the memory occupied
by EFI. This is difficult to avoid without the EFI bootloader intervening,
but it is possible to detect such a conflict.

For EFI reserved memory regions (like runtime service), they will not be
reported as usable memory to Xen. The usable memory regions will be added
to bootinfo.memblk as device tree boot. So I think all static defined memory
regions would be check with bootinfo.memblk to find the conflict.
For EFI relocate Xen and DTB, I think Xen itself can know these addresses
easily and easy to check.

But if we don't add code to uboot or EDK2, what can we do are just check,
print error and panic. But starting from the actual usage scenario, because
the scenarios using static heap are normally highly customized, their EFI
firmware will bypass the static memory we defined in device tree when
customizing. So maybe check conflict is enough?

Cheers,
Wei Chen

> In any case, I think we can postpone to after the release.


[qemu-mainline test] 172916: regressions - FAIL

2022-09-01 Thread osstest service owner
flight 172916 qemu-mainline real [real]
http://logs.test-lab.xenproject.org/osstest/logs/172916/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-amd64-libvirt   6 libvirt-buildfail REGR. vs. 172123
 build-i386-libvirt6 libvirt-buildfail REGR. vs. 172123
 build-arm64-libvirt   6 libvirt-buildfail REGR. vs. 172123
 build-i386-xsm6 xen-buildfail REGR. vs. 172123
 build-armhf-libvirt   6 libvirt-buildfail REGR. vs. 172123

Tests which did not succeed, but are not blocking:
 test-amd64-i386-xl-xsm1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-qcow2  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-raw  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-xsm   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-raw   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-i386-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt   1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-amd64-libvirt-vhd  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-amd64-i386-xl-qemuu-debianhvm-i386-xsm  1 build-check(1)  blocked n/a
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 172123
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 172123
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 172123
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 172123
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 172123
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass

version targeted for testing:
 qemuue93ded1bf6c94ab95015b33e188bc8b0b0c32670
baseline version:
 qemuu2480f3bbd03814b0651a1f74959f5c6631ee58

[ovmf test] 172923: regressions - FAIL

2022-09-01 Thread osstest service owner
flight 172923 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/172923/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-i386-libvirt6 libvirt-buildfail REGR. vs. 172136
 build-amd64-libvirt   6 libvirt-buildfail REGR. vs. 172136

version targeted for testing:
 ovmf 6edd257861fd8ef44d23a6c1afb329efd95a9ae7
baseline version:
 ovmf 444260d45ec2a84e8f8c192b3539a3cd5591d009

Last test of basis   172136  2022-08-04 06:43:42 Z   28 days
Failing since172151  2022-08-05 02:40:28 Z   27 days  222 attempts
Testing same since   172923  2022-09-01 23:42:02 Z0 days1 attempts


People who touched revisions under test:
  Abdul Lateef Attar 
  Abner Chang 
  Ard Biesheuvel 
  Bob Feng 
  Chasel Chiu 
  Chen, Xiao X 
  Czajkowski, Maciej 
  Dimitrije Pavlov 
  Dun Tan 
  Edward Pickup 
  Foster Nong 
  Gregx Yeh 
  Guo Dong 
  Igor Kulchytskyy 
  James Lu 
  Jeff Brasen 
  Jiaxin Wu 
  Jose Marinho 
  KasimX Liu 
  Kavya 
  Konstantin Aladyshev 
  Kun Qin 
  Liming Gao 
  Liu, Zhiguang 
  Maciej Czajkowski 
  Michael D Kinney 
  Michael Kubacki 
  Pierre Gondois 
  Ray Ni 
  Rebecca Cran 
  Rebecca Cran 
  Sainadh Nagolu 
  Sami Mujawar 
  Shengfengx Xue 
  Wu, Jiaxin 
  Xiao X Chen 
  Yuanhao Xie 
  Zhiguang Liu 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  fail
 build-i386-libvirt   fail
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Not pushing.

(No revision log; it would be 1440 lines long.)



Re: [PATCH v3 6/7] xen/arm: introduce xen-evtchn dom0less property

2022-09-01 Thread Stefano Stabellini
On Thu, 1 Sep 2022, Julien Grall wrote:
> On 01/09/2022 10:13, Rahul Singh wrote:
> > Introduce a new sub-node under /chosen node to establish static event
> > channel communication between domains on dom0less systems.
> > 
> > An event channel will be created beforehand to allow the domains to
> > send notifications to each other.
> > 
> > Signed-off-by: Rahul Singh 
> > ---
> > Changes in v3:
> >   - use device-tree used_by to find the domain id of the evtchn node.
> >   - add new static_evtchn_create variable in struct dt_device_node to
> > hold the information if evtchn is already created.
> >   - fix minor comments
> > Changes in v2:
> >   - no change
> > ---
> >   docs/misc/arm/device-tree/booting.txt |  64 -
> >   xen/arch/arm/domain_build.c   | 128 ++
> >   xen/arch/arm/include/asm/setup.h  |   1 +
> >   xen/arch/arm/setup.c  |   2 +
> >   xen/include/xen/device_tree.h |  13 +++
> >   5 files changed, 207 insertions(+), 1 deletion(-)
> > 
> > diff --git a/docs/misc/arm/device-tree/booting.txt
> > b/docs/misc/arm/device-tree/booting.txt
> > index 98253414b8..edef98e6d1 100644
> > --- a/docs/misc/arm/device-tree/booting.txt
> > +++ b/docs/misc/arm/device-tree/booting.txt
> > @@ -212,7 +212,7 @@ with the following properties:
> >   enable only selected interfaces.
> > Under the "xen,domain" compatible node, one or more sub-nodes are
> > present
> > -for the DomU kernel and ramdisk.
> > +for the DomU kernel, ramdisk and static event channel.
> > The kernel sub-node has the following properties:
> >   @@ -254,11 +254,44 @@ The ramdisk sub-node has the following properties:
> >   property because it will be created by the UEFI stub on boot.
> >   This option is needed only when UEFI boot is used.
> >   +The static event channel sub-node has the following properties:
> > +
> > +- compatible
> > +
> > +"xen,evtchn"
> > +
> > +- xen,evtchn
> > +
> > +The property is tuples of two numbers
> > +(local-evtchn link-to-foreign-evtchn) where:
> > +
> > +local-evtchn is an integer value that will be used to allocate local
> > port
> > +for a domain to send and receive event notifications to/from the remote
> > +domain. Maximum supported value is 2^17 for FIFO ABI and 4096 for 2L
> > ABI.
> > +It is recommended to use low event channel IDs.
> > +
> > +link-to-foreign-evtchn is a single phandle to a remote evtchn to which
> > +local-evtchn will be connected.
> > Example
> >   ===
> > chosen {
> > +
> > +module@0 {
> > +compatible = "multiboot,kernel", "multiboot,module";
> > +xen,uefi-binary = "...";
> > +bootargs = "...";
> > +
> > +};
> 
> NIT: Describing this node in the example seems unnecessary.
> 
> > +
> > +/* one sub-node per local event channel */
> > +ec1: evtchn@1 {
> > + compatible = "xen,evtchn-v1";
> > + /* local-evtchn link-to-foreign-evtchn */
> > + xen,evtchn = <0xa &ec2>;
> > +};
> > +
> 
> Here you provide an example for dom0. But the position where you describe the
> binding suggests this binding only exists for domUs.
> 
> Either we duplicate the binding or we re-order the documentation to have
> common binding in a single place. My preference would be the latter.
> 
> >   domU1 {
> >   compatible = "xen,domain";
> >   #address-cells = <0x2>;
> > @@ -277,6 +310,23 @@ chosen {
> >   compatible = "multiboot,ramdisk", "multiboot,module";
> >   reg = <0x0 0x4b00 0xff>;
> >   };
> > +
> > +/* one sub-node per local event channel */
> > +ec2: evtchn@2 {
> > +compatible = "xen,evtchn-v1";
> > +/* local-evtchn link-to-foreign-evtchn */
> > +xen,evtchn = <0xa &ec1>;
> > +};
> > +
> > +ec3: evtchn@3 {
> > +compatible = "xen,evtchn-v1";
> > +xen,evtchn = <0xb &ec5>;
> > +};
> > +
> > +ec4: evtchn@4 {
> > +compatible = "xen,evtchn-v1";
> > +xen,evtchn = <0xc &ec6>;
> > +};
> >   };
> > domU2 {
> > @@ -296,6 +346,18 @@ chosen {
> >   compatible = "multiboot,ramdisk", "multiboot,module";
> >   reg = <0x0 0x4d00 0xff>;
> >   };
> > +
> > +/* one sub-node per local event channel */
> > +ec5: evtchn@5 {
> > +compatible = "xen,evtchn-v1";
> > +/* local-evtchn link-to-foreign-evtchn */
> > +xen,evtchn = <0xb &ec3>;
> > +};
> > +
> > +ec6: evtchn@6 {
> > +compatible = "xen,evtchn-v1";
> > +xen,evtchn = <0xd &ec4>;
> > +};
> >   };
> >   };
> >   diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> > index 707e247f6a..4b24261825 100644
> > --- a/xen/arch/arm/domain_build.c
> > +++ b/xen/arch/arm/domain_build.c
> > @@ -33,6 +33,8 @@
> >   #

Re: [PATCH 0/7] Fix MISRA C 2012 Rule 20.7 violations

2022-09-01 Thread Stefano Stabellini
On Thu, 1 Sep 2022, Bertrand Marquis wrote:
> Hi Xenia,
> 
> > On 1 Sep 2022, at 10:27, Xenia Ragiadakou  wrote:
> > 
> > 
> > On 9/1/22 01:35, Stefano Stabellini wrote:
> >> Patches 1, 4, and 6 are already committed. I plan to commit patches 2, 3
> >> and 5 in the next couple of days.
> >> Patch 7 needs further discussions and it is best addressed during the
> >> next MISRA C sync-up.
> > 
> > I would like to share here, before the next MISRA C sync, my understandings 
> > that will hopefully resolve a wrong impression of mine, that I may have 
> > spread around, regarding this rule.
> > There was a misunderstanding regarding the rule 20.7 from my part and I 
> > think that Jan is absolutely right that parenthesizing macro parameters 
> > used as function arguments is not required by the rule.
> > 
> > The rule 20.7 states "Expressions resulting from the expansion of macro 
> > parameters shall be enclosed in parentheses" and in the rationale of the 
> > rule states "If a macro parameter is not being used as an expression then 
> > the parentheses are not necessary because no operators are involved.".
> > 
> > Initially, based on the title, my understanding was that it requires for 
> > the expression resulting from the expansion of the macro to be enclosed in 
> > parentheses. Then, based on the rule explanation and the examples given,  
> > my understanding was that it requires the macro parameters that are used as 
> > expressions to be enclosed in parentheses.
> > But, after re-thinking about it, the most probable and what makes more 
> > sense, is that it require parentheses around the macro parameters that are 
> > part of an expression and not around those that are used as expressions.
> > 
> > Therefore, macro parameters being used as function arguments are not 
> > required to be enclosed in parentheses, because the function arguments are 
> > part of an expression list, not of an expression (comma is evaluated as 
> > separator, not as operator).
> > While, macro parameters used as rhs and lhs expressions of the assignment 
> > operator are required to be enclosed in parentheses because they are part 
> > of an assignment expression.
> > 
> > I verified that the violation reported by cppcheck is not due to missing 
> > parentheses around the function argument (though still I have not 
> > understood the origin of the warning). Also, Eclair does not report it.
> > 
> > Hence, it was a misunderstanding of mine and there is no inconsistency, 
> > with respect to this rule, in adding parentheses around macro parameters 
> > used as rhs of assignments. The rule does not require adding parentheses 
> > around macro parameters used as function arguments and neither cppcheck nor 
> > Eclair report violation for missing parentheses around macro parameters 
> > used as function arguments.
> 
> 
> Thanks a lot for the detailed explanation :-)
> 
> What you say does make sense and I agree with your analysis here, only 
> protect when part of an expression and not use as a subsequent parameter (for 
> a function or an other macro).

Yeah I also agree with your analysis, and many thanks for
double-checking the cppcheck and Eclair's reports.



Setting constant-time mode CPU flag

2022-09-01 Thread Demi Marie Obenour
On Intel chips (Ice Lake and later) and ARM64, a bit needs to be set in
a CPU register to enforce constant-time execution.  Linux plans to set
this bit by default; Xen should do the same.  See
https://lore.kernel.org/lkml/ywgcrqutxmx0w...@gmail.com/T/ for details.
I recommend setting the bit unconditionally and ignoring guest attempts
to change it.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v6 5/9] xen/arm: Add additional reference to owner domain when the owner is allocated

2022-09-01 Thread Stefano Stabellini
On Thu, 1 Sep 2022, Julien Grall wrote:
> Hi Penny,
> 
> On 21/07/2022 14:21, Penny Zheng wrote:
> > Borrower domain will fail to get a page ref using the owner domain
> > during allocation, when the owner is created after borrower.
> > 
> > So here, we decide to get and add the right amount of reference, which
> > is the number of borrowers, when the owner is allocated.
> > 
> > Signed-off-by: Penny Zheng 
> > Reviewed-by: Stefano Stabellini 
> 
> IMHO, this tag should not have been kept given...
> 
> > ---
> > v6 change:
> > - adapt to the change of "nr_shm_borrowers"
> 
> ... this change. 'reviewed-by' means that the person reviewed the code and
> therefore agree with the logic. So I would only keep the tag if the change is
> trivial (including typo, coding style) and would drop it (or confirm with the
> person) otherwise.
> 
> Stefano, can you confirm you are happy that your reviewed-by tag is kept?

Yes I confirm

Reviewed-by: Stefano Stabellini 


> > - add in-code comment to explain if the borrower is created first, we intend
> > to
> > add pages in the P2M without reference.
> > ---
> > v5 change:
> > - no change
> > ---
> > v4 changes:
> > - no change
> > ---
> > v3 change:
> > - printk rather than dprintk since it is a serious error
> > ---
> > v2 change:
> > - new commit
> > ---
> >   xen/arch/arm/domain_build.c | 60 +
> >   1 file changed, 60 insertions(+)
> > 
> > diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> > index a7e95c34a7..e891e800a7 100644
> > --- a/xen/arch/arm/domain_build.c
> > +++ b/xen/arch/arm/domain_build.c
> > @@ -761,6 +761,30 @@ static void __init assign_static_memory_11(struct
> > domain *d,
> >   }
> > #ifdef CONFIG_STATIC_SHM
> > +static int __init acquire_nr_borrower_domain(struct domain *d,
> > + paddr_t pbase, paddr_t psize,
> > + unsigned long *nr_borrowers)
> > +{
> > +unsigned long bank;
> 
> NIT: AFAICT nr_banks is an "unsigned int".
> 
> Other than that:
> 
> Acked-by: Julien Grall 
> 
> Cheers,
> 
> -- 
> Julien Grall
> 



Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Steven Rostedt
On Thu, 1 Sep 2022 21:35:32 -0400
Kent Overstreet  wrote:

> On Thu, Sep 01, 2022 at 08:23:11PM -0400, Steven Rostedt wrote:
> > If ftrace, perf, bpf can't do what you want, take a harder look to see if
> > you can modify them to do so.  
> 
> Maybe we can use this exchange to make both of our tools better. I like your
> histograms - the quantiles algorithm I've had for years is janky, I've been
> meaning to rip that out, I'd love to take a look at your code for that. And
> having an on/off switch is a good idea, I'll try to add that at some point.
> Maybe you got some ideas from my stuff too.
> 
> I'd love to get better tracepoints for measuring latency - what I added to
> init_wait() and finish_wait() was really only a starting point. Figuring out
> the right places to measure is where I'd like to be investing my time in this
> area, and there's no reason we couldn't both be making use of that.

Yes, this is exactly what I'm talking about. I'm not against your work, I
just want you to work more with everyone to come up with ideas that can
help everyone as a whole. That's how "open source communities" is suppose
to work ;-)

The histogram and synthetic events can use some more clean ups. There's a
lot of places that can be improved in that code. But I feel the ideas
behind that code is sound. It's just getting the implementation to be a bit
more efficient.

> 
> e.g. with kernel waitqueues, I looked at hooking prepare_to_wait() first but 
> not
> all code uses that, init_wait() got me better coverage. But I've already seen
> that that misses things, too, there's more work to be done.

I picked prepare_to_wait() just because I was hacking up something quick
and thought that was "close enough" ;-)

-- Steve




Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator

2022-09-01 Thread Stefano Stabellini
On Thu, 1 Sep 2022, Julien Grall wrote:
> Hi Stefano,
> 
> On 01/09/2022 18:08, Stefano Stabellini wrote:
> > > > Also, what happen with UEFI? Is it easy to guarantee the region will not
> > > > be used?
> > > 
> > > For now I think it is not easy to guarantee that, do you have some ideas
> > > in mind? I think I can follow this in above follow-up series to improve
> > > things.
> > 
> > For clarity, are we worried that the region is used by the bootloader
> > for other things? For instance U-Boot or Tianocore placing some
> > firmware tables inside the range specified for xenheap?
> 
> Yes. I think it would be difficult for an admin to figure out which regions
> are used. Although they are likely (?) going to be static for a given
> UEFI/U-boot build.
> 
> My major concern is such bug can be very difficult to root cause because we
> have no safety in Xen. The most likely symptom would be random corruption.

Thanks for the clarification. Yeah, I think we'll have to do some
"creative thinking" to figure out a solution to this issue. I wonder if
U-boot or Tianocore have some sort of API (or build-time data) to know
the unavailable ranges.

In any case, I think we can postpone to after the release.



Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 08:23:11PM -0400, Steven Rostedt wrote:
> If ftrace, perf, bpf can't do what you want, take a harder look to see if
> you can modify them to do so.

Maybe we can use this exchange to make both of our tools better. I like your
histograms - the quantiles algorithm I've had for years is janky, I've been
meaning to rip that out, I'd love to take a look at your code for that. And
having an on/off switch is a good idea, I'll try to add that at some point.
Maybe you got some ideas from my stuff too.

I'd love to get better tracepoints for measuring latency - what I added to
init_wait() and finish_wait() was really only a starting point. Figuring out
the right places to measure is where I'd like to be investing my time in this
area, and there's no reason we couldn't both be making use of that.

e.g. with kernel waitqueues, I looked at hooking prepare_to_wait() first but not
all code uses that, init_wait() got me better coverage. But I've already seen
that that misses things, too, there's more work to be done.

random thought: might try adding a warning in schedule() any time it's called
and codetag_time_stats_start() hasn't been called, that'll be a starting
point...



RE: [PATCH 1/2] docs, xen/arm: Introduce reserved heap memory

2022-09-01 Thread Henry Wang
Hi Julien,

> -Original Message-
> From: Julien Grall 
> Subject: Re: [PATCH 1/2] docs, xen/arm: Introduce reserved heap memory
> 
> Hi Henry,
> 
> > Also, this commit introduces the logic to parse the reserved heap
> > configuation in device tree by reusing the device tree entry definition
> 
> typo: s/configuation/configuration/

Oops, sorry for that...

> 
> >   docs/misc/arm/device-tree/booting.txt | 46
> +
> 
> I have skipped the documentation and looked at the code instead.

No problem, Stefano and Michal have already provided some comments
in the doc.

> 
> It sounds like to me, we want to have an enum rather than multiple
> boolean. This would also make easier...
> 
> >   {
> >   const struct fdt_property *prop;
> >   unsigned int i, banks;
> > @@ -96,6 +97,7 @@ static int __init device_tree_get_meminfo(const void
> *fdt, int node,
> >   mem->bank[mem->nr_banks].start = start;
> >   mem->bank[mem->nr_banks].size = size;
> >   mem->bank[mem->nr_banks].xen_domain = xen_domain;
> > +mem->bank[mem->nr_banks].xen_heap = xen_heap;
> >   mem->nr_banks++;
> >   }
> >
> > @@ -185,7 +187,7 @@ static int __init process_memory_node(const void
> *fdt, int node,
> > void *data)
> >   {
> >   return device_tree_get_meminfo(fdt, node, "reg", address_cells,
> size_cells,
> > -   data, false);
> > +   data, false, false);
> 
> ... to understand the two "false" here.

I agree, will change in v2.

> 
> >   }
> >
> >   static int __init process_reserved_memory_node(const void *fdt, int node,
> > @@ -293,7 +295,7 @@ static void __init process_multiboot_node(const
> void *fdt, int node,
> >kind, start, domU);
> >   }
> >
> > -static void __init process_chosen_node(const void *fdt, int node,
> > +static int __init process_chosen_node(const void *fdt, int node,
> >  const char *name,
> >  u32 address_cells, u32 size_cells) 
> > >   {
> > @@ -301,16 +303,40 @@ static void __init process_chosen_node(const
> void *fdt, int node,
> >   paddr_t start, end;
> >   int len;
> >
> > +if ( fdt_get_property(fdt, node, "xen,static-mem", NULL) )
> > +{
> > +u32 address_cells = device_tree_get_u32(fdt, node,
> > +
> > "#xen,static-mem-address-cells",
> > +0);
> > +u32 size_cells = device_tree_get_u32(fdt, node,
> > + "#xen,static-mem-size-cells", 
> > 0);
> > +int rc;
> > +
> > +printk("Checking for reserved heap in /chosen\n");
> > +if ( address_cells < 1 || size_cells < 1 )
> > +{
> > +printk("fdt: node `%s': invalid #xen,static-mem-address-cells 
> > or
> #xen,static-mem-size-cells\n",
> > +   name);
> > +return -EINVAL;
> > +}
> > +
> > +rc = device_tree_get_meminfo(fdt, node, "xen,static-mem",
> address_cells,
> > + size_cells, &bootinfo.reserved_mem, 
> > false,
> > + true);
> > +if ( rc )
> > +return rc;
> > +}
> > +
> >   printk("Checking for initrd in /chosen\n");
> >
> >   prop = fdt_get_property(fdt, node, "linux,initrd-start", &len);
> >   if ( !prop )
> >   /* No initrd present. */
> > -return;
> > +return 0;
> >   if ( len != sizeof(u32) && len != sizeof(u64) )
> >   {
> >   printk("linux,initrd-start property has invalid length %d\n", 
> > len);
> > -return;
> > +return -EINVAL;
> 
> This is technically a change in behavior for Xen (we would panic rather
> than continue). I am happy with the proposal. However, this doesn't seem
> to be explained in the commit message.
> 
> That said, I think this should be split in a separate patch along with
> the ones below (including the prototype changes).

According to Michal's comment, I've removed the return type and function
prototype change in my local v2. So this patch itself is fine. My question now
would be, do maintainers think this change of behavior with processing the
chosen node be helpful? Do we prefer an instant panic or current behavior?

I am more than happy to add a patch for changing the return/panic behavior
if everyone is happy with it.

> 
> >   }
> >   start = dt_read_number((void *)&prop->data, dt_size_to_cells(len));
> >
> > @@ -318,12 +344,12 @@ static void __init process_chosen_node(const
> void *fdt, int node,
> >   if ( !prop )
> >   {
> >   printk("linux,initrd-end not present but -start was\n");
> > -return;
> > +return -EINVAL;
> >   }
> >   if ( len != sizeof(u32) && len != sizeof(u64) )
> >   {

Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 06:04:46PM -0700, Roman Gushchin wrote:
> On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote:
> > On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote:
> > > I'd suggest to run something like iperf on a fast hardware. And maybe some
> > > io_uring stuff too. These are two places which were historically most 
> > > sensitive
> > > to the (kernel) memory accounting speed.
> > 
> > I'm getting wildly inconsistent results with iperf.
> > 
> > io_uring-echo-server and rust_echo_bench gets me:
> > Benchmarking: 127.0.0.1:12345
> > 50 clients, running 512 bytes, 60 sec.
> > 
> > Without alloc tagging:  120547 request/sec
> > With:   116748 request/sec
> > 
> > https://github.com/frevib/io_uring-echo-server
> > https://github.com/haraldh/rust_echo_bench
> > 
> > How's that look to you? Close enough? :)
> 
> Yes, this looks good (a bit too good).

Eh, I was hoping for better :)

> I'm not that familiar with io_uring, Jens and Pavel should have a better idea
> what and how to run (I know they've workarounded the kernel memory accounting
> because of the performance in the past, this is why I suspect it might be an
> issue here as well).
> 
> This is a recent optimization on the networking side:
> https://lore.kernel.org/linux-mm/20220825000506.239406-1-shake...@google.com/
> 
> Maybe you can try to repeat this experiment.

I'd be more interested in a synthetic benchmark, if you know of any.



[linux-5.4 test] 172914: regressions - FAIL

2022-09-01 Thread osstest service owner
flight 172914 linux-5.4 real [real]
http://logs.test-lab.xenproject.org/osstest/logs/172914/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-i386-libvirt6 libvirt-buildfail REGR. vs. 172128
 build-amd64-libvirt   6 libvirt-buildfail REGR. vs. 172128
 build-arm64-libvirt   6 libvirt-buildfail REGR. vs. 172128
 build-armhf-libvirt   6 libvirt-buildfail REGR. vs. 172128

Tests which did not succeed, but are not blocking:
 test-arm64-arm64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-amd64-libvirt-vhd  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-i386-libvirt-raw   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-xsm   1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-qcow2  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-raw  1 build-check(1)   blocked  n/a
 test-armhf-armhf-xl-credit1 18 guest-start/debian.repeat fail blocked in 172128
 test-armhf-armhf-xl-credit2  18 guest-start/debian.repeatfail  like 172108
 test-armhf-armhf-xl-multivcpu 18 guest-start/debian.repeatfail like 172108
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 172108
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 172128
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 172128
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 172128
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 172128
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 172128
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 172128
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 172128
 test-armhf-armhf-xl-rtds 18 guest-start/debian.repeatfail  like 172128
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 172128
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armh

Re: [RFC PATCH 14/30] mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache objects

2022-09-01 Thread Suren Baghdasaryan
On Thu, Sep 1, 2022 at 4:41 PM Roman Gushchin  wrote:
>
> On Tue, Aug 30, 2022 at 02:49:03PM -0700, Suren Baghdasaryan wrote:
> > Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext
> > objects. Also prevent slabobj_ext allocations for kmem_cache objects.
> >
> > Signed-off-by: Suren Baghdasaryan 
>
> Patches 12-14 look good to me.
> It's probably to early to ack anything, but otherwise I'd ack them.

Thank you for reviewing!

>
> Thanks!



Re: [RFC PATCH 11/30] mm: introduce slabobj_ext to support slab object extensions

2022-09-01 Thread Suren Baghdasaryan
On Thu, Sep 1, 2022 at 4:36 PM Roman Gushchin  wrote:
>
> On Tue, Aug 30, 2022 at 02:49:00PM -0700, Suren Baghdasaryan wrote:
> > Currently slab pages can store only vectors of obj_cgroup pointers in
> > page->memcg_data. Introduce slabobj_ext structure to allow more data
> > to be stored for each slab object. Wraps obj_cgroup into slabobj_ext
> > to support current functionality while allowing to extend slabobj_ext
> > in the future.
> >
> > Note: ideally the config dependency should be turned the other way around:
> > MEMCG should depend on SLAB_OBJ_EXT and {page|slab|folio}.memcg_data would
> > be renamed to something like {page|slab|folio}.objext_data. However doing
> > this in RFC would introduce considerable churn unrelated to the overall
> > idea, so avoiding this until v1.
>
> Hi Suren!

Hi Roman,

>
> I'd say CONFIG_MEMCG_KMEM and CONFIG_YOUR_NEW_STUFF should both depend on
> SLAB_OBJ_EXT.
> CONFIG_MEMCG_KMEM depend on CONFIG_MEMCG anyway.

Yes, I agree. I wanted to mention here that the current dependency is
incorrect and should be reworked. Having both depending on
SLAB_OBJ_EXT seems like the right approach.

>
> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  include/linux/memcontrol.h |  18 --
> >  init/Kconfig   |   5 ++
> >  mm/kfence/core.c   |   2 +-
> >  mm/memcontrol.c|  60 ++-
> >  mm/page_owner.c|   2 +-
> >  mm/slab.h  | 119 +
> >  6 files changed, 131 insertions(+), 75 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 6257867fbf95..315399f77173 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -227,6 +227,14 @@ struct obj_cgroup {
> >   };
> >  };
> >
> > +/*
> > + * Extended information for slab objects stored as an array in 
> > page->memcg_data
> > + * if MEMCG_DATA_OBJEXTS is set.
> > + */
> > +struct slabobj_ext {
> > + struct obj_cgroup *objcg;
> > +} __aligned(8);
>
> Why do we need this aligment requirement?

To save space by avoiding padding, however, all members today will be
pointers, so it's meaningless and we can safely drop it.

>
> > +
> >  /*
> >   * The memory controller data structure. The memory controller controls 
> > both
> >   * page cache and RSS per cgroup. We would eventually like to provide
> > @@ -363,7 +371,7 @@ extern struct mem_cgroup *root_mem_cgroup;
> >
> >  enum page_memcg_data_flags {
> >   /* page->memcg_data is a pointer to an objcgs vector */
> > - MEMCG_DATA_OBJCGS = (1UL << 0),
> > + MEMCG_DATA_OBJEXTS = (1UL << 0),
> >   /* page has been accounted as a non-slab kernel page */
> >   MEMCG_DATA_KMEM = (1UL << 1),
> >   /* the next bit after the last actual flag */
> > @@ -401,7 +409,7 @@ static inline struct mem_cgroup *__folio_memcg(struct 
> > folio *folio)
> >   unsigned long memcg_data = folio->memcg_data;
> >
> >   VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
> > - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
> > + VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
> >   VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);
> >
> >   return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> > @@ -422,7 +430,7 @@ static inline struct obj_cgroup *__folio_objcg(struct 
> > folio *folio)
> >   unsigned long memcg_data = folio->memcg_data;
> >
> >   VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
> > - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
> > + VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
> >   VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);
> >
> >   return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> > @@ -517,7 +525,7 @@ static inline struct mem_cgroup 
> > *page_memcg_check(struct page *page)
> >*/
> >   unsigned long memcg_data = READ_ONCE(page->memcg_data);
> >
> > - if (memcg_data & MEMCG_DATA_OBJCGS)
> > + if (memcg_data & MEMCG_DATA_OBJEXTS)
> >   return NULL;
> >
> >   if (memcg_data & MEMCG_DATA_KMEM) {
> > @@ -556,7 +564,7 @@ static inline struct mem_cgroup 
> > *get_mem_cgroup_from_objcg(struct obj_cgroup *ob
> >  static inline bool folio_memcg_kmem(struct folio *folio)
> >  {
> >   VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page);
> > - VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJCGS, folio);
> > + VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJEXTS, folio);
> >   return folio->memcg_data & MEMCG_DATA_KMEM;
> >  }
> >
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 532362fcfe31..82396d7a2717 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -958,6 +958,10 @@ config MEMCG
> >   help
> > Provides control over the memory footprint of tasks in a cgroup.
> >
> > +config SLAB_OBJ_EXT
> > + bool
> > + depends on MEMCG
> > +
> >  config MEMCG_SWAP
> >   bool
> 

Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Steven Rostedt
On Thu, 1 Sep 2022 18:55:15 -0400
Kent Overstreet  wrote:

> On Thu, Sep 01, 2022 at 06:34:30PM -0400, Steven Rostedt wrote:
> > On Thu, 1 Sep 2022 17:54:38 -0400
> > Kent Overstreet  wrote:  
> > > 
> > > So this looks like it's gotten better since I last looked, but it's still 
> > > not
> > > there yet.
> > > 
> > > Part of the problem is that the tracepoints themselves are in the wrong 
> > > place:
> > > your end event is when a task is woken up, but that means spurious 
> > > wakeups will  
> > 
> > The end event is when a task is scheduled onto the CPU. The start event is
> > the first time it is woken up.  
> 
> Yeah, that's not what I want. You're just tracing latency due to having more
> processes runnable than CPUs.
> 
> I don't care about that for debugging, though! I specifically want latency at
> the wait_event() level, and related - every time a process blocked _on some
> condition_, until that condition became true. Not until some random, 
> potentially
> spurious wakeup.

Ideally this would be better if we could pass the stack trace from one
event to the next, but that wouldn't be too hard to implement. It just
needs to be done.

But anyway:

 # echo 'p:wait prepare_to_wait_event' > /sys/kernel/tracing/kprobe_events

// created an event on prepare_to_wait_event as that's usually called just
// before wait event.

 # sqlhist -e -n wait_sched 'select start.common_pid as 
pid,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta from wait as start 
join sched_switch as end on start.common_pid = end.prev_pid where 
end.prev_state & 3'

// Create a "wait_sched" event that traces the time between the
// prepare_to_wait_event call and the scheduler. Only trigger it if the
// schedule happens in the interruptible or uninterruptible states.

 # sqlhist -e -n wake_sched 'select start.pid,(end.TIMESTAMP_USECS - 
start.TIMESTAMP_USECS) as delta2 from wait_sched as start join sched_switch as 
end on start.pid = end.next_pid where start.delta < 50'

// Now attach the wait_event to the sched_switch where the task gets
// scheduled back in. But we are only going to care if the delta between
// the prepare_to_wait_event and the schedule is less that 50us. This is a
// hack to just care about where a prepare_to_wait_event was done just before
// scheduling out.

 # echo 'hist:keys=pid,delta2.buckets=10:sort=delta2' > 
/sys/kernel/tracing/events/synthetic/wake_sched/trigger

// Now we are going to look at the deltas that the task was sleeping for an
// event. But this just gives pids and deltas.

 # echo 'hist:keys=pid,stacktrace if delta < 50' >> 
/sys/kernel/tracing/events/synthetic/wait_sched/trigger

// And this is to get the backtraces of where the task was. This is because
// the stack trace is not available at the schedule in, because the
// sched_switch can only give the stack trace of when a task schedules out.
// Again, this is somewhat a hack.

 # cat /sys/kernel/tracing/events/synthetic/wake_sched/hist
# event histogram
#
# trigger info: 
hist:keys=pid,delta2.buckets=10:vals=hitcount:sort=delta2.buckets=10:size=2048 
[active]
#

{ pid:   2114, delta2: ~ 10-19 } hitcount:  1
{ pid:   1389, delta2: ~ 160-169 } hitcount:  1
{ pid:   1389, delta2: ~ 660-669 } hitcount:  1
{ pid:   1389, delta2: ~ 1020-1029 } hitcount:  1
{ pid:   1189, delta2: ~ 500020-500029 } hitcount:  1
{ pid:   1189, delta2: ~ 500030-500039 } hitcount:  1
{ pid:   1195, delta2: ~ 500030-500039 } hitcount:  2
{ pid:   1189, delta2: ~ 500040-500049 } hitcount: 10
{ pid:   1193, delta2: ~ 500040-500049 } hitcount:  3
{ pid:   1197, delta2: ~ 500040-500049 } hitcount:  3
{ pid:   1195, delta2: ~ 500040-500049 } hitcount:  9
{ pid:   1190, delta2: ~ 500050-500059 } hitcount: 55
{ pid:   1197, delta2: ~ 500050-500059 } hitcount: 51
{ pid:   1191, delta2: ~ 500050-500059 } hitcount: 61
{ pid:   1198, delta2: ~ 500050-500059 } hitcount: 56
{ pid:   1195, delta2: ~ 500050-500059 } hitcount: 48
{ pid:   1192, delta2: ~ 500050-500059 } hitcount: 54
{ pid:   1194, delta2: ~ 500050-500059 } hitcount: 50
{ pid:   1196, delta2: ~ 500050-500059 } hitcount: 57
{ pid:   1189, delta2: ~ 500050-500059 } hitcount: 48
{ pid:   1193, delta2: ~ 500050-500059 } hitcount: 52
{ pid:   1194, delta2: ~ 500060-500069 } hitcount: 12
{ pid:   1191, delta2: ~ 500060-500069 } hitcount:  2
{ pid:   1190, delta2: ~ 500060-500069 } hitcount:  7
{ pid:   1198, delta2: ~ 500060-500069 } hitcount:  9
{ pid:   1193, delta2: ~ 500060-500069 } hitcount:  6
{ pid:   1196, delta2: ~ 500060-500069 } hitcount:  5
{ pid:   1192, delta2: ~ 500060-500069 } hitcount:  9
{ pid:   1197, delta2: ~ 500060-500069 } hitcount:  9
{ pid:   

Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote:
> I'd suggest to run something like iperf on a fast hardware. And maybe some
> io_uring stuff too. These are two places which were historically most 
> sensitive
> to the (kernel) memory accounting speed.

I'm getting wildly inconsistent results with iperf.

io_uring-echo-server and rust_echo_bench gets me:
Benchmarking: 127.0.0.1:12345
50 clients, running 512 bytes, 60 sec.

Without alloc tagging:  120547 request/sec
With:   116748 request/sec

https://github.com/frevib/io_uring-echo-server
https://github.com/haraldh/rust_echo_bench

How's that look to you? Close enough? :)



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Suren Baghdasaryan
On Thu, Sep 1, 2022 at 3:54 PM Roman Gushchin  wrote:
>
> On Thu, Sep 01, 2022 at 06:37:20PM -0400, Kent Overstreet wrote:
> > On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote:
> > > On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote:
> > > > This is very interesting work! Do you have any data about the overhead
> > > > this introduces, especially in a production environment? I am
> > > > especially interested in memory allocations tracking and detecting
> > > > leaks.
> > >
> > > +1
> > >
> > > I think the question whether it indeed can be always turned on in the 
> > > production
> > > or not is the main one. If not, the advantage over ftrace/bpf/... is not 
> > > that
> > > obvious. Otherwise it will be indeed a VERY useful thing.
> >
> > Low enough overhead to run in production was my primary design goal.
> >
> > Stats are kept in a struct that's defined at the callsite. So this adds _no_
> > pointer chasing to the allocation path, unless we've switch to percpu 
> > counters
> > at that callsite (see the lazy percpu counters patch), where we need to 
> > deref
> > one percpu pointer to save an atomic.
> >
> > Then we need to stash a pointer to the alloc_tag, so that kfree() can find 
> > it.
> > For slab allocations this uses the same storage area as memcg, so for
> > allocations that are using that we won't be touching any additional 
> > cachelines.
> > (I wanted the pointer to the alloc_tag to be stored inline with the 
> > allocation,
> > but that would've caused alignment difficulties).
> >
> > Then there's a pointer deref introduced to the kfree() path, to get back to 
> > the
> > original alloc_tag and subtract the allocation from that callsite. That one
> > won't be free, and with percpu counters we've got another dependent load 
> > too -
> > hmm, it might be worth benchmarking with just atomics, skipping the percpu
> > counters.
> >
> > So the overhead won't be zero, I expect it'll show up in some synthetic
> > benchmarks, but yes I do definitely expect this to be worth enabling in
> > production in many scenarios.
>
> I'm somewhat sceptical, but I usually am. And in this case I'll be really 
> happy
> to be wrong.
>
> On a bright side, maybe most of the overhead will come from few allocations,
> so an option to explicitly exclude them will do the trick.
>
> I'd suggest to run something like iperf on a fast hardware. And maybe some
> io_uring stuff too. These are two places which were historically most 
> sensitive
> to the (kernel) memory accounting speed.

Thanks for the suggestions, Roman. I'll see how I can get this done.
I'll have to find someone with access to fast hardware (Android is not
great for that) and backporting the patchset to the supported kernel
version. Will do my best.
Thanks,
Suren.

>
> Thanks!
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>



Re: [RFC PATCH 28/30] Improved symbolic error names

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 04:19:35PM -0700, Joe Perches wrote:
> On Tue, 2022-08-30 at 14:49 -0700, Suren Baghdasaryan wrote:
> > From: Kent Overstreet 
> > 
> > This patch adds per-error-site error codes, with error strings that
> > include their file and line number.
> > 
> > To use, change code that returns an error, e.g.
> > return -ENOMEM;
> > to
> > return -ERR(ENOMEM);
> > 
> > Then, errname() will return a string that includes the file and line
> > number of the ERR() call, for example
> > printk("Got error %s!\n", errname(err));
> > will result in
> > Got error ENOMEM at foo.c:1234
> 
> Why? Something wrong with just using %pe ?
> 
>   printk("Got error %pe at %s:%d!\n", ERR_PTR(err), __FILE__, __LINE__);
> 
> Likely __FILE__ and __LINE__ aren't particularly useful.

That doesn't do what this patchset does. If it only did that, it wouldn't make
much sense, would it? :)

With this patchset,
 printk("Got error %pe!\n", ptr);

prints out a file and line number, but it's _not_ the file/line number of the
printk statement - it's the file/line number where the error originated!

:)



[ovmf test] 172921: regressions - FAIL

2022-09-01 Thread osstest service owner
flight 172921 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/172921/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-i386-libvirt6 libvirt-buildfail REGR. vs. 172136
 build-amd64-libvirt   6 libvirt-buildfail REGR. vs. 172136

version targeted for testing:
 ovmf ac55fcb051e4f288d29432043a42c13866419598
baseline version:
 ovmf 444260d45ec2a84e8f8c192b3539a3cd5591d009

Last test of basis   172136  2022-08-04 06:43:42 Z   28 days
Failing since172151  2022-08-05 02:40:28 Z   27 days  221 attempts
Testing same since   172917  2022-09-01 15:42:06 Z0 days2 attempts


People who touched revisions under test:
  Abdul Lateef Attar 
  Abner Chang 
  Ard Biesheuvel 
  Bob Feng 
  Chasel Chiu 
  Chen, Xiao X 
  Czajkowski, Maciej 
  Dimitrije Pavlov 
  Dun Tan 
  Edward Pickup 
  Foster Nong 
  Gregx Yeh 
  Guo Dong 
  Igor Kulchytskyy 
  James Lu 
  Jeff Brasen 
  Jiaxin Wu 
  Jose Marinho 
  KasimX Liu 
  Kavya 
  Konstantin Aladyshev 
  Kun Qin 
  Liming Gao 
  Liu, Zhiguang 
  Maciej Czajkowski 
  Michael D Kinney 
  Michael Kubacki 
  Ray Ni 
  Rebecca Cran 
  Rebecca Cran 
  Sainadh Nagolu 
  Sami Mujawar 
  Shengfengx Xue 
  Wu, Jiaxin 
  Xiao X Chen 
  Yuanhao Xie 
  Zhiguang Liu 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  fail
 build-i386-libvirt   fail
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Not pushing.

(No revision log; it would be 1425 lines long.)



Re: [RFC PATCH 28/30] Improved symbolic error names

2022-09-01 Thread Joe Perches
On Tue, 2022-08-30 at 14:49 -0700, Suren Baghdasaryan wrote:
> From: Kent Overstreet 
> 
> This patch adds per-error-site error codes, with error strings that
> include their file and line number.
> 
> To use, change code that returns an error, e.g.
> return -ENOMEM;
> to
> return -ERR(ENOMEM);
> 
> Then, errname() will return a string that includes the file and line
> number of the ERR() call, for example
> printk("Got error %s!\n", errname(err));
> will result in
> Got error ENOMEM at foo.c:1234

Why? Something wrong with just using %pe ?

printk("Got error %pe at %s:%d!\n", ERR_PTR(err), __FILE__, __LINE__);

Likely __FILE__ and __LINE__ aren't particularly useful.

And using ERR would add rather a lot of bloat as each codetag_error_code
struct would be unique.

+#define ERR(_err)  \
+({ \
+   static struct codetag_error_code\
+   __used  \
+   __section("error_code_tags")\
+   __aligned(8) e = {  \
+   .str= #_err " at " __FILE__ ":" __stringify(__LINE__),\
+   .err= _err, \
+   };  \
+   \
+   e.err;  \
+})




[xen-unstable test] 172910: tolerable FAIL - PUSHED

2022-09-01 Thread osstest service owner
flight 172910 xen-unstable real [real]
flight 172920 xen-unstable real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/172910/
http://logs.test-lab.xenproject.org/osstest/logs/172920/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-amd64-amd64-xl-credit1 20 guest-localmigrate/x10 fail pass in 
172920-retest
 test-armhf-armhf-xl-cubietruck  8 xen-boot  fail pass in 172920-retest
 test-amd64-i386-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail pass in 
172920-retest

Tests which did not succeed, but are not blocking:
 test-amd64-i386-libvirt   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-i386-libvirt-raw   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-xsm   1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-amd64-libvirt-vhd  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-qcow2  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-raw  1 build-check(1)   blocked  n/a
 test-armhf-armhf-xl-cubietruck 15 migrate-support-check fail in 172920 never 
pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-check fail in 172920 
never pass
 build-amd64-libvirt   6 libvirt-buildfail  like 172901
 build-i386-libvirt6 libvirt-buildfail  like 172901
 build-arm64-libvirt   6 libvirt-buildfail  like 172901
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 172901
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 172901
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 172901
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 172901
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 172901
 build-armhf-libvirt   6 libvirt-buildfail  like 172901
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 172901
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 172901
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 172901
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 172901
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-

Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 06:34:30PM -0400, Steven Rostedt wrote:
> On Thu, 1 Sep 2022 17:54:38 -0400
> Kent Overstreet  wrote:
> > 
> > So this looks like it's gotten better since I last looked, but it's still 
> > not
> > there yet.
> > 
> > Part of the problem is that the tracepoints themselves are in the wrong 
> > place:
> > your end event is when a task is woken up, but that means spurious wakeups 
> > will
> 
> The end event is when a task is scheduled onto the CPU. The start event is
> the first time it is woken up.

Yeah, that's not what I want. You're just tracing latency due to having more
processes runnable than CPUs.

I don't care about that for debugging, though! I specifically want latency at
the wait_event() level, and related - every time a process blocked _on some
condition_, until that condition became true. Not until some random, potentially
spurious wakeup.


> Not the prettiest thing to read. But hey, we got the full stack of where
> these latencies happened!

Most of the time I _don't_ want full stacktraces, though!

That means I have a ton more output to sort through, and the data is far more
expensive to collect.

I don't know why it's what people go to first - see the page_owner stuff - but
that doesn't get used much either because the output is _really hard to sort
through_.

Most of the time, just a single file and line number is all you want - and
tracing has always made it hard to get at that.


> Yes, it adds some overhead when the events are triggered due to the
> stacktrace code, but it's extremely useful information.
> 
> > 
> > So, it looks like tracing has made some progress over the past 10 years,
> > but for debugging latency issues it's still not there yet in general. I
> 
> I call BS on that statement. Just because you do not know what has been
> added to the kernel in the last 10 years (like you had no idea about
> seq_buf and that was added in 2014) means to me that you are totally
> clueless on what tracing can and can not do.
> 
> It appears to me that you are too focused on inventing your own wheel that
> does exactly what you want before looking to see how things are today. Just
> because something didn't fit your needs 10 years ago doesn't mean that it
> can't fit your needs today.

...And the ad hominem attacks start.

Steve, I'm not attacking you, and there's room enough in this world for the both
of us to be doing our thing creating new and useful tools.

> I'm already getting complaints from customers/users that are saying there's
> too many tools in the toolbox already. (Do we use ftrace/perf/bpf?). The
> idea is to have the tools using mostly the same infrastructure, and not be
> 100% off on its own, unless there's a clear reason to invent a new wheel
> that several people are asking for, not just one or two.

I would like to see more focus on usability.

That means, in a best case scenario, always-on data collection that I can just
look at, and it'll already be in the format most likely to be useful.

Surely you can appreciate the usefulness of that..?

Tracing started out as a tool for efficiently getting lots of data out of the
kernel, and it's great for that. But I think your focus on the cool thing you
built may be blinding you a bit to alternative approaches...



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote:
> On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote:
> > This is very interesting work! Do you have any data about the overhead
> > this introduces, especially in a production environment? I am
> > especially interested in memory allocations tracking and detecting
> > leaks.
> 
> +1
> 
> I think the question whether it indeed can be always turned on in the 
> production
> or not is the main one. If not, the advantage over ftrace/bpf/... is not that
> obvious. Otherwise it will be indeed a VERY useful thing.

Low enough overhead to run in production was my primary design goal.

Stats are kept in a struct that's defined at the callsite. So this adds _no_
pointer chasing to the allocation path, unless we've switch to percpu counters
at that callsite (see the lazy percpu counters patch), where we need to deref
one percpu pointer to save an atomic.

Then we need to stash a pointer to the alloc_tag, so that kfree() can find it.
For slab allocations this uses the same storage area as memcg, so for
allocations that are using that we won't be touching any additional cachelines.
(I wanted the pointer to the alloc_tag to be stored inline with the allocation,
but that would've caused alignment difficulties).

Then there's a pointer deref introduced to the kfree() path, to get back to the
original alloc_tag and subtract the allocation from that callsite. That one
won't be free, and with percpu counters we've got another dependent load too -
hmm, it might be worth benchmarking with just atomics, skipping the percpu
counters.

So the overhead won't be zero, I expect it'll show up in some synthetic
benchmarks, but yes I do definitely expect this to be worth enabling in
production in many scenarios.



Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Steven Rostedt
On Thu, 1 Sep 2022 17:54:38 -0400
Kent Overstreet  wrote:
> 
> So this looks like it's gotten better since I last looked, but it's still not
> there yet.
> 
> Part of the problem is that the tracepoints themselves are in the wrong place:
> your end event is when a task is woken up, but that means spurious wakeups 
> will

The end event is when a task is scheduled onto the CPU. The start event is
the first time it is woken up.

> cause one wait_event() call to be reported as multiple smaller waits, not one
> long wait - oops, now I can't actually find the thing that's causing my
> multi-second delay.
> 
> Also, in your example you don't have it broken out by callsite. That would be
> the first thing I'd need for any real world debugging.

OK, how about this (currently we can only have 3 keys, but you can create
multiple histograms on the same event).

 # echo 'hist:keys=comm,stacktrace,delta.buckets=10:sort=delta' > 
/sys/kernel/tracing/events/synthetic/wakeup_lat/trigger

(notice the "stacktrace" in the keys)

# cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: 
hist:keys=comm,stacktrace,delta.buckets=10:vals=hitcount:sort=delta.buckets=10:size=2048
 [active]
#

{ comm: migration/2   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: migration/5   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: migration/1   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: migration/7   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: migration/0   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 start_kernel+0x595/0x5be
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: migration/4   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: rtkit-daemon  , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 preempt_schedule_common+0x2d/0x70
 preempt_schedule_thunk+0x16/0x18
 _raw_spin_unlock_irq+0x2e/0x40
 eventfd_write+0xc8/0x290
 vfs_write+0xc0/0x2a0
 ksys_write+0x5f/0xe0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
, delta: ~ 10

[linux-linus test] 172911: regressions - FAIL

2022-09-01 Thread osstest service owner
flight 172911 linux-linus real [real]
http://logs.test-lab.xenproject.org/osstest/logs/172911/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-amd64-libvirt   6 libvirt-buildfail REGR. vs. 172133
 build-arm64-libvirt   6 libvirt-buildfail REGR. vs. 172133
 build-i386-libvirt6 libvirt-buildfail REGR. vs. 172133
 build-armhf-libvirt   6 libvirt-buildfail REGR. vs. 172133

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-qcow2  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-raw  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-qcow2  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-amd64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 172133
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 172133
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 172133
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 172133
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 172133
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass

version targeted for testing:
 linuxc5e4d5e99162ba8025d58a3af7ad103f155d2df7
baseline version:
 linuxb44f2fd87919b5ae6e1756d4c7ba2cbba22238e1

Last test of basis   172133  2022-08-04 05:14:48 Z   28 days
Failing since172152  2022-08-05 04:01:26 Z   27 days   63 attempts
Testing same since   172902  2022-08-31 23:12:06 Z0 days2 attempts


1599 people touched revisions under test,
not listing them all

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64

Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 05:38:44PM -0400, Steven Rostedt wrote:
> On Tue, 30 Aug 2022 14:49:16 -0700
> Suren Baghdasaryan  wrote:
> 
> > From: Kent Overstreet 
> > 
> > This adds the ability to easily instrument code for measuring latency.
> > To use, add the following to calls to your code, at the start and end of
> > the event you wish to measure:
> > 
> >   code_tag_time_stats_start(start_time);
> >   code_tag_time_stats_finish(start_time);
> 
> So you need to modify the code to see what you want?

Figuring out the _correct_ place to measure is often a significant amount of the
total effort.

Having done so once, why not annotate that in the source code?

> For function length you could just do something like this:
> 
>  # cd /sys/kernel/tracing
>  # echo __skb_wait_for_more_packets > set_ftrace_filter
>  # echo 1 > function_profile_enabled
>  # cat trace_stat/function*
>   Function   HitTimeAvg   
>   s^2
>      ------   
>   ---
>   __skb_wait_for_more_packets  10.000 us0.000 us  
>   0.000 us
>   Function   HitTimeAvg   
>   s^2
>      ------   
>   ---
>   __skb_wait_for_more_packets  174.813 us   74.813 us 
>   0.000 us
>   Function   HitTimeAvg   
>   s^2
>      ------   
>   ---
>   Function   HitTimeAvg   
>   s^2
>      ------   
>   ---
> 
> The above is for a 4 CPU machine. The s^2 is the square of the standard
> deviation (makes not having to do divisions while it runs).
> 
> But if you are looking for latency between two events (which can be kprobes
> too, where you do not need to rebuild your kernel):
> 
> From: https://man.archlinux.org/man/sqlhist.1.en
> which comes in: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/
>   if not already installed on your distro.
> 
>  # sqlhist -e -n wakeup_lat 'select end.next_comm as 
> comm,start.pid,start.prio,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as 
> delta from sched_waking as start join sched_switch as end on start.pid = 
> end.next_pid where start.prio < 100'
> 
> The above creates a synthetic event called "wakeup_lat" that joins two
> events (sched_waking and sched_switch) when the pid field of sched_waking
> matches the next_pid field of sched_switch. When there is a match, it will
> trigger the wakeup_lat event only if the prio of the sched_waking event is
> less than 100 (which in the kernel means any real-time task). The
> wakeup_lat event will record the next_comm (as comm field), the pid of
> woken task and the time delta in microseconds between the two events.

So this looks like it's gotten better since I last looked, but it's still not
there yet.

Part of the problem is that the tracepoints themselves are in the wrong place:
your end event is when a task is woken up, but that means spurious wakeups will
cause one wait_event() call to be reported as multiple smaller waits, not one
long wait - oops, now I can't actually find the thing that's causing my
multi-second delay.

Also, in your example you don't have it broken out by callsite. That would be
the first thing I'd need for any real world debugging.

So, it looks like tracing has made some progress over the past 10 years, but
for debugging latency issues it's still not there yet in general. I will
definitely remember function latency tracing the next time I'm doing performance
work, but I expect that to be far too heavy to enable on a live server.

This thing is only a couple hundred lines of code though, so perhaps tracing
shouldn't be the only tool in our toolbox :)



Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Steven Rostedt
On Thu, 1 Sep 2022 17:38:44 -0400
Steven Rostedt  wrote:

>  # echo 'hist:keys=comm,prio,delta.buckets=10:sort=delta' > 
> /sys/kernel/tracing/events/synthetic/wakeup_lat/trigger

The above could almost be done with sqlhist (but I haven't implemented
"buckets=10" yet because that's a new feature. But for now, let's do log2):

 # sqlhist -e 'select comm,prio,cast(delta as log2) from wakeup_lat'

("-e" is to execute the command, as it normally only displays what commands
need to be run to create the synthetic events and histograms)

# cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: 
hist:keys=comm,prio,delta.log2:vals=hitcount:sort=hitcount:size=2048 [active]
#

{ comm: migration/4   , prio:  0, 
delta: ~ 2^5  } hitcount:  1
{ comm: migration/0   , prio:  0, 
delta: ~ 2^4  } hitcount:  2
{ comm: rtkit-daemon  , prio:  0, 
delta: ~ 2^7  } hitcount:  2
{ comm: rtkit-daemon  , prio:  0, 
delta: ~ 2^6  } hitcount:  4
{ comm: migration/0   , prio:  0, 
delta: ~ 2^5  } hitcount:  8
{ comm: migration/4   , prio:  0, 
delta: ~ 2^4  } hitcount:  9
{ comm: migration/2   , prio:  0, 
delta: ~ 2^4  } hitcount: 10
{ comm: migration/5   , prio:  0, 
delta: ~ 2^4  } hitcount: 10
{ comm: migration/7   , prio:  0, 
delta: ~ 2^4  } hitcount: 10
{ comm: migration/1   , prio:  0, 
delta: ~ 2^4  } hitcount: 10
{ comm: migration/6   , prio:  0, 
delta: ~ 2^4  } hitcount: 10

Totals:
Hits: 76
Entries: 11
Dropped: 0


-- Steve



Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Steven Rostedt
On Tue, 30 Aug 2022 14:49:16 -0700
Suren Baghdasaryan  wrote:

> From: Kent Overstreet 
> 
> This adds the ability to easily instrument code for measuring latency.
> To use, add the following to calls to your code, at the start and end of
> the event you wish to measure:
> 
>   code_tag_time_stats_start(start_time);
>   code_tag_time_stats_finish(start_time);

So you need to modify the code to see what you want?

> 
> Stastistics will then show up in debugfs under
> /sys/kernel/debug/time_stats, listed by file and line number.
> 
> Stastics measured include weighted averages of frequency, duration, max
> duration, as well as quantiles.
> 
> This patch also instruments all calls to init_wait and finish_wait,
> which includes all calls to wait_event. Example debugfs output:
> 
> fs/xfs/xfs_trans_ail.c:746 module:xfs func:xfs_ail_push_all_sync
> count:  17
> rate:   0/sec
> frequency:  2 sec
> avg duration:   10 us
> max duration:   232 us
> quantiles (ns): 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128
> 
> lib/sbitmap.c:813 module:sbitmap func:sbitmap_finish_wait
> count:  3
> rate:   0/sec
> frequency:  4 sec
> avg duration:   4 sec
> max duration:   4 sec
> quantiles (ns): 0 4288669120 4288669120 5360836048 5360836048 5360836048 
> 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 
> 5360836048 5360836048
> 
> net/core/datagram.c:122 module:datagram func:__skb_wait_for_more_packets
> count:  10
> rate:   1/sec
> frequency:  859 ms
> avg duration:   472 ms
> max duration:   30 sec
> quantiles (ns): 0 12279 12279 15669 15669 15669 15669 17217 17217 17217 17217 
> 17217 17217 17217 17217

For function length you could just do something like this:

 # cd /sys/kernel/tracing
 # echo __skb_wait_for_more_packets > set_ftrace_filter
 # echo 1 > function_profile_enabled
 # cat trace_stat/function*
  Function   HitTimeAvg 
s^2
     ------ 
---
  __skb_wait_for_more_packets  10.000 us0.000 us
0.000 us
  Function   HitTimeAvg 
s^2
     ------ 
---
  __skb_wait_for_more_packets  174.813 us   74.813 us   
0.000 us
  Function   HitTimeAvg 
s^2
     ------ 
---
  Function   HitTimeAvg 
s^2
     ------ 
---

The above is for a 4 CPU machine. The s^2 is the square of the standard
deviation (makes not having to do divisions while it runs).

But if you are looking for latency between two events (which can be kprobes
too, where you do not need to rebuild your kernel):

From: https://man.archlinux.org/man/sqlhist.1.en
which comes in: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/
  if not already installed on your distro.

 # sqlhist -e -n wakeup_lat 'select end.next_comm as 
comm,start.pid,start.prio,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as 
delta from sched_waking as start join sched_switch as end on start.pid = 
end.next_pid where start.prio < 100'

The above creates a synthetic event called "wakeup_lat" that joins two
events (sched_waking and sched_switch) when the pid field of sched_waking
matches the next_pid field of sched_switch. When there is a match, it will
trigger the wakeup_lat event only if the prio of the sched_waking event is
less than 100 (which in the kernel means any real-time task). The
wakeup_lat event will record the next_comm (as comm field), the pid of
woken task and the time delta in microseconds between the two events.

 # echo 'hist:keys=comm,prio,delta.buckets=10:sort=delta' > 
/sys/kernel/tracing/events/synthetic/wakeup_lat/trigger

The above starts a histogram tracing the name of the woken task, the
priority and the delta (but placing the delta in buckets of size 10, as we
do not need to see every latency number).

 # chrt -f 20 sleep 1

Force something to be woken up that is interesting.

 # cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: 
hist:keys=comm,prio,delta.buckets=10:vals=hitcount:sort=delta.buckets=10:size=2048
 [active]
#

{ comm: migration/5   , prio:  0, 
delta: ~ 10-19 } hitcount:  1
{ comm: migration/2   , prio:  0, 
delta: ~ 10-19 } hitcount:  1
{ comm: sleep , prio: 79, 
delta: ~ 10-19 } hitcount:  1
{ comm: migration/7   , prio:  0, 
delta: ~ 10-19 } hitcount:  

Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 12:39:11PM -0700, Suren Baghdasaryan wrote:
> kmemleak is known to be slow and it's even documented [1], so I hope I
> can skip that part. For page_owner to provide the comparable
> information we would have to capture the call stacks for all page
> allocations unlike our proposal which allows to do that selectively
> for specific call sites. I'll post the overhead numbers of call stack
> capturing once I'm finished with profiling the latest code, hopefully
> sometime tomorrow, in the worst case after the long weekend.

To expand on this further: we're stashing a pointer to the alloc_tag, which is
defined at the allocation callsite. That's how we're able to decrement the
proper counter on free, and why this beats any tracing based approach - with
tracing you'd instead have to correlate allocate/free events. Ouch.

> > Yes, tracking back the call trace would be really needed. The question
> > is whether this is really prohibitively expensive. How much overhead are
> > we talking about? There is no free lunch here, really.  You either have
> > the overhead during runtime when the feature is used or on the source
> > code level for all the future development (with a maze of macros and
> > wrappers).

The full call stack is really not what you want in most applications - that's
what people think they want at first, and why page_owner works the way it does,
but it turns out that then combining all the different but related stack traces
_sucks_ (so why were you saving them in the first place?), and then you have to
do a separate memory allocate for each stack track, which destroys performance.

> 
> Will post the overhead numbers soon.
> What I hear loud and clear is that we need a kernel command-line kill
> switch that mitigates the overhead for having this feature. That seems
> to be the main concern.
> Thanks,

After looking at this more I don't think we should commit just yet - there's
some tradeoffs to be evaluated, and maybe the thing to do first will be to see
if we can cut down on the (huge!) number of allocation interfaces before adding
more complexity.

The ideal approach, from a performance POV, would be to pass a pointer to the
alloc tag to kmalloc() et. all, and then we'd have the actual accounting code in
one place and use a jump label to skip over it when this feature is disabled.

However, there are _many, many_ wrapper functions in our allocation code, and
this approach is going to make the plumbing for the hooks quite a bit bigger
than what we have now - and then, do we want to have this extra alloc_tag
parameter that's not used when CONFIG_ALLOC_TAGGING=n? It's a tiny cost for an
extra unused parameter, but it's a cost - or do we get rid of that with some
extra macro hackery (eww, gross)?

If we do the boot parameter before submission, I think we'll have something
that's maybe not strictly ideal from a performance POV when
CONFIG_ALLOC_TAGGING=y but boot parameter=n, but it'll introduce the minimum
amount of macro insanity.

What we should be able to do pretty easily is discard the alloc_tag structs when
the boot parameter is disabled, because they're in special elf sections and we
already do that (e.g. for .init).



[ovmf test] 172917: regressions - FAIL

2022-09-01 Thread osstest service owner
flight 172917 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/172917/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-i386-libvirt6 libvirt-buildfail REGR. vs. 172136
 build-amd64-libvirt   6 libvirt-buildfail REGR. vs. 172136

version targeted for testing:
 ovmf ac55fcb051e4f288d29432043a42c13866419598
baseline version:
 ovmf 444260d45ec2a84e8f8c192b3539a3cd5591d009

Last test of basis   172136  2022-08-04 06:43:42 Z   28 days
Failing since172151  2022-08-05 02:40:28 Z   27 days  220 attempts
Testing same since   172917  2022-09-01 15:42:06 Z0 days1 attempts


People who touched revisions under test:
  Abdul Lateef Attar 
  Abner Chang 
  Ard Biesheuvel 
  Bob Feng 
  Chasel Chiu 
  Chen, Xiao X 
  Czajkowski, Maciej 
  Dimitrije Pavlov 
  Dun Tan 
  Edward Pickup 
  Foster Nong 
  Gregx Yeh 
  Guo Dong 
  Igor Kulchytskyy 
  James Lu 
  Jeff Brasen 
  Jiaxin Wu 
  Jose Marinho 
  KasimX Liu 
  Kavya 
  Konstantin Aladyshev 
  Kun Qin 
  Liming Gao 
  Liu, Zhiguang 
  Maciej Czajkowski 
  Michael D Kinney 
  Michael Kubacki 
  Ray Ni 
  Rebecca Cran 
  Rebecca Cran 
  Sainadh Nagolu 
  Sami Mujawar 
  Shengfengx Xue 
  Wu, Jiaxin 
  Xiao X Chen 
  Yuanhao Xie 
  Zhiguang Liu 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  fail
 build-i386-libvirt   fail
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Not pushing.

(No revision log; it would be 1425 lines long.)



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Suren Baghdasaryan
On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko  wrote:
>
> On Thu 01-09-22 08:33:19, Suren Baghdasaryan wrote:
> > On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko  wrote:
> [...]
> > > So I find Peter's question completely appropriate while your response to
> > > that not so much! Maybe ftrace is not the right tool for the intented
> > > job. Maybe there are other ways and it would be really great to show
> > > that those have been evaluated and they are not suitable for a), b) and
> > > c) reasons.
> >
> > That's fair.
> > For memory tracking I looked into using kmemleak and page_owner which
> > can't match the required functionality at an overhead acceptable for
> > production and pre-production testing environments.
>
> Being more specific would be really helpful. Especially when your cover
> letter suggests that you rely on page_owner/memcg metadata as well to
> match allocation and their freeing parts.

kmemleak is known to be slow and it's even documented [1], so I hope I
can skip that part. For page_owner to provide the comparable
information we would have to capture the call stacks for all page
allocations unlike our proposal which allows to do that selectively
for specific call sites. I'll post the overhead numbers of call stack
capturing once I'm finished with profiling the latest code, hopefully
sometime tomorrow, in the worst case after the long weekend.

>
> > traces + BPF I
> > haven't evaluated myself but heard from other members of my team who
> > tried using that in production environment with poor results. I'll try
> > to get more specific information on that.
>
> That would be helpful as well.

Ack.

>
> > > E.g. Oscar has been working on extending page_ext to track number of
> > > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But
> > > it can help in environments where page_ext can be enabled and it is
> > > completely non-intrusive to the MM code.
> >
> > Thanks for pointing out this work. I'll need to review and maybe
> > profile it before making any claims.
> >
> > >
> > > If the page_ext overhead is not desirable/acceptable then I am sure
> > > there are other options. E.g. kprobes/LivePatching framework can hook
> > > into functions and alter their behavior. So why not use that for data
> > > collection? Has this been evaluated at all?
> >
> > I'm not sure how I can hook into say alloc_pages() to find out where
> > it was called from without capturing the call stack (which would
> > introduce an overhead at every allocation). Would love to discuss this
> > or other alternatives if they can be done with low enough overhead.
>
> Yes, tracking back the call trace would be really needed. The question
> is whether this is really prohibitively expensive. How much overhead are
> we talking about? There is no free lunch here, really.  You either have
> the overhead during runtime when the feature is used or on the source
> code level for all the future development (with a maze of macros and
> wrappers).

Will post the overhead numbers soon.
What I hear loud and clear is that we need a kernel command-line kill
switch that mitigates the overhead for having this feature. That seems
to be the main concern.
Thanks,
Suren.

[1] https://docs.kernel.org/dev-tools/kmemleak.html#limitations-and-drawbacks

>
> Thanks!
> --
> Michal Hocko
> SUSE Labs



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Michal Hocko
On Thu 01-09-22 08:33:19, Suren Baghdasaryan wrote:
> On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko  wrote:
[...]
> > So I find Peter's question completely appropriate while your response to
> > that not so much! Maybe ftrace is not the right tool for the intented
> > job. Maybe there are other ways and it would be really great to show
> > that those have been evaluated and they are not suitable for a), b) and
> > c) reasons.
> 
> That's fair.
> For memory tracking I looked into using kmemleak and page_owner which
> can't match the required functionality at an overhead acceptable for
> production and pre-production testing environments.

Being more specific would be really helpful. Especially when your cover
letter suggests that you rely on page_owner/memcg metadata as well to
match allocation and their freeing parts.

> traces + BPF I
> haven't evaluated myself but heard from other members of my team who
> tried using that in production environment with poor results. I'll try
> to get more specific information on that.

That would be helpful as well.

> > E.g. Oscar has been working on extending page_ext to track number of
> > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But
> > it can help in environments where page_ext can be enabled and it is
> > completely non-intrusive to the MM code.
> 
> Thanks for pointing out this work. I'll need to review and maybe
> profile it before making any claims.
> 
> >
> > If the page_ext overhead is not desirable/acceptable then I am sure
> > there are other options. E.g. kprobes/LivePatching framework can hook
> > into functions and alter their behavior. So why not use that for data
> > collection? Has this been evaluated at all?
> 
> I'm not sure how I can hook into say alloc_pages() to find out where
> it was called from without capturing the call stack (which would
> introduce an overhead at every allocation). Would love to discuss this
> or other alternatives if they can be done with low enough overhead.

Yes, tracking back the call trace would be really needed. The question
is whether this is really prohibitively expensive. How much overhead are
we talking about? There is no free lunch here, really.  You either have
the overhead during runtime when the feature is used or on the source
code level for all the future development (with a maze of macros and
wrappers).

Thanks!
-- 
Michal Hocko
SUSE Labs



Re: [RFC PATCH 03/30] Lazy percpu counters

2022-09-01 Thread Peter Zijlstra
On Thu, Sep 01, 2022 at 10:32:19AM -0400, Kent Overstreet wrote:
> On Thu, Sep 01, 2022 at 08:51:31AM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote:
> > > +static void lazy_percpu_counter_switch_to_pcpu(struct 
> > > raw_lazy_percpu_counter *c)
> > > +{
> > > + u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN);
> > 
> > Realize that this is incorrect when used under a raw_spinlock_t.
> 
> Can you elaborate?

required lock order: raw_spinlock_t < spinlock_t < mutex

allocators lives at spinlock_t.

Also see CONFIG_PROVE_RAW_LOCK_NESTING and there might be a document
mentioning all this somewhere.

Additionally, this (obviously) also isn't NMI safe.



Re: [PATCH v3 7/7] xen/arm: introduce new xen,enhanced property value

2022-09-01 Thread Julien Grall

Hi Rahul,

On 01/09/2022 10:13, Rahul Singh wrote:

Introduce a new "xen,enhanced" dom0less property value "no-xenstore" to
disable xenstore interface for dom0less guests.

Signed-off-by: Rahul Singh 
---
Changes in v3:
  - new patch in this version
---
  docs/misc/arm/device-tree/booting.txt |  4 
  xen/arch/arm/domain_build.c   | 10 +++---
  xen/arch/arm/include/asm/kernel.h |  3 +++
  3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/docs/misc/arm/device-tree/booting.txt 
b/docs/misc/arm/device-tree/booting.txt
index edef98e6d1..87f57f8889 100644
--- a/docs/misc/arm/device-tree/booting.txt
+++ b/docs/misc/arm/device-tree/booting.txt
@@ -204,6 +204,10 @@ with the following properties:
  - "disabled"
  Xen PV interfaces are disabled.
  
+- no-xenstore

+Xen PV interfaces, including grant-table will be enabled for the VM but
+xenstore will be disabled for the VM.


NIT: I would drop one of the "for the VM" as it seems to be redundant.


+
  If the xen,enhanced property is present with no value, it defaults
  to "enabled". If the xen,enhanced property is not present, PV
  interfaces are disabled.
diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
index 4b24261825..8dd9984225 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -3336,10 +3336,14 @@ static int __init construct_domU(struct domain *d,
   (rc == 0 && !strcmp(dom0less_enhanced, "enabled")) )
  {
  if ( hardware_domain )
-kinfo.dom0less_enhanced = true;
+kinfo.dom0less_xenstore = true;
  else
-panic("Tried to use xen,enhanced without dom0\n");
+panic("Tried to use xen,enhanced without dom0 without 
no-xenstore\n");


This is a bit hard to parse. How about:

"At the moment, Xenstore support requires dom0 to be present"


  }
+else if ( rc == 0 && !strcmp(dom0less_enhanced, "no-xenstore") )
+kinfo.dom0less_xenstore = false;
+
+kinfo.dom0less_enhanced = true;


Wouldn't this now set dom0less_enhanced unconditionally?

  
  if ( vcpu_create(d, 0) == NULL )

  return -ENOMEM;
@@ -3379,7 +3383,7 @@ static int __init construct_domU(struct domain *d,
  if ( rc < 0 )
  return rc;
  
-if ( kinfo.dom0less_enhanced )

+if ( kinfo.dom0less_xenstore )
  {
  ASSERT(hardware_domain);
  rc = alloc_xenstore_evtchn(d);
diff --git a/xen/arch/arm/include/asm/kernel.h 
b/xen/arch/arm/include/asm/kernel.h
index c4dc039b54..3d7fa94910 100644
--- a/xen/arch/arm/include/asm/kernel.h
+++ b/xen/arch/arm/include/asm/kernel.h
@@ -39,6 +39,9 @@ struct kernel_info {
  /* Enable PV drivers */
  bool dom0less_enhanced;
  
+/* Enable Xenstore */

+bool dom0less_xenstore;
+


AFAIU, it is not possible to have *_xenstore = true and *_enhanced = 
false. I think it would be clearer if ``dom0less_enhanced`` is turned to 
an enum with 3 values:

 - None
 - NOXENSTORE/BASIC
 - FULLY_ENHANCED

If we want to be future proof, I would use a field 'flags' where 
non-zero means enhanced. Each bit would indicate which features of Xen 
is exposed.


Cheers,

--
Julien Grall



Re: [PATCH v3 6/7] xen/arm: introduce xen-evtchn dom0less property

2022-09-01 Thread Julien Grall

Hi Rahul,

On 01/09/2022 10:13, Rahul Singh wrote:

Introduce a new sub-node under /chosen node to establish static event
channel communication between domains on dom0less systems.

An event channel will be created beforehand to allow the domains to
send notifications to each other.

Signed-off-by: Rahul Singh 
---
Changes in v3:
  - use device-tree used_by to find the domain id of the evtchn node.
  - add new static_evtchn_create variable in struct dt_device_node to
hold the information if evtchn is already created.
  - fix minor comments
Changes in v2:
  - no change
---
  docs/misc/arm/device-tree/booting.txt |  64 -
  xen/arch/arm/domain_build.c   | 128 ++
  xen/arch/arm/include/asm/setup.h  |   1 +
  xen/arch/arm/setup.c  |   2 +
  xen/include/xen/device_tree.h |  13 +++
  5 files changed, 207 insertions(+), 1 deletion(-)

diff --git a/docs/misc/arm/device-tree/booting.txt 
b/docs/misc/arm/device-tree/booting.txt
index 98253414b8..edef98e6d1 100644
--- a/docs/misc/arm/device-tree/booting.txt
+++ b/docs/misc/arm/device-tree/booting.txt
@@ -212,7 +212,7 @@ with the following properties:
  enable only selected interfaces.
  
  Under the "xen,domain" compatible node, one or more sub-nodes are present

-for the DomU kernel and ramdisk.
+for the DomU kernel, ramdisk and static event channel.
  
  The kernel sub-node has the following properties:
  
@@ -254,11 +254,44 @@ The ramdisk sub-node has the following properties:

  property because it will be created by the UEFI stub on boot.
  This option is needed only when UEFI boot is used.
  
+The static event channel sub-node has the following properties:

+
+- compatible
+
+"xen,evtchn"
+
+- xen,evtchn
+
+The property is tuples of two numbers
+(local-evtchn link-to-foreign-evtchn) where:
+
+local-evtchn is an integer value that will be used to allocate local port
+for a domain to send and receive event notifications to/from the remote
+domain. Maximum supported value is 2^17 for FIFO ABI and 4096 for 2L ABI.
+It is recommended to use low event channel IDs.
+
+link-to-foreign-evtchn is a single phandle to a remote evtchn to which
+local-evtchn will be connected.
  
  Example

  ===
  
  chosen {

+
+module@0 {
+compatible = "multiboot,kernel", "multiboot,module";
+xen,uefi-binary = "...";
+bootargs = "...";
+
+};


NIT: Describing this node in the example seems unnecessary.


+
+/* one sub-node per local event channel */
+ec1: evtchn@1 {
+ compatible = "xen,evtchn-v1";
+ /* local-evtchn link-to-foreign-evtchn */
+ xen,evtchn = <0xa &ec2>;
+};
+


Here you provide an example for dom0. But the position where you 
describe the binding suggests this binding only exists for domUs.


Either we duplicate the binding or we re-order the documentation to have 
common binding in a single place. My preference would be the latter.



  domU1 {
  compatible = "xen,domain";
  #address-cells = <0x2>;
@@ -277,6 +310,23 @@ chosen {
  compatible = "multiboot,ramdisk", "multiboot,module";
  reg = <0x0 0x4b00 0xff>;
  };
+
+/* one sub-node per local event channel */
+ec2: evtchn@2 {
+compatible = "xen,evtchn-v1";
+/* local-evtchn link-to-foreign-evtchn */
+xen,evtchn = <0xa &ec1>;
+};
+
+ec3: evtchn@3 {
+compatible = "xen,evtchn-v1";
+xen,evtchn = <0xb &ec5>;
+};
+
+ec4: evtchn@4 {
+compatible = "xen,evtchn-v1";
+xen,evtchn = <0xc &ec6>;
+};
  };
  
  domU2 {

@@ -296,6 +346,18 @@ chosen {
  compatible = "multiboot,ramdisk", "multiboot,module";
  reg = <0x0 0x4d00 0xff>;
  };
+
+/* one sub-node per local event channel */
+ec5: evtchn@5 {
+compatible = "xen,evtchn-v1";
+/* local-evtchn link-to-foreign-evtchn */
+xen,evtchn = <0xb &ec3>;
+};
+
+ec6: evtchn@6 {
+compatible = "xen,evtchn-v1";
+xen,evtchn = <0xd &ec4>;
+};
  };
  };
  
diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c

index 707e247f6a..4b24261825 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -33,6 +33,8 @@
  #include 
  #include 
  
+#define STATIC_EVTCHN_NODE_SIZE_CELLS 2

+
  static unsigned int __initdata opt_dom0_max_vcpus;
  integer_param("dom0_max_vcpus", opt_dom0_max_vcpus);
  
@@ -3052,6 +3054,131 @@ void __init evtchn_allocate(struct domain *d)

  d->arch.hvm.params[HVM_PARAM_CALLBACK_IRQ] = val;
  }
  
+static int __init get_evtchn_dt_property(const struct dt_device_node *np,

+ uint32_t *port, uint32_t *phandle)
+{
+const __be32 *prop = NULL;
+uint32_t len;
+
+pr

Re: [PATCH v3 3/7] xen/evtchn: restrict the maximum number of evtchn supported for domUs

2022-09-01 Thread Julien Grall

Hi,

On 01/09/2022 14:53, Michal Orzel wrote:

On 01/09/2022 11:13, Rahul Singh wrote:


Restrict the maximum number of evtchn supported for domUs to avoid
allocating a large amount of memory in Xen.

Set the default value of max_evtchn_port to 1023. The value of 1023
should be sufficient for domUs guests because on ARM we don't bind
physical interrupts to event channels. The only use of the evtchn port
is inter-domain communications.

Following the previous discussion, I think the only missing piece is
an explanation that 1023 was chose to follow the default behavior of libxl.


+1. The current explanation only justify why we haven't added a 
device-tree property to change the default value.




Apart from that:
Reviewed-by: Michal Orzel 


Cheers,

--
Julien Grall



Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator

2022-09-01 Thread Julien Grall

Hi Stefano,

On 01/09/2022 18:08, Stefano Stabellini wrote:

Also, what happen with UEFI? Is it easy to guarantee the region will not
be used?


For now I think it is not easy to guarantee that, do you have some ideas
in mind? I think I can follow this in above follow-up series to improve things.


For clarity, are we worried that the region is used by the bootloader
for other things? For instance U-Boot or Tianocore placing some
firmware tables inside the range specified for xenheap?


Yes. I think it would be difficult for an admin to figure out which 
regions are used. Although they are likely (?) going to be static for a 
given UEFI/U-boot build.


My major concern is such bug can be very difficult to root cause because 
we have no safety in Xen. The most likely symptom would be random 
corruption.


Cheers,

--
Julien Grall



Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator

2022-09-01 Thread Julien Grall

Hi Henry,

On 01/09/2022 17:05, Henry Wang wrote:

@@ -755,17 +779,21 @@ static void __init setup_mm(void)

   do
   {
-e = consider_modules(ram_start, ram_end,
+e = !reserved_heap ?
+consider_modules(ram_start, ram_end,
pfn_to_paddr(xenheap_pages),
- 32<<20, 0);
+ 32<<20, 0) :
+reserved_heap_end;


Not entirely related to this series. Now the assumption is the admin
will make sure that none of the reserved regions will overlap.

Do we have any tool to help the admin to verify it? If yes, can we have
a pointer in the documentation? If not, should this be done in Xen?


In the RFC we had the same discussion of this issue [1] and I think a
follow-up series might needed to do the overlap check if we want to
do that in Xen. For the existing tool, I am thinking of ImageBuilder, but
I am curious about Stefano's opinion.



Also, what happen with UEFI? Is it easy to guarantee the region will not
be used?


For now I think it is not easy to guarantee that, do you have some ideas
in mind? I think I can follow this in above follow-up series to improve things.


I don't have any ideas how we can guarantee it (even when using image 
builder). I think we may have to end up to check the overlaps in Xen.







+
   if ( e )
   break;

   xenheap_pages >>= 1;
   } while ( !opt_xenheap_megabytes && xenheap_pages > 32<<(20-

PAGE_SHIFT) );


-if ( ! e )
-panic("Not not enough space for xenheap\n");
+if ( ! e ||
+ ( reserved_heap && reserved_heap_pages < 32<<(20-PAGE_SHIFT) ) )
+panic("Not enough space for xenheap\n");


So on arm32, the xenheap *must* be contiguous. AFAICT,
reserved_heap_pages is the total number of pages in the heap. They may
not be contiguous. So I think this wants to be reworked so we look for
one of the region that match the definition written above the loop.


Thanks for raising this concern, I will do this in V2.





   domheap_pages = heap_pages - xenheap_pages;

@@ -810,9 +838,9 @@ static void __init setup_mm(void)
   static void __init setup_mm(void)
   {
   const struct meminfo *banks = &bootinfo.mem;
-paddr_t ram_start = ~0;
-paddr_t ram_end = 0;
-paddr_t ram_size = 0;
+paddr_t ram_start = ~0, bank_start = ~0;
+paddr_t ram_end = 0, bank_end = 0;
+paddr_t ram_size = 0, bank_size = 0;
   unsigned int i;

   init_pdx();
@@ -821,17 +849,36 @@ static void __init setup_mm(void)
* We need some memory to allocate the page-tables used for the

xenheap

* mappings. But some regions may contain memory already allocated
* for other uses (e.g. modules, reserved-memory...).
- *
+ * If reserved heap regions are properly defined, (only) add these

regions

+ * in the boot allocator. > + */
+if ( reserved_heap )
+{
+for ( i = 0 ; i < bootinfo.reserved_mem.nr_banks; i++ )
+{
+if ( bootinfo.reserved_mem.bank[i].xen_heap )
+{
+bank_start = bootinfo.reserved_mem.bank[i].start;
+bank_size = bootinfo.reserved_mem.bank[i].size;
+bank_end = bank_start + bank_size;
+
+init_boot_pages(bank_start, bank_end);
+}
+}
+}
+/*
+ * No reserved heap regions:
* For simplicity, add all the free regions in the boot allocator.
*/
-populate_boot_allocator();
+else
+populate_boot_allocator();

   total_pages = 0;

   for ( i = 0; i < banks->nr_banks; i++ )
   {


This code is now becoming quite confusing to understanding. This loop is
meant to map the xenheap. If I follow your documentation, it would mean
that only the reserved region should be mapped.


Yes I think this is the same question that I raised in the scissors line of the
commit message of this patch.


Sorry I didn't notice the comment after the scissors line. This is the 
same question :)



What I intend to do is still mapping the whole
RAM because of the xenheap_* variables that you mentioned in...



More confusingly, xenheap_* variables will cover the full RAM.


...here. But only adding the reserved region to the boot allocator so the
reserved region can become the heap later on. I am wondering if we
have a more clear way to do that, any suggestions?


I think your code is correct. It only needs some renaming of the 
existing variable (maybe to directmap_*?) to make clear the area is used 
to access the RAM easily.


Cheers,

--
Julien Grall



Re: [PATCH v6 5/9] xen/arm: Add additional reference to owner domain when the owner is allocated

2022-09-01 Thread Julien Grall

Hi Penny,

On 21/07/2022 14:21, Penny Zheng wrote:

Borrower domain will fail to get a page ref using the owner domain
during allocation, when the owner is created after borrower.

So here, we decide to get and add the right amount of reference, which
is the number of borrowers, when the owner is allocated.

Signed-off-by: Penny Zheng 
Reviewed-by: Stefano Stabellini 


IMHO, this tag should not have been kept given...


---
v6 change:
- adapt to the change of "nr_shm_borrowers"


... this change. 'reviewed-by' means that the person reviewed the code 
and therefore agree with the logic. So I would only keep the tag if the 
change is trivial (including typo, coding style) and would drop it (or 
confirm with the person) otherwise.


Stefano, can you confirm you are happy that your reviewed-by tag is kept?


- add in-code comment to explain if the borrower is created first, we intend to
add pages in the P2M without reference.
---
v5 change:
- no change
---
v4 changes:
- no change
---
v3 change:
- printk rather than dprintk since it is a serious error
---
v2 change:
- new commit
---
  xen/arch/arm/domain_build.c | 60 +
  1 file changed, 60 insertions(+)

diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
index a7e95c34a7..e891e800a7 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -761,6 +761,30 @@ static void __init assign_static_memory_11(struct domain 
*d,
  }
  
  #ifdef CONFIG_STATIC_SHM

+static int __init acquire_nr_borrower_domain(struct domain *d,
+ paddr_t pbase, paddr_t psize,
+ unsigned long *nr_borrowers)
+{
+unsigned long bank;


NIT: AFAICT nr_banks is an "unsigned int".

Other than that:

Acked-by: Julien Grall 

Cheers,

--
Julien Grall



Re: [PATCH v3 2/2] xen/pci: replace call to is_memory_hole to pci_check_bar

2022-09-01 Thread Oleksandr



On 01.09.22 12:29, Rahul Singh wrote:

Hello Rahul


is_memory_hole was implemented for x86 and not for ARM when introduced.
Replace is_memory_hole call to pci_check_bar as function should check
if device BAR is in defined memory range. Also, add an implementation
for ARM which is required for PCI passthrough.

On x86, pci_check_bar will call is_memory_hole which will check if BAR
is not overlapping with any memory region defined in the memory map.

On ARM, pci_check_bar will go through the host bridge ranges and check
if the BAR is in the range of defined ranges.

Signed-off-by: Rahul Singh 



Reviewed-by: Oleksandr Tyshchenko 


Thanks!



---
Changes in v3:
  - fix minor comments
---
  xen/arch/arm/include/asm/pci.h |  2 ++
  xen/arch/arm/pci/pci-host-common.c | 43 ++
  xen/arch/x86/include/asm/pci.h | 10 +++
  xen/drivers/passthrough/pci.c  |  8 +++---
  4 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/xen/arch/arm/include/asm/pci.h b/xen/arch/arm/include/asm/pci.h
index 80a2431804..8cb46f6b71 100644
--- a/xen/arch/arm/include/asm/pci.h
+++ b/xen/arch/arm/include/asm/pci.h
@@ -126,6 +126,8 @@ int pci_host_iterate_bridges_and_count(struct domain *d,
  
  int pci_host_bridge_mappings(struct domain *d);
  
+bool pci_check_bar(const struct pci_dev *pdev, mfn_t start, mfn_t end);

+
  #else   /*!CONFIG_HAS_PCI*/
  
  struct arch_pci_dev { };

diff --git a/xen/arch/arm/pci/pci-host-common.c 
b/xen/arch/arm/pci/pci-host-common.c
index 89ef30028e..0eb121666d 100644
--- a/xen/arch/arm/pci/pci-host-common.c
+++ b/xen/arch/arm/pci/pci-host-common.c
@@ -24,6 +24,16 @@
  
  #include 
  
+/*

+ * struct to hold pci device bar.
+ */
+struct pdev_bar
+{
+mfn_t start;
+mfn_t end;
+bool is_valid;
+};
+
  /*
   * List for all the pci host bridges.
   */
@@ -363,6 +373,39 @@ int __init pci_host_bridge_mappings(struct domain *d)
  return 0;
  }
  
+static int is_bar_valid(const struct dt_device_node *dev,

+uint64_t addr, uint64_t len, void *data)
+{
+struct pdev_bar *bar_data = data;
+unsigned long s = mfn_x(bar_data->start);
+unsigned long e = mfn_x(bar_data->end);
+
+if ( (s <= e) && (s >= PFN_DOWN(addr)) && (e <= PFN_UP(addr + len - 1)) )
+bar_data->is_valid =  true;
+
+return 0;
+}
+
+bool pci_check_bar(const struct pci_dev *pdev, mfn_t start, mfn_t end)
+{
+int ret;
+const struct dt_device_node *dt_node;
+struct pdev_bar bar_data =  {
+.start = start,
+.end = end,
+.is_valid = false
+};
+
+dt_node = pci_find_host_bridge_node(pdev);
+if ( !dt_node )
+return false;
+
+ret = dt_for_each_range(dt_node, &is_bar_valid, &bar_data);
+if ( ret < 0 )
+return false;
+
+return bar_data.is_valid;
+}
  /*
   * Local variables:
   * mode: C
diff --git a/xen/arch/x86/include/asm/pci.h b/xen/arch/x86/include/asm/pci.h
index c8e1a9ecdb..f4a58c8acf 100644
--- a/xen/arch/x86/include/asm/pci.h
+++ b/xen/arch/x86/include/asm/pci.h
@@ -57,4 +57,14 @@ static always_inline bool is_pci_passthrough_enabled(void)
  
  void arch_pci_init_pdev(struct pci_dev *pdev);
  
+static inline bool pci_check_bar(const struct pci_dev *pdev,

+ mfn_t start, mfn_t end)
+{
+/*
+ * Check if BAR is not overlapping with any memory region defined
+ * in the memory map.
+ */
+return is_memory_hole(start, end);
+}
+
  #endif /* __X86_PCI_H__ */
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index cdaf5c247f..149f68bb6e 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -304,8 +304,8 @@ static void check_pdev(const struct pci_dev *pdev)
  if ( rc < 0 )
  /* Unable to size, better leave memory decoding disabled. */
  return;
-if ( size && !is_memory_hole(maddr_to_mfn(addr),
- maddr_to_mfn(addr + size - 1)) )
+if ( size && !pci_check_bar(pdev, maddr_to_mfn(addr),
+maddr_to_mfn(addr + size - 1)) )
  {
  /*
   * Return without enabling memory decoding if BAR position is not
@@ -331,8 +331,8 @@ static void check_pdev(const struct pci_dev *pdev)
  
  if ( rc < 0 )

  return;
-if ( size && !is_memory_hole(maddr_to_mfn(addr),
- maddr_to_mfn(addr + size - 1)) )
+if ( size && !pci_check_bar(pdev, maddr_to_mfn(addr),
+maddr_to_mfn(addr + size - 1)) )
  {
  printk(warn, &pdev->sbdf, "ROM ", PFN_DOWN(addr),
 PFN_DOWN(addr + size - 1));


--
Regards,

Oleksandr Tyshchenko




RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator

2022-09-01 Thread Stefano Stabellini
On Thu, 1 Sep 2022, Henry Wang wrote:
> > -Original Message-
> > From: Julien Grall 
> > Subject: Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and
> > heap allocator
> > 
> > Hi Henry,
> > 
> > On 24/08/2022 08:31, Henry Wang wrote:
> > > This commit firstly adds a global variable `reserved_heap`.
> > > This newly introduced global variable is set at the device tree
> > > parsing time if the reserved heap ranges are defined in the device
> > > tree chosen node.
> > >
> > > For Arm32, In `setup_mm`, if the reserved heap is enabled, we use
> > > the reserved heap region for both domheap and xenheap allocation.
> > >
> > > For Arm64, In `setup_mm`, if the reserved heap is enabled and used,
> > > we make sure that only these reserved heap pages are added to the
> > > boot allocator. These reserved heap pages in the boot allocator are
> > > added to the heap allocator at `end_boot_allocator()`.
> > >
> > > If the reserved heap is disabled, we stick to current page allocation
> > > strategy at boot time.
> > >
> > > Also, take the chance to correct a "double not" print in Arm32
> > > `setup_mm()`.
> > >
> > > Signed-off-by: Henry Wang 
> > > ---
> > > With reserved heap enabled, for Arm64, naming of global variables such
> > > as `xenheap_mfn_start` and `xenheap_mfn_end` seems to be ambiguous,
> > > wondering if we should rename these variables.
> > > ---
> > > Changes from RFC to v1:
> > > - Rebase on top of latest `setup_mm()` changes.
> > > - Added Arm32 logic in `setup_mm()`.
> > > ---
> > >   xen/arch/arm/bootfdt.c   |  2 +
> > >   xen/arch/arm/include/asm/setup.h |  2 +
> > >   xen/arch/arm/setup.c | 79 +---
> > >   3 files changed, 67 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c
> > > index 33704ca487..ab73b6e212 100644
> > > --- a/xen/arch/arm/bootfdt.c
> > > +++ b/xen/arch/arm/bootfdt.c
> > > @@ -325,6 +325,8 @@ static int __init process_chosen_node(const void
> > *fdt, int node,
> > >true);
> > >   if ( rc )
> > >   return rc;
> > > +
> > > +reserved_heap = true;
> > >   }
> > >
> > >   printk("Checking for initrd in /chosen\n");
> > > diff --git a/xen/arch/arm/include/asm/setup.h
> > b/xen/arch/arm/include/asm/setup.h
> > > index e80f3d6201..00536a6d55 100644
> > > --- a/xen/arch/arm/include/asm/setup.h
> > > +++ b/xen/arch/arm/include/asm/setup.h
> > > @@ -92,6 +92,8 @@ extern struct bootinfo bootinfo;
> > >
> > >   extern domid_t max_init_domid;
> > >
> > > +extern bool reserved_heap;
> > > +
> > >   void copy_from_paddr(void *dst, paddr_t paddr, unsigned long len);
> > >
> > >   size_t estimate_efi_size(unsigned int mem_nr_banks);
> > > diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
> > > index 500307edc0..fe76cf6325 100644
> > > --- a/xen/arch/arm/setup.c
> > > +++ b/xen/arch/arm/setup.c
> > > @@ -73,6 +73,8 @@ integer_param("xenheap_megabytes",
> > opt_xenheap_megabytes);
> > >
> > >   domid_t __read_mostly max_init_domid;
> > >
> > > +bool __read_mostly reserved_heap;
> > > +
> > >   static __used void init_done(void)
> > >   {
> > >   /* Must be done past setting system_state. */
> > > @@ -699,8 +701,10 @@ static void __init populate_boot_allocator(void)
> > >   #ifdef CONFIG_ARM_32
> > >   static void __init setup_mm(void)
> > >   {
> > > -paddr_t ram_start, ram_end, ram_size, e;
> > > -unsigned long ram_pages;
> > > +paddr_t ram_start, ram_end, ram_size, e, bank_start, bank_end,
> > bank_size;
> > > +paddr_t reserved_heap_start = ~0, reserved_heap_end = 0,
> > > +reserved_heap_size = 0;
> > > +unsigned long ram_pages, reserved_heap_pages = 0;
> > >   unsigned long heap_pages, xenheap_pages, domheap_pages;
> > >   unsigned int i;
> > >   const uint32_t ctr = READ_CP32(CTR);
> > > @@ -720,9 +724,9 @@ static void __init setup_mm(void)
> > >
> > >   for ( i = 1; i < bootinfo.mem.nr_banks; i++ )
> > >   {
> > > -paddr_t bank_start = bootinfo.mem.bank[i].start;
> > > -paddr_t bank_size = bootinfo.mem.bank[i].size;
> > > -paddr_t bank_end = bank_start + bank_size;
> > > +bank_start = bootinfo.mem.bank[i].start;
> > > +bank_size = bootinfo.mem.bank[i].size;
> > > +bank_end = bank_start + bank_size;
> > >
> > >   ram_size  = ram_size + bank_size;
> > >   ram_start = min(ram_start,bank_start);
> > > @@ -731,6 +735,25 @@ static void __init setup_mm(void)
> > >
> > >   total_pages = ram_pages = ram_size >> PAGE_SHIFT;
> > >
> > > +if ( reserved_heap )
> > > +{
> > > +for ( i = 0 ; i < bootinfo.reserved_mem.nr_banks; i++ )
> > > +{
> > > +if ( bootinfo.reserved_mem.bank[i].xen_heap )
> > > +{
> > > +bank_start = bootinfo.reserved_mem.bank[i].start;
> > > +bank_size = bootinfo.reserved_mem.bank

Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 12:05:01PM +0100, Mel Gorman wrote:
> As pointed out elsewhere, attaching to the tracepoint and recording relevant
> state is an option other than trying to parse a raw ftrace feed. For memory
> leaks, there are already tracepoints for page allocation and free that could
> be used to track allocations that are not freed at a given point in time.

Page allocation tracepoints are not sufficient for what we're trying to do here,
and a substantial amount of effort in this patchset has gone into just getting
the hooking locations right - our memory allocation interfaces are not trivial.

That's something people should keep in mind when commenting on the size of this
patchset, since that's effort that would have to be spent for /any/ complete
solution, be in tracepoint based or no.

Additionally, we need to be able to write assertions that verify that our hook
locations are correct, that allocations or frees aren't getting double counted
or missed - highly necessary given the maze of nested memory allocation
interfaces we have (i.e. slab.h), and it's something a tracepoint based
implementation would have to account for - otherwise, a tool isn't very useful
if you can't trust the numbers it's giving you.

And then you have to correlate the allocate and free events, so that you know
which allocate callsite to decrement the amount freed from.

How would you plan on doing that with tracepoints?

> There is also the kernel memory leak detector although I never had reason
> to use it (https://www.kernel.org/doc/html/v6.0-rc3/dev-tools/kmemleak.html)
> and it sounds like it would be expensive.

Kmemleak is indeed expensive, and in the past I've had issues with it not
catching everything (I've noticed the kmemleak annotations growing, so maybe
this is less of an issue than it was).

And this is a more complete solution (though not something that could strictly
replace kmemleak): strict memory leaks aren't the only issue, it's also drivers
unexpectedly consuming more memory than expected.

I'll bet you a beer that when people have had this awhile, we're going to have a
bunch of bugs discovered and fixed along the lines of "oh hey, this driver
wasn't supposed to be using this 1 MB of memory, I never noticed that before".

> > > It's also unclear *who* would enable this. It looks like it would mostly
> > > have value during the development stage of an embedded platform to track
> > > kernel memory usage on a per-application basis in an environment where it
> > > may be difficult to setup tracing and tracking. Would it ever be enabled
> > > in production? Would a distribution ever enable this? If it's enabled, any
> > > overhead cannot be disabled/enabled at run or boot time so anyone enabling
> > > this would carry the cost without never necessarily consuming the data.
> > 
> > The whole point of this is to be cheap enough to enable in production -
> > especially the latency tracing infrastructure. There's a lot of value to
> > always-on system visibility infrastructure, so that when a live machine 
> > starts
> > to do something wonky the data is already there.
> > 
> 
> Sure, there is value but nothing stops the tracepoints being attached as
> a boot-time service where interested. For latencies, there is already
> bpf examples for tracing individual function latency over time e.g.
> https://github.com/iovisor/bcc/blob/master/tools/funclatency.py although
> I haven't used it recently.

So this is cool, I'll check it out today.

Tracing of /function/ latency is definitely something you'd want tracing/kprobes
for - that's way more practical than any code tagging-based approach. And if the
output is reliable and useful I could definitely see myself using this, thank
you.

But for data collection where it makes sense to annotate in the source code
where the data collection points are, I see the code-tagging based approach as
simpler - it cuts out a whole bunch of indirection. The diffstat on the code
tagging time stats patch is

 8 files changed, 233 insertions(+), 6 deletions(-)

And that includes hooking wait.h - this is really simple, easy stuff.

The memory allocation tracking patches are more complicated because we've got a
ton of memory allocation interfaces and we're aiming for strict correctness
there - because that tool needs strict correctness in order to be useful.

> Live parsing of ftrace is possible, albeit expensive.
> https://github.com/gormanm/mmtests/blob/master/monitors/watch-highorder.pl
> tracks counts of high-order allocations and dumps a report on interrupt as
> an example of live parsing ftrace and only recording interesting state. It's
> not tracking state you are interested in but it demonstrates it is possible
> to rely on ftrace alone and monitor from userspace. It's bit-rotted but
> can be fixed with

Yeah, if this is as far as people have gotten with ftrace on memory allocations
than I don't think tracing is credible here, sorry.

> The ease of use is a criticism as there 

RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator

2022-09-01 Thread Henry Wang
Hi Julien,

Thanks for your review.

> -Original Message-
> From: Julien Grall 
> Subject: Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and
> heap allocator
> 
> Hi Henry,
> 
> On 24/08/2022 08:31, Henry Wang wrote:
> > This commit firstly adds a global variable `reserved_heap`.
> > This newly introduced global variable is set at the device tree
> > parsing time if the reserved heap ranges are defined in the device
> > tree chosen node.
> >
> > For Arm32, In `setup_mm`, if the reserved heap is enabled, we use
> > the reserved heap region for both domheap and xenheap allocation.
> >
> > For Arm64, In `setup_mm`, if the reserved heap is enabled and used,
> > we make sure that only these reserved heap pages are added to the
> > boot allocator. These reserved heap pages in the boot allocator are
> > added to the heap allocator at `end_boot_allocator()`.
> >
> > If the reserved heap is disabled, we stick to current page allocation
> > strategy at boot time.
> >
> > Also, take the chance to correct a "double not" print in Arm32
> > `setup_mm()`.
> >
> > Signed-off-by: Henry Wang 
> > ---
> > With reserved heap enabled, for Arm64, naming of global variables such
> > as `xenheap_mfn_start` and `xenheap_mfn_end` seems to be ambiguous,
> > wondering if we should rename these variables.
> > ---
> > Changes from RFC to v1:
> > - Rebase on top of latest `setup_mm()` changes.
> > - Added Arm32 logic in `setup_mm()`.
> > ---
> >   xen/arch/arm/bootfdt.c   |  2 +
> >   xen/arch/arm/include/asm/setup.h |  2 +
> >   xen/arch/arm/setup.c | 79 +---
> >   3 files changed, 67 insertions(+), 16 deletions(-)
> >
> > diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c
> > index 33704ca487..ab73b6e212 100644
> > --- a/xen/arch/arm/bootfdt.c
> > +++ b/xen/arch/arm/bootfdt.c
> > @@ -325,6 +325,8 @@ static int __init process_chosen_node(const void
> *fdt, int node,
> >true);
> >   if ( rc )
> >   return rc;
> > +
> > +reserved_heap = true;
> >   }
> >
> >   printk("Checking for initrd in /chosen\n");
> > diff --git a/xen/arch/arm/include/asm/setup.h
> b/xen/arch/arm/include/asm/setup.h
> > index e80f3d6201..00536a6d55 100644
> > --- a/xen/arch/arm/include/asm/setup.h
> > +++ b/xen/arch/arm/include/asm/setup.h
> > @@ -92,6 +92,8 @@ extern struct bootinfo bootinfo;
> >
> >   extern domid_t max_init_domid;
> >
> > +extern bool reserved_heap;
> > +
> >   void copy_from_paddr(void *dst, paddr_t paddr, unsigned long len);
> >
> >   size_t estimate_efi_size(unsigned int mem_nr_banks);
> > diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
> > index 500307edc0..fe76cf6325 100644
> > --- a/xen/arch/arm/setup.c
> > +++ b/xen/arch/arm/setup.c
> > @@ -73,6 +73,8 @@ integer_param("xenheap_megabytes",
> opt_xenheap_megabytes);
> >
> >   domid_t __read_mostly max_init_domid;
> >
> > +bool __read_mostly reserved_heap;
> > +
> >   static __used void init_done(void)
> >   {
> >   /* Must be done past setting system_state. */
> > @@ -699,8 +701,10 @@ static void __init populate_boot_allocator(void)
> >   #ifdef CONFIG_ARM_32
> >   static void __init setup_mm(void)
> >   {
> > -paddr_t ram_start, ram_end, ram_size, e;
> > -unsigned long ram_pages;
> > +paddr_t ram_start, ram_end, ram_size, e, bank_start, bank_end,
> bank_size;
> > +paddr_t reserved_heap_start = ~0, reserved_heap_end = 0,
> > +reserved_heap_size = 0;
> > +unsigned long ram_pages, reserved_heap_pages = 0;
> >   unsigned long heap_pages, xenheap_pages, domheap_pages;
> >   unsigned int i;
> >   const uint32_t ctr = READ_CP32(CTR);
> > @@ -720,9 +724,9 @@ static void __init setup_mm(void)
> >
> >   for ( i = 1; i < bootinfo.mem.nr_banks; i++ )
> >   {
> > -paddr_t bank_start = bootinfo.mem.bank[i].start;
> > -paddr_t bank_size = bootinfo.mem.bank[i].size;
> > -paddr_t bank_end = bank_start + bank_size;
> > +bank_start = bootinfo.mem.bank[i].start;
> > +bank_size = bootinfo.mem.bank[i].size;
> > +bank_end = bank_start + bank_size;
> >
> >   ram_size  = ram_size + bank_size;
> >   ram_start = min(ram_start,bank_start);
> > @@ -731,6 +735,25 @@ static void __init setup_mm(void)
> >
> >   total_pages = ram_pages = ram_size >> PAGE_SHIFT;
> >
> > +if ( reserved_heap )
> > +{
> > +for ( i = 0 ; i < bootinfo.reserved_mem.nr_banks; i++ )
> > +{
> > +if ( bootinfo.reserved_mem.bank[i].xen_heap )
> > +{
> > +bank_start = bootinfo.reserved_mem.bank[i].start;
> > +bank_size = bootinfo.reserved_mem.bank[i].size;
> > +bank_end = bank_start + bank_size;
> > +
> > +reserved_heap_size += bank_size;
> > +reserved_heap_start = min(reserved_heap_start, bank_start);
> > +reserved_heap_end =

Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 05:07:06PM +0200, David Hildenbrand wrote:
> Skimming over the patches (that I was CCed on) and skimming over the
> cover letter, I got the impression that everything after patch 7 is
> introducing something new instead of refactoring something out.

You skimmed over the dynamic debug patch then...



RE: Push 4.17 Feature freeze and Code freeze one week later

2022-09-01 Thread Henry Wang
Hi Bertrand,

> -Original Message-
> From: Bertrand Marquis 
> Subject: Push 4.17 Feature freeze and Code freeze one week later
> 
> Hi,
> 
> Seeing that we have lots of series quite advanced in review but not already
> merged, could we push
> both Feature freeze and Code freeze deadline one week later to have a
> chance to finish those ?

I am good with that. The updated schedule now would be:

- Feature freeze Fri Sep 9, 2022 
- Code freezeFri Sep 30, 2022 (+3 weeks from Feature freeze)
- Hard code freeze   Fri Oct 14, 2022 (+2 weeks from Code freeze)
- Final commits  Fri Oct 28, 2022 (+2 weeks from Hard code freeze)
- ReleaseWed Nov 2, 2022

The planned RCs would then be (1 per week as soon as the code freeze):
- RC1: Fri Oct 7, 2022.
- RC2: Fri Oct 14, 2022.
- RC3: Fri Oct 21, 2022.
- RC4 (Final): Fri Oct 28, 2022.

Please raise concerns and questions about above schedule and we can have
a discussion.

Kind regards,
Henry

> 
> Cheers
> Bertrand
> 




Re: [PATCH v6 1/9] xen/arm: introduce static shared memory

2022-09-01 Thread Bertrand Marquis
Hi,

> On 1 Sep 2022, at 16:40, Julien Grall  wrote:
> 
> Hi Penny,
> 
> On 29/08/2022 07:57, Penny Zheng wrote:
>>> -Original Message-
>>> From: Julien Grall 
>>> Sent: Friday, August 26, 2022 9:17 PM
>>> To: Penny Zheng ; xen-devel@lists.xenproject.org
>>> Cc: Stefano Stabellini ; Bertrand Marquis
>>> ; Volodymyr Babchuk
>>> 
>>> Subject: Re: [PATCH v6 1/9] xen/arm: introduce static shared memory
>>> 
>>> Hi Penny,
>>> 
>> Hi Julien
>>  
>>> On 21/07/2022 14:21, Penny Zheng wrote:
 From: Penny Zheng 
 
 This patch series introduces a new feature: setting up static shared
 memory on a dom0less system, through device tree configuration.
 
 This commit parses shared memory node at boot-time, and reserve it in
 bootinfo.reserved_mem to avoid other use.
 
 This commits proposes a new Kconfig CONFIG_STATIC_SHM to wrap
 static-shm-related codes, and this option depends on static memory(
 CONFIG_STATIC_MEMORY). That's because that later we want to reuse a
 few helpers, guarded with CONFIG_STATIC_MEMORY, like
 acquire_staticmem_pages, etc, on static shared memory.
 
 Signed-off-by: Penny Zheng 
 ---
 v6 change:
 - when host physical address is ommited, output the error message
 since xen doesn't support it at the moment
 - add the following check: 1) The shm ID matches and the region
 exactly match
 2) The shm ID doesn't match and the region doesn't overlap
 - change it to "unsigned int" to be aligned with nr_banks
 - check the len of the property to confirm is it big enough to contain
 "paddr", "size", and "gaddr"
 - shm_id defined before nr_shm_domain, so we could re-use the existing
 hole and avoid increasing the size of the structure.
 - change "nr_shm_domain" to "nr_shm_borrowers", to not increment if
 the role is owner in parsing code
 - make "xen,shm_id" property as arbitrary string, with a strict limit
 on the number of characters, MAX_SHM_ID_LENGTH
 ---
 v5 change:
 - no change
 ---
 v4 change:
 - nit fix on doc
 ---
 v3 change:
 - make nr_shm_domain unsigned int
 ---
 v2 change:
 - document refinement
 - remove bitmap and use the iteration to check
 - add a new field nr_shm_domain to keep the number of shared domain
 ---
   docs/misc/arm/device-tree/booting.txt | 124 
   xen/arch/arm/Kconfig  |   6 +
   xen/arch/arm/bootfdt.c| 157 ++
   xen/arch/arm/include/asm/setup.h  |   9 ++
   4 files changed, 296 insertions(+)
 
 diff --git a/docs/misc/arm/device-tree/booting.txt
 b/docs/misc/arm/device-tree/booting.txt
 index 98253414b8..8013fb98fe 100644
 --- a/docs/misc/arm/device-tree/booting.txt
 +++ b/docs/misc/arm/device-tree/booting.txt
 @@ -378,3 +378,127 @@ device-tree:
 
   This will reserve a 512MB region starting at the host physical address
   0x3000 to be exclusively used by DomU1.
 +
 +Static Shared Memory
 +
 +
 +The static shared memory device tree nodes allow users to statically
 +set up shared memory on dom0less system, enabling domains to do
 +shm-based communication.
 +
 +- compatible
 +
 +"xen,domain-shared-memory-v1"
 +
 +- xen,shm-id
 +
 +An arbitrary string that represents the unique identifier of the 
 shared
 +memory region, with a strict limit on the number of characters(\0
>>> included),
 +`MAX_SHM_ID_LENGTH(16)`. e.g. "xen,shm-id = "my-shared-mem-1"".
 +
 +- xen,shared-mem
 +
 +An array takes a physical address, which is the base address of the
 +shared memory region in host physical address space, a size, and a
>>> guest
 +physical address, as the target address of the mapping.
 +e.g. xen,shared-mem = < [host physical address] [size] [guest
 + address] >
>>> 
>>> Your implementation below is checking for overlap and also have some
>>> restriction. Can they be documented in the binding?
>>> 
 +
 +The number of cells for the host address (and size) is the same as the
 +guest pseudo-physical address and they are inherited from the parent
>>> node.
>>> 
>>> In v5, we discussed to have the host address optional. However, the binding
>>> has not been updated to reflect that. Note that I am not asking to 
>>> implement,
>>> but instead request that the binding can be used for such setup.
>>> 
>> How about:
>> "
>> Host physical address could be omitted by users, and let Xen decide where it 
>> locates.
>> "
> 
> I am fine with that.
> 
>> Do you think I shall further point out that right now, this part feature is 
>> not implemented
>> in codes?
> 
> I have made a couple of suggestion for the code. But I think the binding 
> would look a bit odd without the host physical address. We woul

Re: [RFC PATCH 03/30] Lazy percpu counters

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 10:48:39AM -0400, Steven Rostedt wrote:
> On Thu, 1 Sep 2022 10:32:19 -0400
> Kent Overstreet  wrote:
> 
> > On Thu, Sep 01, 2022 at 08:51:31AM +0200, Peter Zijlstra wrote:
> > > On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote:  
> > > > +static void lazy_percpu_counter_switch_to_pcpu(struct 
> > > > raw_lazy_percpu_counter *c)
> > > > +{
> > > > +   u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, 
> > > > GFP_ATOMIC|__GFP_NOWARN);  
> > > 
> > > Realize that this is incorrect when used under a raw_spinlock_t.  
> > 
> > Can you elaborate?
> 
> All allocations (including GFP_ATOMIC) grab normal spin_locks. When
> PREEMPT_RT is configured, normal spin_locks turn into a mutex, where as
> raw_spinlock's do not.
> 
> Thus, if this is done within a raw_spinlock with PREEMPT_RT configured, it
> can cause a schedule while holding a spinlock.

Thanks, I think we should be good here but I'll document it anyways.



Re: [PATCH v6 1/9] xen/arm: introduce static shared memory

2022-09-01 Thread Julien Grall

Hi Penny,

On 29/08/2022 07:57, Penny Zheng wrote:

-Original Message-
From: Julien Grall 
Sent: Friday, August 26, 2022 9:17 PM
To: Penny Zheng ; xen-devel@lists.xenproject.org
Cc: Stefano Stabellini ; Bertrand Marquis
; Volodymyr Babchuk

Subject: Re: [PATCH v6 1/9] xen/arm: introduce static shared memory

Hi Penny,



Hi Julien
  

On 21/07/2022 14:21, Penny Zheng wrote:

From: Penny Zheng 

This patch series introduces a new feature: setting up static shared
memory on a dom0less system, through device tree configuration.

This commit parses shared memory node at boot-time, and reserve it in
bootinfo.reserved_mem to avoid other use.

This commits proposes a new Kconfig CONFIG_STATIC_SHM to wrap
static-shm-related codes, and this option depends on static memory(
CONFIG_STATIC_MEMORY). That's because that later we want to reuse a
few helpers, guarded with CONFIG_STATIC_MEMORY, like
acquire_staticmem_pages, etc, on static shared memory.

Signed-off-by: Penny Zheng 
---
v6 change:
- when host physical address is ommited, output the error message
since xen doesn't support it at the moment
- add the following check: 1) The shm ID matches and the region
exactly match
2) The shm ID doesn't match and the region doesn't overlap
- change it to "unsigned int" to be aligned with nr_banks
- check the len of the property to confirm is it big enough to contain
"paddr", "size", and "gaddr"
- shm_id defined before nr_shm_domain, so we could re-use the existing
hole and avoid increasing the size of the structure.
- change "nr_shm_domain" to "nr_shm_borrowers", to not increment if
the role is owner in parsing code
- make "xen,shm_id" property as arbitrary string, with a strict limit
on the number of characters, MAX_SHM_ID_LENGTH
---
v5 change:
- no change
---
v4 change:
- nit fix on doc
---
v3 change:
- make nr_shm_domain unsigned int
---
v2 change:
- document refinement
- remove bitmap and use the iteration to check
- add a new field nr_shm_domain to keep the number of shared domain
---
   docs/misc/arm/device-tree/booting.txt | 124 
   xen/arch/arm/Kconfig  |   6 +
   xen/arch/arm/bootfdt.c| 157 ++
   xen/arch/arm/include/asm/setup.h  |   9 ++
   4 files changed, 296 insertions(+)

diff --git a/docs/misc/arm/device-tree/booting.txt
b/docs/misc/arm/device-tree/booting.txt
index 98253414b8..8013fb98fe 100644
--- a/docs/misc/arm/device-tree/booting.txt
+++ b/docs/misc/arm/device-tree/booting.txt
@@ -378,3 +378,127 @@ device-tree:

   This will reserve a 512MB region starting at the host physical address
   0x3000 to be exclusively used by DomU1.
+
+Static Shared Memory
+
+
+The static shared memory device tree nodes allow users to statically
+set up shared memory on dom0less system, enabling domains to do
+shm-based communication.
+
+- compatible
+
+"xen,domain-shared-memory-v1"
+
+- xen,shm-id
+
+An arbitrary string that represents the unique identifier of the shared
+memory region, with a strict limit on the number of characters(\0

included),

+`MAX_SHM_ID_LENGTH(16)`. e.g. "xen,shm-id = "my-shared-mem-1"".
+
+- xen,shared-mem
+
+An array takes a physical address, which is the base address of the
+shared memory region in host physical address space, a size, and a

guest

+physical address, as the target address of the mapping.
+e.g. xen,shared-mem = < [host physical address] [size] [guest
+ address] >


Your implementation below is checking for overlap and also have some
restriction. Can they be documented in the binding?


+
+The number of cells for the host address (and size) is the same as the
+guest pseudo-physical address and they are inherited from the parent

node.

In v5, we discussed to have the host address optional. However, the binding
has not been updated to reflect that. Note that I am not asking to implement,
but instead request that the binding can be used for such setup.



How about:
"
Host physical address could be omitted by users, and let Xen decide where it 
locates.
"


I am fine with that.


Do you think I shall further point out that right now, this part feature is not 
implemented
in codes?


I have made a couple of suggestion for the code. But I think the binding 
would look a bit odd without the host physical address. We would now have:


< [size] [guest address]>

I think it would be more natural if we had

<[guest address] [size]>

And

<[guest address] [size] [host physical address]>




a/xen/arch/arm/include/asm/setup.h

b/xen/arch/arm/include/asm/setup.h

index 2bb01ecfa8..39d4e93b8b 100644
--- a/xen/arch/arm/include/asm/setup.h
+++ b/xen/arch/arm/include/asm/setup.h
@@ -23,10 +23,19 @@ typedef enum {
   }  bootmodule_kind;


+#ifdef CONFIG_STATIC_SHM
+/* Indicates the maximum number of characters(\0 included) for shm_id
+*/ #define MAX_SHM_ID_LENGTH 16 #endif


Is the #ifdef really needed?


+
   struct membank {
   paddr_t start;
 

Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Suren Baghdasaryan
On Thu, Sep 1, 2022 at 8:07 AM David Hildenbrand  wrote:
>
> On 01.09.22 16:23, Kent Overstreet wrote:
> > On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote:
> >> On 31.08.22 21:01, Kent Overstreet wrote:
> >>> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
>  On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > Whatever asking for an explanation as to why equivalent functionality
> > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> 
>  Fully agreed and this is especially true for a change this size
>  77 files changed, 3406 insertions(+), 703 deletions(-)
> >>>
> >>> In the case of memory allocation accounting, you flat cannot do this with 
> >>> ftrace
> >>> - you could maybe do a janky version that isn't fully accurate, much 
> >>> slower,
> >>> more complicated for the developer to understand and debug and more 
> >>> complicated
> >>> for the end user.
> >>>
> >>> But please, I invite anyone who's actually been doing this with ftrace to
> >>> demonstrate otherwise.
> >>>
> >>> Ftrace just isn't the right tool for the job here - we're talking about 
> >>> adding
> >>> per callsite accounting to some of the fastest fast paths in the kernel.
> >>>
> >>> And the size of the changes for memory allocation accounting are much more
> >>> reasonable:
> >>>  33 files changed, 623 insertions(+), 99 deletions(-)
> >>>
> >>> The code tagging library should exist anyways, it's been open coded half 
> >>> a dozen
> >>> times in the kernel already.
> >>
> >> Hi Kent,
> >>
> >> independent of the other discussions, if it's open coded already, does
> >> it make sense to factor that already-open-coded part out independently
> >> of the remainder of the full series here?
> >
> > It's discussed in the cover letter, that is exactly how the patch series is
> > structured.
>
> Skimming over the patches (that I was CCed on) and skimming over the
> cover letter, I got the impression that everything after patch 7 is
> introducing something new instead of refactoring something out.

Hi David,
Yes, you are right, the RFC does incorporate lots of parts which can
be considered separately. They are sent together to present the
overall scope of the proposal but I do intend to send them separately
once we decide if it's worth working on.
Thanks,
Suren.

>
> >
> >> [I didn't immediately spot if this series also attempts already to
> >> replace that open-coded part]
> >
> > Uh huh.
> >
> > Honestly, some days it feels like lkml is just as bad as slashdot, with 
> > people
> > wanting to get in their two cents without actually reading...
>
> ... and of course you had to reply like that. I should just have learned
> from my last upstream experience with you and kept you on my spam list.
>
> Thanks, bye
>
> --
> Thanks,
>
> David / dhildenb
>



[PATCH] xen/grants: prevent integer overflow in gnttab_dma_alloc_pages()

2022-09-01 Thread Dan Carpenter
The change from kcalloc() to kvmalloc() means that arg->nr_pages
might now be large enough that the "args->nr_pages << PAGE_SHIFT" can
result in an integer overflow.

Fixes: b3f7931f5c61 ("xen/gntdev: switch from kcalloc() to kvcalloc()")
Signed-off-by: Dan Carpenter 
---
 drivers/xen/grant-table.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index 738029de3c67..e1ec725c2819 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -1047,6 +1047,9 @@ int gnttab_dma_alloc_pages(struct gnttab_dma_alloc_args 
*args)
size_t size;
int i, ret;
 
+   if (args->nr_pages < 0 || args->nr_pages > (INT_MAX >> PAGE_SHIFT))
+   return -ENOMEM;
+
size = args->nr_pages << PAGE_SHIFT;
if (args->coherent)
args->vaddr = dma_alloc_coherent(args->dev, size,
-- 
2.35.1




Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Suren Baghdasaryan
On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko  wrote:
>
> On Wed 31-08-22 15:01:54, Kent Overstreet wrote:
> > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> > > On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > > > Whatever asking for an explanation as to why equivalent functionality
> > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> > >
> > > Fully agreed and this is especially true for a change this size
> > > 77 files changed, 3406 insertions(+), 703 deletions(-)
> >
> > In the case of memory allocation accounting, you flat cannot do this with 
> > ftrace
> > - you could maybe do a janky version that isn't fully accurate, much slower,
> > more complicated for the developer to understand and debug and more 
> > complicated
> > for the end user.
> >
> > But please, I invite anyone who's actually been doing this with ftrace to
> > demonstrate otherwise.
> >
> > Ftrace just isn't the right tool for the job here - we're talking about 
> > adding
> > per callsite accounting to some of the fastest fast paths in the kernel.
> >
> > And the size of the changes for memory allocation accounting are much more
> > reasonable:
> >  33 files changed, 623 insertions(+), 99 deletions(-)
> >
> > The code tagging library should exist anyways, it's been open coded half a 
> > dozen
> > times in the kernel already.
> >
> > And once we've got that, the time stats code is _also_ far simpler than 
> > doing it
> > with ftrace would be. If anyone here has successfully debugged latency 
> > issues
> > with ftrace, I'd really like to hear it. Again, for debugging latency 
> > issues you
> > want something that can always be on, and that's not cheap with ftrace - and
> > never mind the hassle of correlating start and end wait trace events, 
> > builting
> > up histograms, etc. - that's all handled here.
> >
> > Cheap, simple, easy to use. What more could you want?
>
> A big ad on a banner. But more seriously.
>
> This patchset is _huge_ and touching a lot of different areas. It will
> be not only hard to review but even harder to maintain longterm. So
> it is completely reasonable to ask for potential alternatives with a
> smaller code footprint. I am pretty sure you are aware of that workflow.

The patchset is huge because it introduces a reusable part (the first
6 patches introducing code tagging) and 6 different applications in
very different areas of the kernel. We wanted to present all of them
in the RFC to show the variety of cases this mechanism can be reused
for. If the code tagging is accepted, each application can be posted
separately to the appropriate group of people. Hopefully that makes it
easier to review. Those first 6 patches are not that big and are quite
isolated IMHO:

 include/linux/codetag.h |  83 ++
 include/linux/lazy-percpu-counter.h |  67 
 include/linux/module.h  |   1 +
 kernel/module/internal.h|   1 -
 kernel/module/main.c|   4 +
 lib/Kconfig |   3 +
 lib/Kconfig.debug   |   4 +
 lib/Makefile|   3 +
 lib/codetag.c   | 248 
 lib/lazy-percpu-counter.c   | 141 
 lib/string_helpers.c|   3 +-
 scripts/kallsyms.c  |  13 ++

>
> So I find Peter's question completely appropriate while your response to
> that not so much! Maybe ftrace is not the right tool for the intented
> job. Maybe there are other ways and it would be really great to show
> that those have been evaluated and they are not suitable for a), b) and
> c) reasons.

That's fair.
For memory tracking I looked into using kmemleak and page_owner which
can't match the required functionality at an overhead acceptable for
production and pre-production testing environments. traces + BPF I
haven't evaluated myself but heard from other members of my team who
tried using that in production environment with poor results. I'll try
to get more specific information on that.

>
> E.g. Oscar has been working on extending page_ext to track number of
> allocations for specific calltrace[1]. Is this 1:1 replacement? No! But
> it can help in environments where page_ext can be enabled and it is
> completely non-intrusive to the MM code.

Thanks for pointing out this work. I'll need to review and maybe
profile it before making any claims.

>
> If the page_ext overhead is not desirable/acceptable then I am sure
> there are other options. E.g. kprobes/LivePatching framework can hook
> into functions and alter their behavior. So why not use that for data
> collection? Has this been evaluated at all?

I'm not sure how I can hook into say alloc_pages() to find out where
it was called from without capturing the call stack (which would
introduce an overhead at every allocation). Would love to discuss this
or other alternatives if they can be done with low enough overhead.
Thanks,
Suren.

>
> And please

Re: Push 4.17 Feature freeze and Code freeze one week later

2022-09-01 Thread Stefano Stabellini
On Thu, 1 Sep 2022, Bertrand Marquis wrote:
> Hi,
> 
> Seeing that we have lots of series quite advanced in review but not already 
> merged, could we push
> both Feature freeze and Code freeze deadline one week later to have a chance 
> to finish those ?

+1

On the ARM side we have a higher-than-usual number of patch series still
to commit. This is different from past times when people rushed at the
last week many patch series, completely new and unreviewed. These series
went through the review process properly, they are in good condition,
and are typically at v4 or above. They only miss one last round of
review.

I think we should give them a chance to finish, so it would be great to
have 1 more week.



Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator

2022-09-01 Thread Julien Grall

Hi Henry,

On 24/08/2022 08:31, Henry Wang wrote:

This commit firstly adds a global variable `reserved_heap`.
This newly introduced global variable is set at the device tree
parsing time if the reserved heap ranges are defined in the device
tree chosen node.

For Arm32, In `setup_mm`, if the reserved heap is enabled, we use
the reserved heap region for both domheap and xenheap allocation.

For Arm64, In `setup_mm`, if the reserved heap is enabled and used,
we make sure that only these reserved heap pages are added to the
boot allocator. These reserved heap pages in the boot allocator are
added to the heap allocator at `end_boot_allocator()`.

If the reserved heap is disabled, we stick to current page allocation
strategy at boot time.

Also, take the chance to correct a "double not" print in Arm32
`setup_mm()`.

Signed-off-by: Henry Wang 
---
With reserved heap enabled, for Arm64, naming of global variables such
as `xenheap_mfn_start` and `xenheap_mfn_end` seems to be ambiguous,
wondering if we should rename these variables.
---
Changes from RFC to v1:
- Rebase on top of latest `setup_mm()` changes.
- Added Arm32 logic in `setup_mm()`.
---
  xen/arch/arm/bootfdt.c   |  2 +
  xen/arch/arm/include/asm/setup.h |  2 +
  xen/arch/arm/setup.c | 79 +---
  3 files changed, 67 insertions(+), 16 deletions(-)

diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c
index 33704ca487..ab73b6e212 100644
--- a/xen/arch/arm/bootfdt.c
+++ b/xen/arch/arm/bootfdt.c
@@ -325,6 +325,8 @@ static int __init process_chosen_node(const void *fdt, int 
node,
   true);
  if ( rc )
  return rc;
+
+reserved_heap = true;
  }
  
  printk("Checking for initrd in /chosen\n");

diff --git a/xen/arch/arm/include/asm/setup.h b/xen/arch/arm/include/asm/setup.h
index e80f3d6201..00536a6d55 100644
--- a/xen/arch/arm/include/asm/setup.h
+++ b/xen/arch/arm/include/asm/setup.h
@@ -92,6 +92,8 @@ extern struct bootinfo bootinfo;
  
  extern domid_t max_init_domid;
  
+extern bool reserved_heap;

+
  void copy_from_paddr(void *dst, paddr_t paddr, unsigned long len);
  
  size_t estimate_efi_size(unsigned int mem_nr_banks);

diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index 500307edc0..fe76cf6325 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -73,6 +73,8 @@ integer_param("xenheap_megabytes", opt_xenheap_megabytes);
  
  domid_t __read_mostly max_init_domid;
  
+bool __read_mostly reserved_heap;

+
  static __used void init_done(void)
  {
  /* Must be done past setting system_state. */
@@ -699,8 +701,10 @@ static void __init populate_boot_allocator(void)
  #ifdef CONFIG_ARM_32
  static void __init setup_mm(void)
  {
-paddr_t ram_start, ram_end, ram_size, e;
-unsigned long ram_pages;
+paddr_t ram_start, ram_end, ram_size, e, bank_start, bank_end, bank_size;
+paddr_t reserved_heap_start = ~0, reserved_heap_end = 0,
+reserved_heap_size = 0;
+unsigned long ram_pages, reserved_heap_pages = 0;
  unsigned long heap_pages, xenheap_pages, domheap_pages;
  unsigned int i;
  const uint32_t ctr = READ_CP32(CTR);
@@ -720,9 +724,9 @@ static void __init setup_mm(void)
  
  for ( i = 1; i < bootinfo.mem.nr_banks; i++ )

  {
-paddr_t bank_start = bootinfo.mem.bank[i].start;
-paddr_t bank_size = bootinfo.mem.bank[i].size;
-paddr_t bank_end = bank_start + bank_size;
+bank_start = bootinfo.mem.bank[i].start;
+bank_size = bootinfo.mem.bank[i].size;
+bank_end = bank_start + bank_size;
  
  ram_size  = ram_size + bank_size;

  ram_start = min(ram_start,bank_start);
@@ -731,6 +735,25 @@ static void __init setup_mm(void)
  
  total_pages = ram_pages = ram_size >> PAGE_SHIFT;
  
+if ( reserved_heap )

+{
+for ( i = 0 ; i < bootinfo.reserved_mem.nr_banks; i++ )
+{
+if ( bootinfo.reserved_mem.bank[i].xen_heap )
+{
+bank_start = bootinfo.reserved_mem.bank[i].start;
+bank_size = bootinfo.reserved_mem.bank[i].size;
+bank_end = bank_start + bank_size;
+
+reserved_heap_size += bank_size;
+reserved_heap_start = min(reserved_heap_start, bank_start);
+reserved_heap_end = max(reserved_heap_end, bank_end);
+}
+}
+
+reserved_heap_pages = reserved_heap_size >> PAGE_SHIFT;
+}
+
  /*
   * If the user has not requested otherwise via the command line
   * then locate the xenheap using these constraints:
@@ -743,7 +766,8 @@ static void __init setup_mm(void)
   * We try to allocate the largest xenheap possible within these
   * constraints.
   */
-heap_pages = ram_pages;
+heap_pages = !reserved_heap ? ram_pages : reserved_heap_pages;
+
  if ( opt_xenheap_megabytes )
  xenheap_pages = opt_xenheap_megabytes << 

[ovmf test] 172912: regressions - FAIL

2022-09-01 Thread osstest service owner
flight 172912 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/172912/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-i386-libvirt6 libvirt-buildfail REGR. vs. 172136
 build-amd64-libvirt   6 libvirt-buildfail REGR. vs. 172136

version targeted for testing:
 ovmf 033ba8bb2976629fdb664d7131f44f8b0b8f6777
baseline version:
 ovmf 444260d45ec2a84e8f8c192b3539a3cd5591d009

Last test of basis   172136  2022-08-04 06:43:42 Z   28 days
Failing since172151  2022-08-05 02:40:28 Z   27 days  219 attempts
Testing same since   172912  2022-09-01 11:42:11 Z0 days1 attempts


People who touched revisions under test:
  Abdul Lateef Attar 
  Abner Chang 
  Ard Biesheuvel 
  Bob Feng 
  Chasel Chiu 
  Chen, Xiao X 
  Czajkowski, Maciej 
  Dimitrije Pavlov 
  Dun Tan 
  Edward Pickup 
  Foster Nong 
  Gregx Yeh 
  Guo Dong 
  Igor Kulchytskyy 
  James Lu 
  Jiaxin Wu 
  Jose Marinho 
  KasimX Liu 
  Kavya 
  Konstantin Aladyshev 
  Kun Qin 
  Liming Gao 
  Liu, Zhiguang 
  Maciej Czajkowski 
  Michael D Kinney 
  Michael Kubacki 
  Ray Ni 
  Rebecca Cran 
  Sainadh Nagolu 
  Sami Mujawar 
  Shengfengx Xue 
  Wu, Jiaxin 
  Xiao X Chen 
  Yuanhao Xie 
  Zhiguang Liu 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  fail
 build-i386-libvirt   fail
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Not pushing.

(No revision log; it would be 1360 lines long.)



[qemu-mainline test] 172905: regressions - FAIL

2022-09-01 Thread osstest service owner
flight 172905 qemu-mainline real [real]
http://logs.test-lab.xenproject.org/osstest/logs/172905/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-amd64-libvirt   6 libvirt-buildfail REGR. vs. 172123
 build-i386-libvirt6 libvirt-buildfail REGR. vs. 172123
 build-arm64-libvirt   6 libvirt-buildfail REGR. vs. 172123
 build-armhf-libvirt   6 libvirt-buildfail REGR. vs. 172123
 test-amd64-i386-xl-vhd   22 guest-start.2fail REGR. vs. 172123

Regressions which are regarded as allowable (not blocking):
 test-armhf-armhf-xl-rtds18 guest-start/debian.repeat fail REGR. vs. 172123

Tests which did not succeed, but are not blocking:
 test-arm64-arm64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-qcow2  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-raw  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-xsm   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-raw   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-i386-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt   1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-amd64-libvirt-vhd  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 172123
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 172123
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 172123
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 172123
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 172123
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass

version targeted for testing:
 qemuue93ded1bf6c94ab95015b33e188bc8b0b0c32670
baseline version:
 qemuu2480f3bbd03814b0651a1f74959f5c6631ee5819

Last test 

Re: [PATCH v2 0/3] xen-blk{front,back}: Fix the broken semantic and flow of feature-persistent

2022-09-01 Thread Juergen Gross

On 31.08.22 18:58, SeongJae Park wrote:

Changes from v1
(https://lore.kernel.org/xen-devel/20220825161511.94922-1...@kernel.org/)
- Fix the wrong feature_persistent caching position of blkfront
- Set blkfront's feature_persistent field setting with simple '&&'
   instead of 'if' (Pratyush Yadav)

This patchset fixes misuse of the 'feature-persistent' advertisement
semantic (patches 1 and 2), and the wrong timing of the
'feature_persistent' value caching, which made persistent grants feature
always disabled.

SeongJae Park (3):
   xen-blkback: Advertise feature-persistent as user requested
   xen-blkfront: Advertise feature-persistent as user requested
   xen-blkfront: Cache feature_persistent value before advertisement

  drivers/block/xen-blkback/common.h |  3 +++
  drivers/block/xen-blkback/xenbus.c |  6 --
  drivers/block/xen-blkfront.c   | 20 
  3 files changed, 19 insertions(+), 10 deletions(-)



For the whole series:

Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 0/3] xen-blk{front,back}: Fix the broken semantic and flow of feature-persistent

2022-09-01 Thread Marek Marczykowski-Górecki
On Wed, Aug 31, 2022 at 05:08:17PM +, SeongJae Park wrote:
> On Wed, 31 Aug 2022 16:58:21 + SeongJae Park  wrote:
> 
> > Changes from v1
> > (https://lore.kernel.org/xen-devel/20220825161511.94922-1...@kernel.org/)
> > - Fix the wrong feature_persistent caching position of blkfront
> > - Set blkfront's feature_persistent field setting with simple '&&'
> >   instead of 'if' (Pratyush Yadav)
> > 
> > This patchset fixes misuse of the 'feature-persistent' advertisement
> > semantic (patches 1 and 2), and the wrong timing of the
> > 'feature_persistent' value caching, which made persistent grants feature
> > always disabled.
> 
> Please note that I have some problem in my test setup and therefore was unable
> to fully test this patchset.  I am posting this though, as the impact of the
> bug is not trivial (always disabling persistent grants), and to make testing 
> of
> my proposed fix from others easier.  Hope to get someone's test results or 
> code
> review of this patchset even before I fix my test setup problem.

I can confirm it fixes the issue, thanks!

Tested-by: Marek Marczykowski-Górecki 

> Juergen, I didn't add your 'Reviewed-by:'s to the first two patches of this
> series because I changed some of the description for making it clear which bug
> and commit it is really fixing.  Specifically, I wordsmithed the working and
> changed 'Fixed:' tag.  Code change is almost same, though.
> 
> 
> Thanks,
> SJ
> 
> > 
> > SeongJae Park (3):
> >   xen-blkback: Advertise feature-persistent as user requested
> >   xen-blkfront: Advertise feature-persistent as user requested
> >   xen-blkfront: Cache feature_persistent value before advertisement
> > 
> >  drivers/block/xen-blkback/common.h |  3 +++
> >  drivers/block/xen-blkback/xenbus.c |  6 --
> >  drivers/block/xen-blkfront.c   | 20 
> >  3 files changed, 19 insertions(+), 10 deletions(-)
> > 
> > -- 
> > 2.25.1
> > 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Push 4.17 Feature freeze and Code freeze one week later

2022-09-01 Thread Bertrand Marquis
Hi,

Seeing that we have lots of series quite advanced in review but not already 
merged, could we push
both Feature freeze and Code freeze deadline one week later to have a chance to 
finish those ?

Cheers
Bertrand




Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread David Hildenbrand
On 01.09.22 16:23, Kent Overstreet wrote:
> On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote:
>> On 31.08.22 21:01, Kent Overstreet wrote:
>>> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
 On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> Whatever asking for an explanation as to why equivalent functionality
> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.

 Fully agreed and this is especially true for a change this size
 77 files changed, 3406 insertions(+), 703 deletions(-)
>>>
>>> In the case of memory allocation accounting, you flat cannot do this with 
>>> ftrace
>>> - you could maybe do a janky version that isn't fully accurate, much slower,
>>> more complicated for the developer to understand and debug and more 
>>> complicated
>>> for the end user.
>>>
>>> But please, I invite anyone who's actually been doing this with ftrace to
>>> demonstrate otherwise.
>>>
>>> Ftrace just isn't the right tool for the job here - we're talking about 
>>> adding
>>> per callsite accounting to some of the fastest fast paths in the kernel.
>>>
>>> And the size of the changes for memory allocation accounting are much more
>>> reasonable:
>>>  33 files changed, 623 insertions(+), 99 deletions(-)
>>>
>>> The code tagging library should exist anyways, it's been open coded half a 
>>> dozen
>>> times in the kernel already.
>>
>> Hi Kent,
>>
>> independent of the other discussions, if it's open coded already, does
>> it make sense to factor that already-open-coded part out independently
>> of the remainder of the full series here?
> 
> It's discussed in the cover letter, that is exactly how the patch series is
> structured.

Skimming over the patches (that I was CCed on) and skimming over the
cover letter, I got the impression that everything after patch 7 is
introducing something new instead of refactoring something out.

>  
>> [I didn't immediately spot if this series also attempts already to
>> replace that open-coded part]
> 
> Uh huh.
> 
> Honestly, some days it feels like lkml is just as bad as slashdot, with people
> wanting to get in their two cents without actually reading...

... and of course you had to reply like that. I should just have learned
from my last upstream experience with you and kept you on my spam list.

Thanks, bye

-- 
Thanks,

David / dhildenb




Re: [RFC PATCH 03/30] Lazy percpu counters

2022-09-01 Thread Steven Rostedt
On Thu, 1 Sep 2022 10:32:19 -0400
Kent Overstreet  wrote:

> On Thu, Sep 01, 2022 at 08:51:31AM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote:  
> > > +static void lazy_percpu_counter_switch_to_pcpu(struct 
> > > raw_lazy_percpu_counter *c)
> > > +{
> > > + u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN);  
> > 
> > Realize that this is incorrect when used under a raw_spinlock_t.  
> 
> Can you elaborate?

All allocations (including GFP_ATOMIC) grab normal spin_locks. When
PREEMPT_RT is configured, normal spin_locks turn into a mutex, where as
raw_spinlock's do not.

Thus, if this is done within a raw_spinlock with PREEMPT_RT configured, it
can cause a schedule while holding a spinlock.

-- Steve



Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 09:11:17AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 30, 2022 at 02:49:16PM -0700, Suren Baghdasaryan wrote:
> > From: Kent Overstreet 
> > 
> > This adds the ability to easily instrument code for measuring latency.
> > To use, add the following to calls to your code, at the start and end of
> > the event you wish to measure:
> > 
> >   code_tag_time_stats_start(start_time);
> >   code_tag_time_stats_finish(start_time);
> > 
> > Stastistics will then show up in debugfs under
> > /sys/kernel/debug/time_stats, listed by file and line number.
> > 
> > Stastics measured include weighted averages of frequency, duration, max
> > duration, as well as quantiles.
> > 
> > This patch also instruments all calls to init_wait and finish_wait,
> > which includes all calls to wait_event. Example debugfs output:
> 
> How can't you do this with a simple eBPF script on top of
> trace_sched_stat_* and friends?

I know about those tracepoints, and I've never found them to be usable. I've
never succesfully used them for debugging latency issues, or known anyone who
has.

And an eBPF script to do everything this does wouldn't be simple at all.
Honesly, the time stats stuff looks _far_ simpler to me than anything involving
tracing - and with tracing you have to correlate the start and end events after
the fact.



Re: [PATCH 1/2] docs, xen/arm: Introduce reserved heap memory

2022-09-01 Thread Julien Grall

Hi Henry,

On 24/08/2022 08:31, Henry Wang wrote:

This commit introduces the reserved heap memory, which is parts of RAM
reserved in the beginning of the boot time for heap.

A new boolean field `xen_heap` in `struct membank` is added to store the
configuration telling if the memory bank is reserved as heap through
`xen,static-mem` property in device tree `chosen` node.

Also, this commit introduces the logic to parse the reserved heap
configuation in device tree by reusing the device tree entry definition


typo: s/configuation/configuration/


of the static memory allocation feature:

- Add a boolean parameter `xen_heap` to `device_tree_get_meminfo` to
reflect whether the memory bank is reserved as heap.

- Use `device_tree_get_meminfo` to parse the reserved heap configuation
in `chosen` node of the device tree.

- In order to reuse the function `device_tree_get_meminfo`, the
return type of `process_chosen_node` is changed from void to int.

A documentation section is added, describing the definition of reserved
heap memory and the method of enabling the reserved heap memory through
device tree at boot time.

Signed-off-by: Henry Wang 
Signed-off-by: Penny Zheng 
---
The name of the device tree property was chosen because we want to
reuse as much as the device tree parsing helpers from the static
memory allocation feature, but we would like to hear the upstream
reviewers' opinion about if using "xen,static-heap" is better.
---
Changes from RFC to v1:
- Rename the terminology to reserved heap.
---
  docs/misc/arm/device-tree/booting.txt | 46 +


I have skipped the documentation and looked at the code instead.


  xen/arch/arm/bootfdt.c| 49 +--
  xen/arch/arm/domain_build.c   |  5 +--
  xen/arch/arm/include/asm/setup.h  |  1 +
  4 files changed, 89 insertions(+), 12 deletions(-)

diff --git a/docs/misc/arm/device-tree/booting.txt 
b/docs/misc/arm/device-tree/booting.txt
index 98253414b8..e064f64d9a 100644
--- a/docs/misc/arm/device-tree/booting.txt
+++ b/docs/misc/arm/device-tree/booting.txt
@@ -378,3 +378,49 @@ device-tree:
  
  This will reserve a 512MB region starting at the host physical address

  0x3000 to be exclusively used by DomU1.
+
+
+Reserved Heap Memory
+
+
+The reserved heap memory (also known as the statically-configured heap) refers
+to parts of RAM reserved in the beginning for heap. The memory is reserved by
+configuration in the device tree using physical address ranges.
+
+The reserved heap memory declared in the device tree defines the memory areas
+that will be reserved to be used exclusively as heap.
+
+- For Arm32, since there can be seperated heaps, the reserved heap will be used
+for both domheap and xenheap.
+- For Arm64, since domheap and xenheap are the same, the defined reserved heap
+areas shall always go to the heap allocator.
+
+The reserved heap memory is an optional feature and can be enabled by adding a
+device tree property in the `chosen` node. Currently, this feature reuses the
+static memory allocation device tree configuration.
+
+The dtb property should look like as follows:
+
+- property name
+
+"xen,static-mem" (Should be used in the `chosen` node)
+
+- cells
+
+Specify the start address and the length of the reserved heap memory.
+The number of cells for the address and the size should be defined
+using the properties `#xen,static-mem-address-cells` and
+`#xen,static-mem-size-cells` respectively.
+
+Below is an example on how to specify the reserved heap in device tree:
+
+/ {
+chosen {
+#xen,static-mem-address-cells = <0x2>;
+#xen,static-mem-size-cells = <0x2>;
+xen,static-mem = <0x0 0x3000 0x0 0x4000>;
+};
+};
+
+RAM at 0x3000 of 1G size will be reserved as heap.
+
diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c
index ec81a45de9..33704ca487 100644
--- a/xen/arch/arm/bootfdt.c
+++ b/xen/arch/arm/bootfdt.c
@@ -64,7 +64,8 @@ void __init device_tree_get_reg(const __be32 **cell, u32 
address_cells,
  static int __init device_tree_get_meminfo(const void *fdt, int node,
const char *prop_name,
u32 address_cells, u32 size_cells,
-  void *data, bool xen_domain)
+  void *data, bool xen_domain,
+  bool xen_heap)


It sounds like to me, we want to have an enum rather than multiple 
boolean. This would also make easier...



  {
  const struct fdt_property *prop;
  unsigned int i, banks;
@@ -96,6 +97,7 @@ static int __init device_tree_get_meminfo(const void *fdt, 
int node,
  mem->bank[mem->nr_banks].start = start;
  mem->bank[mem->nr_banks].size = size;
  mem->bank[mem->nr_banks].xen_domain = xen_domain;
+mem->bank[mem->nr_ba

Re: [RFC PATCH 03/30] Lazy percpu counters

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 08:51:31AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote:
> > +static void lazy_percpu_counter_switch_to_pcpu(struct 
> > raw_lazy_percpu_counter *c)
> > +{
> > +   u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN);
> 
> Realize that this is incorrect when used under a raw_spinlock_t.

Can you elaborate?



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 09:00:17AM +0200, Peter Zijlstra wrote:
> On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote:
> 
> > It's also unclear *who* would enable this. It looks like it would mostly
> > have value during the development stage of an embedded platform to track
> > kernel memory usage on a per-application basis in an environment where it
> > may be difficult to setup tracing and tracking. Would it ever be enabled
> > in production? 
> 
> Afaict this is developer only; it is all unconditional code.
> 
> > Would a distribution ever enable this? 
> 
> I would sincerely hope not. Because:
> 
> > If it's enabled, any overhead cannot be disabled/enabled at run or
> > boot time so anyone enabling this would carry the cost without never
> > necessarily consuming the data.
> 
> this.

We could make it a boot parameter, with the alternatives infrastructure - with a
bit of refactoring there'd be a single function call to nop out, and then we
could also drop the elf sections as well, so that when built in but disabled the
overhead would be practically nil.



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Kent Overstreet
On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote:
> On 31.08.22 21:01, Kent Overstreet wrote:
> > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> >> On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> >>> Whatever asking for an explanation as to why equivalent functionality
> >>> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> >>
> >> Fully agreed and this is especially true for a change this size
> >> 77 files changed, 3406 insertions(+), 703 deletions(-)
> > 
> > In the case of memory allocation accounting, you flat cannot do this with 
> > ftrace
> > - you could maybe do a janky version that isn't fully accurate, much slower,
> > more complicated for the developer to understand and debug and more 
> > complicated
> > for the end user.
> > 
> > But please, I invite anyone who's actually been doing this with ftrace to
> > demonstrate otherwise.
> > 
> > Ftrace just isn't the right tool for the job here - we're talking about 
> > adding
> > per callsite accounting to some of the fastest fast paths in the kernel.
> > 
> > And the size of the changes for memory allocation accounting are much more
> > reasonable:
> >  33 files changed, 623 insertions(+), 99 deletions(-)
> > 
> > The code tagging library should exist anyways, it's been open coded half a 
> > dozen
> > times in the kernel already.
> 
> Hi Kent,
> 
> independent of the other discussions, if it's open coded already, does
> it make sense to factor that already-open-coded part out independently
> of the remainder of the full series here?

It's discussed in the cover letter, that is exactly how the patch series is
structured.
 
> [I didn't immediately spot if this series also attempts already to
> replace that open-coded part]

Uh huh.

Honestly, some days it feels like lkml is just as bad as slashdot, with people
wanting to get in their two cents without actually reading...



Re: [PATCH] xen/arm: Add xen/arch/arm/efi/stub.c in .gitignore

2022-09-01 Thread Bertrand Marquis
Hi Julien,

This one passed through my filtering, sorry for that.

> On 12 Aug 2022, at 20:19, Julien Grall  wrote:
> 
> From: Julien Grall 
> 
> Xen build system the symbolic link xen/arch/arm/efi/stub.c. So we want
> to ignore it.
> 
> Fixes: 7f96859b0d00 ("xen: reuse x86 EFI stub functions for Arm")
> Signed-off-by: Julien Grall 
Reviewed-by: Bertrand Marquis 

Cheers
Bertrand

> ---
> .gitignore | 1 +
> 1 file changed, 1 insertion(+)
> 
> diff --git a/.gitignore b/.gitignore
> index ed7bd8bdc76c..0d53eb304993 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -295,6 +295,7 @@ xen/.config
> xen/.config.old
> xen/.xen.elf32
> xen/System.map
> +xen/arch/arm/efi/stub.c
> xen/arch/x86/boot/mkelf32
> xen/arch/x86/boot/cmdline.S
> xen/arch/x86/boot/reloc.S
> -- 
> 2.37.1
> 




Re: [PATCH 01/10] xen/arm: smmuv3: Fix l1 stream table size in the error message

2022-09-01 Thread Julien Grall




On 01/09/2022 11:27, Rahul Singh wrote:

Hi Julien,


Hi Rahul,


On 24 Aug 2022, at 3:58 pm, Julien Grall  wrote:

Hi Rahul,

On 24/08/2022 14:53, Rahul Singh wrote:

Backport Linux commit dc898eb84b25c39ea46f28c48a169bdbd0e2c7e0
iommu/arm-smmu-v3: Fix l1 stream table size in the error message


We have a tag for this (see Origin). If you use it, then…

Ok.



Original commit message:


You don't need to add "original commit message" here and the content is exactly 
the same.

Ok.



 iommu/arm-smmu-v3: Fix l1 stream table size in the error message
 The actual size of level-1 stream table is l1size. This looks like
 an oversight on commit d2e88e7c081ef ("iommu/arm-smmu: Fix LOG2SIZE
 setting for 2-level stream tables") which forgot to update the @size
 in error message as well.
 As memory allocation failure is already bad enough, nothing worse
 would happen. But let's be careful.
 Signed-off-by: Zenghui Yu 


AFAICT, you didn't make any change to this patch. So the "From:" should still 
be from Zenghui Yu. For an example how to do backport, see
9c432b876bf518866d431bda73f2be1250f688eb "x86/mwait-idle: add SPR support".


Ok.


Also, it would be good to clarify whether they are clean backport and required 
some changes (other than context changes). I assume they are clean backports?

Yes all patches are clean back port.
  
Here is the commit msg please have a look once. Once you confirmed I will modify other patches and send it for review.

It looks fine to me.

Cheers,

--
Julien Grall



Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator

2022-09-01 Thread Julien Grall




On 01/09/2022 15:01, Henry Wang wrote:

Hi Julien,


-Original Message-
From: Julien Grall 

+paddr_t reserved_heap_start = ~0, reserved_heap_end = 0,


INVALID_PADDR or ~0ULL


I would strongly prefer the former. It is more difficult to understand
what's the value means in the latter.


Thanks for the input. You mean the INVALID_PADDR correct? 


That's correct.

Cheers,

--
Julien Grall



RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator

2022-09-01 Thread Henry Wang
Hi Julien,

> -Original Message-
> From: Julien Grall 
> >> +paddr_t reserved_heap_start = ~0, reserved_heap_end = 0,
> >
> > INVALID_PADDR or ~0ULL
> 
> I would strongly prefer the former. It is more difficult to understand
> what's the value means in the latter.

Thanks for the input. You mean the INVALID_PADDR correct? Yeah
this is the one that I used in my local v2, will send it tomorrow after
doing the bootinfo change.

Kind regards,
Henry

> 
> Cheers,
> 
> --
> Julien Grall


Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator

2022-09-01 Thread Julien Grall




On 30/08/2022 02:04, Stefano Stabellini wrote:

  size_t estimate_efi_size(unsigned int mem_nr_banks);
diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index 500307edc0..fe76cf6325 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -73,6 +73,8 @@ integer_param("xenheap_megabytes", opt_xenheap_megabytes);
  
  domid_t __read_mostly max_init_domid;
  
+bool __read_mostly reserved_heap;

+
  static __used void init_done(void)
  {
  /* Must be done past setting system_state. */
@@ -699,8 +701,10 @@ static void __init populate_boot_allocator(void)
  #ifdef CONFIG_ARM_32
  static void __init setup_mm(void)
  {
-paddr_t ram_start, ram_end, ram_size, e;
-unsigned long ram_pages;
+paddr_t ram_start, ram_end, ram_size, e, bank_start, bank_end, bank_size;
+paddr_t reserved_heap_start = ~0, reserved_heap_end = 0,


INVALID_PADDR or ~0ULL


I would strongly prefer the former. It is more difficult to understand 
what's the value means in the latter.


Cheers,

--
Julien Grall



Re: [PATCH v3 3/7] xen/evtchn: restrict the maximum number of evtchn supported for domUs

2022-09-01 Thread Michal Orzel
Hi Rahul,

On 01/09/2022 11:13, Rahul Singh wrote:
> 
> Restrict the maximum number of evtchn supported for domUs to avoid
> allocating a large amount of memory in Xen.
> 
> Set the default value of max_evtchn_port to 1023. The value of 1023
> should be sufficient for domUs guests because on ARM we don't bind
> physical interrupts to event channels. The only use of the evtchn port
> is inter-domain communications.
Following the previous discussion, I think the only missing piece is
an explanation that 1023 was chose to follow the default behavior of libxl.

Apart from that:
Reviewed-by: Michal Orzel 

~Michal



Re: [PATCH v11 6/6] xen: retrieve reserved pages on populate_physmap

2022-09-01 Thread Julien Grall

Hi Penny,

On 31/08/2022 03:40, Penny Zheng wrote:

+/*
+ * Acquire a page from reserved page list(resv_page_list), when populating
+ * memory for static domain on runtime.
+ */
+mfn_t acquire_reserved_page(struct domain *d, unsigned int memflags)
+{
+struct page_info *page;
+
+ASSERT_ALLOC_CONTEXT();
+
+/* Acquire a page from reserved page list(resv_page_list). */
+spin_lock(&d->page_alloc_lock);
+page = page_list_remove_head(&d->resv_page_list);
+spin_unlock(&d->page_alloc_lock);
+if ( unlikely(!page) )
+return INVALID_MFN;
+
+if ( !prepare_staticmem_pages(page, 1, memflags) )
+goto fail;
+
+if ( assign_domstatic_pages(d, page, 1, memflags) )
+goto fail_assign;
+
+return page_to_mfn(page);
+
+ fail_assign:
+unprepare_staticmem_pages(page, 1, false);


Looking at assign_domstatic_pages(). It will already call 
unprepare_staticmem_pages() in one of the error path. It doesn't look 
like the latter can be called twice on a page.


To be honest, I find a bit odd that assign_domstatic_pages() is calling 
unprepare_staticmem_pages() because the former doesn't call the 
"prepare" function.


AFAICT, this is an issue introduced in this patch. So I would remove the 
call from assign_domstatic_pages() and then let the caller calls 
unprepare_staticmem_pages() (this would need to be added in 
acquire_domstatic_pages()).


Also, I think it would be good to explain why we don't need to scrub. 
Something like:


"The page was never accessible by the domain. So scrubbing can be skipped".

Cheers,

--
Julien Grall



RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator

2022-09-01 Thread Henry Wang
Hi Bertrand,

> -Original Message-
> From: Bertrand Marquis 
> > On 1 Sep 2022, at 02:03, Henry Wang  wrote:
> >
> > Hi Arm maintainers,
> >
> >> -Original Message-
> >> Hi Henry,
> >>
> >> On 30/08/2022 08:11, Henry Wang wrote:
> >>>
> >>> Hi Michal,
> >>>
> >>> Sorry about the late reply - I had a couple of days off. Thank you very
> >>> much for the review! I will add my reply and answer some of your
> >>> questions below.
> >>>
>  -Original Message-
>  From: Michal Orzel 
>  Subject: Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot
> >> and
>  heap allocator
> 
> > This commit firstly adds a global variable `reserved_heap`.
> > This newly introduced global variable is set at the device tree
> > parsing time if the reserved heap ranges are defined in the device
> > tree chosen node.
> >
>  Did you consider putting reserved_heap into bootinfo structure?
> >>>
> >>> Actually I did, but I saw current bootinfo only contains some structs so
> >>> I was not sure if this is the preferred way, but since you are raising 
> >>> this
> >>> question, I will follow this method in v2.
> >>
> >> This is what I think would be better but maintainers will have a decisive
> vote.
> >
> > I think this is the only uncertain comment that I received during the latest
> > review of this series. I agree that Michal is making a good point (Thanks!)
> but we
> > are curious about what maintainers think. Could you please kindly share
> your
> > opinion on the more preferred approach? I will do that in v2. Thanks very
> much!
> 
> I think it does make sense to put this in bootinfo.

I am good with that, then I think I will move this to bootinfo in v2 unless 
other
objections. Thank you for the input.

Kind regards,
Henry

> 
> Cheers
> Bertrand
> 




Re: [PATCH 0/7] Fix MISRA C 2012 Rule 20.7 violations

2022-09-01 Thread Bertrand Marquis
Hi Xenia,

> On 1 Sep 2022, at 10:27, Xenia Ragiadakou  wrote:
> 
> 
> On 9/1/22 01:35, Stefano Stabellini wrote:
>> Patches 1, 4, and 6 are already committed. I plan to commit patches 2, 3
>> and 5 in the next couple of days.
>> Patch 7 needs further discussions and it is best addressed during the
>> next MISRA C sync-up.
> 
> I would like to share here, before the next MISRA C sync, my understandings 
> that will hopefully resolve a wrong impression of mine, that I may have 
> spread around, regarding this rule.
> There was a misunderstanding regarding the rule 20.7 from my part and I think 
> that Jan is absolutely right that parenthesizing macro parameters used as 
> function arguments is not required by the rule.
> 
> The rule 20.7 states "Expressions resulting from the expansion of macro 
> parameters shall be enclosed in parentheses" and in the rationale of the rule 
> states "If a macro parameter is not being used as an expression then the 
> parentheses are not necessary because no operators are involved.".
> 
> Initially, based on the title, my understanding was that it requires for the 
> expression resulting from the expansion of the macro to be enclosed in 
> parentheses. Then, based on the rule explanation and the examples given,  my 
> understanding was that it requires the macro parameters that are used as 
> expressions to be enclosed in parentheses.
> But, after re-thinking about it, the most probable and what makes more sense, 
> is that it require parentheses around the macro parameters that are part of 
> an expression and not around those that are used as expressions.
> 
> Therefore, macro parameters being used as function arguments are not required 
> to be enclosed in parentheses, because the function arguments are part of an 
> expression list, not of an expression (comma is evaluated as separator, not 
> as operator).
> While, macro parameters used as rhs and lhs expressions of the assignment 
> operator are required to be enclosed in parentheses because they are part of 
> an assignment expression.
> 
> I verified that the violation reported by cppcheck is not due to missing 
> parentheses around the function argument (though still I have not understood 
> the origin of the warning). Also, Eclair does not report it.
> 
> Hence, it was a misunderstanding of mine and there is no inconsistency, with 
> respect to this rule, in adding parentheses around macro parameters used as 
> rhs of assignments. The rule does not require adding parentheses around macro 
> parameters used as function arguments and neither cppcheck nor Eclair report 
> violation for missing parentheses around macro parameters used as function 
> arguments.


Thanks a lot for the detailed explanation :-)

What you say does make sense and I agree with your analysis here, only protect 
when part of an expression and not use as a subsequent parameter (for a 
function or an other macro).

Regards
Bertrand

> 
>> On Fri, 19 Aug 2022, Xenia Ragiadakou wrote:
>>> Xenia Ragiadakou (7):
>>>   xen/arm: gic_v3_its: Fix MISRA C 2012 Rule 20.7 violations
>>>   xsm/flask: sidtab: Fix MISRA C 2012 Rule 20.7 violations
>>>   xen/elf: Fix MISRA C 2012 Rule 20.7 violations
>>>   xen/vgic: Fix MISRA C 2012 Rule 20.7 violation
>>>   xen/rbtree: Fix MISRA C 2012 Rule 20.7 violation
>>>   xen/arm: processor: Fix MISRA C 2012 Rule 20.7 violations
>>>   xen/device_tree: Fix MISRA C 2012 Rule 20.7 violations
>>> 
>>>  xen/arch/arm/include/asm/gic_v3_its.h | 10 +-
>>>  xen/arch/arm/include/asm/new_vgic.h   |  2 +-
>>>  xen/arch/arm/include/asm/processor.h  |  4 ++--
>>>  xen/include/xen/device_tree.h |  6 +++---
>>>  xen/include/xen/elfstructs.h  |  4 ++--
>>>  xen/lib/rbtree.c  |  2 +-
>>>  xen/xsm/flask/ss/sidtab.c |  8 
>>>  7 files changed, 18 insertions(+), 18 deletions(-)
>>> 
>>> -- 
>>> 2.34.1
>>> 
> 
> -- 
> Xenia




Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator

2022-09-01 Thread Bertrand Marquis
Hi,

> On 1 Sep 2022, at 02:03, Henry Wang  wrote:
> 
> Hi Arm maintainers,
> 
>> -Original Message-
>> Hi Henry,
>> 
>> On 30/08/2022 08:11, Henry Wang wrote:
>>> 
>>> Hi Michal,
>>> 
>>> Sorry about the late reply - I had a couple of days off. Thank you very
>>> much for the review! I will add my reply and answer some of your
>>> questions below.
>>> 
 -Original Message-
 From: Michal Orzel 
 Subject: Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot
>> and
 heap allocator
 
> This commit firstly adds a global variable `reserved_heap`.
> This newly introduced global variable is set at the device tree
> parsing time if the reserved heap ranges are defined in the device
> tree chosen node.
> 
 Did you consider putting reserved_heap into bootinfo structure?
>>> 
>>> Actually I did, but I saw current bootinfo only contains some structs so
>>> I was not sure if this is the preferred way, but since you are raising this
>>> question, I will follow this method in v2.
>> 
>> This is what I think would be better but maintainers will have a decisive 
>> vote.
> 
> I think this is the only uncertain comment that I received during the latest
> review of this series. I agree that Michal is making a good point (Thanks!) 
> but we
> are curious about what maintainers think. Could you please kindly share your
> opinion on the more preferred approach? I will do that in v2. Thanks very 
> much!

I think it does make sense to put this in bootinfo. 

Cheers
Bertrand




[linux-5.4 test] 172904: regressions - FAIL

2022-09-01 Thread osstest service owner
flight 172904 linux-5.4 real [real]
http://logs.test-lab.xenproject.org/osstest/logs/172904/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-i386-libvirt6 libvirt-buildfail REGR. vs. 172128
 build-amd64-libvirt   6 libvirt-buildfail REGR. vs. 172128
 build-arm64-libvirt   6 libvirt-buildfail REGR. vs. 172128
 build-armhf-libvirt   6 libvirt-buildfail REGR. vs. 172128

Tests which are failing intermittently (not blocking):
 test-armhf-armhf-xl-rtds 14 guest-start  fail in 172894 pass in 172904
 test-amd64-amd64-qemuu-freebsd12-amd64 18 guest-saverestore.2 fail pass in 
172894

Tests which did not succeed, but are not blocking:
 test-arm64-arm64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-amd64-libvirt-vhd  1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-pair  1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a
 test-amd64-i386-libvirt-raw   1 build-check(1)   blocked  n/a
 test-amd64-i386-libvirt-xsm   1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-qcow2  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-raw  1 build-check(1)   blocked  n/a
 test-armhf-armhf-xl-credit1 18 guest-start/debian.repeat fail blocked in 172128
 test-armhf-armhf-xl-credit2 18 guest-start/debian.repeat fail in 172894 like 
172108
 test-armhf-armhf-xl-credit1  14 guest-start fail in 172894 like 172128
 test-armhf-armhf-xl-multivcpu 14 guest-startfail in 172894 like 172128
 test-armhf-armhf-xl-credit2 15 migrate-support-check fail in 172894 never pass
 test-armhf-armhf-xl-credit2 16 saverestore-support-check fail in 172894 never 
pass
 test-armhf-armhf-xl-multivcpu 18 guest-start/debian.repeatfail like 172108
 test-armhf-armhf-xl-vhd  13 guest-start  fail  like 172108
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 172108
 test-armhf-armhf-xl-credit2  14 guest-start  fail  like 172128
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 172128
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 172128
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 172128
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 172128
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 172128
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 172128
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 172128
 test-armhf-armhf-xl-rtds 18 guest-start/debian.repeatfail  like 172128
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 172128
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-x

Re: [PATCH v11 5/6] xen: rename free_staticmem_pages to unprepare_staticmem_pages

2022-09-01 Thread Julien Grall

Hi Penny,

On 31/08/2022 03:40, Penny Zheng wrote:

The name of free_staticmem_pages is inappropriate, considering it is
the opposite of function prepare_staticmem_pages.

Rename free_staticmem_pages to unprepare_staticmem_pages.

Signed-off-by: Penny Zheng 
Acked-by: Jan Beulich 


Reviewed-by: Julien Grall 

Cheers,

--
Julien Grall



Re: [PATCH] xen-pcifront: Handle missed Connected state

2022-09-01 Thread Jason Andryuk
On Wed, Aug 31, 2022 at 10:35 PM Rich Persaud  wrote:
>
> On Aug 29, 2022, at 11:16 AM, Jason Andryuk  wrote:
> >
> > An HVM guest with linux stubdom and 2 PCI devices failed to start as
> > libxl timed out waiting for the PCI devices to be added.  It happens
> > intermittently but with some regularity.  libxl wrote the two xenstore
> > entries for the devices, but then timed out waiting for backend state 4
> > (Connected) - the state stayed at 7 (Reconfiguring).  (PCI passthrough
> > to an HVM with stubdomain is PV passthrough to the stubdomain and then
> > HVM passthrough with the QEMU inside the stubdomain.)
> >
> > The stubdom kernel never printed "pcifront pci-0: Installing PCI
> > frontend", so it seems to have missed state 4 which would have
> > called pcifront_try_connect -> pcifront_connect_and_init_dma
>
> Is there a state machine doc/flowchart for LibXL and Xen PCI device 
> passthrough to Linux? This would be a valuable addition to Xen's developer 
> docs, even as a whiteboard photo in this thread.

I am not aware of one.

-Jason



Re: [PATCH] acpi: Add TPM2 interface definition and make the TPM version configurable.

2022-09-01 Thread Jason Andryuk
On Tue, Aug 30, 2022 at 4:30 PM Jennifer Herbert
 wrote:
>
> This patch introduces an optional TPM 2 interface definition to the ACPI 
> table,
> which is to be used as part of a vTPM 2 implementation.
> To enable the new interface - I have made the TPM interface version
> configurable in the acpi_config, with the default being the existing 
> 1.2.(TCPA)
> I have also added to hvmloader an option to utilise this new config, which can
> be triggered by setting the platform/tpm_verion xenstore key.
>
> Signed-off-by: Jennifer Herbert 

Reviewed-by: Jason Andryuk 

Thanks.

Is there a particular reason why CRB (Command Response Buffer) was
chosen over TIS (TPM Interface Specification)?  I think of CRB as more
of an embedded device TPM interface, and TIS is what is usually used
with physical TPMs.  My experiences have only been with TIS devices,
so that is influencing my outlook.  Hmm, this patch seems to reference
the Intel Platform Trust Technology (PTT) fTPM (firmware-TPM) as using
the CRB interface:
https://patchwork.kernel.org/project/tpmdd-devel/patch/1417672167-3489-8-git-send-email-jarkko.sakki...@linux.intel.com/
 If PTT fTPMs are using CRB, then it's more than just embedded
devices..

Regards,
Jason



Re: [PATCH v3 1/7] xen/evtchn: Make sure all buckets below d->valid_evtchns are allocated

2022-09-01 Thread Rahul Singh
Hi Michal,

> On 1 Sep 2022, at 1:47 pm, Michal Orzel  wrote:
> 
> Hi Rahul,
> 
> On 01/09/2022 11:13, Rahul Singh wrote:
>> 
>> From: Julien Grall 
>> 
>> Since commit 01280dc19cf3 "evtchn: simplify port_is_valid()", the event
>> channels code assumes that all the buckets below d->valid_evtchns are
>> always allocated.
>> 
>> This assumption hold in most of the situation because a guest is not
>> allowed to chose the port. Instead, it will be the first free from port
>> 0.
>> 
>> When static event channel support will be added for dom0less domains
>> user can request to allocate the evtchn port numbers that are scattered
>> in nature.
>> 
>> The existing implementation of evtchn_allocate_port() is not able to
>> deal with such situation and will end up to override bucket or/and leave
>> some bucket unallocated. The latter will result to a droplet crash if
>> the event channel belongs to an unallocated bucket.
>> 
>> This can be solved by making sure that all the buckets below
>> d->valid_evtchns are allocated. There should be no impact for most of
>> the situation but LM/LU as only one bucket would be allocated. For
>> LM/LU, we may end up to allocate multiple buckets if ports in use are
>> sparse.
>> 
>> A potential alternative is to check that the bucket is valid in
>> is_port_valid(). This should still possible to do it without taking
>> per-domain lock but will result a couple more of memory access.
>> 
>> Signed-off-by: Julien Grall 
>> Signed-off-by: Rahul Singh 
>> ---
>> Changes in v3:
>> - fix comments in commit msg.
>> - modify code related to d->valid_evtchns and {read,write}_atomic()
>> Changes in v2:
>> - new patch in this version to avoid the security issue
>> ---
>> xen/common/event_channel.c | 55 --
>> 1 file changed, 35 insertions(+), 20 deletions(-)
>> 
>> diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c
>> index c2c6f8c151..80b06d9743 100644
>> --- a/xen/common/event_channel.c
>> +++ b/xen/common/event_channel.c
>> @@ -193,6 +193,15 @@ static struct evtchn *alloc_evtchn_bucket(struct domain 
>> *d, unsigned int port)
>> return NULL;
>> }
>> 
>> +/*
>> + * Allocate a given port and ensure all the buckets up to that ports
>> + * have been allocated.
>> + *
>> + * The last part is important because the rest of the event channel code
>> + * relies on all the buckets up to d->valid_evtchns to be valid. However,
>> + * event channels may be sparsed when restoring a domain during Guest
>> + * Transparent Migration and Live Update.
> You got rid of mentioning these non-existing features from the commit msg,
> but you still mention them here.

I missed that I will fix that in next version.

Regards,
Rahul




Re: [PATCH v3 1/7] xen/evtchn: Make sure all buckets below d->valid_evtchns are allocated

2022-09-01 Thread Michal Orzel
Hi Rahul,

On 01/09/2022 11:13, Rahul Singh wrote:
> 
> From: Julien Grall 
> 
> Since commit 01280dc19cf3 "evtchn: simplify port_is_valid()", the event
> channels code assumes that all the buckets below d->valid_evtchns are
> always allocated.
> 
> This assumption hold in most of the situation because a guest is not
> allowed to chose the port. Instead, it will be the first free from port
> 0.
> 
> When static event channel support will be added for dom0less domains
> user can request to allocate the evtchn port numbers that are scattered
> in nature.
> 
> The existing implementation of evtchn_allocate_port() is not able to
> deal with such situation and will end up to override bucket or/and leave
> some bucket unallocated. The latter will result to a droplet crash if
> the event channel belongs to an unallocated bucket.
> 
> This can be solved by making sure that all the buckets below
> d->valid_evtchns are allocated. There should be no impact for most of
> the situation but LM/LU as only one bucket would be allocated. For
> LM/LU, we may end up to allocate multiple buckets if ports in use are
> sparse.
> 
> A potential alternative is to check that the bucket is valid in
> is_port_valid(). This should still possible to do it without taking
> per-domain lock but will result a couple more of memory access.
> 
> Signed-off-by: Julien Grall 
> Signed-off-by: Rahul Singh 
> ---
> Changes in v3:
>  - fix comments in commit msg.
>  - modify code related to d->valid_evtchns and {read,write}_atomic()
> Changes in v2:
>  - new patch in this version to avoid the security issue
> ---
>  xen/common/event_channel.c | 55 --
>  1 file changed, 35 insertions(+), 20 deletions(-)
> 
> diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c
> index c2c6f8c151..80b06d9743 100644
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -193,6 +193,15 @@ static struct evtchn *alloc_evtchn_bucket(struct domain 
> *d, unsigned int port)
>  return NULL;
>  }
> 
> +/*
> + * Allocate a given port and ensure all the buckets up to that ports
> + * have been allocated.
> + *
> + * The last part is important because the rest of the event channel code
> + * relies on all the buckets up to d->valid_evtchns to be valid. However,
> + * event channels may be sparsed when restoring a domain during Guest
> + * Transparent Migration and Live Update.
You got rid of mentioning these non-existing features from the commit msg,
but you still mention them here.

Apart from that, all the Jan comments were addressed, so:
Reviewed-by: Michal Orzel 

~Michal



  1   2   >