[ovmf test] 172926: regressions - FAIL
flight 172926 ovmf real [real] http://logs.test-lab.xenproject.org/osstest/logs/172926/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: build-i386-libvirt6 libvirt-buildfail REGR. vs. 172136 build-amd64-libvirt 6 libvirt-buildfail REGR. vs. 172136 version targeted for testing: ovmf ec87181192f013f4f7ff916b2a39ff2c87b079f3 baseline version: ovmf 444260d45ec2a84e8f8c192b3539a3cd5591d009 Last test of basis 172136 2022-08-04 06:43:42 Z 28 days Failing since172151 2022-08-05 02:40:28 Z 28 days 223 attempts Testing same since 172926 2022-09-02 02:30:44 Z0 days1 attempts People who touched revisions under test: Abdul Lateef Attar Abner Chang Ard Biesheuvel Bob Feng Chasel Chiu Chen, Xiao X Czajkowski, Maciej Dimitrije Pavlov Dun Tan Edward Pickup Foster Nong Gregx Yeh Guo Dong Igor Kulchytskyy James Lu Jeff Brasen Jiaxin Wu Jose Marinho KasimX Liu Kavya Konstantin Aladyshev Kun Qin Liming Gao Liu, Zhiguang Maciej Czajkowski Michael D Kinney Michael Kubacki Pierre Gondois Ray Ni Rebecca Cran Rebecca Cran Sainadh Nagolu Sami Mujawar Shengfengx Xue Wu, Jiaxin Xiao X Chen Yuanhao Xie Zhiguang Liu jobs: build-amd64-xsm pass build-i386-xsm pass build-amd64 pass build-i386 pass build-amd64-libvirt fail build-i386-libvirt fail build-amd64-pvopspass build-i386-pvops pass test-amd64-amd64-xl-qemuu-ovmf-amd64 pass test-amd64-i386-xl-qemuu-ovmf-amd64 pass sg-report-flight on osstest.test-lab.xenproject.org logs: /home/logs/logs images: /home/logs/images Logs, config files, etc. are available at http://logs.test-lab.xenproject.org/osstest/logs Explanation of these reports, and of osstest in general, is at http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master Test harness code can be found at http://xenbits.xen.org/gitweb?p=osstest.git;a=summary Not pushing. (No revision log; it would be 1457 lines long.)
Re: [RFC PATCH 16/30] mm: enable slab allocation tagging for kmalloc and friends
On Tue, Aug 30, 2022 at 02:49:05PM -0700, Suren Baghdasaryan wrote: > Redefine kmalloc, krealloc, kzalloc, kcalloc, etc. to record allocations > and deallocations done by these functions. One particular case when this functionality might be very useful: in the past we've seen examples (at Fb) where it was hard to understand the difference between slab memory sizes of two different kernel versions due to slab caches merging. Once a slab cache is merged with another large cache, this data is pretty much lost. So I definetely see value in stats which are independent from kmem caches. The performance overhead is a concern here, so more data would be useful. Thanks!
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote: > On Wed, Aug 31, 2022 at 12:02 PM Kent Overstreet > wrote: > > > > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: > > > On Wed 31-08-22 11:19:48, Mel Gorman wrote: > > > > Whatever asking for an explanation as to why equivalent functionality > > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > > > > > > Fully agreed and this is especially true for a change this size > > > 77 files changed, 3406 insertions(+), 703 deletions(-) > > > > In the case of memory allocation accounting, you flat cannot do this with > > ftrace > > - you could maybe do a janky version that isn't fully accurate, much slower, > > more complicated for the developer to understand and debug and more > > complicated > > for the end user. > > > > But please, I invite anyone who's actually been doing this with ftrace to > > demonstrate otherwise. > > > > Ftrace just isn't the right tool for the job here - we're talking about > > adding > > per callsite accounting to some of the fastest fast paths in the kernel. > > > > And the size of the changes for memory allocation accounting are much more > > reasonable: > > 33 files changed, 623 insertions(+), 99 deletions(-) > > > > The code tagging library should exist anyways, it's been open coded half a > > dozen > > times in the kernel already. > > > > And once we've got that, the time stats code is _also_ far simpler than > > doing it > > with ftrace would be. If anyone here has successfully debugged latency > > issues > > with ftrace, I'd really like to hear it. Again, for debugging latency > > issues you > > want something that can always be on, and that's not cheap with ftrace - and > > never mind the hassle of correlating start and end wait trace events, > > builting > > up histograms, etc. - that's all handled here. > > > > Cheap, simple, easy to use. What more could you want? > > > > This is very interesting work! Do you have any data about the overhead > this introduces, especially in a production environment? I am > especially interested in memory allocations tracking and detecting > leaks. +1 I think the question whether it indeed can be always turned on in the production or not is the main one. If not, the advantage over ftrace/bpf/... is not that obvious. Otherwise it will be indeed a VERY useful thing. Also, there is a lot of interesting stuff within this patchset, which might be useful elsewhere. So thanks to Kent and Suren for this work! Thanks!
Re: [RFC PATCH 14/30] mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache objects
On Tue, Aug 30, 2022 at 02:49:03PM -0700, Suren Baghdasaryan wrote: > Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext > objects. Also prevent slabobj_ext allocations for kmem_cache objects. > > Signed-off-by: Suren Baghdasaryan Patches 12-14 look good to me. It's probably to early to ack anything, but otherwise I'd ack them. Thanks!
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote: > On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote: > > I'd suggest to run something like iperf on a fast hardware. And maybe some > > io_uring stuff too. These are two places which were historically most > > sensitive > > to the (kernel) memory accounting speed. > > I'm getting wildly inconsistent results with iperf. > > io_uring-echo-server and rust_echo_bench gets me: > Benchmarking: 127.0.0.1:12345 > 50 clients, running 512 bytes, 60 sec. > > Without alloc tagging:120547 request/sec > With: 116748 request/sec > > https://github.com/frevib/io_uring-echo-server > https://github.com/haraldh/rust_echo_bench > > How's that look to you? Close enough? :) Yes, this looks good (a bit too good). I'm not that familiar with io_uring, Jens and Pavel should have a better idea what and how to run (I know they've workarounded the kernel memory accounting because of the performance in the past, this is why I suspect it might be an issue here as well). This is a recent optimization on the networking side: https://lore.kernel.org/linux-mm/20220825000506.239406-1-shake...@google.com/ Maybe you can try to repeat this experiment. Thanks!
Re: [RFC PATCH 11/30] mm: introduce slabobj_ext to support slab object extensions
On Tue, Aug 30, 2022 at 02:49:00PM -0700, Suren Baghdasaryan wrote: > Currently slab pages can store only vectors of obj_cgroup pointers in > page->memcg_data. Introduce slabobj_ext structure to allow more data > to be stored for each slab object. Wraps obj_cgroup into slabobj_ext > to support current functionality while allowing to extend slabobj_ext > in the future. > > Note: ideally the config dependency should be turned the other way around: > MEMCG should depend on SLAB_OBJ_EXT and {page|slab|folio}.memcg_data would > be renamed to something like {page|slab|folio}.objext_data. However doing > this in RFC would introduce considerable churn unrelated to the overall > idea, so avoiding this until v1. Hi Suren! I'd say CONFIG_MEMCG_KMEM and CONFIG_YOUR_NEW_STUFF should both depend on SLAB_OBJ_EXT. CONFIG_MEMCG_KMEM depend on CONFIG_MEMCG anyway. > > Signed-off-by: Suren Baghdasaryan > --- > include/linux/memcontrol.h | 18 -- > init/Kconfig | 5 ++ > mm/kfence/core.c | 2 +- > mm/memcontrol.c| 60 ++- > mm/page_owner.c| 2 +- > mm/slab.h | 119 + > 6 files changed, 131 insertions(+), 75 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 6257867fbf95..315399f77173 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -227,6 +227,14 @@ struct obj_cgroup { > }; > }; > > +/* > + * Extended information for slab objects stored as an array in > page->memcg_data > + * if MEMCG_DATA_OBJEXTS is set. > + */ > +struct slabobj_ext { > + struct obj_cgroup *objcg; > +} __aligned(8); Why do we need this aligment requirement? > + > /* > * The memory controller data structure. The memory controller controls both > * page cache and RSS per cgroup. We would eventually like to provide > @@ -363,7 +371,7 @@ extern struct mem_cgroup *root_mem_cgroup; > > enum page_memcg_data_flags { > /* page->memcg_data is a pointer to an objcgs vector */ > - MEMCG_DATA_OBJCGS = (1UL << 0), > + MEMCG_DATA_OBJEXTS = (1UL << 0), > /* page has been accounted as a non-slab kernel page */ > MEMCG_DATA_KMEM = (1UL << 1), > /* the next bit after the last actual flag */ > @@ -401,7 +409,7 @@ static inline struct mem_cgroup *__folio_memcg(struct > folio *folio) > unsigned long memcg_data = folio->memcg_data; > > VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); > - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio); > + VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio); > VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio); > > return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); > @@ -422,7 +430,7 @@ static inline struct obj_cgroup *__folio_objcg(struct > folio *folio) > unsigned long memcg_data = folio->memcg_data; > > VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); > - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio); > + VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio); > VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio); > > return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); > @@ -517,7 +525,7 @@ static inline struct mem_cgroup *page_memcg_check(struct > page *page) >*/ > unsigned long memcg_data = READ_ONCE(page->memcg_data); > > - if (memcg_data & MEMCG_DATA_OBJCGS) > + if (memcg_data & MEMCG_DATA_OBJEXTS) > return NULL; > > if (memcg_data & MEMCG_DATA_KMEM) { > @@ -556,7 +564,7 @@ static inline struct mem_cgroup > *get_mem_cgroup_from_objcg(struct obj_cgroup *ob > static inline bool folio_memcg_kmem(struct folio *folio) > { > VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page); > - VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJCGS, folio); > + VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJEXTS, folio); > return folio->memcg_data & MEMCG_DATA_KMEM; > } > > diff --git a/init/Kconfig b/init/Kconfig > index 532362fcfe31..82396d7a2717 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -958,6 +958,10 @@ config MEMCG > help > Provides control over the memory footprint of tasks in a cgroup. > > +config SLAB_OBJ_EXT > + bool > + depends on MEMCG > + > config MEMCG_SWAP > bool > depends on MEMCG && SWAP > @@ -966,6 +970,7 @@ config MEMCG_SWAP > config MEMCG_KMEM > bool > depends on MEMCG && !SLOB > + select SLAB_OBJ_EXT > default y > > config BLK_CGROUP > diff --git a/mm/kfence/core.c b/mm/kfence/core.c > index c252081b11df..c0958e4a32e2 100644 > --- a/mm/kfence/core.c > +++ b/mm/kfence/core.c > @@ -569,7 +569,7 @@ static unsigned long kfence_init_pool(void) > __folio_set_slab(slab_folio(slab)); > #ifdef CONFIG_MEMCG > slab->memcg_data = (unsigned long)&kfence_metadata[i / 2 - > 1].objcg
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 06:37:20PM -0400, Kent Overstreet wrote: > On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote: > > On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote: > > > This is very interesting work! Do you have any data about the overhead > > > this introduces, especially in a production environment? I am > > > especially interested in memory allocations tracking and detecting > > > leaks. > > > > +1 > > > > I think the question whether it indeed can be always turned on in the > > production > > or not is the main one. If not, the advantage over ftrace/bpf/... is not > > that > > obvious. Otherwise it will be indeed a VERY useful thing. > > Low enough overhead to run in production was my primary design goal. > > Stats are kept in a struct that's defined at the callsite. So this adds _no_ > pointer chasing to the allocation path, unless we've switch to percpu counters > at that callsite (see the lazy percpu counters patch), where we need to deref > one percpu pointer to save an atomic. > > Then we need to stash a pointer to the alloc_tag, so that kfree() can find it. > For slab allocations this uses the same storage area as memcg, so for > allocations that are using that we won't be touching any additional > cachelines. > (I wanted the pointer to the alloc_tag to be stored inline with the > allocation, > but that would've caused alignment difficulties). > > Then there's a pointer deref introduced to the kfree() path, to get back to > the > original alloc_tag and subtract the allocation from that callsite. That one > won't be free, and with percpu counters we've got another dependent load too - > hmm, it might be worth benchmarking with just atomics, skipping the percpu > counters. > > So the overhead won't be zero, I expect it'll show up in some synthetic > benchmarks, but yes I do definitely expect this to be worth enabling in > production in many scenarios. I'm somewhat sceptical, but I usually am. And in this case I'll be really happy to be wrong. On a bright side, maybe most of the overhead will come from few allocations, so an option to explicitly exclude them will do the trick. I'd suggest to run something like iperf on a fast hardware. And maybe some io_uring stuff too. These are two places which were historically most sensitive to the (kernel) memory accounting speed. Thanks!
[linux-linus test] 172922: regressions - FAIL
flight 172922 linux-linus real [real] http://logs.test-lab.xenproject.org/osstest/logs/172922/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: build-amd64-libvirt 6 libvirt-buildfail REGR. vs. 172133 build-arm64-libvirt 6 libvirt-buildfail REGR. vs. 172133 build-i386-libvirt6 libvirt-buildfail REGR. vs. 172133 build-armhf-libvirt 6 libvirt-buildfail REGR. vs. 172133 build-arm64-pvops 6 kernel-build fail REGR. vs. 172133 Regressions which are regarded as allowable (not blocking): test-amd64-amd64-xl-rtds 20 guest-localmigrate/x10 fail REGR. vs. 172133 test-armhf-armhf-xl-rtds18 guest-start/debian.repeat fail REGR. vs. 172133 Tests which did not succeed, but are not blocking: test-armhf-armhf-libvirt 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-qcow2 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-raw 1 build-check(1) blocked n/a test-arm64-arm64-examine 1 build-check(1) blocked n/a test-amd64-amd64-libvirt 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-pair 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-qcow2 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-raw 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-xsm 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-raw 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-xsm 1 build-check(1) blocked n/a test-arm64-arm64-xl 1 build-check(1) blocked n/a test-arm64-arm64-xl-credit1 1 build-check(1) blocked n/a test-arm64-arm64-xl-credit2 1 build-check(1) blocked n/a test-arm64-arm64-xl-seattle 1 build-check(1) blocked n/a test-arm64-arm64-xl-thunderx 1 build-check(1) blocked n/a test-arm64-arm64-xl-vhd 1 build-check(1) blocked n/a test-arm64-arm64-xl-xsm 1 build-check(1) blocked n/a test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 172133 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 172133 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 172133 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 172133 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 172133 test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass version targeted for testing: linux42e66b1cc3a070671001f8a1e933a80818a192bf baseline version: linuxb44f2fd87919b5ae6e1756d4c7ba2cbba22238e1 Last test of basis 172133 2022-08-04 05:14:48 Z 28 days Failing since172152 2022-08-05 04:01:26 Z 28 days 64 attempts Testing same since 172922 2022-09-01 22:42:52 Z0 days1 attempts 1629 people touched revisions under test, not listing them all jobs: build-amd64-xsm pass build-arm64-xsm pass build-i386-xsm pass build-amd64 pass build-arm64 pass build-armhf pass build-i386 pass
[PATCH v4 6/6] xen: introduce a Kconfig option to configure NUMA nodes number
Currently the maximum number of NUMA nodes is a hardcoded value. This provides little flexibility unless changing the code. Introduce a new Kconfig option to change the maximum number of NUMA nodes conveniently. Also considering that not all architectures support NUMA, this Kconfig option is only visible on NUMA enabled architectures. Architectures not supporting NUMA still use 1 for MAX_NUMNODES. As NODES_SHIFT is currently unused, we're taking this opportunity to remove it. Signed-off-by: Wei Chen Acked-by: Jan Beulich --- v3 -> v4: 1. Update the commit log to follow Jan's suggestion. 2. Add Ack-by. v2 -> v3: 1. Fix indent. 2. Use 2-64 for node range. v1 -> v2: 1. Add NODES_SHIFT remove message in commit log 2. Change NR_NUMA_NODES upper bound from 4095 to 255. --- xen/arch/Kconfig| 11 +++ xen/arch/x86/include/asm/numa.h | 2 -- xen/include/xen/numa.h | 11 ++- 3 files changed, 17 insertions(+), 7 deletions(-) diff --git a/xen/arch/Kconfig b/xen/arch/Kconfig index f16eb0df43..7028f7b74f 100644 --- a/xen/arch/Kconfig +++ b/xen/arch/Kconfig @@ -17,3 +17,14 @@ config NR_CPUS For CPU cores which support Simultaneous Multi-Threading or similar technologies, this the number of logical threads which Xen will support. + +config NR_NUMA_NODES + int "Maximum number of NUMA nodes supported" + range 2 64 + default "64" + depends on NUMA + help + Controls the build-time size of various arrays and bitmaps + associated with multiple-nodes management. It is the upper bound of + the number of NUMA nodes that the scheduler, memory allocation and + other NUMA-aware components can handle. diff --git a/xen/arch/x86/include/asm/numa.h b/xen/arch/x86/include/asm/numa.h index 2ca3475271..7866afa408 100644 --- a/xen/arch/x86/include/asm/numa.h +++ b/xen/arch/x86/include/asm/numa.h @@ -3,8 +3,6 @@ #include -#define NODES_SHIFT 6 - typedef u8 nodeid_t; extern int srat_rev; diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h index e593115ba2..e16a7c3764 100644 --- a/xen/include/xen/numa.h +++ b/xen/include/xen/numa.h @@ -3,14 +3,15 @@ #include -#ifndef NODES_SHIFT -#define NODES_SHIFT 0 -#endif - #define NUMA_NO_NODE 0xFF #define NUMA_NO_DISTANCE 0xFF -#define MAX_NUMNODES(1 << NODES_SHIFT) +#ifdef CONFIG_NR_NUMA_NODES +#define MAX_NUMNODES CONFIG_NR_NUMA_NODES +#else +#define MAX_NUMNODES 1 +#endif + #define NR_NODE_MEMBLKS (MAX_NUMNODES * 2) #define vcpu_to_node(v) (cpu_to_node((v)->processor)) -- 2.25.1
[PATCH v4 0/6] Device tree based NUMA support for Arm - Part#2
(The Arm device tree based NUMA support patch set contains 35 patches. In order to make stuff easier for reviewers, I split them into 3 parts: 1. Preparation. I have re-sorted the patch series. And moved independent patches to the head of the series - merged in [1] 2. Move generically usable code from x86 to common - this series. 3. Add new code to support Arm. This series only contains the second part patches. As the whole NUMA series has been reviewed for 1 round in [2], so this series would be v3) Xen memory allocation and scheduler modules are NUMA aware. But actually, on x86 has implemented the architecture APIs to support NUMA. Arm was providing a set of fake architecture APIs to make it compatible with NUMA awared memory allocation and scheduler. Arm system was working well as a single node NUMA system with these fake APIs, because we didn't have multiple nodes NUMA system on Arm. But in recent years, more and more Arm devices support multiple nodes NUMA system. So now we have a new problem. When Xen is running on these Arm devices, Xen still treat them as single node SMP systems. The NUMA affinity capability of Xen memory allocation and scheduler becomes meaningless. Because they rely on input data that does not reflect real NUMA layout. Xen still think the access time for all of the memory is the same for all CPUs. However, Xen may allocate memory to a VM from different NUMA nodes with different access speeds. This difference can be amplified in workloads inside VM, causing performance instability and timeouts. So in this patch series, we implement a set of NUMA API to use device tree to describe the NUMA layout. We reuse most of the code of x86 NUMA to create and maintain the mapping between memory and CPU, create the matrix between any two NUMA nodes. Except ACPI and some x86 specified code, we have moved other code to common. In next stage, when we implement ACPI based NUMA for Arm64, we may move the ACPI NUMA code to common too, but in current stage, we keep it as x86 only. This patch serires has been tested and booted well on one Arm64 NUMA machine and one HPE x86 NUMA machine. [1] https://lists.xenproject.org/archives/html/xen-devel/2022-06/msg00499.html [2] https://lists.xenproject.org/archives/html/xen-devel/2021-09/msg01903.html --- v3 -> v4: 1. Add init_as_disable as arch_numa_disabled parameter in the patche where use it. 2. Drop unnecessary "else" from arch_numa_setup, and fix its indentation. 3. Restore compute_hash_shift's return value to int. 4. Remove unnecessary parentheses for macros. 5. Use unsigned int for proper variables. 6. Fix some code-style. 7. Move arch_get_ram_range function comment to header file. 8. Use bool for found, and add a new "err" for the return value of arch_get_ram_range. 9. Use -ENODATA instead of -EINVAL for non-RAM type ranges. 10. Use bool as return value for functions that only return 0/1 or 0/-EINVAL. 11. Move mem_hotplug to a proper place in mm.h 12. Remove useless "size" in numa_scan_nodes. 13. Add CONFIG_HAS_NUMA_NODE_FWID to gate print the mapping between node id and architectural node id (fw node id). v2 -> v3: 1. Drop enumeration of numa status. 2. Use helpers to get/update acpi_numa. 3. Insert spaces among parameters of strncmp in numa_setup. 4. Drop helpers to access mem_hotplug. Export mem_hotplug for all arch. 5. Remove acpi.h from common/numa.c. 6. Rename acpi_scan_nodes to numa_scan_nodes. 7. Replace u8 by uint8_t for memnodemap. 8. Use unsigned int for memnode_shift and adjust related functions (compute_hash_shift, populate_memnodemap) to use correct types for return values or parameters. 9. Use nodeid_t for nodeid and node numbers. 10. Use __read_mostly and __ro_after_init for appropriate variables. 11. Adjust the __read_mostly and __initdata location for some variables. 12. Convert from plain int to unsigned for cpuid and other proper 13. Remove unnecessary change items in history. 14. Rename arch_get_memory_map to arch_get_ram_range. 15. Use -ENOENT instead of -ENODEV to indicate end of memory map. 16. Add description to code comment that arch_get_ram_range returns RAM range in [start, end) format. 17. Rename bad_srat to numa_fw_bad. 18. Rename node_to_pxm to numa_node_to_arch_nid. 19. Merge patch#7 and #8 into patch#6. 20. Move NR_NODE_MEMBLKS from x86/acpi.h to common/numa.h 22. Use 2-64 for node range. v1 -> v2: 1. Refine the commit messages of several patches. 2. Merge v1 patch#9,10 into one patch. Introduce the new functions in the same patch that this patch will be used first time. 3. Fold if ( end > mem_hotplug ) to mem_hotplug_update_boundary, in this case, we can drop mem_hotplug_boundary. 4. Remove fw_numa, use enumeration to replace numa_off and acpi_numa. 5. Correct return value of srat_disabled. 6. Introduce numa_enabled_with_firmware. 7. Refine the justification of using !node_data[nid].node_spanned_pages. 8. Use ASSERT to replace VIRTUAL_BUG_O
[PATCH v4 5/6] xen/x86: move NUMA scan nodes codes from x86 to common
x86 has implemented a set of codes to scan NUMA nodes. These codes will parse NUMA memory and processor information from ACPI SRAT table. But except some ACPI specific codes, most of the scan codes like memory blocks validation, node memory range updates and some sanity check can be reused by other NUMA implementation. So in this patch, we move some variables and related functions for NUMA memory and processor to common as library. At the same time, numa_set_processor_nodes_parsed has been introduced for ACPI specific code to update processor parsing results. With this helper, we can reuse most of NUMA memory affinity init code from ACPI. As bad_srat and node_to_pxm functions have been used in common code to do architectural fallback and node to architectural node info translation. But it doesn't make sense to reuse the functions names in common code, we have rename them to neutral names as well. PXM is an ACPI specific item, we can't use it in common code directly. As an alternative, we extend the parameters of numa_update_node_memblks. The caller can pass the PXM as print messages' prefix or as architectural node id. And we introduced a CONFIG_HAS_NUMA_NODE_FWID to prevent print the mapping between node id and architectural node id for those architectures do not have architectural node id. In this case, we do not need to retain a lot of per-arch code but still can print architectural log messages for different NUMA implementations. mem_hotplug also has been accessing by common code, except x86, other architectures like Arm will also want to implement memory hotplug in future. We export mem_hotplug to common will not bring any harm for Arm and we also can reduce some per-arch helpers to access mem_hotplug. As asm/acpi.h has been removed from common/numa.c, we have to move NR_NODE_MEMBLKS from asm/acpi.h to xen/numa.h in this patch as well. Signed-off-by: Wei Chen --- v3 -> v4: 1. Use bool as return value for functions that only return 0/1 or 0/-EINVAL. 2. Move mem_hotplug to a proper place in mm.h 3. Remove useless "size" in numa_scan_nodes. 4. Use unsigned int or const for proper variables. 5. Fix code-style. 6. Add init_as_disable as arch_numa_disabled parameter. 7. Add CONFIG_HAS_NUMA_NODE_FWID to gate print the mapping between node id and architectural node id (fw node id). v2 -> v3: 1. Add __ro_after_init to proper variables. 2. Rename bad_srat to numa_fw_bad. 3. Rename node_to_pxm to numa_node_to_arch_nid. 4. Merge patch#7 and #8 into this patch. 5. Correct int to unsigned int in proper places. 6. Move NR_NODE_MEMBLKS from x86/acpi.h to common/numa.h 7. Drop helpers to access mem_hotplug, we export mem_hotplug from x86/mm.c to common/page_alloc.c v1 -> v2: 1. Add code comment for numa_update_node_memblks to explain: Assumes all memory regions belonging to a single node are in one chunk. Holes between them will be included in the node. 2. Merge this single patch instead of serval patches to move x86 SRAT code to common. 3. Export node_to_pxm to keep pxm information in NUMA scan nodes error messages. 4. Change the code style to target file's Xen code-style. 5. Adjust some __init and __initdata for some functions and variables. 6. Merge two patches into this patch: 1. replace CONFIG_ACPI_NUMA by CONFIG_NUMA. 2. replace "SRAT" texts. 7. Turn numa_scan_nodes to static. --- xen/arch/x86/include/asm/acpi.h | 1 - xen/arch/x86/include/asm/mm.h | 2 - xen/arch/x86/include/asm/numa.h | 3 +- xen/arch/x86/mm.c | 2 - xen/arch/x86/numa.c | 7 +- xen/arch/x86/srat.c | 311 +++ xen/common/numa.c | 321 +++- xen/common/page_alloc.c | 2 + xen/drivers/acpi/Kconfig| 1 + xen/include/xen/mm.h| 2 + xen/include/xen/numa.h | 12 +- 11 files changed, 363 insertions(+), 301 deletions(-) diff --git a/xen/arch/x86/include/asm/acpi.h b/xen/arch/x86/include/asm/acpi.h index 5c2dd5da2d..c453450a74 100644 --- a/xen/arch/x86/include/asm/acpi.h +++ b/xen/arch/x86/include/asm/acpi.h @@ -102,7 +102,6 @@ extern unsigned long acpi_wakeup_address; #define ARCH_HAS_POWER_INIT1 extern s8 acpi_numa; -#define NR_NODE_MEMBLKS (MAX_NUMNODES*2) extern struct acpi_sleep_info acpi_sinfo; #define acpi_video_flags bootsym(video_flags) diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h index 0fc826de46..95ff71a83a 100644 --- a/xen/arch/x86/include/asm/mm.h +++ b/xen/arch/x86/include/asm/mm.h @@ -474,8 +474,6 @@ static inline int get_page_and_type(struct page_info *page, ASSERT(((_p)->count_info & PGC_count_mask) != 0); \ ASSERT(page_get_owner(_p) == (_d)) -extern paddr_t mem_hotplug; - /** * With shadow pagetables, the different kinds of address start * to get get confusing. diff --git a/xen/arch/x86/include/
[PATCH v4 4/6] xen/x86: use arch_get_ram_range to get information from E820 map
The sanity check of nodes_cover_memory is also a requirement of other architectures that support NUMA. But now, the code of nodes_cover_memory is tied to the x86 E820. In this case, we introduce arch_get_ram_range to decouple architecture specific memory map from this function. This means, other architectures like Arm can also use it to check its node and memory coverage from bootmem info. Depends arch_get_ram_range, we make nodes_cover_memory become architecture independent. We also use neutral words to replace SRAT and E820 in the print message of this function. This will to make the massage seems more common. As arch_get_ram_range use unsigned int for index, we also adjust the index in nodes_cover_memory from int to unsigned int. Signed-off-by: Wei Chen --- v3 -> v4: 1. Move function comment to header file. 2. Use bool for found, and add a new "err" for the return value of arch_get_ram_range. 3. Use -ENODATA instead of -EINVAL for non-RAM type ranges. v2 -> v3: 1. Rename arch_get_memory_map to arch_get_ram_range. 2. Use -ENOENT instead of -ENODEV to indicate end of memory map. 3. Add description to code comment that arch_get_ram_range returns RAM range in [start, end) format. v1 -> v2: 1. Use arch_get_memory_map to replace arch_get_memory_bank_range and arch_get_memory_bank_number. 2. Remove the !start || !end check, because caller guarantee these two pointers will not be NULL. --- xen/arch/x86/numa.c| 15 +++ xen/arch/x86/srat.c| 30 ++ xen/include/xen/numa.h | 13 + 3 files changed, 46 insertions(+), 12 deletions(-) diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c index 90b2a22591..fa8caaa084 100644 --- a/xen/arch/x86/numa.c +++ b/xen/arch/x86/numa.c @@ -9,6 +9,7 @@ #include #include #include +#include #ifndef Dprintk #define Dprintk(x...) @@ -93,3 +94,17 @@ unsigned int __init arch_get_dma_bitsize(void) flsl(node_start_pfn(node) + node_spanned_pages(node) / 4 - 1) + PAGE_SHIFT, 32); } + +int __init arch_get_ram_range(unsigned int idx, paddr_t *start, paddr_t *end) +{ +if ( idx >= e820.nr_map ) +return -ENOENT; + +if ( e820.map[idx].type != E820_RAM ) +return -ENODATA; + +*start = e820.map[idx].addr; +*end = *start + e820.map[idx].size; + +return 0; +} diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c index 4c7f0c547e..bd9694db24 100644 --- a/xen/arch/x86/srat.c +++ b/xen/arch/x86/srat.c @@ -428,37 +428,43 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma) Make sure the PXMs cover all memory. */ static int __init nodes_cover_memory(void) { - int i; + unsigned int i; - for (i = 0; i < e820.nr_map; i++) { - int j, found; + for (i = 0; ; i++) { + int err; + unsigned int j; + bool found; paddr_t start, end; - if (e820.map[i].type != E820_RAM) { - continue; - } + /* Try to loop memory map from index 0 to end to get RAM ranges. */ + err = arch_get_ram_range(i, &start, &end); - start = e820.map[i].addr; - end = e820.map[i].addr + e820.map[i].size; + /* Reach the end of arch's memory map */ + if (err == -ENOENT) + break; + + /* Index relate entry is not RAM, skip it. */ + if (err) + continue; do { - found = 0; + found = false; for_each_node_mask(j, memory_nodes_parsed) if (start < nodes[j].end && end > nodes[j].start) { if (start >= nodes[j].start) { start = nodes[j].end; - found = 1; + found = true; } if (end <= nodes[j].end) { end = nodes[j].start; - found = 1; + found = true; } } } while (found && start < end); if (start < end) { - printk(KERN_ERR "SRAT: No PXM for e820 range: " + printk(KERN_ERR "NUMA: No NODE for RAM range: " "[%"PRIpaddr", %"PRIpaddr"]\n", start, end - 1); return 0; } diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h index af0abfc8cf..38be7db960 100644 --- a/xen/include/xen/numa.h +++ b/xen/include/xen/numa.h @@
[PATCH v4 2/6] xen/x86: move generically usable NUMA code from x86 to common
There are some codes in x86/numa.c can be shared by common architectures to implememnt NUMA support. Just like some variables and functions to check and store NUMA memory map. And some variables and functions to do NUMA initialization. In this patch, we move them to common/numa.c and xen/numa.h and use the CONFIG_NUMA to gate them for non-NUMA supported architectures. As the target header file is Xen-style, so we trim some spaces and replace tabs for the codes that has been moved to xen/numa.h at the same time. As acpi_scan_nodes has been used in a common function, it doesn't make sense to use acpi_xxx in common code, so we rename it to numa_scan_nodes in this patch too. After that if we still use CONFIG_ACPI_NUMA in to gate numa_scan_nodes in numa_initmem_init, that doesn't make sense. As CONFIG_NUMA will be selected by CONFIG_ACPI_NUMA for x86. So, we replace CONFIG_ACPI_NUMA by CONFIG_NUMA to gate numa_scan_nodes. As arch_numa_disabled has been implememnted for ACPI NUMA, we can rename srat_disabled to numa_disabled and move it to common code as well. Signed-off-by: Wei Chen --- v3 -> v4: 1. Restore compute_hash_shift's return value to int. 2. Remove unnecessary parentheses for macros. 3. Use unsigned int for proper variables. 4. Fix some code-style. v2 -> v3: 1. Remove acpi.h from common/numa.c. 2. Rename acpi_scan_nodes to numa_scan_nodes. 3. Replace u8 by uint8_t for memnodemap. 4. Use unsigned int for memnode_shift and adjust related functions (compute_hash_shift, populate_memnodemap) to use correct types for return values or parameters. 5. Use nodeid_t for nodeid and node numbers. 6. Use __read_mostly and __ro_after_init for appropriate variables. 7. Adjust the __read_mostly and __initdata location for some variables. 8. convert from plain int to unsigned for cpuid and other proper variables. 9. Use __attribute_pure__ instead of __attribute__((pure)). 10. Replace CONFIG_ACPI_NUMA by CONFIG_NUMA in numa_initmem_init. 11. Add const for some functions' parameters. 12. Move srat_disabled to common code with new name numa_disabled. 13. Fix some spaces code-style for numa_emulation. 14. Change from int to unsigned int for numa_fake. v1 -> v2: 1. New patch in v2. --- xen/arch/x86/include/asm/acpi.h | 1 - xen/arch/x86/include/asm/numa.h | 57 +--- xen/arch/x86/include/asm/setup.h | 1 - xen/arch/x86/numa.c | 430 +- xen/arch/x86/smpboot.c | 2 +- xen/arch/x86/srat.c | 8 +- xen/common/Makefile | 1 + xen/common/numa.c| 442 +++ xen/include/xen/numa.h | 67 + 9 files changed, 517 insertions(+), 492 deletions(-) create mode 100644 xen/common/numa.c diff --git a/xen/arch/x86/include/asm/acpi.h b/xen/arch/x86/include/asm/acpi.h index 9a9cc4c240..5c2dd5da2d 100644 --- a/xen/arch/x86/include/asm/acpi.h +++ b/xen/arch/x86/include/asm/acpi.h @@ -102,7 +102,6 @@ extern unsigned long acpi_wakeup_address; #define ARCH_HAS_POWER_INIT1 extern s8 acpi_numa; -extern int acpi_scan_nodes(u64 start, u64 end); #define NR_NODE_MEMBLKS (MAX_NUMNODES*2) extern struct acpi_sleep_info acpi_sinfo; diff --git a/xen/arch/x86/include/asm/numa.h b/xen/arch/x86/include/asm/numa.h index 237f2c6dbf..6c87942d43 100644 --- a/xen/arch/x86/include/asm/numa.h +++ b/xen/arch/x86/include/asm/numa.h @@ -9,72 +9,17 @@ typedef u8 nodeid_t; extern int srat_rev; -extern nodeid_t cpu_to_node[NR_CPUS]; -extern cpumask_t node_to_cpumask[]; - -#define cpu_to_node(cpu) (cpu_to_node[cpu]) -#define parent_node(node) (node) -#define node_to_first_cpu(node) (__ffs(node_to_cpumask[node])) -#define node_to_cpumask(node)(node_to_cpumask[node]) - -struct node { - paddr_t start, end; -}; - -extern int compute_hash_shift(struct node *nodes, int numnodes, - nodeid_t *nodeids); extern nodeid_t pxm_to_node(unsigned int pxm); #define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT)) -#define VIRTUAL_BUG_ON(x) -extern void numa_add_cpu(int cpu); -extern void numa_init_array(void); -extern bool numa_off; - -extern int arch_numa_setup(const char *opt); -extern bool arch_numa_disabled(void); -extern bool srat_disabled(void); -extern void numa_set_node(int cpu, nodeid_t node); +extern bool numa_disabled(void); extern nodeid_t setup_node(unsigned int pxm); extern void srat_detect_node(int cpu); -extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end); extern nodeid_t apicid_to_node[]; extern void init_cpu_to_node(void); -static inline void clear_node_cpumask(int cpu) -{ - cpumask_clear_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]); -} - -/* Simple perfect hash to map pdx to node numbers */ -extern int memnode_shift; -extern unsigned long memnodemapsize; -extern u8 *memnodemap; - -struct node_data { -unsigned long node_start_pfn; -unsigned long node_spanned_pages; -}
[PATCH v4 3/6] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid
VIRTUAL_BUG_ON is an empty macro used in phys_to_nid. This results in two lines of error-checking code in phys_to_nid that is not actually working and causing two compilation errors: 1. error: "MAX_NUMNODES" undeclared (first use in this function). This is because in the common header file, "MAX_NUMNODES" is defined after the common header file includes the ARCH header file, where phys_to_nid has attempted to use "MAX_NUMNODES". This error was resolved after we moved the phys_to_nid from x86 ARCH header file to common header file. 2. error: wrong type argument to unary exclamation mark. This is because, the error-checking code contains !node_data[nid]. But node_data is a data structure variable, it's not a pointer. So, in this patch, we use ASSERT instead of VIRTUAL_BUG_ON to enable the two lines of error-checking code. And fix the left compilation errors by replacing !node_data[nid] to !node_data[nid].node_spanned_pages. Although NUMA allows one node can only have CPUs but without any memory. And node with 0 bytes of memory might have an entry in memnodemap[] theoretically. But that doesn't mean phys_to_nid can find any valid address from a node with 0 bytes memory. Signed-off-by: Wei Chen Tested-by: Jiamei Xie Acked-by: Jan Beulich --- v3 -> v4: no change. v2 -> v3: 1. Remove unnecessary change items in history. 2. Add Acked-by. v1 -> v2: 1. Use ASSERT to replace VIRTUAL_BUG_ON in phys_to_nid. 2. Adjust the conditional express for ASSERT. 3. Refine the justification of using !node_data[nid].node_spanned_pages. --- xen/include/xen/numa.h | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h index baef4cd553..af0abfc8cf 100644 --- a/xen/include/xen/numa.h +++ b/xen/include/xen/numa.h @@ -36,8 +36,6 @@ struct node { extern int compute_hash_shift(const struct node *nodes, nodeid_t numnodes, nodeid_t *nodeids); -#define VIRTUAL_BUG_ON(x) - extern bool numa_off; extern void numa_add_cpu(unsigned int cpu); @@ -70,9 +68,9 @@ extern struct node_data node_data[]; static inline nodeid_t __attribute_pure__ phys_to_nid(paddr_t addr) { nodeid_t nid; -VIRTUAL_BUG_ON((paddr_to_pdx(addr) >> memnode_shift) >= memnodemapsize); +ASSERT((paddr_to_pdx(addr) >> memnode_shift) < memnodemapsize); nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift]; -VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid]); +ASSERT(nid < MAX_NUMNODES && node_data[nid].node_spanned_pages); return nid; } -- 2.25.1
[PATCH v4 1/6] xen/x86: Provide helpers for common code to access acpi_numa
acpi_numa is a specific NUMA switch for ACPI NUMA implementation. Other NUMA implementation may not need this switch. But this switch is not only used by ACPI code, it is also used directly in some general NUMA logic code. So far this hasn't caused any problem because Xen only has x86 implementing ACPI NUMA, but now Arm is implementing device tree based NUMA. Accesssing acpi_numa directly in some functions will be a block of reusing NUMA common code. It is also difficult for us to replace it with a new generic switch, because it is hard to prove that the new switch states can guarantee the original code will work correctly. So in this patch, we provide two helpers for common code to update and get states of acpi_numa. And other new NUMA implementations just need to provide the same helpers for common code. In this case, the generic NUMA logic code can be reused by all NUMA implementations. Signed-off-by: Wei Chen --- v3 -> v4: 1. Drop parameter from arch_numa_disabled, the parameter will be introduced in later patch where use it. 2. Drop unnecessary "else" from arch_numa_setup, and fix its indentation. v2 -> v3: 1. Drop enumeration of numa status. 2. Use helpers to get/update acpi_numa. 3. Insert spaces among parameters of strncmp in numa_setup. v1 -> v2: 1. Remove fw_numa. 2. Use enumeration to replace numa_off and acpi_numa. 3. Correct return value of srat_disabled. 4. Introduce numa_enabled_with_firmware. --- xen/arch/x86/include/asm/numa.h | 5 +++-- xen/arch/x86/numa.c | 38 ++--- 2 files changed, 28 insertions(+), 15 deletions(-) diff --git a/xen/arch/x86/include/asm/numa.h b/xen/arch/x86/include/asm/numa.h index c32ccffde3..237f2c6dbf 100644 --- a/xen/arch/x86/include/asm/numa.h +++ b/xen/arch/x86/include/asm/numa.h @@ -32,8 +32,9 @@ extern void numa_add_cpu(int cpu); extern void numa_init_array(void); extern bool numa_off; - -extern int srat_disabled(void); +extern int arch_numa_setup(const char *opt); +extern bool arch_numa_disabled(void); +extern bool srat_disabled(void); extern void numa_set_node(int cpu, nodeid_t node); extern nodeid_t setup_node(unsigned int pxm); extern void srat_detect_node(int cpu); diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c index 627ae8aa95..a5bc7a78c9 100644 --- a/xen/arch/x86/numa.c +++ b/xen/arch/x86/numa.c @@ -50,9 +50,28 @@ nodemask_t __read_mostly node_online_map = { { [0] = 1UL } }; bool numa_off; s8 acpi_numa = 0; -int srat_disabled(void) +int __init arch_numa_setup(const char *opt) { -return numa_off || acpi_numa < 0; +#ifdef CONFIG_ACPI_NUMA +if ( !strncmp(opt, "noacpi", 6) ) +{ +numa_off = false; +acpi_numa = -1; +return 0; +} +#endif + +return -EINVAL; +} + +bool arch_numa_disabled(void) +{ +return acpi_numa < 0; +} + +bool srat_disabled(void) +{ +return numa_off || arch_numa_disabled(); } /* @@ -291,28 +310,21 @@ void numa_set_node(int cpu, nodeid_t node) /* [numa=off] */ static int __init cf_check numa_setup(const char *opt) { -if ( !strncmp(opt,"off",3) ) +if ( !strncmp(opt, "off", 3) ) numa_off = true; -else if ( !strncmp(opt,"on",2) ) +else if ( !strncmp(opt, "on", 2) ) numa_off = false; #ifdef CONFIG_NUMA_EMU else if ( !strncmp(opt, "fake=", 5) ) { numa_off = false; -numa_fake = simple_strtoul(opt+5,NULL,0); +numa_fake = simple_strtoul(opt + 5, NULL, 0); if ( numa_fake >= MAX_NUMNODES ) numa_fake = MAX_NUMNODES; } -#endif -#ifdef CONFIG_ACPI_NUMA -else if ( !strncmp(opt,"noacpi",6) ) -{ -numa_off = false; -acpi_numa = -1; -} #endif else -return -EINVAL; +return arch_numa_setup(opt); return 0; } -- 2.25.1
RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator
Hi Julien, > -Original Message- > From: Julien Grall > >> This code is now becoming quite confusing to understanding. This loop is > >> meant to map the xenheap. If I follow your documentation, it would > mean > >> that only the reserved region should be mapped. > > > > Yes I think this is the same question that I raised in the scissors line of > > the > > commit message of this patch. > > Sorry I didn't notice the comment after the scissors line. This is the > same question :) > > > What I intend to do is still mapping the whole > > RAM because of the xenheap_* variables that you mentioned in... > > > >> > >> More confusingly, xenheap_* variables will cover the full RAM. > > > > ...here. But only adding the reserved region to the boot allocator so the > > reserved region can become the heap later on. I am wondering if we > > have a more clear way to do that, any suggestions? > > I think your code is correct. It only needs some renaming of the > existing variable (maybe to directmap_*?) to make clear the area is used > to access the RAM easily. Thanks for the clarification. I checked the code and found the xenheap_* variables are: xenheap_virt_start, xenheap_virt_end, xenheap_mfn_start, xenheap_mfn_end, xenheap_base_pdx. For clarification, do we need to change all of them to directmap_*? A pure renaming should be easy (and I guess also safe), but maybe I am overthinking because arm32 also uses xenheap_virt_end, xenheap_mfn_start and xenheap_mfn_end. These variables refer to the real xenheap, I am not sure renaming these would reduce the readability for arm32. Kind regards, Henry > > Cheers, > > -- > Julien Grall
RE: [PATCH v6 1/9] xen/arm: introduce static shared memory
Hi Julien > -Original Message- > From: Julien Grall > Sent: Thursday, September 1, 2022 11:40 PM > To: Penny Zheng ; xen-devel@lists.xenproject.org > Cc: Stefano Stabellini ; Bertrand Marquis > ; Volodymyr Babchuk > > Subject: Re: [PATCH v6 1/9] xen/arm: introduce static shared memory > > Hi Penny, > > On 29/08/2022 07:57, Penny Zheng wrote: > >> -Original Message- > >> From: Julien Grall > >> Sent: Friday, August 26, 2022 9:17 PM > >> To: Penny Zheng ; xen- > de...@lists.xenproject.org > >> Cc: Stefano Stabellini ; Bertrand Marquis > >> ; Volodymyr Babchuk > >> > >> Subject: Re: [PATCH v6 1/9] xen/arm: introduce static shared memory > >> > >> Hi Penny, > >> > > > > Hi Julien > > > >> On 21/07/2022 14:21, Penny Zheng wrote: > >>> From: Penny Zheng > >>> > >>> This patch series introduces a new feature: setting up static shared > >>> memory on a dom0less system, through device tree configuration. > >>> > >>> This commit parses shared memory node at boot-time, and reserve it > >>> in bootinfo.reserved_mem to avoid other use. > >>> > >>> This commits proposes a new Kconfig CONFIG_STATIC_SHM to wrap > >>> static-shm-related codes, and this option depends on static memory( > >>> CONFIG_STATIC_MEMORY). That's because that later we want to reuse a > >>> few helpers, guarded with CONFIG_STATIC_MEMORY, like > >>> acquire_staticmem_pages, etc, on static shared memory. > >>> > >>> Signed-off-by: Penny Zheng > >>> --- > >>> v6 change: > >>> - when host physical address is ommited, output the error message > >>> since xen doesn't support it at the moment > >>> - add the following check: 1) The shm ID matches and the region > >>> exactly match > >>> 2) The shm ID doesn't match and the region doesn't overlap > >>> - change it to "unsigned int" to be aligned with nr_banks > >>> - check the len of the property to confirm is it big enough to > >>> contain "paddr", "size", and "gaddr" > >>> - shm_id defined before nr_shm_domain, so we could re-use the > >>> existing hole and avoid increasing the size of the structure. > >>> - change "nr_shm_domain" to "nr_shm_borrowers", to not increment if > >>> the role is owner in parsing code > >>> - make "xen,shm_id" property as arbitrary string, with a strict > >>> limit on the number of characters, MAX_SHM_ID_LENGTH > >>> --- > >>> v5 change: > >>> - no change > >>> --- > >>> v4 change: > >>> - nit fix on doc > >>> --- > >>> v3 change: > >>> - make nr_shm_domain unsigned int > >>> --- > >>> v2 change: > >>> - document refinement > >>> - remove bitmap and use the iteration to check > >>> - add a new field nr_shm_domain to keep the number of shared domain > >>> --- > >>>docs/misc/arm/device-tree/booting.txt | 124 > >>>xen/arch/arm/Kconfig | 6 + > >>>xen/arch/arm/bootfdt.c| 157 ++ > >>>xen/arch/arm/include/asm/setup.h | 9 ++ > >>>4 files changed, 296 insertions(+) > >>> > >>> diff --git a/docs/misc/arm/device-tree/booting.txt > >>> b/docs/misc/arm/device-tree/booting.txt > >>> index 98253414b8..8013fb98fe 100644 > >>> --- a/docs/misc/arm/device-tree/booting.txt > >>> +++ b/docs/misc/arm/device-tree/booting.txt > >>> @@ -378,3 +378,127 @@ device-tree: > >>> > >>>This will reserve a 512MB region starting at the host physical address > >>>0x3000 to be exclusively used by DomU1. > >>> + > >>> +Static Shared Memory > >>> + > >>> + > >>> +The static shared memory device tree nodes allow users to > >>> +statically set up shared memory on dom0less system, enabling > >>> +domains to do shm-based communication. > >>> + > >>> +- compatible > >>> + > >>> +"xen,domain-shared-memory-v1" > >>> + > >>> +- xen,shm-id > >>> + > >>> +An arbitrary string that represents the unique identifier of the > >>> shared > >>> +memory region, with a strict limit on the number of > >>> + characters(\0 > >> included), > >>> +`MAX_SHM_ID_LENGTH(16)`. e.g. "xen,shm-id = "my-shared-mem- > 1"". > >>> + > >>> +- xen,shared-mem > >>> + > >>> +An array takes a physical address, which is the base address of the > >>> +shared memory region in host physical address space, a size, > >>> + and a > >> guest > >>> +physical address, as the target address of the mapping. > >>> +e.g. xen,shared-mem = < [host physical address] [size] [guest > >>> + address] > > >> > >> Your implementation below is checking for overlap and also have some > >> restriction. Can they be documented in the binding? > >> > >>> + > >>> +The number of cells for the host address (and size) is the same as > >>> the > >>> +guest pseudo-physical address and they are inherited from the > >>> + parent > >> node. > >> > >> In v5, we discussed to have the host address optional. However, the > >> binding has not been updated to reflect that. Note that I am not > >> asking to implement, but instead request that the binding can be used for > such setup. > >> > > > > How
RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator
> -Original Message- > From: Xen-devel On Behalf Of Wei > Chen > Sent: 2022年9月2日 11:03 > To: Stefano Stabellini ; Julien Grall > > Cc: Henry Wang ; xen-devel@lists.xenproject.org; > Bertrand Marquis ; Volodymyr Babchuk > > Subject: RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and > heap allocator > > Hi Julien and Stefano, > > > -Original Message- > > From: Stefano Stabellini > > Sent: 2022年9月2日 9:51 > > To: Julien Grall > > Cc: Stefano Stabellini ; Henry Wang > > ; xen-devel@lists.xenproject.org; Bertrand Marquis > > ; Wei Chen ; Volodymyr > Babchuk > > > > Subject: Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and > > heap allocator > > > > On Thu, 1 Sep 2022, Julien Grall wrote: > > > Hi Stefano, > > > > > In any case, I think we can postpone to after the release. Maybe we can add some notes to say that this feature is still experimental in EFI + DTS boot? Cheers, Wei Chen
RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator
Hi Julien and Stefano, > -Original Message- > From: Stefano Stabellini > Sent: 2022年9月2日 9:51 > To: Julien Grall > Cc: Stefano Stabellini ; Henry Wang > ; xen-devel@lists.xenproject.org; Bertrand Marquis > ; Wei Chen ; Volodymyr Babchuk > > Subject: Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and > heap allocator > > On Thu, 1 Sep 2022, Julien Grall wrote: > > Hi Stefano, > > > > On 01/09/2022 18:08, Stefano Stabellini wrote: > > > > > Also, what happen with UEFI? Is it easy to guarantee the region > will not > > > > > be used? > > > > > > > > For now I think it is not easy to guarantee that, do you have some > ideas > > > > in mind? I think I can follow this in above follow-up series to > improve > > > > things. > > > > > > For clarity, are we worried that the region is used by the bootloader > > > for other things? For instance U-Boot or Tianocore placing some > > > firmware tables inside the range specified for xenheap? > > > > Yes. I think it would be difficult for an admin to figure out which > regions > > are used. Although they are likely (?) going to be static for a given > > UEFI/U-boot build. > > > > My major concern is such bug can be very difficult to root cause because > we > > have no safety in Xen. The most likely symptom would be random > corruption. > > Thanks for the clarification. Yeah, I think we'll have to do some > "creative thinking" to figure out a solution to this issue. I wonder if > U-boot or Tianocore have some sort of API (or build-time data) to know > the unavailable ranges. > When Xen is booted through EFI, all the memory statically defined in the Device tree has a certain probability of conflicting with the memory occupied by EFI. This is difficult to avoid without the EFI bootloader intervening, but it is possible to detect such a conflict. For EFI reserved memory regions (like runtime service), they will not be reported as usable memory to Xen. The usable memory regions will be added to bootinfo.memblk as device tree boot. So I think all static defined memory regions would be check with bootinfo.memblk to find the conflict. For EFI relocate Xen and DTB, I think Xen itself can know these addresses easily and easy to check. But if we don't add code to uboot or EDK2, what can we do are just check, print error and panic. But starting from the actual usage scenario, because the scenarios using static heap are normally highly customized, their EFI firmware will bypass the static memory we defined in device tree when customizing. So maybe check conflict is enough? Cheers, Wei Chen > In any case, I think we can postpone to after the release.
[qemu-mainline test] 172916: regressions - FAIL
flight 172916 qemu-mainline real [real] http://logs.test-lab.xenproject.org/osstest/logs/172916/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: build-amd64-libvirt 6 libvirt-buildfail REGR. vs. 172123 build-i386-libvirt6 libvirt-buildfail REGR. vs. 172123 build-arm64-libvirt 6 libvirt-buildfail REGR. vs. 172123 build-i386-xsm6 xen-buildfail REGR. vs. 172123 build-armhf-libvirt 6 libvirt-buildfail REGR. vs. 172123 Tests which did not succeed, but are not blocking: test-amd64-i386-xl-xsm1 build-check(1) blocked n/a test-arm64-arm64-libvirt-raw 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-xsm 1 build-check(1) blocked n/a test-armhf-armhf-libvirt 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-qcow2 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-raw 1 build-check(1) blocked n/a test-amd64-i386-libvirt-xsm 1 build-check(1) blocked n/a test-amd64-i386-libvirt-raw 1 build-check(1) blocked n/a test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-i386-libvirt-pair 1 build-check(1) blocked n/a test-amd64-i386-libvirt 1 build-check(1) blocked n/a test-amd64-amd64-libvirt 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-pair 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-vhd 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-xsm 1 build-check(1) blocked n/a test-amd64-i386-xl-qemuu-debianhvm-i386-xsm 1 build-check(1) blocked n/a test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 172123 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 172123 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 172123 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 172123 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 172123 test-amd64-i386-xl-pvshim14 guest-start fail never pass test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-vhd 14 migrate-support-checkfail never pass test-arm64-arm64-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass version targeted for testing: qemuue93ded1bf6c94ab95015b33e188bc8b0b0c32670 baseline version: qemuu2480f3bbd03814b0651a1f74959f5c6631ee58
[ovmf test] 172923: regressions - FAIL
flight 172923 ovmf real [real] http://logs.test-lab.xenproject.org/osstest/logs/172923/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: build-i386-libvirt6 libvirt-buildfail REGR. vs. 172136 build-amd64-libvirt 6 libvirt-buildfail REGR. vs. 172136 version targeted for testing: ovmf 6edd257861fd8ef44d23a6c1afb329efd95a9ae7 baseline version: ovmf 444260d45ec2a84e8f8c192b3539a3cd5591d009 Last test of basis 172136 2022-08-04 06:43:42 Z 28 days Failing since172151 2022-08-05 02:40:28 Z 27 days 222 attempts Testing same since 172923 2022-09-01 23:42:02 Z0 days1 attempts People who touched revisions under test: Abdul Lateef Attar Abner Chang Ard Biesheuvel Bob Feng Chasel Chiu Chen, Xiao X Czajkowski, Maciej Dimitrije Pavlov Dun Tan Edward Pickup Foster Nong Gregx Yeh Guo Dong Igor Kulchytskyy James Lu Jeff Brasen Jiaxin Wu Jose Marinho KasimX Liu Kavya Konstantin Aladyshev Kun Qin Liming Gao Liu, Zhiguang Maciej Czajkowski Michael D Kinney Michael Kubacki Pierre Gondois Ray Ni Rebecca Cran Rebecca Cran Sainadh Nagolu Sami Mujawar Shengfengx Xue Wu, Jiaxin Xiao X Chen Yuanhao Xie Zhiguang Liu jobs: build-amd64-xsm pass build-i386-xsm pass build-amd64 pass build-i386 pass build-amd64-libvirt fail build-i386-libvirt fail build-amd64-pvopspass build-i386-pvops pass test-amd64-amd64-xl-qemuu-ovmf-amd64 pass test-amd64-i386-xl-qemuu-ovmf-amd64 pass sg-report-flight on osstest.test-lab.xenproject.org logs: /home/logs/logs images: /home/logs/images Logs, config files, etc. are available at http://logs.test-lab.xenproject.org/osstest/logs Explanation of these reports, and of osstest in general, is at http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master Test harness code can be found at http://xenbits.xen.org/gitweb?p=osstest.git;a=summary Not pushing. (No revision log; it would be 1440 lines long.)
Re: [PATCH v3 6/7] xen/arm: introduce xen-evtchn dom0less property
On Thu, 1 Sep 2022, Julien Grall wrote: > On 01/09/2022 10:13, Rahul Singh wrote: > > Introduce a new sub-node under /chosen node to establish static event > > channel communication between domains on dom0less systems. > > > > An event channel will be created beforehand to allow the domains to > > send notifications to each other. > > > > Signed-off-by: Rahul Singh > > --- > > Changes in v3: > > - use device-tree used_by to find the domain id of the evtchn node. > > - add new static_evtchn_create variable in struct dt_device_node to > > hold the information if evtchn is already created. > > - fix minor comments > > Changes in v2: > > - no change > > --- > > docs/misc/arm/device-tree/booting.txt | 64 - > > xen/arch/arm/domain_build.c | 128 ++ > > xen/arch/arm/include/asm/setup.h | 1 + > > xen/arch/arm/setup.c | 2 + > > xen/include/xen/device_tree.h | 13 +++ > > 5 files changed, 207 insertions(+), 1 deletion(-) > > > > diff --git a/docs/misc/arm/device-tree/booting.txt > > b/docs/misc/arm/device-tree/booting.txt > > index 98253414b8..edef98e6d1 100644 > > --- a/docs/misc/arm/device-tree/booting.txt > > +++ b/docs/misc/arm/device-tree/booting.txt > > @@ -212,7 +212,7 @@ with the following properties: > > enable only selected interfaces. > > Under the "xen,domain" compatible node, one or more sub-nodes are > > present > > -for the DomU kernel and ramdisk. > > +for the DomU kernel, ramdisk and static event channel. > > The kernel sub-node has the following properties: > > @@ -254,11 +254,44 @@ The ramdisk sub-node has the following properties: > > property because it will be created by the UEFI stub on boot. > > This option is needed only when UEFI boot is used. > > +The static event channel sub-node has the following properties: > > + > > +- compatible > > + > > +"xen,evtchn" > > + > > +- xen,evtchn > > + > > +The property is tuples of two numbers > > +(local-evtchn link-to-foreign-evtchn) where: > > + > > +local-evtchn is an integer value that will be used to allocate local > > port > > +for a domain to send and receive event notifications to/from the remote > > +domain. Maximum supported value is 2^17 for FIFO ABI and 4096 for 2L > > ABI. > > +It is recommended to use low event channel IDs. > > + > > +link-to-foreign-evtchn is a single phandle to a remote evtchn to which > > +local-evtchn will be connected. > > Example > > === > > chosen { > > + > > +module@0 { > > +compatible = "multiboot,kernel", "multiboot,module"; > > +xen,uefi-binary = "..."; > > +bootargs = "..."; > > + > > +}; > > NIT: Describing this node in the example seems unnecessary. > > > + > > +/* one sub-node per local event channel */ > > +ec1: evtchn@1 { > > + compatible = "xen,evtchn-v1"; > > + /* local-evtchn link-to-foreign-evtchn */ > > + xen,evtchn = <0xa &ec2>; > > +}; > > + > > Here you provide an example for dom0. But the position where you describe the > binding suggests this binding only exists for domUs. > > Either we duplicate the binding or we re-order the documentation to have > common binding in a single place. My preference would be the latter. > > > domU1 { > > compatible = "xen,domain"; > > #address-cells = <0x2>; > > @@ -277,6 +310,23 @@ chosen { > > compatible = "multiboot,ramdisk", "multiboot,module"; > > reg = <0x0 0x4b00 0xff>; > > }; > > + > > +/* one sub-node per local event channel */ > > +ec2: evtchn@2 { > > +compatible = "xen,evtchn-v1"; > > +/* local-evtchn link-to-foreign-evtchn */ > > +xen,evtchn = <0xa &ec1>; > > +}; > > + > > +ec3: evtchn@3 { > > +compatible = "xen,evtchn-v1"; > > +xen,evtchn = <0xb &ec5>; > > +}; > > + > > +ec4: evtchn@4 { > > +compatible = "xen,evtchn-v1"; > > +xen,evtchn = <0xc &ec6>; > > +}; > > }; > > domU2 { > > @@ -296,6 +346,18 @@ chosen { > > compatible = "multiboot,ramdisk", "multiboot,module"; > > reg = <0x0 0x4d00 0xff>; > > }; > > + > > +/* one sub-node per local event channel */ > > +ec5: evtchn@5 { > > +compatible = "xen,evtchn-v1"; > > +/* local-evtchn link-to-foreign-evtchn */ > > +xen,evtchn = <0xb &ec3>; > > +}; > > + > > +ec6: evtchn@6 { > > +compatible = "xen,evtchn-v1"; > > +xen,evtchn = <0xd &ec4>; > > +}; > > }; > > }; > > diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c > > index 707e247f6a..4b24261825 100644 > > --- a/xen/arch/arm/domain_build.c > > +++ b/xen/arch/arm/domain_build.c > > @@ -33,6 +33,8 @@ > > #
Re: [PATCH 0/7] Fix MISRA C 2012 Rule 20.7 violations
On Thu, 1 Sep 2022, Bertrand Marquis wrote: > Hi Xenia, > > > On 1 Sep 2022, at 10:27, Xenia Ragiadakou wrote: > > > > > > On 9/1/22 01:35, Stefano Stabellini wrote: > >> Patches 1, 4, and 6 are already committed. I plan to commit patches 2, 3 > >> and 5 in the next couple of days. > >> Patch 7 needs further discussions and it is best addressed during the > >> next MISRA C sync-up. > > > > I would like to share here, before the next MISRA C sync, my understandings > > that will hopefully resolve a wrong impression of mine, that I may have > > spread around, regarding this rule. > > There was a misunderstanding regarding the rule 20.7 from my part and I > > think that Jan is absolutely right that parenthesizing macro parameters > > used as function arguments is not required by the rule. > > > > The rule 20.7 states "Expressions resulting from the expansion of macro > > parameters shall be enclosed in parentheses" and in the rationale of the > > rule states "If a macro parameter is not being used as an expression then > > the parentheses are not necessary because no operators are involved.". > > > > Initially, based on the title, my understanding was that it requires for > > the expression resulting from the expansion of the macro to be enclosed in > > parentheses. Then, based on the rule explanation and the examples given, > > my understanding was that it requires the macro parameters that are used as > > expressions to be enclosed in parentheses. > > But, after re-thinking about it, the most probable and what makes more > > sense, is that it require parentheses around the macro parameters that are > > part of an expression and not around those that are used as expressions. > > > > Therefore, macro parameters being used as function arguments are not > > required to be enclosed in parentheses, because the function arguments are > > part of an expression list, not of an expression (comma is evaluated as > > separator, not as operator). > > While, macro parameters used as rhs and lhs expressions of the assignment > > operator are required to be enclosed in parentheses because they are part > > of an assignment expression. > > > > I verified that the violation reported by cppcheck is not due to missing > > parentheses around the function argument (though still I have not > > understood the origin of the warning). Also, Eclair does not report it. > > > > Hence, it was a misunderstanding of mine and there is no inconsistency, > > with respect to this rule, in adding parentheses around macro parameters > > used as rhs of assignments. The rule does not require adding parentheses > > around macro parameters used as function arguments and neither cppcheck nor > > Eclair report violation for missing parentheses around macro parameters > > used as function arguments. > > > Thanks a lot for the detailed explanation :-) > > What you say does make sense and I agree with your analysis here, only > protect when part of an expression and not use as a subsequent parameter (for > a function or an other macro). Yeah I also agree with your analysis, and many thanks for double-checking the cppcheck and Eclair's reports.
Setting constant-time mode CPU flag
On Intel chips (Ice Lake and later) and ARM64, a bit needs to be set in a CPU register to enforce constant-time execution. Linux plans to set this bit by default; Xen should do the same. See https://lore.kernel.org/lkml/ywgcrqutxmx0w...@gmail.com/T/ for details. I recommend setting the bit unconditionally and ignoring guest attempts to change it. -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH v6 5/9] xen/arm: Add additional reference to owner domain when the owner is allocated
On Thu, 1 Sep 2022, Julien Grall wrote: > Hi Penny, > > On 21/07/2022 14:21, Penny Zheng wrote: > > Borrower domain will fail to get a page ref using the owner domain > > during allocation, when the owner is created after borrower. > > > > So here, we decide to get and add the right amount of reference, which > > is the number of borrowers, when the owner is allocated. > > > > Signed-off-by: Penny Zheng > > Reviewed-by: Stefano Stabellini > > IMHO, this tag should not have been kept given... > > > --- > > v6 change: > > - adapt to the change of "nr_shm_borrowers" > > ... this change. 'reviewed-by' means that the person reviewed the code and > therefore agree with the logic. So I would only keep the tag if the change is > trivial (including typo, coding style) and would drop it (or confirm with the > person) otherwise. > > Stefano, can you confirm you are happy that your reviewed-by tag is kept? Yes I confirm Reviewed-by: Stefano Stabellini > > - add in-code comment to explain if the borrower is created first, we intend > > to > > add pages in the P2M without reference. > > --- > > v5 change: > > - no change > > --- > > v4 changes: > > - no change > > --- > > v3 change: > > - printk rather than dprintk since it is a serious error > > --- > > v2 change: > > - new commit > > --- > > xen/arch/arm/domain_build.c | 60 + > > 1 file changed, 60 insertions(+) > > > > diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c > > index a7e95c34a7..e891e800a7 100644 > > --- a/xen/arch/arm/domain_build.c > > +++ b/xen/arch/arm/domain_build.c > > @@ -761,6 +761,30 @@ static void __init assign_static_memory_11(struct > > domain *d, > > } > > #ifdef CONFIG_STATIC_SHM > > +static int __init acquire_nr_borrower_domain(struct domain *d, > > + paddr_t pbase, paddr_t psize, > > + unsigned long *nr_borrowers) > > +{ > > +unsigned long bank; > > NIT: AFAICT nr_banks is an "unsigned int". > > Other than that: > > Acked-by: Julien Grall > > Cheers, > > -- > Julien Grall >
Re: [RFC PATCH 27/30] Code tagging based latency tracking
On Thu, 1 Sep 2022 21:35:32 -0400 Kent Overstreet wrote: > On Thu, Sep 01, 2022 at 08:23:11PM -0400, Steven Rostedt wrote: > > If ftrace, perf, bpf can't do what you want, take a harder look to see if > > you can modify them to do so. > > Maybe we can use this exchange to make both of our tools better. I like your > histograms - the quantiles algorithm I've had for years is janky, I've been > meaning to rip that out, I'd love to take a look at your code for that. And > having an on/off switch is a good idea, I'll try to add that at some point. > Maybe you got some ideas from my stuff too. > > I'd love to get better tracepoints for measuring latency - what I added to > init_wait() and finish_wait() was really only a starting point. Figuring out > the right places to measure is where I'd like to be investing my time in this > area, and there's no reason we couldn't both be making use of that. Yes, this is exactly what I'm talking about. I'm not against your work, I just want you to work more with everyone to come up with ideas that can help everyone as a whole. That's how "open source communities" is suppose to work ;-) The histogram and synthetic events can use some more clean ups. There's a lot of places that can be improved in that code. But I feel the ideas behind that code is sound. It's just getting the implementation to be a bit more efficient. > > e.g. with kernel waitqueues, I looked at hooking prepare_to_wait() first but > not > all code uses that, init_wait() got me better coverage. But I've already seen > that that misses things, too, there's more work to be done. I picked prepare_to_wait() just because I was hacking up something quick and thought that was "close enough" ;-) -- Steve
Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator
On Thu, 1 Sep 2022, Julien Grall wrote: > Hi Stefano, > > On 01/09/2022 18:08, Stefano Stabellini wrote: > > > > Also, what happen with UEFI? Is it easy to guarantee the region will not > > > > be used? > > > > > > For now I think it is not easy to guarantee that, do you have some ideas > > > in mind? I think I can follow this in above follow-up series to improve > > > things. > > > > For clarity, are we worried that the region is used by the bootloader > > for other things? For instance U-Boot or Tianocore placing some > > firmware tables inside the range specified for xenheap? > > Yes. I think it would be difficult for an admin to figure out which regions > are used. Although they are likely (?) going to be static for a given > UEFI/U-boot build. > > My major concern is such bug can be very difficult to root cause because we > have no safety in Xen. The most likely symptom would be random corruption. Thanks for the clarification. Yeah, I think we'll have to do some "creative thinking" to figure out a solution to this issue. I wonder if U-boot or Tianocore have some sort of API (or build-time data) to know the unavailable ranges. In any case, I think we can postpone to after the release.
Re: [RFC PATCH 27/30] Code tagging based latency tracking
On Thu, Sep 01, 2022 at 08:23:11PM -0400, Steven Rostedt wrote: > If ftrace, perf, bpf can't do what you want, take a harder look to see if > you can modify them to do so. Maybe we can use this exchange to make both of our tools better. I like your histograms - the quantiles algorithm I've had for years is janky, I've been meaning to rip that out, I'd love to take a look at your code for that. And having an on/off switch is a good idea, I'll try to add that at some point. Maybe you got some ideas from my stuff too. I'd love to get better tracepoints for measuring latency - what I added to init_wait() and finish_wait() was really only a starting point. Figuring out the right places to measure is where I'd like to be investing my time in this area, and there's no reason we couldn't both be making use of that. e.g. with kernel waitqueues, I looked at hooking prepare_to_wait() first but not all code uses that, init_wait() got me better coverage. But I've already seen that that misses things, too, there's more work to be done. random thought: might try adding a warning in schedule() any time it's called and codetag_time_stats_start() hasn't been called, that'll be a starting point...
RE: [PATCH 1/2] docs, xen/arm: Introduce reserved heap memory
Hi Julien, > -Original Message- > From: Julien Grall > Subject: Re: [PATCH 1/2] docs, xen/arm: Introduce reserved heap memory > > Hi Henry, > > > Also, this commit introduces the logic to parse the reserved heap > > configuation in device tree by reusing the device tree entry definition > > typo: s/configuation/configuration/ Oops, sorry for that... > > > docs/misc/arm/device-tree/booting.txt | 46 > + > > I have skipped the documentation and looked at the code instead. No problem, Stefano and Michal have already provided some comments in the doc. > > It sounds like to me, we want to have an enum rather than multiple > boolean. This would also make easier... > > > { > > const struct fdt_property *prop; > > unsigned int i, banks; > > @@ -96,6 +97,7 @@ static int __init device_tree_get_meminfo(const void > *fdt, int node, > > mem->bank[mem->nr_banks].start = start; > > mem->bank[mem->nr_banks].size = size; > > mem->bank[mem->nr_banks].xen_domain = xen_domain; > > +mem->bank[mem->nr_banks].xen_heap = xen_heap; > > mem->nr_banks++; > > } > > > > @@ -185,7 +187,7 @@ static int __init process_memory_node(const void > *fdt, int node, > > void *data) > > { > > return device_tree_get_meminfo(fdt, node, "reg", address_cells, > size_cells, > > - data, false); > > + data, false, false); > > ... to understand the two "false" here. I agree, will change in v2. > > > } > > > > static int __init process_reserved_memory_node(const void *fdt, int node, > > @@ -293,7 +295,7 @@ static void __init process_multiboot_node(const > void *fdt, int node, > >kind, start, domU); > > } > > > > -static void __init process_chosen_node(const void *fdt, int node, > > +static int __init process_chosen_node(const void *fdt, int node, > > const char *name, > > u32 address_cells, u32 size_cells) > > > { > > @@ -301,16 +303,40 @@ static void __init process_chosen_node(const > void *fdt, int node, > > paddr_t start, end; > > int len; > > > > +if ( fdt_get_property(fdt, node, "xen,static-mem", NULL) ) > > +{ > > +u32 address_cells = device_tree_get_u32(fdt, node, > > + > > "#xen,static-mem-address-cells", > > +0); > > +u32 size_cells = device_tree_get_u32(fdt, node, > > + "#xen,static-mem-size-cells", > > 0); > > +int rc; > > + > > +printk("Checking for reserved heap in /chosen\n"); > > +if ( address_cells < 1 || size_cells < 1 ) > > +{ > > +printk("fdt: node `%s': invalid #xen,static-mem-address-cells > > or > #xen,static-mem-size-cells\n", > > + name); > > +return -EINVAL; > > +} > > + > > +rc = device_tree_get_meminfo(fdt, node, "xen,static-mem", > address_cells, > > + size_cells, &bootinfo.reserved_mem, > > false, > > + true); > > +if ( rc ) > > +return rc; > > +} > > + > > printk("Checking for initrd in /chosen\n"); > > > > prop = fdt_get_property(fdt, node, "linux,initrd-start", &len); > > if ( !prop ) > > /* No initrd present. */ > > -return; > > +return 0; > > if ( len != sizeof(u32) && len != sizeof(u64) ) > > { > > printk("linux,initrd-start property has invalid length %d\n", > > len); > > -return; > > +return -EINVAL; > > This is technically a change in behavior for Xen (we would panic rather > than continue). I am happy with the proposal. However, this doesn't seem > to be explained in the commit message. > > That said, I think this should be split in a separate patch along with > the ones below (including the prototype changes). According to Michal's comment, I've removed the return type and function prototype change in my local v2. So this patch itself is fine. My question now would be, do maintainers think this change of behavior with processing the chosen node be helpful? Do we prefer an instant panic or current behavior? I am more than happy to add a patch for changing the return/panic behavior if everyone is happy with it. > > > } > > start = dt_read_number((void *)&prop->data, dt_size_to_cells(len)); > > > > @@ -318,12 +344,12 @@ static void __init process_chosen_node(const > void *fdt, int node, > > if ( !prop ) > > { > > printk("linux,initrd-end not present but -start was\n"); > > -return; > > +return -EINVAL; > > } > > if ( len != sizeof(u32) && len != sizeof(u64) ) > > {
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 06:04:46PM -0700, Roman Gushchin wrote: > On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote: > > On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote: > > > I'd suggest to run something like iperf on a fast hardware. And maybe some > > > io_uring stuff too. These are two places which were historically most > > > sensitive > > > to the (kernel) memory accounting speed. > > > > I'm getting wildly inconsistent results with iperf. > > > > io_uring-echo-server and rust_echo_bench gets me: > > Benchmarking: 127.0.0.1:12345 > > 50 clients, running 512 bytes, 60 sec. > > > > Without alloc tagging: 120547 request/sec > > With: 116748 request/sec > > > > https://github.com/frevib/io_uring-echo-server > > https://github.com/haraldh/rust_echo_bench > > > > How's that look to you? Close enough? :) > > Yes, this looks good (a bit too good). Eh, I was hoping for better :) > I'm not that familiar with io_uring, Jens and Pavel should have a better idea > what and how to run (I know they've workarounded the kernel memory accounting > because of the performance in the past, this is why I suspect it might be an > issue here as well). > > This is a recent optimization on the networking side: > https://lore.kernel.org/linux-mm/20220825000506.239406-1-shake...@google.com/ > > Maybe you can try to repeat this experiment. I'd be more interested in a synthetic benchmark, if you know of any.
[linux-5.4 test] 172914: regressions - FAIL
flight 172914 linux-5.4 real [real] http://logs.test-lab.xenproject.org/osstest/logs/172914/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: build-i386-libvirt6 libvirt-buildfail REGR. vs. 172128 build-amd64-libvirt 6 libvirt-buildfail REGR. vs. 172128 build-arm64-libvirt 6 libvirt-buildfail REGR. vs. 172128 build-armhf-libvirt 6 libvirt-buildfail REGR. vs. 172128 Tests which did not succeed, but are not blocking: test-arm64-arm64-libvirt-raw 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-xsm 1 build-check(1) blocked n/a test-amd64-amd64-libvirt 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-pair 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-vhd 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-xsm 1 build-check(1) blocked n/a test-amd64-i386-libvirt 1 build-check(1) blocked n/a test-amd64-i386-libvirt-pair 1 build-check(1) blocked n/a test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-i386-libvirt-raw 1 build-check(1) blocked n/a test-amd64-i386-libvirt-xsm 1 build-check(1) blocked n/a test-armhf-armhf-libvirt 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-qcow2 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-raw 1 build-check(1) blocked n/a test-armhf-armhf-xl-credit1 18 guest-start/debian.repeat fail blocked in 172128 test-armhf-armhf-xl-credit2 18 guest-start/debian.repeatfail like 172108 test-armhf-armhf-xl-multivcpu 18 guest-start/debian.repeatfail like 172108 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 172108 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 172128 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 172128 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 172128 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 172128 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 172128 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 172128 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 172128 test-armhf-armhf-xl-rtds 18 guest-start/debian.repeatfail like 172128 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 172128 test-amd64-i386-xl-pvshim14 guest-start fail never pass test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armh
Re: [RFC PATCH 14/30] mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache objects
On Thu, Sep 1, 2022 at 4:41 PM Roman Gushchin wrote: > > On Tue, Aug 30, 2022 at 02:49:03PM -0700, Suren Baghdasaryan wrote: > > Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext > > objects. Also prevent slabobj_ext allocations for kmem_cache objects. > > > > Signed-off-by: Suren Baghdasaryan > > Patches 12-14 look good to me. > It's probably to early to ack anything, but otherwise I'd ack them. Thank you for reviewing! > > Thanks!
Re: [RFC PATCH 11/30] mm: introduce slabobj_ext to support slab object extensions
On Thu, Sep 1, 2022 at 4:36 PM Roman Gushchin wrote: > > On Tue, Aug 30, 2022 at 02:49:00PM -0700, Suren Baghdasaryan wrote: > > Currently slab pages can store only vectors of obj_cgroup pointers in > > page->memcg_data. Introduce slabobj_ext structure to allow more data > > to be stored for each slab object. Wraps obj_cgroup into slabobj_ext > > to support current functionality while allowing to extend slabobj_ext > > in the future. > > > > Note: ideally the config dependency should be turned the other way around: > > MEMCG should depend on SLAB_OBJ_EXT and {page|slab|folio}.memcg_data would > > be renamed to something like {page|slab|folio}.objext_data. However doing > > this in RFC would introduce considerable churn unrelated to the overall > > idea, so avoiding this until v1. > > Hi Suren! Hi Roman, > > I'd say CONFIG_MEMCG_KMEM and CONFIG_YOUR_NEW_STUFF should both depend on > SLAB_OBJ_EXT. > CONFIG_MEMCG_KMEM depend on CONFIG_MEMCG anyway. Yes, I agree. I wanted to mention here that the current dependency is incorrect and should be reworked. Having both depending on SLAB_OBJ_EXT seems like the right approach. > > > > > Signed-off-by: Suren Baghdasaryan > > --- > > include/linux/memcontrol.h | 18 -- > > init/Kconfig | 5 ++ > > mm/kfence/core.c | 2 +- > > mm/memcontrol.c| 60 ++- > > mm/page_owner.c| 2 +- > > mm/slab.h | 119 + > > 6 files changed, 131 insertions(+), 75 deletions(-) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 6257867fbf95..315399f77173 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -227,6 +227,14 @@ struct obj_cgroup { > > }; > > }; > > > > +/* > > + * Extended information for slab objects stored as an array in > > page->memcg_data > > + * if MEMCG_DATA_OBJEXTS is set. > > + */ > > +struct slabobj_ext { > > + struct obj_cgroup *objcg; > > +} __aligned(8); > > Why do we need this aligment requirement? To save space by avoiding padding, however, all members today will be pointers, so it's meaningless and we can safely drop it. > > > + > > /* > > * The memory controller data structure. The memory controller controls > > both > > * page cache and RSS per cgroup. We would eventually like to provide > > @@ -363,7 +371,7 @@ extern struct mem_cgroup *root_mem_cgroup; > > > > enum page_memcg_data_flags { > > /* page->memcg_data is a pointer to an objcgs vector */ > > - MEMCG_DATA_OBJCGS = (1UL << 0), > > + MEMCG_DATA_OBJEXTS = (1UL << 0), > > /* page has been accounted as a non-slab kernel page */ > > MEMCG_DATA_KMEM = (1UL << 1), > > /* the next bit after the last actual flag */ > > @@ -401,7 +409,7 @@ static inline struct mem_cgroup *__folio_memcg(struct > > folio *folio) > > unsigned long memcg_data = folio->memcg_data; > > > > VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); > > - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio); > > + VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio); > > VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio); > > > > return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); > > @@ -422,7 +430,7 @@ static inline struct obj_cgroup *__folio_objcg(struct > > folio *folio) > > unsigned long memcg_data = folio->memcg_data; > > > > VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); > > - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio); > > + VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio); > > VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio); > > > > return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); > > @@ -517,7 +525,7 @@ static inline struct mem_cgroup > > *page_memcg_check(struct page *page) > >*/ > > unsigned long memcg_data = READ_ONCE(page->memcg_data); > > > > - if (memcg_data & MEMCG_DATA_OBJCGS) > > + if (memcg_data & MEMCG_DATA_OBJEXTS) > > return NULL; > > > > if (memcg_data & MEMCG_DATA_KMEM) { > > @@ -556,7 +564,7 @@ static inline struct mem_cgroup > > *get_mem_cgroup_from_objcg(struct obj_cgroup *ob > > static inline bool folio_memcg_kmem(struct folio *folio) > > { > > VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page); > > - VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJCGS, folio); > > + VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJEXTS, folio); > > return folio->memcg_data & MEMCG_DATA_KMEM; > > } > > > > diff --git a/init/Kconfig b/init/Kconfig > > index 532362fcfe31..82396d7a2717 100644 > > --- a/init/Kconfig > > +++ b/init/Kconfig > > @@ -958,6 +958,10 @@ config MEMCG > > help > > Provides control over the memory footprint of tasks in a cgroup. > > > > +config SLAB_OBJ_EXT > > + bool > > + depends on MEMCG > > + > > config MEMCG_SWAP > > bool >
Re: [RFC PATCH 27/30] Code tagging based latency tracking
On Thu, 1 Sep 2022 18:55:15 -0400 Kent Overstreet wrote: > On Thu, Sep 01, 2022 at 06:34:30PM -0400, Steven Rostedt wrote: > > On Thu, 1 Sep 2022 17:54:38 -0400 > > Kent Overstreet wrote: > > > > > > So this looks like it's gotten better since I last looked, but it's still > > > not > > > there yet. > > > > > > Part of the problem is that the tracepoints themselves are in the wrong > > > place: > > > your end event is when a task is woken up, but that means spurious > > > wakeups will > > > > The end event is when a task is scheduled onto the CPU. The start event is > > the first time it is woken up. > > Yeah, that's not what I want. You're just tracing latency due to having more > processes runnable than CPUs. > > I don't care about that for debugging, though! I specifically want latency at > the wait_event() level, and related - every time a process blocked _on some > condition_, until that condition became true. Not until some random, > potentially > spurious wakeup. Ideally this would be better if we could pass the stack trace from one event to the next, but that wouldn't be too hard to implement. It just needs to be done. But anyway: # echo 'p:wait prepare_to_wait_event' > /sys/kernel/tracing/kprobe_events // created an event on prepare_to_wait_event as that's usually called just // before wait event. # sqlhist -e -n wait_sched 'select start.common_pid as pid,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta from wait as start join sched_switch as end on start.common_pid = end.prev_pid where end.prev_state & 3' // Create a "wait_sched" event that traces the time between the // prepare_to_wait_event call and the scheduler. Only trigger it if the // schedule happens in the interruptible or uninterruptible states. # sqlhist -e -n wake_sched 'select start.pid,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta2 from wait_sched as start join sched_switch as end on start.pid = end.next_pid where start.delta < 50' // Now attach the wait_event to the sched_switch where the task gets // scheduled back in. But we are only going to care if the delta between // the prepare_to_wait_event and the schedule is less that 50us. This is a // hack to just care about where a prepare_to_wait_event was done just before // scheduling out. # echo 'hist:keys=pid,delta2.buckets=10:sort=delta2' > /sys/kernel/tracing/events/synthetic/wake_sched/trigger // Now we are going to look at the deltas that the task was sleeping for an // event. But this just gives pids and deltas. # echo 'hist:keys=pid,stacktrace if delta < 50' >> /sys/kernel/tracing/events/synthetic/wait_sched/trigger // And this is to get the backtraces of where the task was. This is because // the stack trace is not available at the schedule in, because the // sched_switch can only give the stack trace of when a task schedules out. // Again, this is somewhat a hack. # cat /sys/kernel/tracing/events/synthetic/wake_sched/hist # event histogram # # trigger info: hist:keys=pid,delta2.buckets=10:vals=hitcount:sort=delta2.buckets=10:size=2048 [active] # { pid: 2114, delta2: ~ 10-19 } hitcount: 1 { pid: 1389, delta2: ~ 160-169 } hitcount: 1 { pid: 1389, delta2: ~ 660-669 } hitcount: 1 { pid: 1389, delta2: ~ 1020-1029 } hitcount: 1 { pid: 1189, delta2: ~ 500020-500029 } hitcount: 1 { pid: 1189, delta2: ~ 500030-500039 } hitcount: 1 { pid: 1195, delta2: ~ 500030-500039 } hitcount: 2 { pid: 1189, delta2: ~ 500040-500049 } hitcount: 10 { pid: 1193, delta2: ~ 500040-500049 } hitcount: 3 { pid: 1197, delta2: ~ 500040-500049 } hitcount: 3 { pid: 1195, delta2: ~ 500040-500049 } hitcount: 9 { pid: 1190, delta2: ~ 500050-500059 } hitcount: 55 { pid: 1197, delta2: ~ 500050-500059 } hitcount: 51 { pid: 1191, delta2: ~ 500050-500059 } hitcount: 61 { pid: 1198, delta2: ~ 500050-500059 } hitcount: 56 { pid: 1195, delta2: ~ 500050-500059 } hitcount: 48 { pid: 1192, delta2: ~ 500050-500059 } hitcount: 54 { pid: 1194, delta2: ~ 500050-500059 } hitcount: 50 { pid: 1196, delta2: ~ 500050-500059 } hitcount: 57 { pid: 1189, delta2: ~ 500050-500059 } hitcount: 48 { pid: 1193, delta2: ~ 500050-500059 } hitcount: 52 { pid: 1194, delta2: ~ 500060-500069 } hitcount: 12 { pid: 1191, delta2: ~ 500060-500069 } hitcount: 2 { pid: 1190, delta2: ~ 500060-500069 } hitcount: 7 { pid: 1198, delta2: ~ 500060-500069 } hitcount: 9 { pid: 1193, delta2: ~ 500060-500069 } hitcount: 6 { pid: 1196, delta2: ~ 500060-500069 } hitcount: 5 { pid: 1192, delta2: ~ 500060-500069 } hitcount: 9 { pid: 1197, delta2: ~ 500060-500069 } hitcount: 9 { pid:
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote: > I'd suggest to run something like iperf on a fast hardware. And maybe some > io_uring stuff too. These are two places which were historically most > sensitive > to the (kernel) memory accounting speed. I'm getting wildly inconsistent results with iperf. io_uring-echo-server and rust_echo_bench gets me: Benchmarking: 127.0.0.1:12345 50 clients, running 512 bytes, 60 sec. Without alloc tagging: 120547 request/sec With: 116748 request/sec https://github.com/frevib/io_uring-echo-server https://github.com/haraldh/rust_echo_bench How's that look to you? Close enough? :)
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 1, 2022 at 3:54 PM Roman Gushchin wrote: > > On Thu, Sep 01, 2022 at 06:37:20PM -0400, Kent Overstreet wrote: > > On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote: > > > On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote: > > > > This is very interesting work! Do you have any data about the overhead > > > > this introduces, especially in a production environment? I am > > > > especially interested in memory allocations tracking and detecting > > > > leaks. > > > > > > +1 > > > > > > I think the question whether it indeed can be always turned on in the > > > production > > > or not is the main one. If not, the advantage over ftrace/bpf/... is not > > > that > > > obvious. Otherwise it will be indeed a VERY useful thing. > > > > Low enough overhead to run in production was my primary design goal. > > > > Stats are kept in a struct that's defined at the callsite. So this adds _no_ > > pointer chasing to the allocation path, unless we've switch to percpu > > counters > > at that callsite (see the lazy percpu counters patch), where we need to > > deref > > one percpu pointer to save an atomic. > > > > Then we need to stash a pointer to the alloc_tag, so that kfree() can find > > it. > > For slab allocations this uses the same storage area as memcg, so for > > allocations that are using that we won't be touching any additional > > cachelines. > > (I wanted the pointer to the alloc_tag to be stored inline with the > > allocation, > > but that would've caused alignment difficulties). > > > > Then there's a pointer deref introduced to the kfree() path, to get back to > > the > > original alloc_tag and subtract the allocation from that callsite. That one > > won't be free, and with percpu counters we've got another dependent load > > too - > > hmm, it might be worth benchmarking with just atomics, skipping the percpu > > counters. > > > > So the overhead won't be zero, I expect it'll show up in some synthetic > > benchmarks, but yes I do definitely expect this to be worth enabling in > > production in many scenarios. > > I'm somewhat sceptical, but I usually am. And in this case I'll be really > happy > to be wrong. > > On a bright side, maybe most of the overhead will come from few allocations, > so an option to explicitly exclude them will do the trick. > > I'd suggest to run something like iperf on a fast hardware. And maybe some > io_uring stuff too. These are two places which were historically most > sensitive > to the (kernel) memory accounting speed. Thanks for the suggestions, Roman. I'll see how I can get this done. I'll have to find someone with access to fast hardware (Android is not great for that) and backporting the patchset to the supported kernel version. Will do my best. Thanks, Suren. > > Thanks! > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
Re: [RFC PATCH 28/30] Improved symbolic error names
On Thu, Sep 01, 2022 at 04:19:35PM -0700, Joe Perches wrote: > On Tue, 2022-08-30 at 14:49 -0700, Suren Baghdasaryan wrote: > > From: Kent Overstreet > > > > This patch adds per-error-site error codes, with error strings that > > include their file and line number. > > > > To use, change code that returns an error, e.g. > > return -ENOMEM; > > to > > return -ERR(ENOMEM); > > > > Then, errname() will return a string that includes the file and line > > number of the ERR() call, for example > > printk("Got error %s!\n", errname(err)); > > will result in > > Got error ENOMEM at foo.c:1234 > > Why? Something wrong with just using %pe ? > > printk("Got error %pe at %s:%d!\n", ERR_PTR(err), __FILE__, __LINE__); > > Likely __FILE__ and __LINE__ aren't particularly useful. That doesn't do what this patchset does. If it only did that, it wouldn't make much sense, would it? :) With this patchset, printk("Got error %pe!\n", ptr); prints out a file and line number, but it's _not_ the file/line number of the printk statement - it's the file/line number where the error originated! :)
[ovmf test] 172921: regressions - FAIL
flight 172921 ovmf real [real] http://logs.test-lab.xenproject.org/osstest/logs/172921/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: build-i386-libvirt6 libvirt-buildfail REGR. vs. 172136 build-amd64-libvirt 6 libvirt-buildfail REGR. vs. 172136 version targeted for testing: ovmf ac55fcb051e4f288d29432043a42c13866419598 baseline version: ovmf 444260d45ec2a84e8f8c192b3539a3cd5591d009 Last test of basis 172136 2022-08-04 06:43:42 Z 28 days Failing since172151 2022-08-05 02:40:28 Z 27 days 221 attempts Testing same since 172917 2022-09-01 15:42:06 Z0 days2 attempts People who touched revisions under test: Abdul Lateef Attar Abner Chang Ard Biesheuvel Bob Feng Chasel Chiu Chen, Xiao X Czajkowski, Maciej Dimitrije Pavlov Dun Tan Edward Pickup Foster Nong Gregx Yeh Guo Dong Igor Kulchytskyy James Lu Jeff Brasen Jiaxin Wu Jose Marinho KasimX Liu Kavya Konstantin Aladyshev Kun Qin Liming Gao Liu, Zhiguang Maciej Czajkowski Michael D Kinney Michael Kubacki Ray Ni Rebecca Cran Rebecca Cran Sainadh Nagolu Sami Mujawar Shengfengx Xue Wu, Jiaxin Xiao X Chen Yuanhao Xie Zhiguang Liu jobs: build-amd64-xsm pass build-i386-xsm pass build-amd64 pass build-i386 pass build-amd64-libvirt fail build-i386-libvirt fail build-amd64-pvopspass build-i386-pvops pass test-amd64-amd64-xl-qemuu-ovmf-amd64 pass test-amd64-i386-xl-qemuu-ovmf-amd64 pass sg-report-flight on osstest.test-lab.xenproject.org logs: /home/logs/logs images: /home/logs/images Logs, config files, etc. are available at http://logs.test-lab.xenproject.org/osstest/logs Explanation of these reports, and of osstest in general, is at http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master Test harness code can be found at http://xenbits.xen.org/gitweb?p=osstest.git;a=summary Not pushing. (No revision log; it would be 1425 lines long.)
Re: [RFC PATCH 28/30] Improved symbolic error names
On Tue, 2022-08-30 at 14:49 -0700, Suren Baghdasaryan wrote: > From: Kent Overstreet > > This patch adds per-error-site error codes, with error strings that > include their file and line number. > > To use, change code that returns an error, e.g. > return -ENOMEM; > to > return -ERR(ENOMEM); > > Then, errname() will return a string that includes the file and line > number of the ERR() call, for example > printk("Got error %s!\n", errname(err)); > will result in > Got error ENOMEM at foo.c:1234 Why? Something wrong with just using %pe ? printk("Got error %pe at %s:%d!\n", ERR_PTR(err), __FILE__, __LINE__); Likely __FILE__ and __LINE__ aren't particularly useful. And using ERR would add rather a lot of bloat as each codetag_error_code struct would be unique. +#define ERR(_err) \ +({ \ + static struct codetag_error_code\ + __used \ + __section("error_code_tags")\ + __aligned(8) e = { \ + .str= #_err " at " __FILE__ ":" __stringify(__LINE__),\ + .err= _err, \ + }; \ + \ + e.err; \ +})
[xen-unstable test] 172910: tolerable FAIL - PUSHED
flight 172910 xen-unstable real [real] flight 172920 xen-unstable real-retest [real] http://logs.test-lab.xenproject.org/osstest/logs/172910/ http://logs.test-lab.xenproject.org/osstest/logs/172920/ Failures :-/ but no regressions. Tests which are failing intermittently (not blocking): test-amd64-amd64-xl-credit1 20 guest-localmigrate/x10 fail pass in 172920-retest test-armhf-armhf-xl-cubietruck 8 xen-boot fail pass in 172920-retest test-amd64-i386-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail pass in 172920-retest Tests which did not succeed, but are not blocking: test-amd64-i386-libvirt 1 build-check(1) blocked n/a test-amd64-i386-libvirt-pair 1 build-check(1) blocked n/a test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-i386-libvirt-raw 1 build-check(1) blocked n/a test-amd64-i386-libvirt-xsm 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-raw 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-xsm 1 build-check(1) blocked n/a test-amd64-amd64-libvirt 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-pair 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-vhd 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-xsm 1 build-check(1) blocked n/a test-armhf-armhf-libvirt 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-qcow2 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-raw 1 build-check(1) blocked n/a test-armhf-armhf-xl-cubietruck 15 migrate-support-check fail in 172920 never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-check fail in 172920 never pass build-amd64-libvirt 6 libvirt-buildfail like 172901 build-i386-libvirt6 libvirt-buildfail like 172901 build-arm64-libvirt 6 libvirt-buildfail like 172901 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 172901 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 172901 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 172901 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 172901 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 172901 build-armhf-libvirt 6 libvirt-buildfail like 172901 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 172901 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 172901 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 172901 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 172901 test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-amd64-i386-xl-pvshim14 guest-start fail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-vhd 14 migrate-support-checkfail never pass test-arm64-arm64-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-
Re: [RFC PATCH 27/30] Code tagging based latency tracking
On Thu, Sep 01, 2022 at 06:34:30PM -0400, Steven Rostedt wrote: > On Thu, 1 Sep 2022 17:54:38 -0400 > Kent Overstreet wrote: > > > > So this looks like it's gotten better since I last looked, but it's still > > not > > there yet. > > > > Part of the problem is that the tracepoints themselves are in the wrong > > place: > > your end event is when a task is woken up, but that means spurious wakeups > > will > > The end event is when a task is scheduled onto the CPU. The start event is > the first time it is woken up. Yeah, that's not what I want. You're just tracing latency due to having more processes runnable than CPUs. I don't care about that for debugging, though! I specifically want latency at the wait_event() level, and related - every time a process blocked _on some condition_, until that condition became true. Not until some random, potentially spurious wakeup. > Not the prettiest thing to read. But hey, we got the full stack of where > these latencies happened! Most of the time I _don't_ want full stacktraces, though! That means I have a ton more output to sort through, and the data is far more expensive to collect. I don't know why it's what people go to first - see the page_owner stuff - but that doesn't get used much either because the output is _really hard to sort through_. Most of the time, just a single file and line number is all you want - and tracing has always made it hard to get at that. > Yes, it adds some overhead when the events are triggered due to the > stacktrace code, but it's extremely useful information. > > > > > So, it looks like tracing has made some progress over the past 10 years, > > but for debugging latency issues it's still not there yet in general. I > > I call BS on that statement. Just because you do not know what has been > added to the kernel in the last 10 years (like you had no idea about > seq_buf and that was added in 2014) means to me that you are totally > clueless on what tracing can and can not do. > > It appears to me that you are too focused on inventing your own wheel that > does exactly what you want before looking to see how things are today. Just > because something didn't fit your needs 10 years ago doesn't mean that it > can't fit your needs today. ...And the ad hominem attacks start. Steve, I'm not attacking you, and there's room enough in this world for the both of us to be doing our thing creating new and useful tools. > I'm already getting complaints from customers/users that are saying there's > too many tools in the toolbox already. (Do we use ftrace/perf/bpf?). The > idea is to have the tools using mostly the same infrastructure, and not be > 100% off on its own, unless there's a clear reason to invent a new wheel > that several people are asking for, not just one or two. I would like to see more focus on usability. That means, in a best case scenario, always-on data collection that I can just look at, and it'll already be in the format most likely to be useful. Surely you can appreciate the usefulness of that..? Tracing started out as a tool for efficiently getting lots of data out of the kernel, and it's great for that. But I think your focus on the cool thing you built may be blinding you a bit to alternative approaches...
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote: > On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote: > > This is very interesting work! Do you have any data about the overhead > > this introduces, especially in a production environment? I am > > especially interested in memory allocations tracking and detecting > > leaks. > > +1 > > I think the question whether it indeed can be always turned on in the > production > or not is the main one. If not, the advantage over ftrace/bpf/... is not that > obvious. Otherwise it will be indeed a VERY useful thing. Low enough overhead to run in production was my primary design goal. Stats are kept in a struct that's defined at the callsite. So this adds _no_ pointer chasing to the allocation path, unless we've switch to percpu counters at that callsite (see the lazy percpu counters patch), where we need to deref one percpu pointer to save an atomic. Then we need to stash a pointer to the alloc_tag, so that kfree() can find it. For slab allocations this uses the same storage area as memcg, so for allocations that are using that we won't be touching any additional cachelines. (I wanted the pointer to the alloc_tag to be stored inline with the allocation, but that would've caused alignment difficulties). Then there's a pointer deref introduced to the kfree() path, to get back to the original alloc_tag and subtract the allocation from that callsite. That one won't be free, and with percpu counters we've got another dependent load too - hmm, it might be worth benchmarking with just atomics, skipping the percpu counters. So the overhead won't be zero, I expect it'll show up in some synthetic benchmarks, but yes I do definitely expect this to be worth enabling in production in many scenarios.
Re: [RFC PATCH 27/30] Code tagging based latency tracking
On Thu, 1 Sep 2022 17:54:38 -0400 Kent Overstreet wrote: > > So this looks like it's gotten better since I last looked, but it's still not > there yet. > > Part of the problem is that the tracepoints themselves are in the wrong place: > your end event is when a task is woken up, but that means spurious wakeups > will The end event is when a task is scheduled onto the CPU. The start event is the first time it is woken up. > cause one wait_event() call to be reported as multiple smaller waits, not one > long wait - oops, now I can't actually find the thing that's causing my > multi-second delay. > > Also, in your example you don't have it broken out by callsite. That would be > the first thing I'd need for any real world debugging. OK, how about this (currently we can only have 3 keys, but you can create multiple histograms on the same event). # echo 'hist:keys=comm,stacktrace,delta.buckets=10:sort=delta' > /sys/kernel/tracing/events/synthetic/wakeup_lat/trigger (notice the "stacktrace" in the keys) # cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist # event histogram # # trigger info: hist:keys=comm,stacktrace,delta.buckets=10:vals=hitcount:sort=delta.buckets=10:size=2048 [active] # { comm: migration/2 , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule_idle+0x26/0x40 do_idle+0xb4/0xd0 cpu_startup_entry+0x19/0x20 secondary_startup_64_no_verify+0xc2/0xcb , delta: ~ 10-19} hitcount: 7 { comm: migration/5 , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule_idle+0x26/0x40 do_idle+0xb4/0xd0 cpu_startup_entry+0x19/0x20 secondary_startup_64_no_verify+0xc2/0xcb , delta: ~ 10-19} hitcount: 7 { comm: migration/1 , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule_idle+0x26/0x40 do_idle+0xb4/0xd0 cpu_startup_entry+0x19/0x20 secondary_startup_64_no_verify+0xc2/0xcb , delta: ~ 10-19} hitcount: 7 { comm: migration/7 , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule_idle+0x26/0x40 do_idle+0xb4/0xd0 cpu_startup_entry+0x19/0x20 secondary_startup_64_no_verify+0xc2/0xcb , delta: ~ 10-19} hitcount: 7 { comm: migration/0 , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule_idle+0x26/0x40 do_idle+0xb4/0xd0 cpu_startup_entry+0x19/0x20 start_kernel+0x595/0x5be secondary_startup_64_no_verify+0xc2/0xcb , delta: ~ 10-19} hitcount: 7 { comm: migration/4 , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 schedule_idle+0x26/0x40 do_idle+0xb4/0xd0 cpu_startup_entry+0x19/0x20 secondary_startup_64_no_verify+0xc2/0xcb , delta: ~ 10-19} hitcount: 7 { comm: rtkit-daemon , stacktrace: event_hist_trigger+0x290/0x2b0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x240 trace_event_raw_event_sched_switch+0x120/0x180 __traceiter_sched_switch+0x39/0x50 __schedule+0x310/0x700 preempt_schedule_common+0x2d/0x70 preempt_schedule_thunk+0x16/0x18 _raw_spin_unlock_irq+0x2e/0x40 eventfd_write+0xc8/0x290 vfs_write+0xc0/0x2a0 ksys_write+0x5f/0xe0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb , delta: ~ 10
[linux-linus test] 172911: regressions - FAIL
flight 172911 linux-linus real [real] http://logs.test-lab.xenproject.org/osstest/logs/172911/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: build-amd64-libvirt 6 libvirt-buildfail REGR. vs. 172133 build-arm64-libvirt 6 libvirt-buildfail REGR. vs. 172133 build-i386-libvirt6 libvirt-buildfail REGR. vs. 172133 build-armhf-libvirt 6 libvirt-buildfail REGR. vs. 172133 Tests which did not succeed, but are not blocking: test-armhf-armhf-libvirt 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-qcow2 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-raw 1 build-check(1) blocked n/a test-amd64-amd64-libvirt 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-pair 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-qcow2 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-raw 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-xsm 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-raw 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-xsm 1 build-check(1) blocked n/a test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 172133 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 172133 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 172133 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 172133 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 172133 test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-arm64-arm64-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass version targeted for testing: linuxc5e4d5e99162ba8025d58a3af7ad103f155d2df7 baseline version: linuxb44f2fd87919b5ae6e1756d4c7ba2cbba22238e1 Last test of basis 172133 2022-08-04 05:14:48 Z 28 days Failing since172152 2022-08-05 04:01:26 Z 27 days 63 attempts Testing same since 172902 2022-08-31 23:12:06 Z0 days2 attempts 1599 people touched revisions under test, not listing them all jobs: build-amd64-xsm pass build-arm64-xsm pass build-i386-xsm pass build-amd64
Re: [RFC PATCH 27/30] Code tagging based latency tracking
On Thu, Sep 01, 2022 at 05:38:44PM -0400, Steven Rostedt wrote: > On Tue, 30 Aug 2022 14:49:16 -0700 > Suren Baghdasaryan wrote: > > > From: Kent Overstreet > > > > This adds the ability to easily instrument code for measuring latency. > > To use, add the following to calls to your code, at the start and end of > > the event you wish to measure: > > > > code_tag_time_stats_start(start_time); > > code_tag_time_stats_finish(start_time); > > So you need to modify the code to see what you want? Figuring out the _correct_ place to measure is often a significant amount of the total effort. Having done so once, why not annotate that in the source code? > For function length you could just do something like this: > > # cd /sys/kernel/tracing > # echo __skb_wait_for_more_packets > set_ftrace_filter > # echo 1 > function_profile_enabled > # cat trace_stat/function* > Function HitTimeAvg > s^2 > ------ > --- > __skb_wait_for_more_packets 10.000 us0.000 us > 0.000 us > Function HitTimeAvg > s^2 > ------ > --- > __skb_wait_for_more_packets 174.813 us 74.813 us > 0.000 us > Function HitTimeAvg > s^2 > ------ > --- > Function HitTimeAvg > s^2 > ------ > --- > > The above is for a 4 CPU machine. The s^2 is the square of the standard > deviation (makes not having to do divisions while it runs). > > But if you are looking for latency between two events (which can be kprobes > too, where you do not need to rebuild your kernel): > > From: https://man.archlinux.org/man/sqlhist.1.en > which comes in: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/ > if not already installed on your distro. > > # sqlhist -e -n wakeup_lat 'select end.next_comm as > comm,start.pid,start.prio,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as > delta from sched_waking as start join sched_switch as end on start.pid = > end.next_pid where start.prio < 100' > > The above creates a synthetic event called "wakeup_lat" that joins two > events (sched_waking and sched_switch) when the pid field of sched_waking > matches the next_pid field of sched_switch. When there is a match, it will > trigger the wakeup_lat event only if the prio of the sched_waking event is > less than 100 (which in the kernel means any real-time task). The > wakeup_lat event will record the next_comm (as comm field), the pid of > woken task and the time delta in microseconds between the two events. So this looks like it's gotten better since I last looked, but it's still not there yet. Part of the problem is that the tracepoints themselves are in the wrong place: your end event is when a task is woken up, but that means spurious wakeups will cause one wait_event() call to be reported as multiple smaller waits, not one long wait - oops, now I can't actually find the thing that's causing my multi-second delay. Also, in your example you don't have it broken out by callsite. That would be the first thing I'd need for any real world debugging. So, it looks like tracing has made some progress over the past 10 years, but for debugging latency issues it's still not there yet in general. I will definitely remember function latency tracing the next time I'm doing performance work, but I expect that to be far too heavy to enable on a live server. This thing is only a couple hundred lines of code though, so perhaps tracing shouldn't be the only tool in our toolbox :)
Re: [RFC PATCH 27/30] Code tagging based latency tracking
On Thu, 1 Sep 2022 17:38:44 -0400 Steven Rostedt wrote: > # echo 'hist:keys=comm,prio,delta.buckets=10:sort=delta' > > /sys/kernel/tracing/events/synthetic/wakeup_lat/trigger The above could almost be done with sqlhist (but I haven't implemented "buckets=10" yet because that's a new feature. But for now, let's do log2): # sqlhist -e 'select comm,prio,cast(delta as log2) from wakeup_lat' ("-e" is to execute the command, as it normally only displays what commands need to be run to create the synthetic events and histograms) # cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist # event histogram # # trigger info: hist:keys=comm,prio,delta.log2:vals=hitcount:sort=hitcount:size=2048 [active] # { comm: migration/4 , prio: 0, delta: ~ 2^5 } hitcount: 1 { comm: migration/0 , prio: 0, delta: ~ 2^4 } hitcount: 2 { comm: rtkit-daemon , prio: 0, delta: ~ 2^7 } hitcount: 2 { comm: rtkit-daemon , prio: 0, delta: ~ 2^6 } hitcount: 4 { comm: migration/0 , prio: 0, delta: ~ 2^5 } hitcount: 8 { comm: migration/4 , prio: 0, delta: ~ 2^4 } hitcount: 9 { comm: migration/2 , prio: 0, delta: ~ 2^4 } hitcount: 10 { comm: migration/5 , prio: 0, delta: ~ 2^4 } hitcount: 10 { comm: migration/7 , prio: 0, delta: ~ 2^4 } hitcount: 10 { comm: migration/1 , prio: 0, delta: ~ 2^4 } hitcount: 10 { comm: migration/6 , prio: 0, delta: ~ 2^4 } hitcount: 10 Totals: Hits: 76 Entries: 11 Dropped: 0 -- Steve
Re: [RFC PATCH 27/30] Code tagging based latency tracking
On Tue, 30 Aug 2022 14:49:16 -0700 Suren Baghdasaryan wrote: > From: Kent Overstreet > > This adds the ability to easily instrument code for measuring latency. > To use, add the following to calls to your code, at the start and end of > the event you wish to measure: > > code_tag_time_stats_start(start_time); > code_tag_time_stats_finish(start_time); So you need to modify the code to see what you want? > > Stastistics will then show up in debugfs under > /sys/kernel/debug/time_stats, listed by file and line number. > > Stastics measured include weighted averages of frequency, duration, max > duration, as well as quantiles. > > This patch also instruments all calls to init_wait and finish_wait, > which includes all calls to wait_event. Example debugfs output: > > fs/xfs/xfs_trans_ail.c:746 module:xfs func:xfs_ail_push_all_sync > count: 17 > rate: 0/sec > frequency: 2 sec > avg duration: 10 us > max duration: 232 us > quantiles (ns): 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 > > lib/sbitmap.c:813 module:sbitmap func:sbitmap_finish_wait > count: 3 > rate: 0/sec > frequency: 4 sec > avg duration: 4 sec > max duration: 4 sec > quantiles (ns): 0 4288669120 4288669120 5360836048 5360836048 5360836048 > 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 > 5360836048 5360836048 > > net/core/datagram.c:122 module:datagram func:__skb_wait_for_more_packets > count: 10 > rate: 1/sec > frequency: 859 ms > avg duration: 472 ms > max duration: 30 sec > quantiles (ns): 0 12279 12279 15669 15669 15669 15669 17217 17217 17217 17217 > 17217 17217 17217 17217 For function length you could just do something like this: # cd /sys/kernel/tracing # echo __skb_wait_for_more_packets > set_ftrace_filter # echo 1 > function_profile_enabled # cat trace_stat/function* Function HitTimeAvg s^2 ------ --- __skb_wait_for_more_packets 10.000 us0.000 us 0.000 us Function HitTimeAvg s^2 ------ --- __skb_wait_for_more_packets 174.813 us 74.813 us 0.000 us Function HitTimeAvg s^2 ------ --- Function HitTimeAvg s^2 ------ --- The above is for a 4 CPU machine. The s^2 is the square of the standard deviation (makes not having to do divisions while it runs). But if you are looking for latency between two events (which can be kprobes too, where you do not need to rebuild your kernel): From: https://man.archlinux.org/man/sqlhist.1.en which comes in: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/ if not already installed on your distro. # sqlhist -e -n wakeup_lat 'select end.next_comm as comm,start.pid,start.prio,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta from sched_waking as start join sched_switch as end on start.pid = end.next_pid where start.prio < 100' The above creates a synthetic event called "wakeup_lat" that joins two events (sched_waking and sched_switch) when the pid field of sched_waking matches the next_pid field of sched_switch. When there is a match, it will trigger the wakeup_lat event only if the prio of the sched_waking event is less than 100 (which in the kernel means any real-time task). The wakeup_lat event will record the next_comm (as comm field), the pid of woken task and the time delta in microseconds between the two events. # echo 'hist:keys=comm,prio,delta.buckets=10:sort=delta' > /sys/kernel/tracing/events/synthetic/wakeup_lat/trigger The above starts a histogram tracing the name of the woken task, the priority and the delta (but placing the delta in buckets of size 10, as we do not need to see every latency number). # chrt -f 20 sleep 1 Force something to be woken up that is interesting. # cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist # event histogram # # trigger info: hist:keys=comm,prio,delta.buckets=10:vals=hitcount:sort=delta.buckets=10:size=2048 [active] # { comm: migration/5 , prio: 0, delta: ~ 10-19 } hitcount: 1 { comm: migration/2 , prio: 0, delta: ~ 10-19 } hitcount: 1 { comm: sleep , prio: 79, delta: ~ 10-19 } hitcount: 1 { comm: migration/7 , prio: 0, delta: ~ 10-19 } hitcount:
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 12:39:11PM -0700, Suren Baghdasaryan wrote: > kmemleak is known to be slow and it's even documented [1], so I hope I > can skip that part. For page_owner to provide the comparable > information we would have to capture the call stacks for all page > allocations unlike our proposal which allows to do that selectively > for specific call sites. I'll post the overhead numbers of call stack > capturing once I'm finished with profiling the latest code, hopefully > sometime tomorrow, in the worst case after the long weekend. To expand on this further: we're stashing a pointer to the alloc_tag, which is defined at the allocation callsite. That's how we're able to decrement the proper counter on free, and why this beats any tracing based approach - with tracing you'd instead have to correlate allocate/free events. Ouch. > > Yes, tracking back the call trace would be really needed. The question > > is whether this is really prohibitively expensive. How much overhead are > > we talking about? There is no free lunch here, really. You either have > > the overhead during runtime when the feature is used or on the source > > code level for all the future development (with a maze of macros and > > wrappers). The full call stack is really not what you want in most applications - that's what people think they want at first, and why page_owner works the way it does, but it turns out that then combining all the different but related stack traces _sucks_ (so why were you saving them in the first place?), and then you have to do a separate memory allocate for each stack track, which destroys performance. > > Will post the overhead numbers soon. > What I hear loud and clear is that we need a kernel command-line kill > switch that mitigates the overhead for having this feature. That seems > to be the main concern. > Thanks, After looking at this more I don't think we should commit just yet - there's some tradeoffs to be evaluated, and maybe the thing to do first will be to see if we can cut down on the (huge!) number of allocation interfaces before adding more complexity. The ideal approach, from a performance POV, would be to pass a pointer to the alloc tag to kmalloc() et. all, and then we'd have the actual accounting code in one place and use a jump label to skip over it when this feature is disabled. However, there are _many, many_ wrapper functions in our allocation code, and this approach is going to make the plumbing for the hooks quite a bit bigger than what we have now - and then, do we want to have this extra alloc_tag parameter that's not used when CONFIG_ALLOC_TAGGING=n? It's a tiny cost for an extra unused parameter, but it's a cost - or do we get rid of that with some extra macro hackery (eww, gross)? If we do the boot parameter before submission, I think we'll have something that's maybe not strictly ideal from a performance POV when CONFIG_ALLOC_TAGGING=y but boot parameter=n, but it'll introduce the minimum amount of macro insanity. What we should be able to do pretty easily is discard the alloc_tag structs when the boot parameter is disabled, because they're in special elf sections and we already do that (e.g. for .init).
[ovmf test] 172917: regressions - FAIL
flight 172917 ovmf real [real] http://logs.test-lab.xenproject.org/osstest/logs/172917/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: build-i386-libvirt6 libvirt-buildfail REGR. vs. 172136 build-amd64-libvirt 6 libvirt-buildfail REGR. vs. 172136 version targeted for testing: ovmf ac55fcb051e4f288d29432043a42c13866419598 baseline version: ovmf 444260d45ec2a84e8f8c192b3539a3cd5591d009 Last test of basis 172136 2022-08-04 06:43:42 Z 28 days Failing since172151 2022-08-05 02:40:28 Z 27 days 220 attempts Testing same since 172917 2022-09-01 15:42:06 Z0 days1 attempts People who touched revisions under test: Abdul Lateef Attar Abner Chang Ard Biesheuvel Bob Feng Chasel Chiu Chen, Xiao X Czajkowski, Maciej Dimitrije Pavlov Dun Tan Edward Pickup Foster Nong Gregx Yeh Guo Dong Igor Kulchytskyy James Lu Jeff Brasen Jiaxin Wu Jose Marinho KasimX Liu Kavya Konstantin Aladyshev Kun Qin Liming Gao Liu, Zhiguang Maciej Czajkowski Michael D Kinney Michael Kubacki Ray Ni Rebecca Cran Rebecca Cran Sainadh Nagolu Sami Mujawar Shengfengx Xue Wu, Jiaxin Xiao X Chen Yuanhao Xie Zhiguang Liu jobs: build-amd64-xsm pass build-i386-xsm pass build-amd64 pass build-i386 pass build-amd64-libvirt fail build-i386-libvirt fail build-amd64-pvopspass build-i386-pvops pass test-amd64-amd64-xl-qemuu-ovmf-amd64 pass test-amd64-i386-xl-qemuu-ovmf-amd64 pass sg-report-flight on osstest.test-lab.xenproject.org logs: /home/logs/logs images: /home/logs/images Logs, config files, etc. are available at http://logs.test-lab.xenproject.org/osstest/logs Explanation of these reports, and of osstest in general, is at http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master Test harness code can be found at http://xenbits.xen.org/gitweb?p=osstest.git;a=summary Not pushing. (No revision log; it would be 1425 lines long.)
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko wrote: > > On Thu 01-09-22 08:33:19, Suren Baghdasaryan wrote: > > On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko wrote: > [...] > > > So I find Peter's question completely appropriate while your response to > > > that not so much! Maybe ftrace is not the right tool for the intented > > > job. Maybe there are other ways and it would be really great to show > > > that those have been evaluated and they are not suitable for a), b) and > > > c) reasons. > > > > That's fair. > > For memory tracking I looked into using kmemleak and page_owner which > > can't match the required functionality at an overhead acceptable for > > production and pre-production testing environments. > > Being more specific would be really helpful. Especially when your cover > letter suggests that you rely on page_owner/memcg metadata as well to > match allocation and their freeing parts. kmemleak is known to be slow and it's even documented [1], so I hope I can skip that part. For page_owner to provide the comparable information we would have to capture the call stacks for all page allocations unlike our proposal which allows to do that selectively for specific call sites. I'll post the overhead numbers of call stack capturing once I'm finished with profiling the latest code, hopefully sometime tomorrow, in the worst case after the long weekend. > > > traces + BPF I > > haven't evaluated myself but heard from other members of my team who > > tried using that in production environment with poor results. I'll try > > to get more specific information on that. > > That would be helpful as well. Ack. > > > > E.g. Oscar has been working on extending page_ext to track number of > > > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But > > > it can help in environments where page_ext can be enabled and it is > > > completely non-intrusive to the MM code. > > > > Thanks for pointing out this work. I'll need to review and maybe > > profile it before making any claims. > > > > > > > > If the page_ext overhead is not desirable/acceptable then I am sure > > > there are other options. E.g. kprobes/LivePatching framework can hook > > > into functions and alter their behavior. So why not use that for data > > > collection? Has this been evaluated at all? > > > > I'm not sure how I can hook into say alloc_pages() to find out where > > it was called from without capturing the call stack (which would > > introduce an overhead at every allocation). Would love to discuss this > > or other alternatives if they can be done with low enough overhead. > > Yes, tracking back the call trace would be really needed. The question > is whether this is really prohibitively expensive. How much overhead are > we talking about? There is no free lunch here, really. You either have > the overhead during runtime when the feature is used or on the source > code level for all the future development (with a maze of macros and > wrappers). Will post the overhead numbers soon. What I hear loud and clear is that we need a kernel command-line kill switch that mitigates the overhead for having this feature. That seems to be the main concern. Thanks, Suren. [1] https://docs.kernel.org/dev-tools/kmemleak.html#limitations-and-drawbacks > > Thanks! > -- > Michal Hocko > SUSE Labs
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu 01-09-22 08:33:19, Suren Baghdasaryan wrote: > On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko wrote: [...] > > So I find Peter's question completely appropriate while your response to > > that not so much! Maybe ftrace is not the right tool for the intented > > job. Maybe there are other ways and it would be really great to show > > that those have been evaluated and they are not suitable for a), b) and > > c) reasons. > > That's fair. > For memory tracking I looked into using kmemleak and page_owner which > can't match the required functionality at an overhead acceptable for > production and pre-production testing environments. Being more specific would be really helpful. Especially when your cover letter suggests that you rely on page_owner/memcg metadata as well to match allocation and their freeing parts. > traces + BPF I > haven't evaluated myself but heard from other members of my team who > tried using that in production environment with poor results. I'll try > to get more specific information on that. That would be helpful as well. > > E.g. Oscar has been working on extending page_ext to track number of > > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But > > it can help in environments where page_ext can be enabled and it is > > completely non-intrusive to the MM code. > > Thanks for pointing out this work. I'll need to review and maybe > profile it before making any claims. > > > > > If the page_ext overhead is not desirable/acceptable then I am sure > > there are other options. E.g. kprobes/LivePatching framework can hook > > into functions and alter their behavior. So why not use that for data > > collection? Has this been evaluated at all? > > I'm not sure how I can hook into say alloc_pages() to find out where > it was called from without capturing the call stack (which would > introduce an overhead at every allocation). Would love to discuss this > or other alternatives if they can be done with low enough overhead. Yes, tracking back the call trace would be really needed. The question is whether this is really prohibitively expensive. How much overhead are we talking about? There is no free lunch here, really. You either have the overhead during runtime when the feature is used or on the source code level for all the future development (with a maze of macros and wrappers). Thanks! -- Michal Hocko SUSE Labs
Re: [RFC PATCH 03/30] Lazy percpu counters
On Thu, Sep 01, 2022 at 10:32:19AM -0400, Kent Overstreet wrote: > On Thu, Sep 01, 2022 at 08:51:31AM +0200, Peter Zijlstra wrote: > > On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote: > > > +static void lazy_percpu_counter_switch_to_pcpu(struct > > > raw_lazy_percpu_counter *c) > > > +{ > > > + u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN); > > > > Realize that this is incorrect when used under a raw_spinlock_t. > > Can you elaborate? required lock order: raw_spinlock_t < spinlock_t < mutex allocators lives at spinlock_t. Also see CONFIG_PROVE_RAW_LOCK_NESTING and there might be a document mentioning all this somewhere. Additionally, this (obviously) also isn't NMI safe.
Re: [PATCH v3 7/7] xen/arm: introduce new xen,enhanced property value
Hi Rahul, On 01/09/2022 10:13, Rahul Singh wrote: Introduce a new "xen,enhanced" dom0less property value "no-xenstore" to disable xenstore interface for dom0less guests. Signed-off-by: Rahul Singh --- Changes in v3: - new patch in this version --- docs/misc/arm/device-tree/booting.txt | 4 xen/arch/arm/domain_build.c | 10 +++--- xen/arch/arm/include/asm/kernel.h | 3 +++ 3 files changed, 14 insertions(+), 3 deletions(-) diff --git a/docs/misc/arm/device-tree/booting.txt b/docs/misc/arm/device-tree/booting.txt index edef98e6d1..87f57f8889 100644 --- a/docs/misc/arm/device-tree/booting.txt +++ b/docs/misc/arm/device-tree/booting.txt @@ -204,6 +204,10 @@ with the following properties: - "disabled" Xen PV interfaces are disabled. +- no-xenstore +Xen PV interfaces, including grant-table will be enabled for the VM but +xenstore will be disabled for the VM. NIT: I would drop one of the "for the VM" as it seems to be redundant. + If the xen,enhanced property is present with no value, it defaults to "enabled". If the xen,enhanced property is not present, PV interfaces are disabled. diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c index 4b24261825..8dd9984225 100644 --- a/xen/arch/arm/domain_build.c +++ b/xen/arch/arm/domain_build.c @@ -3336,10 +3336,14 @@ static int __init construct_domU(struct domain *d, (rc == 0 && !strcmp(dom0less_enhanced, "enabled")) ) { if ( hardware_domain ) -kinfo.dom0less_enhanced = true; +kinfo.dom0less_xenstore = true; else -panic("Tried to use xen,enhanced without dom0\n"); +panic("Tried to use xen,enhanced without dom0 without no-xenstore\n"); This is a bit hard to parse. How about: "At the moment, Xenstore support requires dom0 to be present" } +else if ( rc == 0 && !strcmp(dom0less_enhanced, "no-xenstore") ) +kinfo.dom0less_xenstore = false; + +kinfo.dom0less_enhanced = true; Wouldn't this now set dom0less_enhanced unconditionally? if ( vcpu_create(d, 0) == NULL ) return -ENOMEM; @@ -3379,7 +3383,7 @@ static int __init construct_domU(struct domain *d, if ( rc < 0 ) return rc; -if ( kinfo.dom0less_enhanced ) +if ( kinfo.dom0less_xenstore ) { ASSERT(hardware_domain); rc = alloc_xenstore_evtchn(d); diff --git a/xen/arch/arm/include/asm/kernel.h b/xen/arch/arm/include/asm/kernel.h index c4dc039b54..3d7fa94910 100644 --- a/xen/arch/arm/include/asm/kernel.h +++ b/xen/arch/arm/include/asm/kernel.h @@ -39,6 +39,9 @@ struct kernel_info { /* Enable PV drivers */ bool dom0less_enhanced; +/* Enable Xenstore */ +bool dom0less_xenstore; + AFAIU, it is not possible to have *_xenstore = true and *_enhanced = false. I think it would be clearer if ``dom0less_enhanced`` is turned to an enum with 3 values: - None - NOXENSTORE/BASIC - FULLY_ENHANCED If we want to be future proof, I would use a field 'flags' where non-zero means enhanced. Each bit would indicate which features of Xen is exposed. Cheers, -- Julien Grall
Re: [PATCH v3 6/7] xen/arm: introduce xen-evtchn dom0less property
Hi Rahul, On 01/09/2022 10:13, Rahul Singh wrote: Introduce a new sub-node under /chosen node to establish static event channel communication between domains on dom0less systems. An event channel will be created beforehand to allow the domains to send notifications to each other. Signed-off-by: Rahul Singh --- Changes in v3: - use device-tree used_by to find the domain id of the evtchn node. - add new static_evtchn_create variable in struct dt_device_node to hold the information if evtchn is already created. - fix minor comments Changes in v2: - no change --- docs/misc/arm/device-tree/booting.txt | 64 - xen/arch/arm/domain_build.c | 128 ++ xen/arch/arm/include/asm/setup.h | 1 + xen/arch/arm/setup.c | 2 + xen/include/xen/device_tree.h | 13 +++ 5 files changed, 207 insertions(+), 1 deletion(-) diff --git a/docs/misc/arm/device-tree/booting.txt b/docs/misc/arm/device-tree/booting.txt index 98253414b8..edef98e6d1 100644 --- a/docs/misc/arm/device-tree/booting.txt +++ b/docs/misc/arm/device-tree/booting.txt @@ -212,7 +212,7 @@ with the following properties: enable only selected interfaces. Under the "xen,domain" compatible node, one or more sub-nodes are present -for the DomU kernel and ramdisk. +for the DomU kernel, ramdisk and static event channel. The kernel sub-node has the following properties: @@ -254,11 +254,44 @@ The ramdisk sub-node has the following properties: property because it will be created by the UEFI stub on boot. This option is needed only when UEFI boot is used. +The static event channel sub-node has the following properties: + +- compatible + +"xen,evtchn" + +- xen,evtchn + +The property is tuples of two numbers +(local-evtchn link-to-foreign-evtchn) where: + +local-evtchn is an integer value that will be used to allocate local port +for a domain to send and receive event notifications to/from the remote +domain. Maximum supported value is 2^17 for FIFO ABI and 4096 for 2L ABI. +It is recommended to use low event channel IDs. + +link-to-foreign-evtchn is a single phandle to a remote evtchn to which +local-evtchn will be connected. Example === chosen { + +module@0 { +compatible = "multiboot,kernel", "multiboot,module"; +xen,uefi-binary = "..."; +bootargs = "..."; + +}; NIT: Describing this node in the example seems unnecessary. + +/* one sub-node per local event channel */ +ec1: evtchn@1 { + compatible = "xen,evtchn-v1"; + /* local-evtchn link-to-foreign-evtchn */ + xen,evtchn = <0xa &ec2>; +}; + Here you provide an example for dom0. But the position where you describe the binding suggests this binding only exists for domUs. Either we duplicate the binding or we re-order the documentation to have common binding in a single place. My preference would be the latter. domU1 { compatible = "xen,domain"; #address-cells = <0x2>; @@ -277,6 +310,23 @@ chosen { compatible = "multiboot,ramdisk", "multiboot,module"; reg = <0x0 0x4b00 0xff>; }; + +/* one sub-node per local event channel */ +ec2: evtchn@2 { +compatible = "xen,evtchn-v1"; +/* local-evtchn link-to-foreign-evtchn */ +xen,evtchn = <0xa &ec1>; +}; + +ec3: evtchn@3 { +compatible = "xen,evtchn-v1"; +xen,evtchn = <0xb &ec5>; +}; + +ec4: evtchn@4 { +compatible = "xen,evtchn-v1"; +xen,evtchn = <0xc &ec6>; +}; }; domU2 { @@ -296,6 +346,18 @@ chosen { compatible = "multiboot,ramdisk", "multiboot,module"; reg = <0x0 0x4d00 0xff>; }; + +/* one sub-node per local event channel */ +ec5: evtchn@5 { +compatible = "xen,evtchn-v1"; +/* local-evtchn link-to-foreign-evtchn */ +xen,evtchn = <0xb &ec3>; +}; + +ec6: evtchn@6 { +compatible = "xen,evtchn-v1"; +xen,evtchn = <0xd &ec4>; +}; }; }; diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c index 707e247f6a..4b24261825 100644 --- a/xen/arch/arm/domain_build.c +++ b/xen/arch/arm/domain_build.c @@ -33,6 +33,8 @@ #include #include +#define STATIC_EVTCHN_NODE_SIZE_CELLS 2 + static unsigned int __initdata opt_dom0_max_vcpus; integer_param("dom0_max_vcpus", opt_dom0_max_vcpus); @@ -3052,6 +3054,131 @@ void __init evtchn_allocate(struct domain *d) d->arch.hvm.params[HVM_PARAM_CALLBACK_IRQ] = val; } +static int __init get_evtchn_dt_property(const struct dt_device_node *np, + uint32_t *port, uint32_t *phandle) +{ +const __be32 *prop = NULL; +uint32_t len; + +pr
Re: [PATCH v3 3/7] xen/evtchn: restrict the maximum number of evtchn supported for domUs
Hi, On 01/09/2022 14:53, Michal Orzel wrote: On 01/09/2022 11:13, Rahul Singh wrote: Restrict the maximum number of evtchn supported for domUs to avoid allocating a large amount of memory in Xen. Set the default value of max_evtchn_port to 1023. The value of 1023 should be sufficient for domUs guests because on ARM we don't bind physical interrupts to event channels. The only use of the evtchn port is inter-domain communications. Following the previous discussion, I think the only missing piece is an explanation that 1023 was chose to follow the default behavior of libxl. +1. The current explanation only justify why we haven't added a device-tree property to change the default value. Apart from that: Reviewed-by: Michal Orzel Cheers, -- Julien Grall
Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator
Hi Stefano, On 01/09/2022 18:08, Stefano Stabellini wrote: Also, what happen with UEFI? Is it easy to guarantee the region will not be used? For now I think it is not easy to guarantee that, do you have some ideas in mind? I think I can follow this in above follow-up series to improve things. For clarity, are we worried that the region is used by the bootloader for other things? For instance U-Boot or Tianocore placing some firmware tables inside the range specified for xenheap? Yes. I think it would be difficult for an admin to figure out which regions are used. Although they are likely (?) going to be static for a given UEFI/U-boot build. My major concern is such bug can be very difficult to root cause because we have no safety in Xen. The most likely symptom would be random corruption. Cheers, -- Julien Grall
Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator
Hi Henry, On 01/09/2022 17:05, Henry Wang wrote: @@ -755,17 +779,21 @@ static void __init setup_mm(void) do { -e = consider_modules(ram_start, ram_end, +e = !reserved_heap ? +consider_modules(ram_start, ram_end, pfn_to_paddr(xenheap_pages), - 32<<20, 0); + 32<<20, 0) : +reserved_heap_end; Not entirely related to this series. Now the assumption is the admin will make sure that none of the reserved regions will overlap. Do we have any tool to help the admin to verify it? If yes, can we have a pointer in the documentation? If not, should this be done in Xen? In the RFC we had the same discussion of this issue [1] and I think a follow-up series might needed to do the overlap check if we want to do that in Xen. For the existing tool, I am thinking of ImageBuilder, but I am curious about Stefano's opinion. Also, what happen with UEFI? Is it easy to guarantee the region will not be used? For now I think it is not easy to guarantee that, do you have some ideas in mind? I think I can follow this in above follow-up series to improve things. I don't have any ideas how we can guarantee it (even when using image builder). I think we may have to end up to check the overlaps in Xen. + if ( e ) break; xenheap_pages >>= 1; } while ( !opt_xenheap_megabytes && xenheap_pages > 32<<(20- PAGE_SHIFT) ); -if ( ! e ) -panic("Not not enough space for xenheap\n"); +if ( ! e || + ( reserved_heap && reserved_heap_pages < 32<<(20-PAGE_SHIFT) ) ) +panic("Not enough space for xenheap\n"); So on arm32, the xenheap *must* be contiguous. AFAICT, reserved_heap_pages is the total number of pages in the heap. They may not be contiguous. So I think this wants to be reworked so we look for one of the region that match the definition written above the loop. Thanks for raising this concern, I will do this in V2. domheap_pages = heap_pages - xenheap_pages; @@ -810,9 +838,9 @@ static void __init setup_mm(void) static void __init setup_mm(void) { const struct meminfo *banks = &bootinfo.mem; -paddr_t ram_start = ~0; -paddr_t ram_end = 0; -paddr_t ram_size = 0; +paddr_t ram_start = ~0, bank_start = ~0; +paddr_t ram_end = 0, bank_end = 0; +paddr_t ram_size = 0, bank_size = 0; unsigned int i; init_pdx(); @@ -821,17 +849,36 @@ static void __init setup_mm(void) * We need some memory to allocate the page-tables used for the xenheap * mappings. But some regions may contain memory already allocated * for other uses (e.g. modules, reserved-memory...). - * + * If reserved heap regions are properly defined, (only) add these regions + * in the boot allocator. > + */ +if ( reserved_heap ) +{ +for ( i = 0 ; i < bootinfo.reserved_mem.nr_banks; i++ ) +{ +if ( bootinfo.reserved_mem.bank[i].xen_heap ) +{ +bank_start = bootinfo.reserved_mem.bank[i].start; +bank_size = bootinfo.reserved_mem.bank[i].size; +bank_end = bank_start + bank_size; + +init_boot_pages(bank_start, bank_end); +} +} +} +/* + * No reserved heap regions: * For simplicity, add all the free regions in the boot allocator. */ -populate_boot_allocator(); +else +populate_boot_allocator(); total_pages = 0; for ( i = 0; i < banks->nr_banks; i++ ) { This code is now becoming quite confusing to understanding. This loop is meant to map the xenheap. If I follow your documentation, it would mean that only the reserved region should be mapped. Yes I think this is the same question that I raised in the scissors line of the commit message of this patch. Sorry I didn't notice the comment after the scissors line. This is the same question :) What I intend to do is still mapping the whole RAM because of the xenheap_* variables that you mentioned in... More confusingly, xenheap_* variables will cover the full RAM. ...here. But only adding the reserved region to the boot allocator so the reserved region can become the heap later on. I am wondering if we have a more clear way to do that, any suggestions? I think your code is correct. It only needs some renaming of the existing variable (maybe to directmap_*?) to make clear the area is used to access the RAM easily. Cheers, -- Julien Grall
Re: [PATCH v6 5/9] xen/arm: Add additional reference to owner domain when the owner is allocated
Hi Penny, On 21/07/2022 14:21, Penny Zheng wrote: Borrower domain will fail to get a page ref using the owner domain during allocation, when the owner is created after borrower. So here, we decide to get and add the right amount of reference, which is the number of borrowers, when the owner is allocated. Signed-off-by: Penny Zheng Reviewed-by: Stefano Stabellini IMHO, this tag should not have been kept given... --- v6 change: - adapt to the change of "nr_shm_borrowers" ... this change. 'reviewed-by' means that the person reviewed the code and therefore agree with the logic. So I would only keep the tag if the change is trivial (including typo, coding style) and would drop it (or confirm with the person) otherwise. Stefano, can you confirm you are happy that your reviewed-by tag is kept? - add in-code comment to explain if the borrower is created first, we intend to add pages in the P2M without reference. --- v5 change: - no change --- v4 changes: - no change --- v3 change: - printk rather than dprintk since it is a serious error --- v2 change: - new commit --- xen/arch/arm/domain_build.c | 60 + 1 file changed, 60 insertions(+) diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c index a7e95c34a7..e891e800a7 100644 --- a/xen/arch/arm/domain_build.c +++ b/xen/arch/arm/domain_build.c @@ -761,6 +761,30 @@ static void __init assign_static_memory_11(struct domain *d, } #ifdef CONFIG_STATIC_SHM +static int __init acquire_nr_borrower_domain(struct domain *d, + paddr_t pbase, paddr_t psize, + unsigned long *nr_borrowers) +{ +unsigned long bank; NIT: AFAICT nr_banks is an "unsigned int". Other than that: Acked-by: Julien Grall Cheers, -- Julien Grall
Re: [PATCH v3 2/2] xen/pci: replace call to is_memory_hole to pci_check_bar
On 01.09.22 12:29, Rahul Singh wrote: Hello Rahul is_memory_hole was implemented for x86 and not for ARM when introduced. Replace is_memory_hole call to pci_check_bar as function should check if device BAR is in defined memory range. Also, add an implementation for ARM which is required for PCI passthrough. On x86, pci_check_bar will call is_memory_hole which will check if BAR is not overlapping with any memory region defined in the memory map. On ARM, pci_check_bar will go through the host bridge ranges and check if the BAR is in the range of defined ranges. Signed-off-by: Rahul Singh Reviewed-by: Oleksandr Tyshchenko Thanks! --- Changes in v3: - fix minor comments --- xen/arch/arm/include/asm/pci.h | 2 ++ xen/arch/arm/pci/pci-host-common.c | 43 ++ xen/arch/x86/include/asm/pci.h | 10 +++ xen/drivers/passthrough/pci.c | 8 +++--- 4 files changed, 59 insertions(+), 4 deletions(-) diff --git a/xen/arch/arm/include/asm/pci.h b/xen/arch/arm/include/asm/pci.h index 80a2431804..8cb46f6b71 100644 --- a/xen/arch/arm/include/asm/pci.h +++ b/xen/arch/arm/include/asm/pci.h @@ -126,6 +126,8 @@ int pci_host_iterate_bridges_and_count(struct domain *d, int pci_host_bridge_mappings(struct domain *d); +bool pci_check_bar(const struct pci_dev *pdev, mfn_t start, mfn_t end); + #else /*!CONFIG_HAS_PCI*/ struct arch_pci_dev { }; diff --git a/xen/arch/arm/pci/pci-host-common.c b/xen/arch/arm/pci/pci-host-common.c index 89ef30028e..0eb121666d 100644 --- a/xen/arch/arm/pci/pci-host-common.c +++ b/xen/arch/arm/pci/pci-host-common.c @@ -24,6 +24,16 @@ #include +/* + * struct to hold pci device bar. + */ +struct pdev_bar +{ +mfn_t start; +mfn_t end; +bool is_valid; +}; + /* * List for all the pci host bridges. */ @@ -363,6 +373,39 @@ int __init pci_host_bridge_mappings(struct domain *d) return 0; } +static int is_bar_valid(const struct dt_device_node *dev, +uint64_t addr, uint64_t len, void *data) +{ +struct pdev_bar *bar_data = data; +unsigned long s = mfn_x(bar_data->start); +unsigned long e = mfn_x(bar_data->end); + +if ( (s <= e) && (s >= PFN_DOWN(addr)) && (e <= PFN_UP(addr + len - 1)) ) +bar_data->is_valid = true; + +return 0; +} + +bool pci_check_bar(const struct pci_dev *pdev, mfn_t start, mfn_t end) +{ +int ret; +const struct dt_device_node *dt_node; +struct pdev_bar bar_data = { +.start = start, +.end = end, +.is_valid = false +}; + +dt_node = pci_find_host_bridge_node(pdev); +if ( !dt_node ) +return false; + +ret = dt_for_each_range(dt_node, &is_bar_valid, &bar_data); +if ( ret < 0 ) +return false; + +return bar_data.is_valid; +} /* * Local variables: * mode: C diff --git a/xen/arch/x86/include/asm/pci.h b/xen/arch/x86/include/asm/pci.h index c8e1a9ecdb..f4a58c8acf 100644 --- a/xen/arch/x86/include/asm/pci.h +++ b/xen/arch/x86/include/asm/pci.h @@ -57,4 +57,14 @@ static always_inline bool is_pci_passthrough_enabled(void) void arch_pci_init_pdev(struct pci_dev *pdev); +static inline bool pci_check_bar(const struct pci_dev *pdev, + mfn_t start, mfn_t end) +{ +/* + * Check if BAR is not overlapping with any memory region defined + * in the memory map. + */ +return is_memory_hole(start, end); +} + #endif /* __X86_PCI_H__ */ diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c index cdaf5c247f..149f68bb6e 100644 --- a/xen/drivers/passthrough/pci.c +++ b/xen/drivers/passthrough/pci.c @@ -304,8 +304,8 @@ static void check_pdev(const struct pci_dev *pdev) if ( rc < 0 ) /* Unable to size, better leave memory decoding disabled. */ return; -if ( size && !is_memory_hole(maddr_to_mfn(addr), - maddr_to_mfn(addr + size - 1)) ) +if ( size && !pci_check_bar(pdev, maddr_to_mfn(addr), +maddr_to_mfn(addr + size - 1)) ) { /* * Return without enabling memory decoding if BAR position is not @@ -331,8 +331,8 @@ static void check_pdev(const struct pci_dev *pdev) if ( rc < 0 ) return; -if ( size && !is_memory_hole(maddr_to_mfn(addr), - maddr_to_mfn(addr + size - 1)) ) +if ( size && !pci_check_bar(pdev, maddr_to_mfn(addr), +maddr_to_mfn(addr + size - 1)) ) { printk(warn, &pdev->sbdf, "ROM ", PFN_DOWN(addr), PFN_DOWN(addr + size - 1)); -- Regards, Oleksandr Tyshchenko
RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator
On Thu, 1 Sep 2022, Henry Wang wrote: > > -Original Message- > > From: Julien Grall > > Subject: Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and > > heap allocator > > > > Hi Henry, > > > > On 24/08/2022 08:31, Henry Wang wrote: > > > This commit firstly adds a global variable `reserved_heap`. > > > This newly introduced global variable is set at the device tree > > > parsing time if the reserved heap ranges are defined in the device > > > tree chosen node. > > > > > > For Arm32, In `setup_mm`, if the reserved heap is enabled, we use > > > the reserved heap region for both domheap and xenheap allocation. > > > > > > For Arm64, In `setup_mm`, if the reserved heap is enabled and used, > > > we make sure that only these reserved heap pages are added to the > > > boot allocator. These reserved heap pages in the boot allocator are > > > added to the heap allocator at `end_boot_allocator()`. > > > > > > If the reserved heap is disabled, we stick to current page allocation > > > strategy at boot time. > > > > > > Also, take the chance to correct a "double not" print in Arm32 > > > `setup_mm()`. > > > > > > Signed-off-by: Henry Wang > > > --- > > > With reserved heap enabled, for Arm64, naming of global variables such > > > as `xenheap_mfn_start` and `xenheap_mfn_end` seems to be ambiguous, > > > wondering if we should rename these variables. > > > --- > > > Changes from RFC to v1: > > > - Rebase on top of latest `setup_mm()` changes. > > > - Added Arm32 logic in `setup_mm()`. > > > --- > > > xen/arch/arm/bootfdt.c | 2 + > > > xen/arch/arm/include/asm/setup.h | 2 + > > > xen/arch/arm/setup.c | 79 +--- > > > 3 files changed, 67 insertions(+), 16 deletions(-) > > > > > > diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c > > > index 33704ca487..ab73b6e212 100644 > > > --- a/xen/arch/arm/bootfdt.c > > > +++ b/xen/arch/arm/bootfdt.c > > > @@ -325,6 +325,8 @@ static int __init process_chosen_node(const void > > *fdt, int node, > > >true); > > > if ( rc ) > > > return rc; > > > + > > > +reserved_heap = true; > > > } > > > > > > printk("Checking for initrd in /chosen\n"); > > > diff --git a/xen/arch/arm/include/asm/setup.h > > b/xen/arch/arm/include/asm/setup.h > > > index e80f3d6201..00536a6d55 100644 > > > --- a/xen/arch/arm/include/asm/setup.h > > > +++ b/xen/arch/arm/include/asm/setup.h > > > @@ -92,6 +92,8 @@ extern struct bootinfo bootinfo; > > > > > > extern domid_t max_init_domid; > > > > > > +extern bool reserved_heap; > > > + > > > void copy_from_paddr(void *dst, paddr_t paddr, unsigned long len); > > > > > > size_t estimate_efi_size(unsigned int mem_nr_banks); > > > diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c > > > index 500307edc0..fe76cf6325 100644 > > > --- a/xen/arch/arm/setup.c > > > +++ b/xen/arch/arm/setup.c > > > @@ -73,6 +73,8 @@ integer_param("xenheap_megabytes", > > opt_xenheap_megabytes); > > > > > > domid_t __read_mostly max_init_domid; > > > > > > +bool __read_mostly reserved_heap; > > > + > > > static __used void init_done(void) > > > { > > > /* Must be done past setting system_state. */ > > > @@ -699,8 +701,10 @@ static void __init populate_boot_allocator(void) > > > #ifdef CONFIG_ARM_32 > > > static void __init setup_mm(void) > > > { > > > -paddr_t ram_start, ram_end, ram_size, e; > > > -unsigned long ram_pages; > > > +paddr_t ram_start, ram_end, ram_size, e, bank_start, bank_end, > > bank_size; > > > +paddr_t reserved_heap_start = ~0, reserved_heap_end = 0, > > > +reserved_heap_size = 0; > > > +unsigned long ram_pages, reserved_heap_pages = 0; > > > unsigned long heap_pages, xenheap_pages, domheap_pages; > > > unsigned int i; > > > const uint32_t ctr = READ_CP32(CTR); > > > @@ -720,9 +724,9 @@ static void __init setup_mm(void) > > > > > > for ( i = 1; i < bootinfo.mem.nr_banks; i++ ) > > > { > > > -paddr_t bank_start = bootinfo.mem.bank[i].start; > > > -paddr_t bank_size = bootinfo.mem.bank[i].size; > > > -paddr_t bank_end = bank_start + bank_size; > > > +bank_start = bootinfo.mem.bank[i].start; > > > +bank_size = bootinfo.mem.bank[i].size; > > > +bank_end = bank_start + bank_size; > > > > > > ram_size = ram_size + bank_size; > > > ram_start = min(ram_start,bank_start); > > > @@ -731,6 +735,25 @@ static void __init setup_mm(void) > > > > > > total_pages = ram_pages = ram_size >> PAGE_SHIFT; > > > > > > +if ( reserved_heap ) > > > +{ > > > +for ( i = 0 ; i < bootinfo.reserved_mem.nr_banks; i++ ) > > > +{ > > > +if ( bootinfo.reserved_mem.bank[i].xen_heap ) > > > +{ > > > +bank_start = bootinfo.reserved_mem.bank[i].start; > > > +bank_size = bootinfo.reserved_mem.bank
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 12:05:01PM +0100, Mel Gorman wrote: > As pointed out elsewhere, attaching to the tracepoint and recording relevant > state is an option other than trying to parse a raw ftrace feed. For memory > leaks, there are already tracepoints for page allocation and free that could > be used to track allocations that are not freed at a given point in time. Page allocation tracepoints are not sufficient for what we're trying to do here, and a substantial amount of effort in this patchset has gone into just getting the hooking locations right - our memory allocation interfaces are not trivial. That's something people should keep in mind when commenting on the size of this patchset, since that's effort that would have to be spent for /any/ complete solution, be in tracepoint based or no. Additionally, we need to be able to write assertions that verify that our hook locations are correct, that allocations or frees aren't getting double counted or missed - highly necessary given the maze of nested memory allocation interfaces we have (i.e. slab.h), and it's something a tracepoint based implementation would have to account for - otherwise, a tool isn't very useful if you can't trust the numbers it's giving you. And then you have to correlate the allocate and free events, so that you know which allocate callsite to decrement the amount freed from. How would you plan on doing that with tracepoints? > There is also the kernel memory leak detector although I never had reason > to use it (https://www.kernel.org/doc/html/v6.0-rc3/dev-tools/kmemleak.html) > and it sounds like it would be expensive. Kmemleak is indeed expensive, and in the past I've had issues with it not catching everything (I've noticed the kmemleak annotations growing, so maybe this is less of an issue than it was). And this is a more complete solution (though not something that could strictly replace kmemleak): strict memory leaks aren't the only issue, it's also drivers unexpectedly consuming more memory than expected. I'll bet you a beer that when people have had this awhile, we're going to have a bunch of bugs discovered and fixed along the lines of "oh hey, this driver wasn't supposed to be using this 1 MB of memory, I never noticed that before". > > > It's also unclear *who* would enable this. It looks like it would mostly > > > have value during the development stage of an embedded platform to track > > > kernel memory usage on a per-application basis in an environment where it > > > may be difficult to setup tracing and tracking. Would it ever be enabled > > > in production? Would a distribution ever enable this? If it's enabled, any > > > overhead cannot be disabled/enabled at run or boot time so anyone enabling > > > this would carry the cost without never necessarily consuming the data. > > > > The whole point of this is to be cheap enough to enable in production - > > especially the latency tracing infrastructure. There's a lot of value to > > always-on system visibility infrastructure, so that when a live machine > > starts > > to do something wonky the data is already there. > > > > Sure, there is value but nothing stops the tracepoints being attached as > a boot-time service where interested. For latencies, there is already > bpf examples for tracing individual function latency over time e.g. > https://github.com/iovisor/bcc/blob/master/tools/funclatency.py although > I haven't used it recently. So this is cool, I'll check it out today. Tracing of /function/ latency is definitely something you'd want tracing/kprobes for - that's way more practical than any code tagging-based approach. And if the output is reliable and useful I could definitely see myself using this, thank you. But for data collection where it makes sense to annotate in the source code where the data collection points are, I see the code-tagging based approach as simpler - it cuts out a whole bunch of indirection. The diffstat on the code tagging time stats patch is 8 files changed, 233 insertions(+), 6 deletions(-) And that includes hooking wait.h - this is really simple, easy stuff. The memory allocation tracking patches are more complicated because we've got a ton of memory allocation interfaces and we're aiming for strict correctness there - because that tool needs strict correctness in order to be useful. > Live parsing of ftrace is possible, albeit expensive. > https://github.com/gormanm/mmtests/blob/master/monitors/watch-highorder.pl > tracks counts of high-order allocations and dumps a report on interrupt as > an example of live parsing ftrace and only recording interesting state. It's > not tracking state you are interested in but it demonstrates it is possible > to rely on ftrace alone and monitor from userspace. It's bit-rotted but > can be fixed with Yeah, if this is as far as people have gotten with ftrace on memory allocations than I don't think tracing is credible here, sorry. > The ease of use is a criticism as there
RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator
Hi Julien, Thanks for your review. > -Original Message- > From: Julien Grall > Subject: Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and > heap allocator > > Hi Henry, > > On 24/08/2022 08:31, Henry Wang wrote: > > This commit firstly adds a global variable `reserved_heap`. > > This newly introduced global variable is set at the device tree > > parsing time if the reserved heap ranges are defined in the device > > tree chosen node. > > > > For Arm32, In `setup_mm`, if the reserved heap is enabled, we use > > the reserved heap region for both domheap and xenheap allocation. > > > > For Arm64, In `setup_mm`, if the reserved heap is enabled and used, > > we make sure that only these reserved heap pages are added to the > > boot allocator. These reserved heap pages in the boot allocator are > > added to the heap allocator at `end_boot_allocator()`. > > > > If the reserved heap is disabled, we stick to current page allocation > > strategy at boot time. > > > > Also, take the chance to correct a "double not" print in Arm32 > > `setup_mm()`. > > > > Signed-off-by: Henry Wang > > --- > > With reserved heap enabled, for Arm64, naming of global variables such > > as `xenheap_mfn_start` and `xenheap_mfn_end` seems to be ambiguous, > > wondering if we should rename these variables. > > --- > > Changes from RFC to v1: > > - Rebase on top of latest `setup_mm()` changes. > > - Added Arm32 logic in `setup_mm()`. > > --- > > xen/arch/arm/bootfdt.c | 2 + > > xen/arch/arm/include/asm/setup.h | 2 + > > xen/arch/arm/setup.c | 79 +--- > > 3 files changed, 67 insertions(+), 16 deletions(-) > > > > diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c > > index 33704ca487..ab73b6e212 100644 > > --- a/xen/arch/arm/bootfdt.c > > +++ b/xen/arch/arm/bootfdt.c > > @@ -325,6 +325,8 @@ static int __init process_chosen_node(const void > *fdt, int node, > >true); > > if ( rc ) > > return rc; > > + > > +reserved_heap = true; > > } > > > > printk("Checking for initrd in /chosen\n"); > > diff --git a/xen/arch/arm/include/asm/setup.h > b/xen/arch/arm/include/asm/setup.h > > index e80f3d6201..00536a6d55 100644 > > --- a/xen/arch/arm/include/asm/setup.h > > +++ b/xen/arch/arm/include/asm/setup.h > > @@ -92,6 +92,8 @@ extern struct bootinfo bootinfo; > > > > extern domid_t max_init_domid; > > > > +extern bool reserved_heap; > > + > > void copy_from_paddr(void *dst, paddr_t paddr, unsigned long len); > > > > size_t estimate_efi_size(unsigned int mem_nr_banks); > > diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c > > index 500307edc0..fe76cf6325 100644 > > --- a/xen/arch/arm/setup.c > > +++ b/xen/arch/arm/setup.c > > @@ -73,6 +73,8 @@ integer_param("xenheap_megabytes", > opt_xenheap_megabytes); > > > > domid_t __read_mostly max_init_domid; > > > > +bool __read_mostly reserved_heap; > > + > > static __used void init_done(void) > > { > > /* Must be done past setting system_state. */ > > @@ -699,8 +701,10 @@ static void __init populate_boot_allocator(void) > > #ifdef CONFIG_ARM_32 > > static void __init setup_mm(void) > > { > > -paddr_t ram_start, ram_end, ram_size, e; > > -unsigned long ram_pages; > > +paddr_t ram_start, ram_end, ram_size, e, bank_start, bank_end, > bank_size; > > +paddr_t reserved_heap_start = ~0, reserved_heap_end = 0, > > +reserved_heap_size = 0; > > +unsigned long ram_pages, reserved_heap_pages = 0; > > unsigned long heap_pages, xenheap_pages, domheap_pages; > > unsigned int i; > > const uint32_t ctr = READ_CP32(CTR); > > @@ -720,9 +724,9 @@ static void __init setup_mm(void) > > > > for ( i = 1; i < bootinfo.mem.nr_banks; i++ ) > > { > > -paddr_t bank_start = bootinfo.mem.bank[i].start; > > -paddr_t bank_size = bootinfo.mem.bank[i].size; > > -paddr_t bank_end = bank_start + bank_size; > > +bank_start = bootinfo.mem.bank[i].start; > > +bank_size = bootinfo.mem.bank[i].size; > > +bank_end = bank_start + bank_size; > > > > ram_size = ram_size + bank_size; > > ram_start = min(ram_start,bank_start); > > @@ -731,6 +735,25 @@ static void __init setup_mm(void) > > > > total_pages = ram_pages = ram_size >> PAGE_SHIFT; > > > > +if ( reserved_heap ) > > +{ > > +for ( i = 0 ; i < bootinfo.reserved_mem.nr_banks; i++ ) > > +{ > > +if ( bootinfo.reserved_mem.bank[i].xen_heap ) > > +{ > > +bank_start = bootinfo.reserved_mem.bank[i].start; > > +bank_size = bootinfo.reserved_mem.bank[i].size; > > +bank_end = bank_start + bank_size; > > + > > +reserved_heap_size += bank_size; > > +reserved_heap_start = min(reserved_heap_start, bank_start); > > +reserved_heap_end =
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 05:07:06PM +0200, David Hildenbrand wrote: > Skimming over the patches (that I was CCed on) and skimming over the > cover letter, I got the impression that everything after patch 7 is > introducing something new instead of refactoring something out. You skimmed over the dynamic debug patch then...
RE: Push 4.17 Feature freeze and Code freeze one week later
Hi Bertrand, > -Original Message- > From: Bertrand Marquis > Subject: Push 4.17 Feature freeze and Code freeze one week later > > Hi, > > Seeing that we have lots of series quite advanced in review but not already > merged, could we push > both Feature freeze and Code freeze deadline one week later to have a > chance to finish those ? I am good with that. The updated schedule now would be: - Feature freeze Fri Sep 9, 2022 - Code freezeFri Sep 30, 2022 (+3 weeks from Feature freeze) - Hard code freeze Fri Oct 14, 2022 (+2 weeks from Code freeze) - Final commits Fri Oct 28, 2022 (+2 weeks from Hard code freeze) - ReleaseWed Nov 2, 2022 The planned RCs would then be (1 per week as soon as the code freeze): - RC1: Fri Oct 7, 2022. - RC2: Fri Oct 14, 2022. - RC3: Fri Oct 21, 2022. - RC4 (Final): Fri Oct 28, 2022. Please raise concerns and questions about above schedule and we can have a discussion. Kind regards, Henry > > Cheers > Bertrand >
Re: [PATCH v6 1/9] xen/arm: introduce static shared memory
Hi, > On 1 Sep 2022, at 16:40, Julien Grall wrote: > > Hi Penny, > > On 29/08/2022 07:57, Penny Zheng wrote: >>> -Original Message- >>> From: Julien Grall >>> Sent: Friday, August 26, 2022 9:17 PM >>> To: Penny Zheng ; xen-devel@lists.xenproject.org >>> Cc: Stefano Stabellini ; Bertrand Marquis >>> ; Volodymyr Babchuk >>> >>> Subject: Re: [PATCH v6 1/9] xen/arm: introduce static shared memory >>> >>> Hi Penny, >>> >> Hi Julien >> >>> On 21/07/2022 14:21, Penny Zheng wrote: From: Penny Zheng This patch series introduces a new feature: setting up static shared memory on a dom0less system, through device tree configuration. This commit parses shared memory node at boot-time, and reserve it in bootinfo.reserved_mem to avoid other use. This commits proposes a new Kconfig CONFIG_STATIC_SHM to wrap static-shm-related codes, and this option depends on static memory( CONFIG_STATIC_MEMORY). That's because that later we want to reuse a few helpers, guarded with CONFIG_STATIC_MEMORY, like acquire_staticmem_pages, etc, on static shared memory. Signed-off-by: Penny Zheng --- v6 change: - when host physical address is ommited, output the error message since xen doesn't support it at the moment - add the following check: 1) The shm ID matches and the region exactly match 2) The shm ID doesn't match and the region doesn't overlap - change it to "unsigned int" to be aligned with nr_banks - check the len of the property to confirm is it big enough to contain "paddr", "size", and "gaddr" - shm_id defined before nr_shm_domain, so we could re-use the existing hole and avoid increasing the size of the structure. - change "nr_shm_domain" to "nr_shm_borrowers", to not increment if the role is owner in parsing code - make "xen,shm_id" property as arbitrary string, with a strict limit on the number of characters, MAX_SHM_ID_LENGTH --- v5 change: - no change --- v4 change: - nit fix on doc --- v3 change: - make nr_shm_domain unsigned int --- v2 change: - document refinement - remove bitmap and use the iteration to check - add a new field nr_shm_domain to keep the number of shared domain --- docs/misc/arm/device-tree/booting.txt | 124 xen/arch/arm/Kconfig | 6 + xen/arch/arm/bootfdt.c| 157 ++ xen/arch/arm/include/asm/setup.h | 9 ++ 4 files changed, 296 insertions(+) diff --git a/docs/misc/arm/device-tree/booting.txt b/docs/misc/arm/device-tree/booting.txt index 98253414b8..8013fb98fe 100644 --- a/docs/misc/arm/device-tree/booting.txt +++ b/docs/misc/arm/device-tree/booting.txt @@ -378,3 +378,127 @@ device-tree: This will reserve a 512MB region starting at the host physical address 0x3000 to be exclusively used by DomU1. + +Static Shared Memory + + +The static shared memory device tree nodes allow users to statically +set up shared memory on dom0less system, enabling domains to do +shm-based communication. + +- compatible + +"xen,domain-shared-memory-v1" + +- xen,shm-id + +An arbitrary string that represents the unique identifier of the shared +memory region, with a strict limit on the number of characters(\0 >>> included), +`MAX_SHM_ID_LENGTH(16)`. e.g. "xen,shm-id = "my-shared-mem-1"". + +- xen,shared-mem + +An array takes a physical address, which is the base address of the +shared memory region in host physical address space, a size, and a >>> guest +physical address, as the target address of the mapping. +e.g. xen,shared-mem = < [host physical address] [size] [guest + address] > >>> >>> Your implementation below is checking for overlap and also have some >>> restriction. Can they be documented in the binding? >>> + +The number of cells for the host address (and size) is the same as the +guest pseudo-physical address and they are inherited from the parent >>> node. >>> >>> In v5, we discussed to have the host address optional. However, the binding >>> has not been updated to reflect that. Note that I am not asking to >>> implement, >>> but instead request that the binding can be used for such setup. >>> >> How about: >> " >> Host physical address could be omitted by users, and let Xen decide where it >> locates. >> " > > I am fine with that. > >> Do you think I shall further point out that right now, this part feature is >> not implemented >> in codes? > > I have made a couple of suggestion for the code. But I think the binding > would look a bit odd without the host physical address. We woul
Re: [RFC PATCH 03/30] Lazy percpu counters
On Thu, Sep 01, 2022 at 10:48:39AM -0400, Steven Rostedt wrote: > On Thu, 1 Sep 2022 10:32:19 -0400 > Kent Overstreet wrote: > > > On Thu, Sep 01, 2022 at 08:51:31AM +0200, Peter Zijlstra wrote: > > > On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote: > > > > +static void lazy_percpu_counter_switch_to_pcpu(struct > > > > raw_lazy_percpu_counter *c) > > > > +{ > > > > + u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, > > > > GFP_ATOMIC|__GFP_NOWARN); > > > > > > Realize that this is incorrect when used under a raw_spinlock_t. > > > > Can you elaborate? > > All allocations (including GFP_ATOMIC) grab normal spin_locks. When > PREEMPT_RT is configured, normal spin_locks turn into a mutex, where as > raw_spinlock's do not. > > Thus, if this is done within a raw_spinlock with PREEMPT_RT configured, it > can cause a schedule while holding a spinlock. Thanks, I think we should be good here but I'll document it anyways.
Re: [PATCH v6 1/9] xen/arm: introduce static shared memory
Hi Penny, On 29/08/2022 07:57, Penny Zheng wrote: -Original Message- From: Julien Grall Sent: Friday, August 26, 2022 9:17 PM To: Penny Zheng ; xen-devel@lists.xenproject.org Cc: Stefano Stabellini ; Bertrand Marquis ; Volodymyr Babchuk Subject: Re: [PATCH v6 1/9] xen/arm: introduce static shared memory Hi Penny, Hi Julien On 21/07/2022 14:21, Penny Zheng wrote: From: Penny Zheng This patch series introduces a new feature: setting up static shared memory on a dom0less system, through device tree configuration. This commit parses shared memory node at boot-time, and reserve it in bootinfo.reserved_mem to avoid other use. This commits proposes a new Kconfig CONFIG_STATIC_SHM to wrap static-shm-related codes, and this option depends on static memory( CONFIG_STATIC_MEMORY). That's because that later we want to reuse a few helpers, guarded with CONFIG_STATIC_MEMORY, like acquire_staticmem_pages, etc, on static shared memory. Signed-off-by: Penny Zheng --- v6 change: - when host physical address is ommited, output the error message since xen doesn't support it at the moment - add the following check: 1) The shm ID matches and the region exactly match 2) The shm ID doesn't match and the region doesn't overlap - change it to "unsigned int" to be aligned with nr_banks - check the len of the property to confirm is it big enough to contain "paddr", "size", and "gaddr" - shm_id defined before nr_shm_domain, so we could re-use the existing hole and avoid increasing the size of the structure. - change "nr_shm_domain" to "nr_shm_borrowers", to not increment if the role is owner in parsing code - make "xen,shm_id" property as arbitrary string, with a strict limit on the number of characters, MAX_SHM_ID_LENGTH --- v5 change: - no change --- v4 change: - nit fix on doc --- v3 change: - make nr_shm_domain unsigned int --- v2 change: - document refinement - remove bitmap and use the iteration to check - add a new field nr_shm_domain to keep the number of shared domain --- docs/misc/arm/device-tree/booting.txt | 124 xen/arch/arm/Kconfig | 6 + xen/arch/arm/bootfdt.c| 157 ++ xen/arch/arm/include/asm/setup.h | 9 ++ 4 files changed, 296 insertions(+) diff --git a/docs/misc/arm/device-tree/booting.txt b/docs/misc/arm/device-tree/booting.txt index 98253414b8..8013fb98fe 100644 --- a/docs/misc/arm/device-tree/booting.txt +++ b/docs/misc/arm/device-tree/booting.txt @@ -378,3 +378,127 @@ device-tree: This will reserve a 512MB region starting at the host physical address 0x3000 to be exclusively used by DomU1. + +Static Shared Memory + + +The static shared memory device tree nodes allow users to statically +set up shared memory on dom0less system, enabling domains to do +shm-based communication. + +- compatible + +"xen,domain-shared-memory-v1" + +- xen,shm-id + +An arbitrary string that represents the unique identifier of the shared +memory region, with a strict limit on the number of characters(\0 included), +`MAX_SHM_ID_LENGTH(16)`. e.g. "xen,shm-id = "my-shared-mem-1"". + +- xen,shared-mem + +An array takes a physical address, which is the base address of the +shared memory region in host physical address space, a size, and a guest +physical address, as the target address of the mapping. +e.g. xen,shared-mem = < [host physical address] [size] [guest + address] > Your implementation below is checking for overlap and also have some restriction. Can they be documented in the binding? + +The number of cells for the host address (and size) is the same as the +guest pseudo-physical address and they are inherited from the parent node. In v5, we discussed to have the host address optional. However, the binding has not been updated to reflect that. Note that I am not asking to implement, but instead request that the binding can be used for such setup. How about: " Host physical address could be omitted by users, and let Xen decide where it locates. " I am fine with that. Do you think I shall further point out that right now, this part feature is not implemented in codes? I have made a couple of suggestion for the code. But I think the binding would look a bit odd without the host physical address. We would now have: < [size] [guest address]> I think it would be more natural if we had <[guest address] [size]> And <[guest address] [size] [host physical address]> a/xen/arch/arm/include/asm/setup.h b/xen/arch/arm/include/asm/setup.h index 2bb01ecfa8..39d4e93b8b 100644 --- a/xen/arch/arm/include/asm/setup.h +++ b/xen/arch/arm/include/asm/setup.h @@ -23,10 +23,19 @@ typedef enum { } bootmodule_kind; +#ifdef CONFIG_STATIC_SHM +/* Indicates the maximum number of characters(\0 included) for shm_id +*/ #define MAX_SHM_ID_LENGTH 16 #endif Is the #ifdef really needed? + struct membank { paddr_t start;
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 1, 2022 at 8:07 AM David Hildenbrand wrote: > > On 01.09.22 16:23, Kent Overstreet wrote: > > On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote: > >> On 31.08.22 21:01, Kent Overstreet wrote: > >>> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: > On Wed 31-08-22 11:19:48, Mel Gorman wrote: > > Whatever asking for an explanation as to why equivalent functionality > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > > Fully agreed and this is especially true for a change this size > 77 files changed, 3406 insertions(+), 703 deletions(-) > >>> > >>> In the case of memory allocation accounting, you flat cannot do this with > >>> ftrace > >>> - you could maybe do a janky version that isn't fully accurate, much > >>> slower, > >>> more complicated for the developer to understand and debug and more > >>> complicated > >>> for the end user. > >>> > >>> But please, I invite anyone who's actually been doing this with ftrace to > >>> demonstrate otherwise. > >>> > >>> Ftrace just isn't the right tool for the job here - we're talking about > >>> adding > >>> per callsite accounting to some of the fastest fast paths in the kernel. > >>> > >>> And the size of the changes for memory allocation accounting are much more > >>> reasonable: > >>> 33 files changed, 623 insertions(+), 99 deletions(-) > >>> > >>> The code tagging library should exist anyways, it's been open coded half > >>> a dozen > >>> times in the kernel already. > >> > >> Hi Kent, > >> > >> independent of the other discussions, if it's open coded already, does > >> it make sense to factor that already-open-coded part out independently > >> of the remainder of the full series here? > > > > It's discussed in the cover letter, that is exactly how the patch series is > > structured. > > Skimming over the patches (that I was CCed on) and skimming over the > cover letter, I got the impression that everything after patch 7 is > introducing something new instead of refactoring something out. Hi David, Yes, you are right, the RFC does incorporate lots of parts which can be considered separately. They are sent together to present the overall scope of the proposal but I do intend to send them separately once we decide if it's worth working on. Thanks, Suren. > > > > >> [I didn't immediately spot if this series also attempts already to > >> replace that open-coded part] > > > > Uh huh. > > > > Honestly, some days it feels like lkml is just as bad as slashdot, with > > people > > wanting to get in their two cents without actually reading... > > ... and of course you had to reply like that. I should just have learned > from my last upstream experience with you and kept you on my spam list. > > Thanks, bye > > -- > Thanks, > > David / dhildenb >
[PATCH] xen/grants: prevent integer overflow in gnttab_dma_alloc_pages()
The change from kcalloc() to kvmalloc() means that arg->nr_pages might now be large enough that the "args->nr_pages << PAGE_SHIFT" can result in an integer overflow. Fixes: b3f7931f5c61 ("xen/gntdev: switch from kcalloc() to kvcalloc()") Signed-off-by: Dan Carpenter --- drivers/xen/grant-table.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c index 738029de3c67..e1ec725c2819 100644 --- a/drivers/xen/grant-table.c +++ b/drivers/xen/grant-table.c @@ -1047,6 +1047,9 @@ int gnttab_dma_alloc_pages(struct gnttab_dma_alloc_args *args) size_t size; int i, ret; + if (args->nr_pages < 0 || args->nr_pages > (INT_MAX >> PAGE_SHIFT)) + return -ENOMEM; + size = args->nr_pages << PAGE_SHIFT; if (args->coherent) args->vaddr = dma_alloc_coherent(args->dev, size, -- 2.35.1
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko wrote: > > On Wed 31-08-22 15:01:54, Kent Overstreet wrote: > > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: > > > On Wed 31-08-22 11:19:48, Mel Gorman wrote: > > > > Whatever asking for an explanation as to why equivalent functionality > > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > > > > > > Fully agreed and this is especially true for a change this size > > > 77 files changed, 3406 insertions(+), 703 deletions(-) > > > > In the case of memory allocation accounting, you flat cannot do this with > > ftrace > > - you could maybe do a janky version that isn't fully accurate, much slower, > > more complicated for the developer to understand and debug and more > > complicated > > for the end user. > > > > But please, I invite anyone who's actually been doing this with ftrace to > > demonstrate otherwise. > > > > Ftrace just isn't the right tool for the job here - we're talking about > > adding > > per callsite accounting to some of the fastest fast paths in the kernel. > > > > And the size of the changes for memory allocation accounting are much more > > reasonable: > > 33 files changed, 623 insertions(+), 99 deletions(-) > > > > The code tagging library should exist anyways, it's been open coded half a > > dozen > > times in the kernel already. > > > > And once we've got that, the time stats code is _also_ far simpler than > > doing it > > with ftrace would be. If anyone here has successfully debugged latency > > issues > > with ftrace, I'd really like to hear it. Again, for debugging latency > > issues you > > want something that can always be on, and that's not cheap with ftrace - and > > never mind the hassle of correlating start and end wait trace events, > > builting > > up histograms, etc. - that's all handled here. > > > > Cheap, simple, easy to use. What more could you want? > > A big ad on a banner. But more seriously. > > This patchset is _huge_ and touching a lot of different areas. It will > be not only hard to review but even harder to maintain longterm. So > it is completely reasonable to ask for potential alternatives with a > smaller code footprint. I am pretty sure you are aware of that workflow. The patchset is huge because it introduces a reusable part (the first 6 patches introducing code tagging) and 6 different applications in very different areas of the kernel. We wanted to present all of them in the RFC to show the variety of cases this mechanism can be reused for. If the code tagging is accepted, each application can be posted separately to the appropriate group of people. Hopefully that makes it easier to review. Those first 6 patches are not that big and are quite isolated IMHO: include/linux/codetag.h | 83 ++ include/linux/lazy-percpu-counter.h | 67 include/linux/module.h | 1 + kernel/module/internal.h| 1 - kernel/module/main.c| 4 + lib/Kconfig | 3 + lib/Kconfig.debug | 4 + lib/Makefile| 3 + lib/codetag.c | 248 lib/lazy-percpu-counter.c | 141 lib/string_helpers.c| 3 +- scripts/kallsyms.c | 13 ++ > > So I find Peter's question completely appropriate while your response to > that not so much! Maybe ftrace is not the right tool for the intented > job. Maybe there are other ways and it would be really great to show > that those have been evaluated and they are not suitable for a), b) and > c) reasons. That's fair. For memory tracking I looked into using kmemleak and page_owner which can't match the required functionality at an overhead acceptable for production and pre-production testing environments. traces + BPF I haven't evaluated myself but heard from other members of my team who tried using that in production environment with poor results. I'll try to get more specific information on that. > > E.g. Oscar has been working on extending page_ext to track number of > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But > it can help in environments where page_ext can be enabled and it is > completely non-intrusive to the MM code. Thanks for pointing out this work. I'll need to review and maybe profile it before making any claims. > > If the page_ext overhead is not desirable/acceptable then I am sure > there are other options. E.g. kprobes/LivePatching framework can hook > into functions and alter their behavior. So why not use that for data > collection? Has this been evaluated at all? I'm not sure how I can hook into say alloc_pages() to find out where it was called from without capturing the call stack (which would introduce an overhead at every allocation). Would love to discuss this or other alternatives if they can be done with low enough overhead. Thanks, Suren. > > And please
Re: Push 4.17 Feature freeze and Code freeze one week later
On Thu, 1 Sep 2022, Bertrand Marquis wrote: > Hi, > > Seeing that we have lots of series quite advanced in review but not already > merged, could we push > both Feature freeze and Code freeze deadline one week later to have a chance > to finish those ? +1 On the ARM side we have a higher-than-usual number of patch series still to commit. This is different from past times when people rushed at the last week many patch series, completely new and unreviewed. These series went through the review process properly, they are in good condition, and are typically at v4 or above. They only miss one last round of review. I think we should give them a chance to finish, so it would be great to have 1 more week.
Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator
Hi Henry, On 24/08/2022 08:31, Henry Wang wrote: This commit firstly adds a global variable `reserved_heap`. This newly introduced global variable is set at the device tree parsing time if the reserved heap ranges are defined in the device tree chosen node. For Arm32, In `setup_mm`, if the reserved heap is enabled, we use the reserved heap region for both domheap and xenheap allocation. For Arm64, In `setup_mm`, if the reserved heap is enabled and used, we make sure that only these reserved heap pages are added to the boot allocator. These reserved heap pages in the boot allocator are added to the heap allocator at `end_boot_allocator()`. If the reserved heap is disabled, we stick to current page allocation strategy at boot time. Also, take the chance to correct a "double not" print in Arm32 `setup_mm()`. Signed-off-by: Henry Wang --- With reserved heap enabled, for Arm64, naming of global variables such as `xenheap_mfn_start` and `xenheap_mfn_end` seems to be ambiguous, wondering if we should rename these variables. --- Changes from RFC to v1: - Rebase on top of latest `setup_mm()` changes. - Added Arm32 logic in `setup_mm()`. --- xen/arch/arm/bootfdt.c | 2 + xen/arch/arm/include/asm/setup.h | 2 + xen/arch/arm/setup.c | 79 +--- 3 files changed, 67 insertions(+), 16 deletions(-) diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c index 33704ca487..ab73b6e212 100644 --- a/xen/arch/arm/bootfdt.c +++ b/xen/arch/arm/bootfdt.c @@ -325,6 +325,8 @@ static int __init process_chosen_node(const void *fdt, int node, true); if ( rc ) return rc; + +reserved_heap = true; } printk("Checking for initrd in /chosen\n"); diff --git a/xen/arch/arm/include/asm/setup.h b/xen/arch/arm/include/asm/setup.h index e80f3d6201..00536a6d55 100644 --- a/xen/arch/arm/include/asm/setup.h +++ b/xen/arch/arm/include/asm/setup.h @@ -92,6 +92,8 @@ extern struct bootinfo bootinfo; extern domid_t max_init_domid; +extern bool reserved_heap; + void copy_from_paddr(void *dst, paddr_t paddr, unsigned long len); size_t estimate_efi_size(unsigned int mem_nr_banks); diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c index 500307edc0..fe76cf6325 100644 --- a/xen/arch/arm/setup.c +++ b/xen/arch/arm/setup.c @@ -73,6 +73,8 @@ integer_param("xenheap_megabytes", opt_xenheap_megabytes); domid_t __read_mostly max_init_domid; +bool __read_mostly reserved_heap; + static __used void init_done(void) { /* Must be done past setting system_state. */ @@ -699,8 +701,10 @@ static void __init populate_boot_allocator(void) #ifdef CONFIG_ARM_32 static void __init setup_mm(void) { -paddr_t ram_start, ram_end, ram_size, e; -unsigned long ram_pages; +paddr_t ram_start, ram_end, ram_size, e, bank_start, bank_end, bank_size; +paddr_t reserved_heap_start = ~0, reserved_heap_end = 0, +reserved_heap_size = 0; +unsigned long ram_pages, reserved_heap_pages = 0; unsigned long heap_pages, xenheap_pages, domheap_pages; unsigned int i; const uint32_t ctr = READ_CP32(CTR); @@ -720,9 +724,9 @@ static void __init setup_mm(void) for ( i = 1; i < bootinfo.mem.nr_banks; i++ ) { -paddr_t bank_start = bootinfo.mem.bank[i].start; -paddr_t bank_size = bootinfo.mem.bank[i].size; -paddr_t bank_end = bank_start + bank_size; +bank_start = bootinfo.mem.bank[i].start; +bank_size = bootinfo.mem.bank[i].size; +bank_end = bank_start + bank_size; ram_size = ram_size + bank_size; ram_start = min(ram_start,bank_start); @@ -731,6 +735,25 @@ static void __init setup_mm(void) total_pages = ram_pages = ram_size >> PAGE_SHIFT; +if ( reserved_heap ) +{ +for ( i = 0 ; i < bootinfo.reserved_mem.nr_banks; i++ ) +{ +if ( bootinfo.reserved_mem.bank[i].xen_heap ) +{ +bank_start = bootinfo.reserved_mem.bank[i].start; +bank_size = bootinfo.reserved_mem.bank[i].size; +bank_end = bank_start + bank_size; + +reserved_heap_size += bank_size; +reserved_heap_start = min(reserved_heap_start, bank_start); +reserved_heap_end = max(reserved_heap_end, bank_end); +} +} + +reserved_heap_pages = reserved_heap_size >> PAGE_SHIFT; +} + /* * If the user has not requested otherwise via the command line * then locate the xenheap using these constraints: @@ -743,7 +766,8 @@ static void __init setup_mm(void) * We try to allocate the largest xenheap possible within these * constraints. */ -heap_pages = ram_pages; +heap_pages = !reserved_heap ? ram_pages : reserved_heap_pages; + if ( opt_xenheap_megabytes ) xenheap_pages = opt_xenheap_megabytes <<
[ovmf test] 172912: regressions - FAIL
flight 172912 ovmf real [real] http://logs.test-lab.xenproject.org/osstest/logs/172912/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: build-i386-libvirt6 libvirt-buildfail REGR. vs. 172136 build-amd64-libvirt 6 libvirt-buildfail REGR. vs. 172136 version targeted for testing: ovmf 033ba8bb2976629fdb664d7131f44f8b0b8f6777 baseline version: ovmf 444260d45ec2a84e8f8c192b3539a3cd5591d009 Last test of basis 172136 2022-08-04 06:43:42 Z 28 days Failing since172151 2022-08-05 02:40:28 Z 27 days 219 attempts Testing same since 172912 2022-09-01 11:42:11 Z0 days1 attempts People who touched revisions under test: Abdul Lateef Attar Abner Chang Ard Biesheuvel Bob Feng Chasel Chiu Chen, Xiao X Czajkowski, Maciej Dimitrije Pavlov Dun Tan Edward Pickup Foster Nong Gregx Yeh Guo Dong Igor Kulchytskyy James Lu Jiaxin Wu Jose Marinho KasimX Liu Kavya Konstantin Aladyshev Kun Qin Liming Gao Liu, Zhiguang Maciej Czajkowski Michael D Kinney Michael Kubacki Ray Ni Rebecca Cran Sainadh Nagolu Sami Mujawar Shengfengx Xue Wu, Jiaxin Xiao X Chen Yuanhao Xie Zhiguang Liu jobs: build-amd64-xsm pass build-i386-xsm pass build-amd64 pass build-i386 pass build-amd64-libvirt fail build-i386-libvirt fail build-amd64-pvopspass build-i386-pvops pass test-amd64-amd64-xl-qemuu-ovmf-amd64 pass test-amd64-i386-xl-qemuu-ovmf-amd64 pass sg-report-flight on osstest.test-lab.xenproject.org logs: /home/logs/logs images: /home/logs/images Logs, config files, etc. are available at http://logs.test-lab.xenproject.org/osstest/logs Explanation of these reports, and of osstest in general, is at http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master Test harness code can be found at http://xenbits.xen.org/gitweb?p=osstest.git;a=summary Not pushing. (No revision log; it would be 1360 lines long.)
[qemu-mainline test] 172905: regressions - FAIL
flight 172905 qemu-mainline real [real] http://logs.test-lab.xenproject.org/osstest/logs/172905/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: build-amd64-libvirt 6 libvirt-buildfail REGR. vs. 172123 build-i386-libvirt6 libvirt-buildfail REGR. vs. 172123 build-arm64-libvirt 6 libvirt-buildfail REGR. vs. 172123 build-armhf-libvirt 6 libvirt-buildfail REGR. vs. 172123 test-amd64-i386-xl-vhd 22 guest-start.2fail REGR. vs. 172123 Regressions which are regarded as allowable (not blocking): test-armhf-armhf-xl-rtds18 guest-start/debian.repeat fail REGR. vs. 172123 Tests which did not succeed, but are not blocking: test-arm64-arm64-libvirt-raw 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-xsm 1 build-check(1) blocked n/a test-armhf-armhf-libvirt 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-qcow2 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-raw 1 build-check(1) blocked n/a test-amd64-i386-libvirt-xsm 1 build-check(1) blocked n/a test-amd64-i386-libvirt-raw 1 build-check(1) blocked n/a test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-i386-libvirt-pair 1 build-check(1) blocked n/a test-amd64-i386-libvirt 1 build-check(1) blocked n/a test-amd64-amd64-libvirt 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-pair 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-vhd 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-xsm 1 build-check(1) blocked n/a test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 172123 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 172123 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 172123 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 172123 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 172123 test-amd64-i386-xl-pvshim14 guest-start fail never pass test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-vhd 14 migrate-support-checkfail never pass test-arm64-arm64-xl-vhd 15 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit2 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-rtds 15 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass test-armhf-armhf-xl 15 migrate-support-checkfail never pass test-armhf-armhf-xl 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-vhd 14 migrate-support-checkfail never pass test-armhf-armhf-xl-vhd 15 saverestore-support-checkfail never pass version targeted for testing: qemuue93ded1bf6c94ab95015b33e188bc8b0b0c32670 baseline version: qemuu2480f3bbd03814b0651a1f74959f5c6631ee5819 Last test
Re: [PATCH v2 0/3] xen-blk{front,back}: Fix the broken semantic and flow of feature-persistent
On 31.08.22 18:58, SeongJae Park wrote: Changes from v1 (https://lore.kernel.org/xen-devel/20220825161511.94922-1...@kernel.org/) - Fix the wrong feature_persistent caching position of blkfront - Set blkfront's feature_persistent field setting with simple '&&' instead of 'if' (Pratyush Yadav) This patchset fixes misuse of the 'feature-persistent' advertisement semantic (patches 1 and 2), and the wrong timing of the 'feature_persistent' value caching, which made persistent grants feature always disabled. SeongJae Park (3): xen-blkback: Advertise feature-persistent as user requested xen-blkfront: Advertise feature-persistent as user requested xen-blkfront: Cache feature_persistent value before advertisement drivers/block/xen-blkback/common.h | 3 +++ drivers/block/xen-blkback/xenbus.c | 6 -- drivers/block/xen-blkfront.c | 20 3 files changed, 19 insertions(+), 10 deletions(-) For the whole series: Reviewed-by: Juergen Gross Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: OpenPGP public key OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH v2 0/3] xen-blk{front,back}: Fix the broken semantic and flow of feature-persistent
On Wed, Aug 31, 2022 at 05:08:17PM +, SeongJae Park wrote: > On Wed, 31 Aug 2022 16:58:21 + SeongJae Park wrote: > > > Changes from v1 > > (https://lore.kernel.org/xen-devel/20220825161511.94922-1...@kernel.org/) > > - Fix the wrong feature_persistent caching position of blkfront > > - Set blkfront's feature_persistent field setting with simple '&&' > > instead of 'if' (Pratyush Yadav) > > > > This patchset fixes misuse of the 'feature-persistent' advertisement > > semantic (patches 1 and 2), and the wrong timing of the > > 'feature_persistent' value caching, which made persistent grants feature > > always disabled. > > Please note that I have some problem in my test setup and therefore was unable > to fully test this patchset. I am posting this though, as the impact of the > bug is not trivial (always disabling persistent grants), and to make testing > of > my proposed fix from others easier. Hope to get someone's test results or > code > review of this patchset even before I fix my test setup problem. I can confirm it fixes the issue, thanks! Tested-by: Marek Marczykowski-Górecki > Juergen, I didn't add your 'Reviewed-by:'s to the first two patches of this > series because I changed some of the description for making it clear which bug > and commit it is really fixing. Specifically, I wordsmithed the working and > changed 'Fixed:' tag. Code change is almost same, though. > > > Thanks, > SJ > > > > > SeongJae Park (3): > > xen-blkback: Advertise feature-persistent as user requested > > xen-blkfront: Advertise feature-persistent as user requested > > xen-blkfront: Cache feature_persistent value before advertisement > > > > drivers/block/xen-blkback/common.h | 3 +++ > > drivers/block/xen-blkback/xenbus.c | 6 -- > > drivers/block/xen-blkfront.c | 20 > > 3 files changed, 19 insertions(+), 10 deletions(-) > > > > -- > > 2.25.1 > > -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Push 4.17 Feature freeze and Code freeze one week later
Hi, Seeing that we have lots of series quite advanced in review but not already merged, could we push both Feature freeze and Code freeze deadline one week later to have a chance to finish those ? Cheers Bertrand
Re: [RFC PATCH 00/30] Code tagging framework and applications
On 01.09.22 16:23, Kent Overstreet wrote: > On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote: >> On 31.08.22 21:01, Kent Overstreet wrote: >>> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: On Wed 31-08-22 11:19:48, Mel Gorman wrote: > Whatever asking for an explanation as to why equivalent functionality > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. Fully agreed and this is especially true for a change this size 77 files changed, 3406 insertions(+), 703 deletions(-) >>> >>> In the case of memory allocation accounting, you flat cannot do this with >>> ftrace >>> - you could maybe do a janky version that isn't fully accurate, much slower, >>> more complicated for the developer to understand and debug and more >>> complicated >>> for the end user. >>> >>> But please, I invite anyone who's actually been doing this with ftrace to >>> demonstrate otherwise. >>> >>> Ftrace just isn't the right tool for the job here - we're talking about >>> adding >>> per callsite accounting to some of the fastest fast paths in the kernel. >>> >>> And the size of the changes for memory allocation accounting are much more >>> reasonable: >>> 33 files changed, 623 insertions(+), 99 deletions(-) >>> >>> The code tagging library should exist anyways, it's been open coded half a >>> dozen >>> times in the kernel already. >> >> Hi Kent, >> >> independent of the other discussions, if it's open coded already, does >> it make sense to factor that already-open-coded part out independently >> of the remainder of the full series here? > > It's discussed in the cover letter, that is exactly how the patch series is > structured. Skimming over the patches (that I was CCed on) and skimming over the cover letter, I got the impression that everything after patch 7 is introducing something new instead of refactoring something out. > >> [I didn't immediately spot if this series also attempts already to >> replace that open-coded part] > > Uh huh. > > Honestly, some days it feels like lkml is just as bad as slashdot, with people > wanting to get in their two cents without actually reading... ... and of course you had to reply like that. I should just have learned from my last upstream experience with you and kept you on my spam list. Thanks, bye -- Thanks, David / dhildenb
Re: [RFC PATCH 03/30] Lazy percpu counters
On Thu, 1 Sep 2022 10:32:19 -0400 Kent Overstreet wrote: > On Thu, Sep 01, 2022 at 08:51:31AM +0200, Peter Zijlstra wrote: > > On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote: > > > +static void lazy_percpu_counter_switch_to_pcpu(struct > > > raw_lazy_percpu_counter *c) > > > +{ > > > + u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN); > > > > Realize that this is incorrect when used under a raw_spinlock_t. > > Can you elaborate? All allocations (including GFP_ATOMIC) grab normal spin_locks. When PREEMPT_RT is configured, normal spin_locks turn into a mutex, where as raw_spinlock's do not. Thus, if this is done within a raw_spinlock with PREEMPT_RT configured, it can cause a schedule while holding a spinlock. -- Steve
Re: [RFC PATCH 27/30] Code tagging based latency tracking
On Thu, Sep 01, 2022 at 09:11:17AM +0200, Peter Zijlstra wrote: > On Tue, Aug 30, 2022 at 02:49:16PM -0700, Suren Baghdasaryan wrote: > > From: Kent Overstreet > > > > This adds the ability to easily instrument code for measuring latency. > > To use, add the following to calls to your code, at the start and end of > > the event you wish to measure: > > > > code_tag_time_stats_start(start_time); > > code_tag_time_stats_finish(start_time); > > > > Stastistics will then show up in debugfs under > > /sys/kernel/debug/time_stats, listed by file and line number. > > > > Stastics measured include weighted averages of frequency, duration, max > > duration, as well as quantiles. > > > > This patch also instruments all calls to init_wait and finish_wait, > > which includes all calls to wait_event. Example debugfs output: > > How can't you do this with a simple eBPF script on top of > trace_sched_stat_* and friends? I know about those tracepoints, and I've never found them to be usable. I've never succesfully used them for debugging latency issues, or known anyone who has. And an eBPF script to do everything this does wouldn't be simple at all. Honesly, the time stats stuff looks _far_ simpler to me than anything involving tracing - and with tracing you have to correlate the start and end events after the fact.
Re: [PATCH 1/2] docs, xen/arm: Introduce reserved heap memory
Hi Henry, On 24/08/2022 08:31, Henry Wang wrote: This commit introduces the reserved heap memory, which is parts of RAM reserved in the beginning of the boot time for heap. A new boolean field `xen_heap` in `struct membank` is added to store the configuration telling if the memory bank is reserved as heap through `xen,static-mem` property in device tree `chosen` node. Also, this commit introduces the logic to parse the reserved heap configuation in device tree by reusing the device tree entry definition typo: s/configuation/configuration/ of the static memory allocation feature: - Add a boolean parameter `xen_heap` to `device_tree_get_meminfo` to reflect whether the memory bank is reserved as heap. - Use `device_tree_get_meminfo` to parse the reserved heap configuation in `chosen` node of the device tree. - In order to reuse the function `device_tree_get_meminfo`, the return type of `process_chosen_node` is changed from void to int. A documentation section is added, describing the definition of reserved heap memory and the method of enabling the reserved heap memory through device tree at boot time. Signed-off-by: Henry Wang Signed-off-by: Penny Zheng --- The name of the device tree property was chosen because we want to reuse as much as the device tree parsing helpers from the static memory allocation feature, but we would like to hear the upstream reviewers' opinion about if using "xen,static-heap" is better. --- Changes from RFC to v1: - Rename the terminology to reserved heap. --- docs/misc/arm/device-tree/booting.txt | 46 + I have skipped the documentation and looked at the code instead. xen/arch/arm/bootfdt.c| 49 +-- xen/arch/arm/domain_build.c | 5 +-- xen/arch/arm/include/asm/setup.h | 1 + 4 files changed, 89 insertions(+), 12 deletions(-) diff --git a/docs/misc/arm/device-tree/booting.txt b/docs/misc/arm/device-tree/booting.txt index 98253414b8..e064f64d9a 100644 --- a/docs/misc/arm/device-tree/booting.txt +++ b/docs/misc/arm/device-tree/booting.txt @@ -378,3 +378,49 @@ device-tree: This will reserve a 512MB region starting at the host physical address 0x3000 to be exclusively used by DomU1. + + +Reserved Heap Memory + + +The reserved heap memory (also known as the statically-configured heap) refers +to parts of RAM reserved in the beginning for heap. The memory is reserved by +configuration in the device tree using physical address ranges. + +The reserved heap memory declared in the device tree defines the memory areas +that will be reserved to be used exclusively as heap. + +- For Arm32, since there can be seperated heaps, the reserved heap will be used +for both domheap and xenheap. +- For Arm64, since domheap and xenheap are the same, the defined reserved heap +areas shall always go to the heap allocator. + +The reserved heap memory is an optional feature and can be enabled by adding a +device tree property in the `chosen` node. Currently, this feature reuses the +static memory allocation device tree configuration. + +The dtb property should look like as follows: + +- property name + +"xen,static-mem" (Should be used in the `chosen` node) + +- cells + +Specify the start address and the length of the reserved heap memory. +The number of cells for the address and the size should be defined +using the properties `#xen,static-mem-address-cells` and +`#xen,static-mem-size-cells` respectively. + +Below is an example on how to specify the reserved heap in device tree: + +/ { +chosen { +#xen,static-mem-address-cells = <0x2>; +#xen,static-mem-size-cells = <0x2>; +xen,static-mem = <0x0 0x3000 0x0 0x4000>; +}; +}; + +RAM at 0x3000 of 1G size will be reserved as heap. + diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c index ec81a45de9..33704ca487 100644 --- a/xen/arch/arm/bootfdt.c +++ b/xen/arch/arm/bootfdt.c @@ -64,7 +64,8 @@ void __init device_tree_get_reg(const __be32 **cell, u32 address_cells, static int __init device_tree_get_meminfo(const void *fdt, int node, const char *prop_name, u32 address_cells, u32 size_cells, - void *data, bool xen_domain) + void *data, bool xen_domain, + bool xen_heap) It sounds like to me, we want to have an enum rather than multiple boolean. This would also make easier... { const struct fdt_property *prop; unsigned int i, banks; @@ -96,6 +97,7 @@ static int __init device_tree_get_meminfo(const void *fdt, int node, mem->bank[mem->nr_banks].start = start; mem->bank[mem->nr_banks].size = size; mem->bank[mem->nr_banks].xen_domain = xen_domain; +mem->bank[mem->nr_ba
Re: [RFC PATCH 03/30] Lazy percpu counters
On Thu, Sep 01, 2022 at 08:51:31AM +0200, Peter Zijlstra wrote: > On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote: > > +static void lazy_percpu_counter_switch_to_pcpu(struct > > raw_lazy_percpu_counter *c) > > +{ > > + u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN); > > Realize that this is incorrect when used under a raw_spinlock_t. Can you elaborate?
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 09:00:17AM +0200, Peter Zijlstra wrote: > On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote: > > > It's also unclear *who* would enable this. It looks like it would mostly > > have value during the development stage of an embedded platform to track > > kernel memory usage on a per-application basis in an environment where it > > may be difficult to setup tracing and tracking. Would it ever be enabled > > in production? > > Afaict this is developer only; it is all unconditional code. > > > Would a distribution ever enable this? > > I would sincerely hope not. Because: > > > If it's enabled, any overhead cannot be disabled/enabled at run or > > boot time so anyone enabling this would carry the cost without never > > necessarily consuming the data. > > this. We could make it a boot parameter, with the alternatives infrastructure - with a bit of refactoring there'd be a single function call to nop out, and then we could also drop the elf sections as well, so that when built in but disabled the overhead would be practically nil.
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote: > On 31.08.22 21:01, Kent Overstreet wrote: > > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: > >> On Wed 31-08-22 11:19:48, Mel Gorman wrote: > >>> Whatever asking for an explanation as to why equivalent functionality > >>> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > >> > >> Fully agreed and this is especially true for a change this size > >> 77 files changed, 3406 insertions(+), 703 deletions(-) > > > > In the case of memory allocation accounting, you flat cannot do this with > > ftrace > > - you could maybe do a janky version that isn't fully accurate, much slower, > > more complicated for the developer to understand and debug and more > > complicated > > for the end user. > > > > But please, I invite anyone who's actually been doing this with ftrace to > > demonstrate otherwise. > > > > Ftrace just isn't the right tool for the job here - we're talking about > > adding > > per callsite accounting to some of the fastest fast paths in the kernel. > > > > And the size of the changes for memory allocation accounting are much more > > reasonable: > > 33 files changed, 623 insertions(+), 99 deletions(-) > > > > The code tagging library should exist anyways, it's been open coded half a > > dozen > > times in the kernel already. > > Hi Kent, > > independent of the other discussions, if it's open coded already, does > it make sense to factor that already-open-coded part out independently > of the remainder of the full series here? It's discussed in the cover letter, that is exactly how the patch series is structured. > [I didn't immediately spot if this series also attempts already to > replace that open-coded part] Uh huh. Honestly, some days it feels like lkml is just as bad as slashdot, with people wanting to get in their two cents without actually reading...
Re: [PATCH] xen/arm: Add xen/arch/arm/efi/stub.c in .gitignore
Hi Julien, This one passed through my filtering, sorry for that. > On 12 Aug 2022, at 20:19, Julien Grall wrote: > > From: Julien Grall > > Xen build system the symbolic link xen/arch/arm/efi/stub.c. So we want > to ignore it. > > Fixes: 7f96859b0d00 ("xen: reuse x86 EFI stub functions for Arm") > Signed-off-by: Julien Grall Reviewed-by: Bertrand Marquis Cheers Bertrand > --- > .gitignore | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/.gitignore b/.gitignore > index ed7bd8bdc76c..0d53eb304993 100644 > --- a/.gitignore > +++ b/.gitignore > @@ -295,6 +295,7 @@ xen/.config > xen/.config.old > xen/.xen.elf32 > xen/System.map > +xen/arch/arm/efi/stub.c > xen/arch/x86/boot/mkelf32 > xen/arch/x86/boot/cmdline.S > xen/arch/x86/boot/reloc.S > -- > 2.37.1 >
Re: [PATCH 01/10] xen/arm: smmuv3: Fix l1 stream table size in the error message
On 01/09/2022 11:27, Rahul Singh wrote: Hi Julien, Hi Rahul, On 24 Aug 2022, at 3:58 pm, Julien Grall wrote: Hi Rahul, On 24/08/2022 14:53, Rahul Singh wrote: Backport Linux commit dc898eb84b25c39ea46f28c48a169bdbd0e2c7e0 iommu/arm-smmu-v3: Fix l1 stream table size in the error message We have a tag for this (see Origin). If you use it, then… Ok. Original commit message: You don't need to add "original commit message" here and the content is exactly the same. Ok. iommu/arm-smmu-v3: Fix l1 stream table size in the error message The actual size of level-1 stream table is l1size. This looks like an oversight on commit d2e88e7c081ef ("iommu/arm-smmu: Fix LOG2SIZE setting for 2-level stream tables") which forgot to update the @size in error message as well. As memory allocation failure is already bad enough, nothing worse would happen. But let's be careful. Signed-off-by: Zenghui Yu AFAICT, you didn't make any change to this patch. So the "From:" should still be from Zenghui Yu. For an example how to do backport, see 9c432b876bf518866d431bda73f2be1250f688eb "x86/mwait-idle: add SPR support". Ok. Also, it would be good to clarify whether they are clean backport and required some changes (other than context changes). I assume they are clean backports? Yes all patches are clean back port. Here is the commit msg please have a look once. Once you confirmed I will modify other patches and send it for review. It looks fine to me. Cheers, -- Julien Grall
Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator
On 01/09/2022 15:01, Henry Wang wrote: Hi Julien, -Original Message- From: Julien Grall +paddr_t reserved_heap_start = ~0, reserved_heap_end = 0, INVALID_PADDR or ~0ULL I would strongly prefer the former. It is more difficult to understand what's the value means in the latter. Thanks for the input. You mean the INVALID_PADDR correct? That's correct. Cheers, -- Julien Grall
RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator
Hi Julien, > -Original Message- > From: Julien Grall > >> +paddr_t reserved_heap_start = ~0, reserved_heap_end = 0, > > > > INVALID_PADDR or ~0ULL > > I would strongly prefer the former. It is more difficult to understand > what's the value means in the latter. Thanks for the input. You mean the INVALID_PADDR correct? Yeah this is the one that I used in my local v2, will send it tomorrow after doing the bootinfo change. Kind regards, Henry > > Cheers, > > -- > Julien Grall
Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator
On 30/08/2022 02:04, Stefano Stabellini wrote: size_t estimate_efi_size(unsigned int mem_nr_banks); diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c index 500307edc0..fe76cf6325 100644 --- a/xen/arch/arm/setup.c +++ b/xen/arch/arm/setup.c @@ -73,6 +73,8 @@ integer_param("xenheap_megabytes", opt_xenheap_megabytes); domid_t __read_mostly max_init_domid; +bool __read_mostly reserved_heap; + static __used void init_done(void) { /* Must be done past setting system_state. */ @@ -699,8 +701,10 @@ static void __init populate_boot_allocator(void) #ifdef CONFIG_ARM_32 static void __init setup_mm(void) { -paddr_t ram_start, ram_end, ram_size, e; -unsigned long ram_pages; +paddr_t ram_start, ram_end, ram_size, e, bank_start, bank_end, bank_size; +paddr_t reserved_heap_start = ~0, reserved_heap_end = 0, INVALID_PADDR or ~0ULL I would strongly prefer the former. It is more difficult to understand what's the value means in the latter. Cheers, -- Julien Grall
Re: [PATCH v3 3/7] xen/evtchn: restrict the maximum number of evtchn supported for domUs
Hi Rahul, On 01/09/2022 11:13, Rahul Singh wrote: > > Restrict the maximum number of evtchn supported for domUs to avoid > allocating a large amount of memory in Xen. > > Set the default value of max_evtchn_port to 1023. The value of 1023 > should be sufficient for domUs guests because on ARM we don't bind > physical interrupts to event channels. The only use of the evtchn port > is inter-domain communications. Following the previous discussion, I think the only missing piece is an explanation that 1023 was chose to follow the default behavior of libxl. Apart from that: Reviewed-by: Michal Orzel ~Michal
Re: [PATCH v11 6/6] xen: retrieve reserved pages on populate_physmap
Hi Penny, On 31/08/2022 03:40, Penny Zheng wrote: +/* + * Acquire a page from reserved page list(resv_page_list), when populating + * memory for static domain on runtime. + */ +mfn_t acquire_reserved_page(struct domain *d, unsigned int memflags) +{ +struct page_info *page; + +ASSERT_ALLOC_CONTEXT(); + +/* Acquire a page from reserved page list(resv_page_list). */ +spin_lock(&d->page_alloc_lock); +page = page_list_remove_head(&d->resv_page_list); +spin_unlock(&d->page_alloc_lock); +if ( unlikely(!page) ) +return INVALID_MFN; + +if ( !prepare_staticmem_pages(page, 1, memflags) ) +goto fail; + +if ( assign_domstatic_pages(d, page, 1, memflags) ) +goto fail_assign; + +return page_to_mfn(page); + + fail_assign: +unprepare_staticmem_pages(page, 1, false); Looking at assign_domstatic_pages(). It will already call unprepare_staticmem_pages() in one of the error path. It doesn't look like the latter can be called twice on a page. To be honest, I find a bit odd that assign_domstatic_pages() is calling unprepare_staticmem_pages() because the former doesn't call the "prepare" function. AFAICT, this is an issue introduced in this patch. So I would remove the call from assign_domstatic_pages() and then let the caller calls unprepare_staticmem_pages() (this would need to be added in acquire_domstatic_pages()). Also, I think it would be good to explain why we don't need to scrub. Something like: "The page was never accessible by the domain. So scrubbing can be skipped". Cheers, -- Julien Grall
RE: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator
Hi Bertrand, > -Original Message- > From: Bertrand Marquis > > On 1 Sep 2022, at 02:03, Henry Wang wrote: > > > > Hi Arm maintainers, > > > >> -Original Message- > >> Hi Henry, > >> > >> On 30/08/2022 08:11, Henry Wang wrote: > >>> > >>> Hi Michal, > >>> > >>> Sorry about the late reply - I had a couple of days off. Thank you very > >>> much for the review! I will add my reply and answer some of your > >>> questions below. > >>> > -Original Message- > From: Michal Orzel > Subject: Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot > >> and > heap allocator > > > This commit firstly adds a global variable `reserved_heap`. > > This newly introduced global variable is set at the device tree > > parsing time if the reserved heap ranges are defined in the device > > tree chosen node. > > > Did you consider putting reserved_heap into bootinfo structure? > >>> > >>> Actually I did, but I saw current bootinfo only contains some structs so > >>> I was not sure if this is the preferred way, but since you are raising > >>> this > >>> question, I will follow this method in v2. > >> > >> This is what I think would be better but maintainers will have a decisive > vote. > > > > I think this is the only uncertain comment that I received during the latest > > review of this series. I agree that Michal is making a good point (Thanks!) > but we > > are curious about what maintainers think. Could you please kindly share > your > > opinion on the more preferred approach? I will do that in v2. Thanks very > much! > > I think it does make sense to put this in bootinfo. I am good with that, then I think I will move this to bootinfo in v2 unless other objections. Thank you for the input. Kind regards, Henry > > Cheers > Bertrand >
Re: [PATCH 0/7] Fix MISRA C 2012 Rule 20.7 violations
Hi Xenia, > On 1 Sep 2022, at 10:27, Xenia Ragiadakou wrote: > > > On 9/1/22 01:35, Stefano Stabellini wrote: >> Patches 1, 4, and 6 are already committed. I plan to commit patches 2, 3 >> and 5 in the next couple of days. >> Patch 7 needs further discussions and it is best addressed during the >> next MISRA C sync-up. > > I would like to share here, before the next MISRA C sync, my understandings > that will hopefully resolve a wrong impression of mine, that I may have > spread around, regarding this rule. > There was a misunderstanding regarding the rule 20.7 from my part and I think > that Jan is absolutely right that parenthesizing macro parameters used as > function arguments is not required by the rule. > > The rule 20.7 states "Expressions resulting from the expansion of macro > parameters shall be enclosed in parentheses" and in the rationale of the rule > states "If a macro parameter is not being used as an expression then the > parentheses are not necessary because no operators are involved.". > > Initially, based on the title, my understanding was that it requires for the > expression resulting from the expansion of the macro to be enclosed in > parentheses. Then, based on the rule explanation and the examples given, my > understanding was that it requires the macro parameters that are used as > expressions to be enclosed in parentheses. > But, after re-thinking about it, the most probable and what makes more sense, > is that it require parentheses around the macro parameters that are part of > an expression and not around those that are used as expressions. > > Therefore, macro parameters being used as function arguments are not required > to be enclosed in parentheses, because the function arguments are part of an > expression list, not of an expression (comma is evaluated as separator, not > as operator). > While, macro parameters used as rhs and lhs expressions of the assignment > operator are required to be enclosed in parentheses because they are part of > an assignment expression. > > I verified that the violation reported by cppcheck is not due to missing > parentheses around the function argument (though still I have not understood > the origin of the warning). Also, Eclair does not report it. > > Hence, it was a misunderstanding of mine and there is no inconsistency, with > respect to this rule, in adding parentheses around macro parameters used as > rhs of assignments. The rule does not require adding parentheses around macro > parameters used as function arguments and neither cppcheck nor Eclair report > violation for missing parentheses around macro parameters used as function > arguments. Thanks a lot for the detailed explanation :-) What you say does make sense and I agree with your analysis here, only protect when part of an expression and not use as a subsequent parameter (for a function or an other macro). Regards Bertrand > >> On Fri, 19 Aug 2022, Xenia Ragiadakou wrote: >>> Xenia Ragiadakou (7): >>> xen/arm: gic_v3_its: Fix MISRA C 2012 Rule 20.7 violations >>> xsm/flask: sidtab: Fix MISRA C 2012 Rule 20.7 violations >>> xen/elf: Fix MISRA C 2012 Rule 20.7 violations >>> xen/vgic: Fix MISRA C 2012 Rule 20.7 violation >>> xen/rbtree: Fix MISRA C 2012 Rule 20.7 violation >>> xen/arm: processor: Fix MISRA C 2012 Rule 20.7 violations >>> xen/device_tree: Fix MISRA C 2012 Rule 20.7 violations >>> >>> xen/arch/arm/include/asm/gic_v3_its.h | 10 +- >>> xen/arch/arm/include/asm/new_vgic.h | 2 +- >>> xen/arch/arm/include/asm/processor.h | 4 ++-- >>> xen/include/xen/device_tree.h | 6 +++--- >>> xen/include/xen/elfstructs.h | 4 ++-- >>> xen/lib/rbtree.c | 2 +- >>> xen/xsm/flask/ss/sidtab.c | 8 >>> 7 files changed, 18 insertions(+), 18 deletions(-) >>> >>> -- >>> 2.34.1 >>> > > -- > Xenia
Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot and heap allocator
Hi, > On 1 Sep 2022, at 02:03, Henry Wang wrote: > > Hi Arm maintainers, > >> -Original Message- >> Hi Henry, >> >> On 30/08/2022 08:11, Henry Wang wrote: >>> >>> Hi Michal, >>> >>> Sorry about the late reply - I had a couple of days off. Thank you very >>> much for the review! I will add my reply and answer some of your >>> questions below. >>> -Original Message- From: Michal Orzel Subject: Re: [PATCH 2/2] xen/arm: Handle reserved heap pages in boot >> and heap allocator > This commit firstly adds a global variable `reserved_heap`. > This newly introduced global variable is set at the device tree > parsing time if the reserved heap ranges are defined in the device > tree chosen node. > Did you consider putting reserved_heap into bootinfo structure? >>> >>> Actually I did, but I saw current bootinfo only contains some structs so >>> I was not sure if this is the preferred way, but since you are raising this >>> question, I will follow this method in v2. >> >> This is what I think would be better but maintainers will have a decisive >> vote. > > I think this is the only uncertain comment that I received during the latest > review of this series. I agree that Michal is making a good point (Thanks!) > but we > are curious about what maintainers think. Could you please kindly share your > opinion on the more preferred approach? I will do that in v2. Thanks very > much! I think it does make sense to put this in bootinfo. Cheers Bertrand
[linux-5.4 test] 172904: regressions - FAIL
flight 172904 linux-5.4 real [real] http://logs.test-lab.xenproject.org/osstest/logs/172904/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: build-i386-libvirt6 libvirt-buildfail REGR. vs. 172128 build-amd64-libvirt 6 libvirt-buildfail REGR. vs. 172128 build-arm64-libvirt 6 libvirt-buildfail REGR. vs. 172128 build-armhf-libvirt 6 libvirt-buildfail REGR. vs. 172128 Tests which are failing intermittently (not blocking): test-armhf-armhf-xl-rtds 14 guest-start fail in 172894 pass in 172904 test-amd64-amd64-qemuu-freebsd12-amd64 18 guest-saverestore.2 fail pass in 172894 Tests which did not succeed, but are not blocking: test-arm64-arm64-libvirt-raw 1 build-check(1) blocked n/a test-arm64-arm64-libvirt-xsm 1 build-check(1) blocked n/a test-amd64-amd64-libvirt 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-pair 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-vhd 1 build-check(1) blocked n/a test-amd64-amd64-libvirt-xsm 1 build-check(1) blocked n/a test-amd64-i386-libvirt 1 build-check(1) blocked n/a test-amd64-i386-libvirt-pair 1 build-check(1) blocked n/a test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 1 build-check(1) blocked n/a test-amd64-i386-libvirt-raw 1 build-check(1) blocked n/a test-amd64-i386-libvirt-xsm 1 build-check(1) blocked n/a test-armhf-armhf-libvirt 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-qcow2 1 build-check(1) blocked n/a test-armhf-armhf-libvirt-raw 1 build-check(1) blocked n/a test-armhf-armhf-xl-credit1 18 guest-start/debian.repeat fail blocked in 172128 test-armhf-armhf-xl-credit2 18 guest-start/debian.repeat fail in 172894 like 172108 test-armhf-armhf-xl-credit1 14 guest-start fail in 172894 like 172128 test-armhf-armhf-xl-multivcpu 14 guest-startfail in 172894 like 172128 test-armhf-armhf-xl-credit2 15 migrate-support-check fail in 172894 never pass test-armhf-armhf-xl-credit2 16 saverestore-support-check fail in 172894 never pass test-armhf-armhf-xl-multivcpu 18 guest-start/debian.repeatfail like 172108 test-armhf-armhf-xl-vhd 13 guest-start fail like 172108 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 172108 test-armhf-armhf-xl-credit2 14 guest-start fail like 172128 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 172128 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 172128 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 172128 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 172128 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 172128 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 172128 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 172128 test-armhf-armhf-xl-rtds 18 guest-start/debian.repeatfail like 172128 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 172128 test-amd64-i386-xl-pvshim14 guest-start fail never pass test-arm64-arm64-xl-seattle 15 migrate-support-checkfail never pass test-arm64-arm64-xl-seattle 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail never pass test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail never pass test-arm64-arm64-xl 15 migrate-support-checkfail never pass test-arm64-arm64-xl 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-xsm 15 migrate-support-checkfail never pass test-arm64-arm64-xl-xsm 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit1 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit1 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-credit1 15 migrate-support-checkfail never pass test-armhf-armhf-xl-credit1 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail never pass test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail never pass test-arm64-arm64-xl-credit2 15 migrate-support-checkfail never pass test-arm64-arm64-xl-credit2 16 saverestore-support-checkfail never pass test-armhf-armhf-xl-arndale 15 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 16 saverestore-support-checkfail never pass test-armhf-armhf-x
Re: [PATCH v11 5/6] xen: rename free_staticmem_pages to unprepare_staticmem_pages
Hi Penny, On 31/08/2022 03:40, Penny Zheng wrote: The name of free_staticmem_pages is inappropriate, considering it is the opposite of function prepare_staticmem_pages. Rename free_staticmem_pages to unprepare_staticmem_pages. Signed-off-by: Penny Zheng Acked-by: Jan Beulich Reviewed-by: Julien Grall Cheers, -- Julien Grall
Re: [PATCH] xen-pcifront: Handle missed Connected state
On Wed, Aug 31, 2022 at 10:35 PM Rich Persaud wrote: > > On Aug 29, 2022, at 11:16 AM, Jason Andryuk wrote: > > > > An HVM guest with linux stubdom and 2 PCI devices failed to start as > > libxl timed out waiting for the PCI devices to be added. It happens > > intermittently but with some regularity. libxl wrote the two xenstore > > entries for the devices, but then timed out waiting for backend state 4 > > (Connected) - the state stayed at 7 (Reconfiguring). (PCI passthrough > > to an HVM with stubdomain is PV passthrough to the stubdomain and then > > HVM passthrough with the QEMU inside the stubdomain.) > > > > The stubdom kernel never printed "pcifront pci-0: Installing PCI > > frontend", so it seems to have missed state 4 which would have > > called pcifront_try_connect -> pcifront_connect_and_init_dma > > Is there a state machine doc/flowchart for LibXL and Xen PCI device > passthrough to Linux? This would be a valuable addition to Xen's developer > docs, even as a whiteboard photo in this thread. I am not aware of one. -Jason
Re: [PATCH] acpi: Add TPM2 interface definition and make the TPM version configurable.
On Tue, Aug 30, 2022 at 4:30 PM Jennifer Herbert wrote: > > This patch introduces an optional TPM 2 interface definition to the ACPI > table, > which is to be used as part of a vTPM 2 implementation. > To enable the new interface - I have made the TPM interface version > configurable in the acpi_config, with the default being the existing > 1.2.(TCPA) > I have also added to hvmloader an option to utilise this new config, which can > be triggered by setting the platform/tpm_verion xenstore key. > > Signed-off-by: Jennifer Herbert Reviewed-by: Jason Andryuk Thanks. Is there a particular reason why CRB (Command Response Buffer) was chosen over TIS (TPM Interface Specification)? I think of CRB as more of an embedded device TPM interface, and TIS is what is usually used with physical TPMs. My experiences have only been with TIS devices, so that is influencing my outlook. Hmm, this patch seems to reference the Intel Platform Trust Technology (PTT) fTPM (firmware-TPM) as using the CRB interface: https://patchwork.kernel.org/project/tpmdd-devel/patch/1417672167-3489-8-git-send-email-jarkko.sakki...@linux.intel.com/ If PTT fTPMs are using CRB, then it's more than just embedded devices.. Regards, Jason
Re: [PATCH v3 1/7] xen/evtchn: Make sure all buckets below d->valid_evtchns are allocated
Hi Michal, > On 1 Sep 2022, at 1:47 pm, Michal Orzel wrote: > > Hi Rahul, > > On 01/09/2022 11:13, Rahul Singh wrote: >> >> From: Julien Grall >> >> Since commit 01280dc19cf3 "evtchn: simplify port_is_valid()", the event >> channels code assumes that all the buckets below d->valid_evtchns are >> always allocated. >> >> This assumption hold in most of the situation because a guest is not >> allowed to chose the port. Instead, it will be the first free from port >> 0. >> >> When static event channel support will be added for dom0less domains >> user can request to allocate the evtchn port numbers that are scattered >> in nature. >> >> The existing implementation of evtchn_allocate_port() is not able to >> deal with such situation and will end up to override bucket or/and leave >> some bucket unallocated. The latter will result to a droplet crash if >> the event channel belongs to an unallocated bucket. >> >> This can be solved by making sure that all the buckets below >> d->valid_evtchns are allocated. There should be no impact for most of >> the situation but LM/LU as only one bucket would be allocated. For >> LM/LU, we may end up to allocate multiple buckets if ports in use are >> sparse. >> >> A potential alternative is to check that the bucket is valid in >> is_port_valid(). This should still possible to do it without taking >> per-domain lock but will result a couple more of memory access. >> >> Signed-off-by: Julien Grall >> Signed-off-by: Rahul Singh >> --- >> Changes in v3: >> - fix comments in commit msg. >> - modify code related to d->valid_evtchns and {read,write}_atomic() >> Changes in v2: >> - new patch in this version to avoid the security issue >> --- >> xen/common/event_channel.c | 55 -- >> 1 file changed, 35 insertions(+), 20 deletions(-) >> >> diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c >> index c2c6f8c151..80b06d9743 100644 >> --- a/xen/common/event_channel.c >> +++ b/xen/common/event_channel.c >> @@ -193,6 +193,15 @@ static struct evtchn *alloc_evtchn_bucket(struct domain >> *d, unsigned int port) >> return NULL; >> } >> >> +/* >> + * Allocate a given port and ensure all the buckets up to that ports >> + * have been allocated. >> + * >> + * The last part is important because the rest of the event channel code >> + * relies on all the buckets up to d->valid_evtchns to be valid. However, >> + * event channels may be sparsed when restoring a domain during Guest >> + * Transparent Migration and Live Update. > You got rid of mentioning these non-existing features from the commit msg, > but you still mention them here. I missed that I will fix that in next version. Regards, Rahul
Re: [PATCH v3 1/7] xen/evtchn: Make sure all buckets below d->valid_evtchns are allocated
Hi Rahul, On 01/09/2022 11:13, Rahul Singh wrote: > > From: Julien Grall > > Since commit 01280dc19cf3 "evtchn: simplify port_is_valid()", the event > channels code assumes that all the buckets below d->valid_evtchns are > always allocated. > > This assumption hold in most of the situation because a guest is not > allowed to chose the port. Instead, it will be the first free from port > 0. > > When static event channel support will be added for dom0less domains > user can request to allocate the evtchn port numbers that are scattered > in nature. > > The existing implementation of evtchn_allocate_port() is not able to > deal with such situation and will end up to override bucket or/and leave > some bucket unallocated. The latter will result to a droplet crash if > the event channel belongs to an unallocated bucket. > > This can be solved by making sure that all the buckets below > d->valid_evtchns are allocated. There should be no impact for most of > the situation but LM/LU as only one bucket would be allocated. For > LM/LU, we may end up to allocate multiple buckets if ports in use are > sparse. > > A potential alternative is to check that the bucket is valid in > is_port_valid(). This should still possible to do it without taking > per-domain lock but will result a couple more of memory access. > > Signed-off-by: Julien Grall > Signed-off-by: Rahul Singh > --- > Changes in v3: > - fix comments in commit msg. > - modify code related to d->valid_evtchns and {read,write}_atomic() > Changes in v2: > - new patch in this version to avoid the security issue > --- > xen/common/event_channel.c | 55 -- > 1 file changed, 35 insertions(+), 20 deletions(-) > > diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c > index c2c6f8c151..80b06d9743 100644 > --- a/xen/common/event_channel.c > +++ b/xen/common/event_channel.c > @@ -193,6 +193,15 @@ static struct evtchn *alloc_evtchn_bucket(struct domain > *d, unsigned int port) > return NULL; > } > > +/* > + * Allocate a given port and ensure all the buckets up to that ports > + * have been allocated. > + * > + * The last part is important because the rest of the event channel code > + * relies on all the buckets up to d->valid_evtchns to be valid. However, > + * event channels may be sparsed when restoring a domain during Guest > + * Transparent Migration and Live Update. You got rid of mentioning these non-existing features from the commit msg, but you still mention them here. Apart from that, all the Jan comments were addressed, so: Reviewed-by: Michal Orzel ~Michal