Re: [PATCH] Doc: admin-guide: Add entry for kvm_cma_resv_ratio kernel param
Hi Randy, Thanks for the comments, will send a V2 fixing them. On Tue, Sep 15, 2020 at 11:18:52PM -0700, Randy Dunlap wrote: > On 9/15/20 11:11 PM, sathn...@linux.vnet.ibm.com wrote: > > From: Satheesh Rajendran > > > > Add document entry for kvm_cma_resv_ratio kernel param which > > is used to alter the KVM contiguous memory allocation percentage > > for hash pagetable allocation used by hash mode PowerPC KVM guests. > > > > Cc: linux-kernel@vger.kernel.org > > Cc: kvm-...@vger.kernel.org > > Cc: linuxppc-...@lists.ozlabs.org > > Cc: Paul Mackerras > > Cc: Michael Ellerman > > Cc: Jonathan Corbet > > Signed-off-by: Satheesh Rajendran > > --- > > Documentation/admin-guide/kernel-parameters.txt | 9 + > > 1 file changed, 9 insertions(+) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > > b/Documentation/admin-guide/kernel-parameters.txt > > index a1068742a6df..9cb126573c71 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -599,6 +599,15 @@ > > altogether. For more information, see > > include/linux/dma-contiguous.h > > > > +kvm_cma_resv_ratio=n > > +[PPC] > > You can put [PPC] on the line above. > sure > > +Reserves given percentage from system memory area > > for > > +contiguous memory allocation for KVM hash pagetable > > +allocation. > > +Bydefault it reserves 5% of total system memory. > > By default > > > +Format: > > +Default: 5 > > + > > and please use tabs for indentation, not all spaces. > sure > > cmo_free_hint= [PPC] Format: { yes | no } > > Specify whether pages are marked as being inactive > > when they are freed. This is used in CMO environments > > > > Entries in kernel-parameters.txt should be sorted into dictionary order, > so please put that with the other kvm parameters. > sure. > thanks. > -- > ~Randy > Thanks! -Satheesh.
Re: [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
On Fri, Sep 11, 2020 at 09:55:23PM +1000, Michael Ellerman wrote: > Srikar Dronamraju writes: > > Current code assumes that cpumask of cpus sharing a l2-cache mask will > > always be a superset of cpu_sibling_mask. > > > > Lets stop that assumption. cpu_l2_cache_mask is a superset of > > cpu_sibling_mask if and only if shared_caches is set. > > I'm seeing oopses with this: > > [0.117392][T1] smp: Bringing up secondary CPUs ... > [0.156515][T1] smp: Brought up 2 nodes, 2 CPUs > [0.158265][T1] numa: Node 0 CPUs: 0 > [0.158520][T1] numa: Node 1 CPUs: 1 > [0.167453][T1] BUG: Unable to handle kernel data access on read at > 0x800041228298 > [0.167992][T1] Faulting instruction address: 0xc018c128 > [0.168817][T1] Oops: Kernel access of bad area, sig: 11 [#1] > [0.168964][T1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries > [0.169417][T1] Modules linked in: > [0.170047][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted > 5.9.0-rc2-00095-g7430ad5aa700 #209 > [0.170305][T1] NIP: c018c128 LR: c018c0cc CTR: > c004dce0 > [0.170498][T1] REGS: c0007e343880 TRAP: 0380 Not tainted > (5.9.0-rc2-00095-g7430ad5aa700) > [0.170602][T1] MSR: 82009033 CR: > 4400 XER: > [0.170985][T1] CFAR: c018c288 IRQMASK: 0 > [0.170985][T1] GPR00: c0007e343b10 > c173e400 4000 > [0.170985][T1] GPR04: 0800 > 0800 > [0.170985][T1] GPR08: c122c298 > c0003fffc000 c0007fd05ce8 > [0.170985][T1] GPR12: c0007e0119f8 c193 > 8ade > [0.170985][T1] GPR16: c0007e3c0640 0917 > c0007e3c0658 0008 > [0.170985][T1] GPR20: c15d0bb8 8ade > c0f57400 c1817c28 > [0.170985][T1] GPR24: c176dc80 c0007e3c0890 > c0007e3cfe00 > [0.170985][T1] GPR28: c1772310 c0007e011900 > c0007e3c0800 0001 > [0.172750][T1] NIP [c018c128] build_sched_domains+0x808/0x14b0 > [0.172900][T1] LR [c018c0cc] build_sched_domains+0x7ac/0x14b0 > [0.173186][T1] Call Trace: > [0.173484][T1] [c0007e343b10] [c018bfe8] > build_sched_domains+0x6c8/0x14b0 (unreliable) > [0.173821][T1] [c0007e343c50] [c018dcdc] > sched_init_domains+0xec/0x130 > [0.174037][T1] [c0007e343ca0] [c10d59d8] > sched_init_smp+0x50/0xc4 > [0.174207][T1] [c0007e343cd0] [c10b45c4] > kernel_init_freeable+0x1b4/0x378 > [0.174378][T1] [c0007e343db0] [c00129fc] > kernel_init+0x24/0x158 > [0.174740][T1] [c0007e343e20] [c000d9d0] > ret_from_kernel_thread+0x5c/0x6c > [0.175050][T1] Instruction dump: > [0.175626][T1] 554905ee 71480040 7d2907b4 4182016c 2c29 3920006e > 913e002c 41820034 > [0.175841][T1] 7c6307b4 e9300020 78631f24 7d58182a <7d2a482a> > f93e0080 7d404828 314a0001 > [0.178340][T1] ---[ end trace 6876b88dd1d4b3bb ]--- > [0.178512][T1] > [1.180458][T1] Kernel panic - not syncing: Attempted to kill init! > exitcode=0x000b > > That's qemu: > > qemu-system-ppc64 -nographic -vga none -M pseries -cpu POWER8 \ > -kernel build~/vmlinux \ > -m 2G,slots=2,maxmem=4G \ > -object memory-backend-ram,size=1G,id=m0 \ > -object memory-backend-ram,size=1G,id=m1 \ > -numa node,nodeid=0,memdev=m0 \ > -numa node,nodeid=1,memdev=m1 \ > -smp 2,sockets=2,maxcpus=2 \ PowerKVM guest vCPUs does not yet have L2 and L3 cache elements I had this bug raised some time ago, probably related? https://bugs.launchpad.net/qemu/+bug/1774605 Regards, -Satheesh. > > > On mambo I get: > > [0.005069][T1] smp: Bringing up secondary CPUs ... > [0.011656][T1] smp: Brought up 2 nodes, 8 CPUs > [0.011682][T1] numa: Node 0 CPUs: 0-3 > [0.011709][T1] numa: Node 1 CPUs: 4-7 > [0.012015][T1] BUG: arch topology borken > [0.012040][T1] the SMT domain not a subset of the CACHE domain > [0.012107][T1] BUG: Unable to handle kernel data access on read at > 0x8001012e7398 > [0.012142][T1] Faulting instruction address: 0xc01aa4f0 > [0.012174][T1] Oops: Kernel access of bad area, sig: 11 [#1] > [0.012206][T1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV > [0.012236][T1] Modules linked in: > [0.012264][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted > 5.9.0-rc2-00095-g7430ad5aa700 #1 > [0.012304][T1] NIP: c01aa4f0 LR: c01aa498 CTR: > > [0.012341][T1] REGS: c000ef583880 TRAP: 0380 Not tainted >
Re: [PATCH 0/7] Optimization to improve cpu online/offline on Powerpc
On Mon, Jul 27, 2020 at 01:25:25PM +0530, Srikar Dronamraju wrote: > Anton reported that his 4096 cpu (1024 cores in a socket) was taking too > long to boot. He also analyzed that most of the time was being spent on > updating cpu_core_mask. > > Here are some optimizations and fixes to make ppc64_cpu --smt=8/ppc64_cpu > --smt=1 run faster and hence boot the kernel also faster. > > Its based on top of my v4 coregroup support patchset. > http://lore.kernel.org/lkml/20200727053230.19753-1-sri...@linux.vnet.ibm.com/t/#u > > The first two patches should solve Anton's immediate problem. > On the unofficial patches, Anton reported that the boot time came from 30 > mins to 6 seconds. (Basically a high core count in a single socket > configuration). Satheesh also reported similar numbers. > > The rest are simple cleanups/optimizations. > > Since cpu_core_mask is an exported symbol for a long duration, lets retain > as a snapshot of cpumask_of_node. boot tested on P9 KVM guest. without this series: # dmesg|grep smp [0.066624] smp: Bringing up secondary CPUs ... [ 347.521264] smp: Brought up 1 node, 2048 CPUs with this series: # dmesg|grep smp [0.067744] smp: Bringing up secondary CPUs ... [ 5.416910] smp: Brought up 1 node, 2048 CPUs Tested-by: Satheesh Rajendran Regards, -Satheesh > > Architecture:ppc64le > Byte Order: Little Endian > CPU(s): 160 > On-line CPU(s) list: 0-159 > Thread(s) per core: 4 > Core(s) per socket: 20 > Socket(s): 2 > NUMA node(s):2 > Model: 2.2 (pvr 004e 1202) > Model name: POWER9, altivec supported > CPU max MHz: 3800. > CPU min MHz: 2166. > L1d cache: 32K > L1i cache: 32K > L2 cache:512K > L3 cache:10240K > NUMA node0 CPU(s): 0-79 > NUMA node8 CPU(s): 80-159 > > without patch (powerpc/next) > [0.099347] smp: Bringing up secondary CPUs ... > [0.832513] smp: Brought up 2 nodes, 160 CPUs > > with powerpc/next + coregroup support patchset > [0.099241] smp: Bringing up secondary CPUs ... > [0.835627] smp: Brought up 2 nodes, 160 CPUs > > with powerpc/next + coregroup + this patchset > [0.097232] smp: Bringing up secondary CPUs ... > [0.528457] smp: Brought up 2 nodes, 160 CPUs > > x ppc64_cpu --smt=1 > + ppc64_cpu --smt=4 > > without patch > N Min MaxMedian AvgStddev > x 100 11.82 17.06 14.01 14.05 1.2665247 > + 100 12.25 16.59 13.86 14.1143 1.164293 > > with patch > N Min MaxMedian AvgStddev > x 100 12.68 16.15 14.2414.2380.75489246 > + 100 12.93 15.85 14.35 14.28970.60041813 > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Anton Blanchard > Cc: Oliver O'Halloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Gautham R Shenoy > Cc: Satheesh Rajendran > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > > Srikar Dronamraju (7): > powerpc/topology: Update topology_core_cpumask > powerpc/smp: Stop updating cpu_core_mask > powerpc/smp: Remove get_physical_package_id > powerpc/smp: Optimize remove_cpu_from_masks > powerpc/smp: Limit cpus traversed to within a node. > powerpc/smp: Stop passing mask to update_mask_by_l2 > powerpc/smp: Depend on cpu_l1_cache_map when adding cpus > > arch/powerpc/include/asm/smp.h | 5 -- > arch/powerpc/include/asm/topology.h | 7 +-- > arch/powerpc/kernel/smp.c | 79 + > 3 files changed, 24 insertions(+), 67 deletions(-) > > -- > 2.17.1 >
[PATCH V2] powerpc/pseries/svm: Remove unwanted check for shared_lppaca_size
Early secure guest boot hits the below crash while booting with vcpus numbers aligned with page boundary for PAGE size of 64k and LPPACA size of 1k i.e 64, 128 etc, due to the BUG_ON assert for shared_lppaca_total_size equal to shared_lppaca_size, [0.00] Partition configured for 64 cpus. [0.00] CPU maps initialized for 1 thread per core [0.00] [ cut here ] [0.00] kernel BUG at arch/powerpc/kernel/paca.c:89! [0.00] Oops: Exception in kernel mode, sig: 5 [#1] [0.00] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries which is not necessary, let's remove it. Fixes: bd104e6db6f0 ("powerpc/pseries/svm: Use shared memory for LPPACA structures") Cc: linux-kernel@vger.kernel.org Cc: Michael Ellerman Cc: Thiago Jung Bauermann Cc: Ram Pai Cc: Sukadev Bhattiprolu Cc: Laurent Dufour Reviewed-by: Laurent Dufour Reviewed-by: Thiago Jung Bauermann Signed-off-by: Satheesh Rajendran --- V2: Added Reviewed by Thiago and Laurent. Added Fixes tag as per Thiago suggest. V1: https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200609105731.14032-1-sathn...@linux.vnet.ibm.com/ --- arch/powerpc/kernel/paca.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index 2168372b792d..74da65aacbc9 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -87,7 +87,7 @@ static void *__init alloc_shared_lppaca(unsigned long size, unsigned long align, * This is very early in boot, so no harm done if the kernel crashes at * this point. */ - BUG_ON(shared_lppaca_size >= shared_lppaca_total_size); + BUG_ON(shared_lppaca_size > shared_lppaca_total_size); return ptr; } -- 2.26.2
[PATCH V2] powerpc/pseries/svm: Drop unused align argument in alloc_shared_lppaca() function
Argument "align" in alloc_shared_lppaca() was unused inside the function. Let's drop it and update code comment for page alignment. Cc: linux-kernel@vger.kernel.org Cc: Thiago Jung Bauermann Cc: Ram Pai Cc: Sukadev Bhattiprolu Cc: Laurent Dufour Cc: Michael Ellerman Reviewed-by: Thiago Jung Bauermann Signed-off-by: Satheesh Rajendran --- V2: Added reviewed by Thiago. Dropped align argument as per Michael suggest. Modified commit msg. V1: http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200609113909.17236-1-sathn...@linux.vnet.ibm.com/ --- arch/powerpc/kernel/paca.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index 8d96169c597e..a174d64d9b4d 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -57,8 +57,8 @@ static void *__init alloc_paca_data(unsigned long size, unsigned long align, #define LPPACA_SIZE 0x400 -static void *__init alloc_shared_lppaca(unsigned long size, unsigned long align, - unsigned long limit, int cpu) +static void *__init alloc_shared_lppaca(unsigned long size, unsigned long limit, + int cpu) { size_t shared_lppaca_total_size = PAGE_ALIGN(nr_cpu_ids * LPPACA_SIZE); static unsigned long shared_lppaca_size; @@ -68,6 +68,12 @@ static void *__init alloc_shared_lppaca(unsigned long size, unsigned long align, if (!shared_lppaca) { memblock_set_bottom_up(true); + /* See Documentation/powerpc/ultravisor.rst for mode details +* +* UV/HV data share is in PAGE granularity, In order to +* minimize the number of pages shared and maximize the +* use of a page, let's use page align. +*/ shared_lppaca = memblock_alloc_try_nid(shared_lppaca_total_size, PAGE_SIZE, MEMBLOCK_LOW_LIMIT, @@ -122,7 +128,7 @@ static struct lppaca * __init new_lppaca(int cpu, unsigned long limit) return NULL; if (is_secure_guest()) - lp = alloc_shared_lppaca(LPPACA_SIZE, 0x400, limit, cpu); + lp = alloc_shared_lppaca(LPPACA_SIZE, limit, cpu); else lp = alloc_paca_data(LPPACA_SIZE, 0x400, limit, cpu); -- 2.26.2
[PATCH] powerpc/pseries/svm: Fixup align argument in alloc_shared_lppaca() function
Argument "align" in alloc_shared_lppaca() function was unused inside the function. Let's fix it and update code comment. Cc: linux-kernel@vger.kernel.org Cc: Thiago Jung Bauermann Cc: Ram Pai Cc: Sukadev Bhattiprolu Cc: Laurent Dufour Signed-off-by: Satheesh Rajendran --- arch/powerpc/kernel/paca.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index 8d96169c597e..9088e107fb43 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -70,7 +70,7 @@ static void *__init alloc_shared_lppaca(unsigned long size, unsigned long align, shared_lppaca = memblock_alloc_try_nid(shared_lppaca_total_size, - PAGE_SIZE, MEMBLOCK_LOW_LIMIT, + align, MEMBLOCK_LOW_LIMIT, limit, NUMA_NO_NODE); if (!shared_lppaca) panic("cannot allocate shared data"); @@ -122,7 +122,14 @@ static struct lppaca * __init new_lppaca(int cpu, unsigned long limit) return NULL; if (is_secure_guest()) - lp = alloc_shared_lppaca(LPPACA_SIZE, 0x400, limit, cpu); + /* +* See Documentation/powerpc/ultravisor.rst for mode details +* +* UV/HV data share is in PAGE granularity, In order to minimize +* the number of pages shared and maximize the use of a page, +* let's use page align. +*/ + lp = alloc_shared_lppaca(LPPACA_SIZE, PAGE_SIZE, limit, cpu); else lp = alloc_paca_data(LPPACA_SIZE, 0x400, limit, cpu); -- 2.26.2
[PATCH] powerpc/pseries/svm: Remove unwanted check for shared_lppaca_size
Early secure guest boot hits the below crash while booting with vcpus numbers aligned with page boundary for PAGE size of 64k and LPPACA size of 1k i.e 64, 128 etc, due to the BUG_ON assert for shared_lppaca_total_size equal to shared_lppaca_size, [0.00] Partition configured for 64 cpus. [0.00] CPU maps initialized for 1 thread per core [0.00] [ cut here ] [0.00] kernel BUG at arch/powerpc/kernel/paca.c:89! [0.00] Oops: Exception in kernel mode, sig: 5 [#1] [0.00] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries which is not necessary, let's remove it. Cc: linux-kernel@vger.kernel.org Cc: Thiago Jung Bauermann Cc: Ram Pai Cc: Sukadev Bhattiprolu Cc: Laurent Dufour Signed-off-by: Satheesh Rajendran --- arch/powerpc/kernel/paca.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index 949eceb25..10b7c54a7 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -86,7 +86,7 @@ static void *__init alloc_shared_lppaca(unsigned long size, unsigned long align, * This is very early in boot, so no harm done if the kernel crashes at * this point. */ - BUG_ON(shared_lppaca_size >= shared_lppaca_total_size); + BUG_ON(shared_lppaca_size > shared_lppaca_total_size); return ptr; } -- 2.26.2
Re: [mainline][Oops][bisected 2ba3e6 ] 5.7.0 boot fails with kernel panic on powerpc
On Wed, Jun 03, 2020 at 03:32:57PM +0200, Joerg Roedel wrote: > On Wed, Jun 03, 2020 at 04:20:57PM +0530, Abdul Haleem wrote: > > @Joerg, Could you please have a look? > > Can you please try the attached patch? Hi Joerg, I did hit the similar boot failue on a Power9 baremetal box(mentioned in Note) and your below patch helped solving that for my environment and am able to boot the system fine. ... Fedora 31 (Thirty One) Kernel 5.7.0-gd6f9469a0-dirty on an ppc64le (hvc0) login: Tested-by: Satheesh Rajendran Note: for the record, here is the boot failure call trace. [0.023555] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl [0.023582] pid_max: default: 163840 minimum: 1280 [0.035014] BUG: Unable to handle kernel data access on read at 0xc060 [0.035058] Faulting instruction address: 0xc0382304 [0.035074] Oops: Kernel access of bad area, sig: 11 [#1] [0.035097] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV [0.035113] Modules linked in: [0.035136] CPU: 24 PID: 0 Comm: swapper/24 Not tainted 5.7.0-gd6f9469a0 #1 [0.035161] NIP: c0382304 LR: c038407c CTR: [0.035197] REGS: c167f930 TRAP: 0300 Not tainted (5.7.0-gd6f9469a0) [0.035241] MSR: 92009033 CR: 42022422 XER: [0.035294] CFAR: c03822fc DAR: c060 DSISR: 4000 IRQMASK: 0 [0.035294] GPR00: c038407c c167fbc0 c168090[ 150.252645597,5] OPAL: Reboot request... [ 150.252928266,5] RESET: Initiating fast reboot 1... 0 c008 [0.035294] GPR04: 01ff c008001f 0060 [0.035294] GPR08: 6000 0005 c060 c0080020 [0.035294] GPR12: 22022422 c187 c000 c008 [0.035294] GPR16: c00807ff c0080020 c060 [0.035294] GPR20: c0080800 c0080800 c00807ff c00807ff [0.035294] GPR24: c163f7c8 c172d0c0 0001 0001 [0.035294] GPR28: c1708000 c172d0c8 c0080800 [0.035622] NIP [c0382304] map_kernel_range_noflush+0x274/0x510 [0.035657] LR [c038407c] __vmalloc_node_range+0x2ec/0x3a0 [0.035690] Call Trace: [0.035709] [c167fbc0] [c038d848] __alloc_pages_nodemask+0x158/0x3f0 (unreliable) [0.035750] [c167fc90] [c038407c] __vmalloc_node_range+0x2ec/0x3a0 [0.035787] [c167fd40] [c0384268] __vmalloc+0x58/0x70 [0.035823] [c167fdb0] [c1056db8] alloc_large_system_hash+0x204/0x304 [0.035870] [c167fe60] [c105c1f0] vfs_caches_init+0xd8/0x138 [0.035916] [c167fee0] [c10242a0] start_kernel+0x644/0x6ec [0.035960] [c167ff90] [c000ca9c] start_here_common+0x1c/0x400 [0.036004] Instruction dump: [0.036016] 3af4 6000 6000 38c90010 7f663036 7d667a14 7cc600d0 7d713038 [0.036038] 38d1 7c373040 41810008 7e91a378 2c25 418201b4 7f464830 [0.036083] ---[ end trace c7e72029dfacc217 ]--- [0.036114] [1.036223] Kernel panic - not syncing: Attempted to kill the idle task! [1.036858] Rebooting in 10 seconds.. Regards, -Satheesh. > > diff --git a/include/asm-generic/5level-fixup.h > b/include/asm-generic/5level-fixup.h > index 58046ddc08d0..afbab31fbd7e 100644 > --- a/include/asm-generic/5level-fixup.h > +++ b/include/asm-generic/5level-fixup.h > @@ -17,6 +17,11 @@ > ((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address)) ? \ > NULL : pud_offset(p4d, address)) > > +#define pud_alloc_track(mm, p4d, address, mask) > \ > + ((unlikely(pgd_none(*(p4d))) && > \ > + (__pud_alloc(mm, p4d, address) || > ({*(mask)|=PGTBL_P4D_MODIFIED;0;})))? \ > + NULL : pud_offset(p4d, address)) > + > #define p4d_alloc(mm, pgd, address) (pgd) > #define p4d_alloc_track(mm, pgd, address, mask) (pgd) > #define p4d_offset(pgd, start) (pgd) > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 7e07f4f490cb..d46bf03b804f 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2088,35 +2088,35 @@ static inline pud_t *pud_alloc(struct mm_struct *mm, > p4d_t *p4d, > NULL : pud_offset(p4d, address); > } > > -static inline p4d_t *p4d_alloc_track(struct mm_struct *mm, pgd_t *pgd, > +static inline pud_t *pud_alloc_track(struct mm_struct *mm, p4d_t *p4d, >
Re: [PATCH] recordmcount: Fix spurious mcount entries on powerpc
On Thu, Jun 27, 2019 at 12:08:01AM +0530, Naveen N. Rao wrote: > The recent change enabling HAVE_C_RECORDMCOUNT on powerpc started > showing the following issue: > > # modprobe kprobe_example >ftrace-powerpc: Not expected bl: opcode is 3c4c0001 >WARNING: CPU: 0 PID: 227 at kernel/trace/ftrace.c:2001 > ftrace_bug+0x90/0x318 >Modules linked in: >CPU: 0 PID: 227 Comm: modprobe Not tainted 5.2.0-rc6-00678-g1c329100b942 #2 >NIP: c0264318 LR: c025d694 CTR: c0f5cd30 >REGS: c1f2b7b0 TRAP: 0700 Not tainted > (5.2.0-rc6-00678-g1c329100b942) >MSR: 90010282b033 CR: > 28228222 XER: >CFAR: c02642fc IRQMASK: 0 > >NIP [c0264318] ftrace_bug+0x90/0x318 >LR [c025d694] ftrace_process_locs+0x4f4/0x5e0 >Call Trace: >[c1f2ba40] [0004] 0x4 (unreliable) >[c1f2bad0] [c025d694] ftrace_process_locs+0x4f4/0x5e0 >[c1f2bb90] [c020ff10] load_module+0x25b0/0x30c0 >[c1f2bd00] [c0210cb0] sys_finit_module+0xc0/0x130 >[c1f2be20] [c000bda4] system_call+0x5c/0x70 >Instruction dump: >419e0018 2f83 419e00bc 2f83ffea 409e00cc 481c 0fe0 3c62ff96 >3901 3940 386386d0 48c4 <0fe0> 3ce20003 3901 3c62ff96 >---[ end trace 4c438d5cebf78381 ]--- >ftrace failed to modify >[] 0xc008012a0008 > actual: 01:00:4c:3c >Initializing ftrace call sites >ftrace record flags: 200 > (0) > expected tramp: c006af4c > > Looking at the relocation records in __mcount_loc showed a few spurious > entries: > RELOCATION RECORDS FOR [__mcount_loc]: > OFFSET TYPE VALUE > R_PPC64_ADDR64.text.unlikely+0x0008 > 0008 R_PPC64_ADDR64.text.unlikely+0x0014 > 0010 R_PPC64_ADDR64.text.unlikely+0x0060 > 0018 R_PPC64_ADDR64.text.unlikely+0x00b4 > 0020 R_PPC64_ADDR64.init.text+0x0008 > 0028 R_PPC64_ADDR64.init.text+0x0014 > > The first entry in each section is incorrect. Looking at the relocation > records, the spurious entries correspond to the R_PPC64_ENTRY records: > RELOCATION RECORDS FOR [.text.unlikely]: > OFFSET TYPE VALUE > R_PPC64_REL64 .TOC.-0x0008 > 0008 R_PPC64_ENTRY *ABS* > 0014 R_PPC64_REL24 _mcount > > > The problem is that we are not validating the return value from > get_mcountsym() in sift_rel_mcount(). With this entry, mcountsym is 0, > but Elf_r_sym(relp) also ends up being 0. Fix this by ensuring mcountsym > is valid before processing the entry. > > Fixes: c7d64b560ce80 ("powerpc/ftrace: Enable C Version of recordmcount") > Signed-off-by: Naveen N. Rao > --- > scripts/recordmcount.h | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) Hi, linuxppc/merge branch latest commit 850f6274c5b5 was failing to boot IBM Power8 Box, and the failure got resolved with this patch. https://github.com/linuxppc/issues/issues/254 # git log -2 --oneline 0e0f55b31ea8 (HEAD -> issue_254) recordmcount: Fix spurious mcount entries on powerpc 850f6274c5b5 (origin/merge, merge) Automatic merge of branches 'master', 'next' and 'fixes' into merge # uname -r 5.2.0-rc6-00123-g0e0f55b31ea8 Tested-by: Satheesh Rajendran Regards, -Satheesh > > diff --git a/scripts/recordmcount.h b/scripts/recordmcount.h > index 13c5e6c8829c..47fca2c69a73 100644 > --- a/scripts/recordmcount.h > +++ b/scripts/recordmcount.h > @@ -325,7 +325,8 @@ static uint_t *sift_rel_mcount(uint_t *mlocp, > if (!mcountsym) > mcountsym = get_mcountsym(sym0, relp, str0); > > - if (mcountsym == Elf_r_sym(relp) && !is_fake_mcount(relp)) { > + if (mcountsym && mcountsym == Elf_r_sym(relp) && > + !is_fake_mcount(relp)) { > uint_t const addend = > _w(_w(relp->r_offset) - recval + mcount_adjust); > mrelp->r_offset = _w(offbase > -- > 2.22.0 >
Re: [PATCH v5 0/5] powerpc: system call table generation support
Hi Firoz, On Thu, Dec 13, 2018 at 02:32:45PM +0530, Firoz Khan wrote: > The purpose of this patch series is, we can easily > add/modify/delete system call table support by cha- > nging entry in syscall.tbl file instead of manually > changing many files. The other goal is to unify the > system call table generation support implementation > across all the architectures. > > The system call tables are in different format in > all architecture. It will be difficult to manually > add, modify or delete the system calls in the resp- > ective files manually. To make it easy by keeping a > script and which'll generate uapi header file and > syscall table file. > > syscall.tbl contains the list of available system > calls along with system call number and correspond- > ing entry point. Add a new system call in this arch- > itecture will be possible by adding new entry in > the syscall.tbl file. > > Adding a new table entry consisting of: > - System call number. > - ABI. > - System call name. > - Entry point name. > - Compat entry name, if required. > - spu entry name, if required. > > ARM, s390 and x86 architecuture does exist the sim- > ilar support. I leverage their implementation to > come up with a generic solution. > > I have done the same support for work for alpha, > ia64, m68k, microblaze, mips, parisc, sh, sparc, > and xtensa. Below mentioned git repository contains > more details about the workflow. > > https://github.com/frzkhn/system_call_table_generator/ > > Finally, this is the ground work to solve the Y2038 > issue. We need to add two dozen of system calls to > solve Y2038 issue. So this patch series will help to > add new system calls easily by adding new entry in the > syscall.tbl. > > Changes since v4: > - DOTSYM macro removed for ppc32, which was causing >the compilation error. > > Changes since v3: > - split compat syscall table out from native table. > - modified the script to add new line in the generated >file. > > Changes since v2: > - modified/optimized the syscall.tbl to avoid duplicate >for the spu entries. > - updated the syscalltbl.sh to meet the above point. > > Changes since v1: > - optimized/updated the syscall table generation >scripts. > - fixed all mixed indentation issues in syscall.tbl. > - added "comments" in syscall_*.tbl. > - changed from generic-y to generated-y in Kbuild. > > Firoz Khan (5): > powerpc: add __NR_syscalls along with NR_syscalls > powerpc: move macro definition from asm/systbl.h > powerpc: add system call table generation support > powerpc: split compat syscall table out from native table > powerpc: generate uapi header and system call table files Tried to apply on linus "master" and linuxppc-dev(https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git) "merge" branch, both failed to apply series. # git am mbox Applying: powerpc: add __NR_syscalls along with NR_syscalls Applying: powerpc: move macro definition from asm/systbl.h Applying: powerpc: add system call table generation support Applying: powerpc: split compat syscall table out from native table Applying: powerpc: generate uapi header and system call table files error: patch failed: arch/powerpc/include/uapi/asm/Kbuild:1 error: arch/powerpc/include/uapi/asm/Kbuild: patch does not apply Patch failed at 0005 powerpc: generate uapi header and system call table files Use 'git am --show-current-patch' to see the failed patch When you have resolved this problem, run "git am --continue". If you prefer to skip this patch, run "git am --skip" instead. To restore the original branch and stop patching, run "git am --abort". Then, tried with linuxppc-dev(https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git) "next" branch, patch got applied, compiled with ppc64le_defconfig and booted on IBM Power8 box. # uname -r 4.20.0-rc2-gdd2690d2c Looks like patch series needs a rebase against the latest kernel versions. Thanks, -Satheesh. > > arch/powerpc/Makefile | 3 + > arch/powerpc/include/asm/Kbuild | 4 + > arch/powerpc/include/asm/syscall.h | 3 +- > arch/powerpc/include/asm/systbl.h | 396 -- > arch/powerpc/include/asm/unistd.h | 3 +- > arch/powerpc/include/uapi/asm/Kbuild| 2 + > arch/powerpc/include/uapi/asm/unistd.h | 389 + > arch/powerpc/kernel/Makefile| 10 - > arch/powerpc/kernel/entry_64.S | 7 +- > arch/powerpc/kernel/syscalls/Makefile | 63 > arch/powerpc/kernel/syscalls/syscall.tbl| 427 > > arch/powerpc/kernel/syscalls/syscallhdr.sh | 37 +++ > arch/powerpc/kernel/syscalls/syscalltbl.sh | 36 +++ > arch/powerpc/kernel/systbl.S| 40 ++- > arch/powerpc/kernel/systbl_chk.c| 60 > arch/powerpc/kernel/vdso.c |
Re: [PATCH v4 0/5] powerpc: system call table generation support
On Fri, Dec 07, 2018 at 11:41:35AM +0530, Firoz Khan wrote: > The purpose of this patch series is, we can easily > add/modify/delete system call table support by cha- > nging entry in syscall.tbl file instead of manually > changing many files. The other goal is to unify the > system call table generation support implementation > across all the architectures. > > The system call tables are in different format in > all architecture. It will be difficult to manually > add, modify or delete the system calls in the resp- > ective files manually. To make it easy by keeping a > script and which'll generate uapi header file and > syscall table file. > > syscall.tbl contains the list of available system > calls along with system call number and correspond- > ing entry point. Add a new system call in this arch- > itecture will be possible by adding new entry in > the syscall.tbl file. > > Adding a new table entry consisting of: > - System call number. > - ABI. > - System call name. > - Entry point name. > - Compat entry name, if required. > - spu entry name, if required. > > ARM, s390 and x86 architecuture does exist the sim- > ilar support. I leverage their implementation to > come up with a generic solution. > > I have done the same support for work for alpha, > ia64, m68k, microblaze, mips, parisc, sh, sparc, > and xtensa. Below mentioned git repository contains > more details about the workflow. > > https://github.com/frzkhn/system_call_table_generator/ > > Finally, this is the ground work to solve the Y2038 > issue. We need to add two dozen of system calls to > solve Y2038 issue. So this patch series will help to > add new system calls easily by adding new entry in the > syscall.tbl. > > Changes since v3: > - split compat syscall table out from native table. > - modified the script to add new line in the generated >file. Hi Firoz, This version(v4) booted fine in IBM Power8 box. Compiled with 'ppc64le_defconfig' aginst https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=merge=a26b21082959cee3075b3edb7ef23071c7e0b09a Reference failure v3 version: https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-November/182110.html Regards, -Satheesh. > > Changes since v2: > - modified/optimized the syscall.tbl to avoid duplicate >for the spu entries. > - updated the syscalltbl.sh to meet the above point. > > Changes since v1: > - optimized/updated the syscall table generation >scripts. > - fixed all mixed indentation issues in syscall.tbl. > - added "comments" in syscall_*.tbl. > - changed from generic-y to generated-y in Kbuild. > > Firoz Khan (5): > powerpc: add __NR_syscalls along with NR_syscalls > powerpc: move macro definition from asm/systbl.h > powerpc: add system call table generation support > powerpc: split compat syscall table out from native table > powerpc: generate uapi header and system call table files > > arch/powerpc/Makefile | 3 + > arch/powerpc/include/asm/Kbuild | 4 + > arch/powerpc/include/asm/syscall.h | 3 +- > arch/powerpc/include/asm/systbl.h | 396 -- > arch/powerpc/include/asm/unistd.h | 3 +- > arch/powerpc/include/uapi/asm/Kbuild| 2 + > arch/powerpc/include/uapi/asm/unistd.h | 389 + > arch/powerpc/kernel/Makefile| 10 - > arch/powerpc/kernel/entry_64.S | 7 +- > arch/powerpc/kernel/syscalls/Makefile | 63 > arch/powerpc/kernel/syscalls/syscall.tbl| 427 > > arch/powerpc/kernel/syscalls/syscallhdr.sh | 37 +++ > arch/powerpc/kernel/syscalls/syscalltbl.sh | 36 +++ > arch/powerpc/kernel/systbl.S| 40 ++- > arch/powerpc/kernel/systbl_chk.c| 60 > arch/powerpc/kernel/vdso.c | 7 +- > arch/powerpc/platforms/cell/spu_callbacks.c | 17 +- > 17 files changed, 606 insertions(+), 898 deletions(-) > delete mode 100644 arch/powerpc/include/asm/systbl.h > create mode 100644 arch/powerpc/kernel/syscalls/Makefile > create mode 100644 arch/powerpc/kernel/syscalls/syscall.tbl > create mode 100644 arch/powerpc/kernel/syscalls/syscallhdr.sh > create mode 100644 arch/powerpc/kernel/syscalls/syscalltbl.sh > delete mode 100644 arch/powerpc/kernel/systbl_chk.c > > -- > 1.9.1 >
Re: System not booting since dm changes? (was Linux 4.20-rc1)
On Mon, Nov 05, 2018 at 08:51:57AM -0500, Mike Snitzer wrote: > On Mon, Nov 05 2018 at 5:25am -0500, > Michael Ellerman wrote: > > > Linus Torvalds writes: > > ... > > > Mike Snitzer (1): > > > device mapper updates > > > > Hi Mike, > > > > Replying here because I can't find the device-mapper pull or the patch > > in question on LKML. I guess I should be subscribed to dm-devel. > > > > We have a box that doesn't boot any more, bisect points at one of: > > > > cef6f55a9fb4 Mike Snitzer dm table: require that request-based DM > > be layered on blk-mq devices > > 953923c09fe8 Mike Snitzer dm: rename DM_TYPE_MQ_REQUEST_BASED to > > DM_TYPE_REQUEST_BASED > > 6a23e05c2fe3 Jens Axboe dm: remove legacy request-based IO path > > > > > > It's a Power8 system running Rawhide, it does have multipath, but I'm > > told it was setup by the Fedora installer, ie. nothing fancy. > > > > The symptom is the system can't find its root filesystem and drops into > > the initramfs shell. The dmesg includes a bunch of errors like below: > > > > [ 43.263460] localhost multipathd[1344]: sdb: fail to get serial > > [ 43.268762] localhost multipathd[1344]: mpatha: failed in domap for > > addition of new path sdb > > [ 43.268762] localhost multipathd[1344]: uevent trigger error > > [ 43.282065] localhost kernel: device-mapper: table: table load > > rejected: not all devices are blk-mq request-stackable > ... > > > > Any ideas what's going wrong here? > > "table load rejected: not all devices are blk-mq request-stackable" > speaks to the fact that you aren't using blk-mq for scsi (aka scsi-mq). > > You need to use scsi_mod.use_blk_mq=Y on the kernel commandline (or set > CONFIG_SCSI_MQ_DEFAULT in your kernel config) Thanks Mike!, above solution worked and the system booted fine now:-) # uname -r 4.20.0-rc1+ # cat /proc/cmdline root=/dev/mapper/fedora_ltc--test--ci2-root ro rd.lvm.lv=fedora_ltc-test-ci2/root rd.lvm.lv=fedora_ltc-test-ci2/swap scsi_mod.use_blk_mq=Y CONFIG_SCSI_MQ_DEFAULT kernel was not set in my kernel config, will set in future runs. Thanks Michael! Regards, -Satheesh. > > Mike >
Re: System not booting since dm changes? (was Linux 4.20-rc1)
On Mon, Nov 05, 2018 at 08:51:57AM -0500, Mike Snitzer wrote: > On Mon, Nov 05 2018 at 5:25am -0500, > Michael Ellerman wrote: > > > Linus Torvalds writes: > > ... > > > Mike Snitzer (1): > > > device mapper updates > > > > Hi Mike, > > > > Replying here because I can't find the device-mapper pull or the patch > > in question on LKML. I guess I should be subscribed to dm-devel. > > > > We have a box that doesn't boot any more, bisect points at one of: > > > > cef6f55a9fb4 Mike Snitzer dm table: require that request-based DM > > be layered on blk-mq devices > > 953923c09fe8 Mike Snitzer dm: rename DM_TYPE_MQ_REQUEST_BASED to > > DM_TYPE_REQUEST_BASED > > 6a23e05c2fe3 Jens Axboe dm: remove legacy request-based IO path > > > > > > It's a Power8 system running Rawhide, it does have multipath, but I'm > > told it was setup by the Fedora installer, ie. nothing fancy. > > > > The symptom is the system can't find its root filesystem and drops into > > the initramfs shell. The dmesg includes a bunch of errors like below: > > > > [ 43.263460] localhost multipathd[1344]: sdb: fail to get serial > > [ 43.268762] localhost multipathd[1344]: mpatha: failed in domap for > > addition of new path sdb > > [ 43.268762] localhost multipathd[1344]: uevent trigger error > > [ 43.282065] localhost kernel: device-mapper: table: table load > > rejected: not all devices are blk-mq request-stackable > ... > > > > Any ideas what's going wrong here? > > "table load rejected: not all devices are blk-mq request-stackable" > speaks to the fact that you aren't using blk-mq for scsi (aka scsi-mq). > > You need to use scsi_mod.use_blk_mq=Y on the kernel commandline (or set > CONFIG_SCSI_MQ_DEFAULT in your kernel config) Thanks Mike!, above solution worked and the system booted fine now:-) # uname -r 4.20.0-rc1+ # cat /proc/cmdline root=/dev/mapper/fedora_ltc--test--ci2-root ro rd.lvm.lv=fedora_ltc-test-ci2/root rd.lvm.lv=fedora_ltc-test-ci2/swap scsi_mod.use_blk_mq=Y CONFIG_SCSI_MQ_DEFAULT kernel was not set in my kernel config, will set in future runs. Thanks Michael! Regards, -Satheesh. > > Mike >
[tip:perf/core] perf bench numa: Fixup discontiguous/sparse numa nodes
Commit-ID: 321a7c35c90cc834851ceda18a8ee18f1d032b92 Gitweb: https://git.kernel.org/tip/321a7c35c90cc834851ceda18a8ee18f1d032b92 Author: Satheesh Rajendran <sathn...@linux.vnet.ibm.com> AuthorDate: Wed, 22 Nov 2017 22:13:53 +0530 Committer: Arnaldo Carvalho de Melo <a...@redhat.com> CommitDate: Tue, 28 Nov 2017 14:28:10 -0300 perf bench numa: Fixup discontiguous/sparse numa nodes Certain systems are designed to have sparse/discontiguous nodes. On such systems, 'perf bench numa' hangs, shows wrong number of nodes and shows values for non-existent nodes. Handle this by only taking nodes that are exposed by kernel to userspace. Signed-off-by: Satheesh Rajendran <sathn...@linux.vnet.ibm.com> Reviewed-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com> Acked-by: Naveen N. Rao <naveen.n@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1edbcd353c009e109e93d78f2f46381930c340fe.1511368645.git.sathn...@linux.vnet.ibm.com Signed-off-by: Balamuruhan S <bal...@linux.vnet.ibm.com> Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com> --- tools/perf/bench/numa.c | 56 - 1 file changed, 51 insertions(+), 5 deletions(-) diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c index d95fdcc..944070e 100644 --- a/tools/perf/bench/numa.c +++ b/tools/perf/bench/numa.c @@ -216,6 +216,47 @@ static const char * const numa_usage[] = { NULL }; +/* + * To get number of numa nodes present. + */ +static int nr_numa_nodes(void) +{ + int i, nr_nodes = 0; + + for (i = 0; i < g->p.nr_nodes; i++) { + if (numa_bitmask_isbitset(numa_nodes_ptr, i)) + nr_nodes++; + } + + return nr_nodes; +} + +/* + * To check if given numa node is present. + */ +static int is_node_present(int node) +{ + return numa_bitmask_isbitset(numa_nodes_ptr, node); +} + +/* + * To check given numa node has cpus. + */ +static bool node_has_cpus(int node) +{ + struct bitmask *cpu = numa_allocate_cpumask(); + unsigned int i; + + if (cpu && !numa_node_to_cpus(node, cpu)) { + for (i = 0; i < cpu->size; i++) { + if (numa_bitmask_isbitset(cpu, i)) + return true; + } + } + + return false; /* lets fall back to nocpus safely */ +} + static cpu_set_t bind_to_cpu(int target_cpu) { cpu_set_t orig_mask, mask; @@ -244,12 +285,12 @@ static cpu_set_t bind_to_cpu(int target_cpu) static cpu_set_t bind_to_node(int target_node) { - int cpus_per_node = g->p.nr_cpus/g->p.nr_nodes; + int cpus_per_node = g->p.nr_cpus / nr_numa_nodes(); cpu_set_t orig_mask, mask; int cpu; int ret; - BUG_ON(cpus_per_node*g->p.nr_nodes != g->p.nr_cpus); + BUG_ON(cpus_per_node * nr_numa_nodes() != g->p.nr_cpus); BUG_ON(!cpus_per_node); ret = sched_getaffinity(0, sizeof(orig_mask), _mask); @@ -649,7 +690,7 @@ static int parse_setup_node_list(void) int i; for (i = 0; i < mul; i++) { - if (t >= g->p.nr_tasks) { + if (t >= g->p.nr_tasks || !node_has_cpus(bind_node)) { printf("\n# NOTE: ignoring bind NODEs starting at NODE#%d\n", bind_node); goto out; } @@ -964,6 +1005,8 @@ static void calc_convergence(double runtime_ns_max, double *convergence) sum = 0; for (node = 0; node < g->p.nr_nodes; node++) { + if (!is_node_present(node)) + continue; nr = nodes[node]; nr_min = min(nr, nr_min); nr_max = max(nr, nr_max); @@ -984,8 +1027,11 @@ static void calc_convergence(double runtime_ns_max, double *convergence) process_groups = 0; for (node = 0; node < g->p.nr_nodes; node++) { - int processes = count_node_processes(node); + int processes; + if (!is_node_present(node)) + continue; + processes = count_node_processes(node); nr = nodes[node]; tprintf(" %2d/%-2d", nr, processes); @@ -1291,7 +1337,7 @@ static void print_summary(void) printf("\n ###\n"); printf(" # %d %s will execute (on %d nodes, %d CPUs):\n", - g->p.nr_tasks, g->p.nr_tasks == 1 ? "task" : "tasks", g->p.nr_nodes, g->p.nr_cpus); + g->p.nr_tasks, g->p.nr_tasks == 1 ? "task" : "tasks", nr_numa_nodes(), g->p.nr_cpus); printf(" # %5dx %5ldMB global shared mem operations\n", g->p.nr_loops, g->p.bytes_global/1024/1024); printf(" # %5dx %5ldMB process shared mem operations\n",
[tip:perf/core] perf bench numa: Fixup discontiguous/sparse numa nodes
Commit-ID: 321a7c35c90cc834851ceda18a8ee18f1d032b92 Gitweb: https://git.kernel.org/tip/321a7c35c90cc834851ceda18a8ee18f1d032b92 Author: Satheesh Rajendran AuthorDate: Wed, 22 Nov 2017 22:13:53 +0530 Committer: Arnaldo Carvalho de Melo CommitDate: Tue, 28 Nov 2017 14:28:10 -0300 perf bench numa: Fixup discontiguous/sparse numa nodes Certain systems are designed to have sparse/discontiguous nodes. On such systems, 'perf bench numa' hangs, shows wrong number of nodes and shows values for non-existent nodes. Handle this by only taking nodes that are exposed by kernel to userspace. Signed-off-by: Satheesh Rajendran Reviewed-by: Srikar Dronamraju Acked-by: Naveen N. Rao Link: http://lkml.kernel.org/r/1edbcd353c009e109e93d78f2f46381930c340fe.1511368645.git.sathn...@linux.vnet.ibm.com Signed-off-by: Balamuruhan S Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/bench/numa.c | 56 - 1 file changed, 51 insertions(+), 5 deletions(-) diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c index d95fdcc..944070e 100644 --- a/tools/perf/bench/numa.c +++ b/tools/perf/bench/numa.c @@ -216,6 +216,47 @@ static const char * const numa_usage[] = { NULL }; +/* + * To get number of numa nodes present. + */ +static int nr_numa_nodes(void) +{ + int i, nr_nodes = 0; + + for (i = 0; i < g->p.nr_nodes; i++) { + if (numa_bitmask_isbitset(numa_nodes_ptr, i)) + nr_nodes++; + } + + return nr_nodes; +} + +/* + * To check if given numa node is present. + */ +static int is_node_present(int node) +{ + return numa_bitmask_isbitset(numa_nodes_ptr, node); +} + +/* + * To check given numa node has cpus. + */ +static bool node_has_cpus(int node) +{ + struct bitmask *cpu = numa_allocate_cpumask(); + unsigned int i; + + if (cpu && !numa_node_to_cpus(node, cpu)) { + for (i = 0; i < cpu->size; i++) { + if (numa_bitmask_isbitset(cpu, i)) + return true; + } + } + + return false; /* lets fall back to nocpus safely */ +} + static cpu_set_t bind_to_cpu(int target_cpu) { cpu_set_t orig_mask, mask; @@ -244,12 +285,12 @@ static cpu_set_t bind_to_cpu(int target_cpu) static cpu_set_t bind_to_node(int target_node) { - int cpus_per_node = g->p.nr_cpus/g->p.nr_nodes; + int cpus_per_node = g->p.nr_cpus / nr_numa_nodes(); cpu_set_t orig_mask, mask; int cpu; int ret; - BUG_ON(cpus_per_node*g->p.nr_nodes != g->p.nr_cpus); + BUG_ON(cpus_per_node * nr_numa_nodes() != g->p.nr_cpus); BUG_ON(!cpus_per_node); ret = sched_getaffinity(0, sizeof(orig_mask), _mask); @@ -649,7 +690,7 @@ static int parse_setup_node_list(void) int i; for (i = 0; i < mul; i++) { - if (t >= g->p.nr_tasks) { + if (t >= g->p.nr_tasks || !node_has_cpus(bind_node)) { printf("\n# NOTE: ignoring bind NODEs starting at NODE#%d\n", bind_node); goto out; } @@ -964,6 +1005,8 @@ static void calc_convergence(double runtime_ns_max, double *convergence) sum = 0; for (node = 0; node < g->p.nr_nodes; node++) { + if (!is_node_present(node)) + continue; nr = nodes[node]; nr_min = min(nr, nr_min); nr_max = max(nr, nr_max); @@ -984,8 +1027,11 @@ static void calc_convergence(double runtime_ns_max, double *convergence) process_groups = 0; for (node = 0; node < g->p.nr_nodes; node++) { - int processes = count_node_processes(node); + int processes; + if (!is_node_present(node)) + continue; + processes = count_node_processes(node); nr = nodes[node]; tprintf(" %2d/%-2d", nr, processes); @@ -1291,7 +1337,7 @@ static void print_summary(void) printf("\n ###\n"); printf(" # %d %s will execute (on %d nodes, %d CPUs):\n", - g->p.nr_tasks, g->p.nr_tasks == 1 ? "task" : "tasks", g->p.nr_nodes, g->p.nr_cpus); + g->p.nr_tasks, g->p.nr_tasks == 1 ? "task" : "tasks", nr_numa_nodes(), g->p.nr_cpus); printf(" # %5dx %5ldMB global shared mem operations\n", g->p.nr_loops, g->p.bytes_global/1024/1024); printf(" # %5dx %5ldMB process shared mem operations\n",
Re: [PATCH v4 1/1] perf/bench/numa: Fixup discontiguous/sparse numa nodes
On Wed, 2017-11-22 at 21:39 +0530, Naveen N. Rao wrote: > > sum += nr; > > } > > BUG_ON(nr_min > nr_max); > > - > > BUG_ON(sum > g->p.nr_tasks); > > Looks like that change to remove a blank line did slip in, but that's > a > small nit. Apart from that, the patch looks good to me. > Acked-by: Naveen N. Rao> Hi Naveen, Thanks, have sent v5 with that addressed :-) Regards, -Satheesh > - Naveen
Re: [PATCH v4 1/1] perf/bench/numa: Fixup discontiguous/sparse numa nodes
On Wed, 2017-11-22 at 21:39 +0530, Naveen N. Rao wrote: > > sum += nr; > > } > > BUG_ON(nr_min > nr_max); > > - > > BUG_ON(sum > g->p.nr_tasks); > > Looks like that change to remove a blank line did slip in, but that's > a > small nit. Apart from that, the patch looks good to me. > Acked-by: Naveen N. Rao > Hi Naveen, Thanks, have sent v5 with that addressed :-) Regards, -Satheesh > - Naveen
Re: [PATCH v3 2/2] perf/bench/numa: Handle discontiguous/sparse numa nodes
Hi Arnaldo,Please find my reply inline. On Tue, 2017-10-31 at 12:26 -0300, Arnaldo Carvalho de Melo wrote: > Em Tue, Oct 31, 2017 at 08:46:58PM +0530, Naveen N. Rao escreveu: > > > > On 2017/08/21 10:17AM, sathn...@linux.vnet.ibm.com wrote: > > > > > > From: Satheesh Rajendran <sathn...@linux.vnet.ibm.com> > > > > > > Certain systems are designed to have sparse/discontiguous nodes. > > > On such systems, perf bench numa hangs, shows wrong number of > > > nodes > > > and shows values for non-existent nodes. Handle this by only > > > taking nodes that are exposed by kernel to userspace. > > > > > > Cc: Arnaldo Carvalho de Melo <a...@kernel.org> > > > Reviewed-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com> > > > Signed-off-by: Satheesh Rajendran <sathn...@linux.vnet.ibm.com> > > > Signed-off-by: Balamuruhan S <bal...@linux.vnet.ibm.com> > > > --- > > > tools/perf/bench/numa.c | 17 ++--- > > > 1 file changed, 10 insertions(+), 7 deletions(-) > > > > > > diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c > > > index 2483174..d44 100644 > > > --- a/tools/perf/bench/numa.c > > > +++ b/tools/perf/bench/numa.c > > > @@ -287,12 +287,12 @@ static cpu_set_t bind_to_cpu(int > > > target_cpu) > > > > > > static cpu_set_t bind_to_node(int target_node) > > > { > > > - int cpus_per_node = g->p.nr_cpus/g->p.nr_nodes; > > > + int cpus_per_node = g->p.nr_cpus/nr_numa_nodes(); > > > cpu_set_t orig_mask, mask; > > > int cpu; > > > int ret; > > > > > > - BUG_ON(cpus_per_node*g->p.nr_nodes != g->p.nr_cpus); > > > + BUG_ON(cpus_per_node*nr_numa_nodes() != g->p.nr_cpus); > > > BUG_ON(!cpus_per_node); > > > > > > ret = sched_getaffinity(0, sizeof(orig_mask), > > > _mask); > > > @@ -692,7 +692,7 @@ static int parse_setup_node_list(void) > > > int i; > > > > > > for (i = 0; i < mul; i++) { > > > - if (t >= g->p.nr_tasks) { > > > + if (t >= g->p.nr_tasks || > > > !node_has_cpus(bind_node)) { > > > printf("\n# NOTE: > > > ignoring bind NODEs starting at NODE#%d\n", bind_node); > > > goto out; > > > } > > > @@ -973,6 +973,7 @@ static void calc_convergence(double > > > runtime_ns_max, double *convergence) > > > int node; > > > int cpu; > > > int t; > > > + int processes; > > > > > > if (!g->p.show_convergence && !g->p.measure_convergence) > > > return; > > > @@ -1007,13 +1008,14 @@ static void calc_convergence(double > > > runtime_ns_max, double *convergence) > > > sum = 0; > > > > > > for (node = 0; node < g->p.nr_nodes; node++) { > > > + if (!is_node_present(node)) > > > + continue; > > > nr = nodes[node]; > > > nr_min = min(nr, nr_min); > > > nr_max = max(nr, nr_max); > > > sum += nr; > > > } > > > BUG_ON(nr_min > nr_max); > > > - > > Looks like an un-necessary change there. > Right, and I would leave the 'int processes' declaration where it is, > as > it is not used outside that loop. > I had hit with this compilation error, so had to move the initialization above. CC bench/numa.o bench/numa.c: In function ‘calc_convergence’: bench/numa.c:1035:3: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] int processes = count_node_processes(node); ^ cc1: all warnings being treated as errors mv: cannot stat ‘bench/.numa.o.tmp’: No such file or directory make[4]: *** [bench/numa.o] Error 1 make[3]: *** [bench] Error 2 make[2]: *** [perf-in.o] Error 2 make[1]: *** [sub-make] Error 2 make: *** [all] Error 2 > The move of that declaration to the top of the calc_convergence() > function made me spend some cycles trying to figure out why that was > done, only to realize that it was an unnecessary change :-\ > Agree, I would have kept it in the same scope, will keep as below, @@ -984,8 +1026,11 @@ static void calc_convergence(double runtime_ns_max, double *convergence) process_groups = 0; for (node = 0; node < g->p.nr_nodes; node++) { - int p
Re: [PATCH v3 2/2] perf/bench/numa: Handle discontiguous/sparse numa nodes
Hi Arnaldo,Please find my reply inline. On Tue, 2017-10-31 at 12:26 -0300, Arnaldo Carvalho de Melo wrote: > Em Tue, Oct 31, 2017 at 08:46:58PM +0530, Naveen N. Rao escreveu: > > > > On 2017/08/21 10:17AM, sathn...@linux.vnet.ibm.com wrote: > > > > > > From: Satheesh Rajendran > > > > > > Certain systems are designed to have sparse/discontiguous nodes. > > > On such systems, perf bench numa hangs, shows wrong number of > > > nodes > > > and shows values for non-existent nodes. Handle this by only > > > taking nodes that are exposed by kernel to userspace. > > > > > > Cc: Arnaldo Carvalho de Melo > > > Reviewed-by: Srikar Dronamraju > > > Signed-off-by: Satheesh Rajendran > > > Signed-off-by: Balamuruhan S > > > --- > > > tools/perf/bench/numa.c | 17 ++--- > > > 1 file changed, 10 insertions(+), 7 deletions(-) > > > > > > diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c > > > index 2483174..d44 100644 > > > --- a/tools/perf/bench/numa.c > > > +++ b/tools/perf/bench/numa.c > > > @@ -287,12 +287,12 @@ static cpu_set_t bind_to_cpu(int > > > target_cpu) > > > > > > static cpu_set_t bind_to_node(int target_node) > > > { > > > - int cpus_per_node = g->p.nr_cpus/g->p.nr_nodes; > > > + int cpus_per_node = g->p.nr_cpus/nr_numa_nodes(); > > > cpu_set_t orig_mask, mask; > > > int cpu; > > > int ret; > > > > > > - BUG_ON(cpus_per_node*g->p.nr_nodes != g->p.nr_cpus); > > > + BUG_ON(cpus_per_node*nr_numa_nodes() != g->p.nr_cpus); > > > BUG_ON(!cpus_per_node); > > > > > > ret = sched_getaffinity(0, sizeof(orig_mask), > > > _mask); > > > @@ -692,7 +692,7 @@ static int parse_setup_node_list(void) > > > int i; > > > > > > for (i = 0; i < mul; i++) { > > > - if (t >= g->p.nr_tasks) { > > > + if (t >= g->p.nr_tasks || > > > !node_has_cpus(bind_node)) { > > > printf("\n# NOTE: > > > ignoring bind NODEs starting at NODE#%d\n", bind_node); > > > goto out; > > > } > > > @@ -973,6 +973,7 @@ static void calc_convergence(double > > > runtime_ns_max, double *convergence) > > > int node; > > > int cpu; > > > int t; > > > + int processes; > > > > > > if (!g->p.show_convergence && !g->p.measure_convergence) > > > return; > > > @@ -1007,13 +1008,14 @@ static void calc_convergence(double > > > runtime_ns_max, double *convergence) > > > sum = 0; > > > > > > for (node = 0; node < g->p.nr_nodes; node++) { > > > + if (!is_node_present(node)) > > > + continue; > > > nr = nodes[node]; > > > nr_min = min(nr, nr_min); > > > nr_max = max(nr, nr_max); > > > sum += nr; > > > } > > > BUG_ON(nr_min > nr_max); > > > - > > Looks like an un-necessary change there. > Right, and I would leave the 'int processes' declaration where it is, > as > it is not used outside that loop. > I had hit with this compilation error, so had to move the initialization above. CC bench/numa.o bench/numa.c: In function ‘calc_convergence’: bench/numa.c:1035:3: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] int processes = count_node_processes(node); ^ cc1: all warnings being treated as errors mv: cannot stat ‘bench/.numa.o.tmp’: No such file or directory make[4]: *** [bench/numa.o] Error 1 make[3]: *** [bench] Error 2 make[2]: *** [perf-in.o] Error 2 make[1]: *** [sub-make] Error 2 make: *** [all] Error 2 > The move of that declaration to the top of the calc_convergence() > function made me spend some cycles trying to figure out why that was > done, only to realize that it was an unnecessary change :-\ > Agree, I would have kept it in the same scope, will keep as below, @@ -984,8 +1026,11 @@ static void calc_convergence(double runtime_ns_max, double *convergence) process_groups = 0; for (node = 0; node < g->p.nr_nodes; node++) { - int processes = count_node_processes(node); + int processes; + if (!is_node_present(node)) + continue; + proces
Re: [PATCH v3 1/2] perf/bench/numa: Add functions to detect sparse numa nodes
Hi Naveen,Thanks for detailed review, my comments inline. On Tue, 2017-10-31 at 20:44 +0530, Naveen N. Rao wrote: > Hi Satheesh, > > On 2017/08/21 10:15AM, sathn...@linux.vnet.ibm.com wrote: > > > > From: Satheesh Rajendran <sathn...@linux.vnet.ibm.com> > > > > Added functions 1) to get a count of all nodes that are exposed to > > userspace. These nodes could be memoryless cpu nodes or cpuless > > memory > > nodes, 2) to check given node is present and 3) to check given > > node has cpus > > > > This information can be used to handle sparse/discontiguous nodes. > > > > Cc: Arnaldo Carvalho de Melo <a...@kernel.org> > > Reviewed-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com> > > Signed-off-by: Satheesh Rajendran <sathn...@linux.vnet.ibm.com> > > Signed-off-by: Balamuruhan S <bal...@linux.vnet.ibm.com> > > --- > > tools/perf/bench/numa.c | 44 > > > > 1 file changed, 44 insertions(+) > > > > diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c > > index 469d65b..2483174 100644 > > --- a/tools/perf/bench/numa.c > > +++ b/tools/perf/bench/numa.c > > @@ -215,6 +215,50 @@ static const char * const numa_usage[] = { > > NULL > > }; > > > > +/* > > + * To get number of numa nodes present. > > + */ > > +static int nr_numa_nodes(void) > > +{ > > + int i, nr_nodes = 0; > > + > > + for (i = 0; i < g->p.nr_nodes; i++) { > > + if (numa_bitmask_isbitset(numa_nodes_ptr, i)) > > + nr_nodes++; > > + } > > + > > + return nr_nodes; > > +} > > + > > +/* > Please run patches through scripts/checkpatch.pl. There is a > trailing > whitespace above... > Sure > > > > + * To check if given numa node is present. > > + */ > > +static int is_node_present(int node) > > +{ > > + return numa_bitmask_isbitset(numa_nodes_ptr, node); > > +} > > + > > +/* > > + * To check given numa node has cpus. > > + */ > > +static bool node_has_cpus(int node) > > +{ > > + struct bitmask *cpu = numa_allocate_cpumask(); > > + unsigned int i; > > + > > + if (cpu == NULL) > > + return false; /* lets fall back to nocpus safely > > */ > > + > > + if (numa_node_to_cpus(node, cpu) == 0) { > This can be simplified to: > if (cpu && !numa_node_to_cpus(node, cpu)) { > > > > > + for (i = 0; i < cpu->size; i++) { > > + if (numa_bitmask_isbitset(cpu, i)) > > + return true; > > + } > > + } > The indentation on those brackets look to be wrong. > Sure > > > > + > > + return false; > > +} > > + > More importantly, you've introduced few functions in this patch, but > none of those are being used. This is not a useful way to split your > patches. In fact, this hurts bisect since trying to build perf with > just > this patch applied throws errors. > Sure, This can be merged to single patch, will do it in next version. > You seem to be addressing a few different issues related to perf > bench > numa. You might want to split your patch based on the specific > issue(s) > each change fixes. > > > - Naveen > > Regards, -Satheesh. > > > > static cpu_set_t bind_to_cpu(int target_cpu) > > { > > cpu_set_t orig_mask, mask;
Re: [PATCH v3 1/2] perf/bench/numa: Add functions to detect sparse numa nodes
Hi Naveen,Thanks for detailed review, my comments inline. On Tue, 2017-10-31 at 20:44 +0530, Naveen N. Rao wrote: > Hi Satheesh, > > On 2017/08/21 10:15AM, sathn...@linux.vnet.ibm.com wrote: > > > > From: Satheesh Rajendran > > > > Added functions 1) to get a count of all nodes that are exposed to > > userspace. These nodes could be memoryless cpu nodes or cpuless > > memory > > nodes, 2) to check given node is present and 3) to check given > > node has cpus > > > > This information can be used to handle sparse/discontiguous nodes. > > > > Cc: Arnaldo Carvalho de Melo > > Reviewed-by: Srikar Dronamraju > > Signed-off-by: Satheesh Rajendran > > Signed-off-by: Balamuruhan S > > --- > > tools/perf/bench/numa.c | 44 > > > > 1 file changed, 44 insertions(+) > > > > diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c > > index 469d65b..2483174 100644 > > --- a/tools/perf/bench/numa.c > > +++ b/tools/perf/bench/numa.c > > @@ -215,6 +215,50 @@ static const char * const numa_usage[] = { > > NULL > > }; > > > > +/* > > + * To get number of numa nodes present. > > + */ > > +static int nr_numa_nodes(void) > > +{ > > + int i, nr_nodes = 0; > > + > > + for (i = 0; i < g->p.nr_nodes; i++) { > > + if (numa_bitmask_isbitset(numa_nodes_ptr, i)) > > + nr_nodes++; > > + } > > + > > + return nr_nodes; > > +} > > + > > +/* > Please run patches through scripts/checkpatch.pl. There is a > trailing > whitespace above... > Sure > > > > + * To check if given numa node is present. > > + */ > > +static int is_node_present(int node) > > +{ > > + return numa_bitmask_isbitset(numa_nodes_ptr, node); > > +} > > + > > +/* > > + * To check given numa node has cpus. > > + */ > > +static bool node_has_cpus(int node) > > +{ > > + struct bitmask *cpu = numa_allocate_cpumask(); > > + unsigned int i; > > + > > + if (cpu == NULL) > > + return false; /* lets fall back to nocpus safely > > */ > > + > > + if (numa_node_to_cpus(node, cpu) == 0) { > This can be simplified to: > if (cpu && !numa_node_to_cpus(node, cpu)) { > > > > > + for (i = 0; i < cpu->size; i++) { > > + if (numa_bitmask_isbitset(cpu, i)) > > + return true; > > + } > > + } > The indentation on those brackets look to be wrong. > Sure > > > > + > > + return false; > > +} > > + > More importantly, you've introduced few functions in this patch, but > none of those are being used. This is not a useful way to split your > patches. In fact, this hurts bisect since trying to build perf with > just > this patch applied throws errors. > Sure, This can be merged to single patch, will do it in next version. > You seem to be addressing a few different issues related to perf > bench > numa. You might want to split your patch based on the specific > issue(s) > each change fixes. > > > - Naveen > > Regards, -Satheesh. > > > > static cpu_set_t bind_to_cpu(int target_cpu) > > { > > cpu_set_t orig_mask, mask;
Re: [PATCH v3 2/2] perf/bench/numa: Handle discontiguous/sparse numa nodes
Hi Naveen, On Tue, 2017-10-31 at 20:46 +0530, Naveen N. Rao wrote: > > > } > > BUG_ON(nr_min > nr_max); > > - > Looks like an un-necessary change there. > > - Naveen > I had hit with this compilation error, so had to move the initialization above. Please advice. Thanks. CC bench/numa.o bench/numa.c: In function ‘calc_convergence’: bench/numa.c:1035:3: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] int processes = count_node_processes(node); ^ cc1: all warnings being treated as errors mv: cannot stat ‘bench/.numa.o.tmp’: No such file or directory make[4]: *** [bench/numa.o] Error 1 make[3]: *** [bench] Error 2 make[2]: *** [perf-in.o] Error 2 make[1]: *** [sub-make] Error 2 make: *** [all] Error 2 Regards, -Satheesh. > > > > BUG_ON(sum > g->p.nr_tasks); > > > > if (0 && (sum < g->p.nr_tasks)) @@ -1027,8 +1029,9 @@ static void calc_convergence(double
Re: [PATCH v3 2/2] perf/bench/numa: Handle discontiguous/sparse numa nodes
Hi Naveen, On Tue, 2017-10-31 at 20:46 +0530, Naveen N. Rao wrote: > > > } > > BUG_ON(nr_min > nr_max); > > - > Looks like an un-necessary change there. > > - Naveen > I had hit with this compilation error, so had to move the initialization above. Please advice. Thanks. CC bench/numa.o bench/numa.c: In function ‘calc_convergence’: bench/numa.c:1035:3: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] int processes = count_node_processes(node); ^ cc1: all warnings being treated as errors mv: cannot stat ‘bench/.numa.o.tmp’: No such file or directory make[4]: *** [bench/numa.o] Error 1 make[3]: *** [bench] Error 2 make[2]: *** [perf-in.o] Error 2 make[1]: *** [sub-make] Error 2 make: *** [all] Error 2 Regards, -Satheesh. > > > > BUG_ON(sum > g->p.nr_tasks); > > > > if (0 && (sum < g->p.nr_tasks)) @@ -1027,8 +1029,9 @@ static void calc_convergence(double
Re: [PATCH v3 0/2] Fixup for discontiguous/sparse numa nodes
Hi Arnaldo, Please let me know if any further comments. Thanks in advance :-) Regards, -Satheesh.On Mon, 2017-08-21 at 15:45 +0530, sathn...@linux.vnet.ibm.com wrote: > From: Satheesh Rajendran <sathn...@linux.vnet.ibm.com> > > Certain systems would have sparse/discontinguous > numa nodes. > perf bench numa doesnt work well on such nodes. > 1. It shows wrong values. > 2. It can hang. > 3. It can show redundant information for non-existant nodes. > > #numactl -H > available: 2 nodes (0,8) > node 0 cpus: 0 1 2 3 4 5 6 7 > node 0 size: 61352 MB > node 0 free: 57168 MB > node 8 cpus: 8 9 10 11 12 13 14 15 > node 8 size: 65416 MB > node 8 free: 36593 MB > node distances: > node 0 8 > 0: 10 40 > 8: 40 10 > > Scenario 1: > > Before Fix: > # perf bench numa mem --no-data_rand_walk -p 2 -t 20 -G 0 -P 3072 -T > 0 -l 50 -c -s 1000 > ... > ... > # 40 tasks will execute (on 9 nodes, 16 CPUs): > Wrong number of > nodes > ... > #2.0% [0.2 > mins] 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 4/1 [ 4/2 ] > l: 0-0 ( 0) > Shows info on non-existant nodes. > > After Fix: > # ./perf bench numa mem --no-data_rand_walk -p 2 -t 20 -G 0 -P 3072 > -T 0 -l 50 -c -s 1000 > ... > ... > # 40 tasks will execute (on 2 nodes, 16 CPUs): > ... > #2.0% [0.2 mins] 9/1 0/0 [ 9/1 ] l: 0-0 ( 0) > #4.0% [0.4 mins] 21/2 19/1 [ 2/3 ] l: 0-1 ( 1) {1-2} > > Scenario 2: > > Before Fix: > # perf bench numa all > # Running numa/mem benchmark... > > ... > # Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 > -M 1 -s 20 -zZq --thp 1 --no-data_rand_walk" > perf: bench/numa.c:306: bind_to_memnode: Assertion `!(ret)' failed. > > Got hung > > After Fix: > # ./perf bench numa all > # Running numa/mem benchmark... > > ... > # Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 > -M 1 -s 20 -zZq --thp 1 --no-data_rand_walk" > > # NOTE: ignoring bind NODEs starting at NODE#1 > # NOTE: 0 tasks mem-bound, 1 tasks unbound > 20.017 secs slowest (max) thread-runtime > 20.000 secs fastest (min) thread-runtime > 20.006 secs average thread-runtime > 0.043 % difference between max/avg runtime > 413.794 GB data processed, per thread > 413.794 GB data processed, total > 0.048 nsecs/byte/thread runtime > 20.672 GB/sec/thread speed > 20.672 GB/sec total speed > > Changes in v2: > Fixed review comments for function names and alloc failure handle > > Changes in v3: > Coding Style fixes. > > > Satheesh Rajendran (2): > perf/bench/numa: Add functions to detect sparse numa nodes > perf/bench/numa: Handle discontiguous/sparse numa nodes > > tools/perf/bench/numa.c | 61 > +++-- > 1 file changed, 54 insertions(+), 7 deletions(-) >
Re: [PATCH v3 0/2] Fixup for discontiguous/sparse numa nodes
Hi Arnaldo, Please let me know if any further comments. Thanks in advance :-) Regards, -Satheesh.On Mon, 2017-08-21 at 15:45 +0530, sathn...@linux.vnet.ibm.com wrote: > From: Satheesh Rajendran > > Certain systems would have sparse/discontinguous > numa nodes. > perf bench numa doesnt work well on such nodes. > 1. It shows wrong values. > 2. It can hang. > 3. It can show redundant information for non-existant nodes. > > #numactl -H > available: 2 nodes (0,8) > node 0 cpus: 0 1 2 3 4 5 6 7 > node 0 size: 61352 MB > node 0 free: 57168 MB > node 8 cpus: 8 9 10 11 12 13 14 15 > node 8 size: 65416 MB > node 8 free: 36593 MB > node distances: > node 0 8 > 0: 10 40 > 8: 40 10 > > Scenario 1: > > Before Fix: > # perf bench numa mem --no-data_rand_walk -p 2 -t 20 -G 0 -P 3072 -T > 0 -l 50 -c -s 1000 > ... > ... > # 40 tasks will execute (on 9 nodes, 16 CPUs): > Wrong number of > nodes > ... > #2.0% [0.2 > mins] 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 4/1 [ 4/2 ] > l: 0-0 ( 0) > Shows info on non-existant nodes. > > After Fix: > # ./perf bench numa mem --no-data_rand_walk -p 2 -t 20 -G 0 -P 3072 > -T 0 -l 50 -c -s 1000 > ... > ... > # 40 tasks will execute (on 2 nodes, 16 CPUs): > ... > #2.0% [0.2 mins] 9/1 0/0 [ 9/1 ] l: 0-0 ( 0) > #4.0% [0.4 mins] 21/2 19/1 [ 2/3 ] l: 0-1 ( 1) {1-2} > > Scenario 2: > > Before Fix: > # perf bench numa all > # Running numa/mem benchmark... > > ... > # Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 > -M 1 -s 20 -zZq --thp 1 --no-data_rand_walk" > perf: bench/numa.c:306: bind_to_memnode: Assertion `!(ret)' failed. > > Got hung > > After Fix: > # ./perf bench numa all > # Running numa/mem benchmark... > > ... > # Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 > -M 1 -s 20 -zZq --thp 1 --no-data_rand_walk" > > # NOTE: ignoring bind NODEs starting at NODE#1 > # NOTE: 0 tasks mem-bound, 1 tasks unbound > 20.017 secs slowest (max) thread-runtime > 20.000 secs fastest (min) thread-runtime > 20.006 secs average thread-runtime > 0.043 % difference between max/avg runtime > 413.794 GB data processed, per thread > 413.794 GB data processed, total > 0.048 nsecs/byte/thread runtime > 20.672 GB/sec/thread speed > 20.672 GB/sec total speed > > Changes in v2: > Fixed review comments for function names and alloc failure handle > > Changes in v3: > Coding Style fixes. > > > Satheesh Rajendran (2): > perf/bench/numa: Add functions to detect sparse numa nodes > perf/bench/numa: Handle discontiguous/sparse numa nodes > > tools/perf/bench/numa.c | 61 > +++-- > 1 file changed, 54 insertions(+), 7 deletions(-) >
Re: [PATCH 1/2] perf/bench/numa: Add functions to detect sparse numa nodes
Thanks Arnaldo for the detailed review :-) Will address them and send across v2. On Thu, 2017-08-10 at 16:22 -0300, Arnaldo Carvalho de Melo wrote: > Em Thu, Aug 10, 2017 at 12:58:49PM +0530, sathn...@linux.vnet.ibm.com > escreveu: > > > > From: Satheesh Rajendran <sathn...@linux.vnet.ibm.com> > > > > Added functions 1) to get a count of all nodes that are exposed to > > userspace. These nodes could be memoryless cpu nodes or cpuless > > memory > > nodes, 2) to check given node is present and 3) to check given > > node has cpus > > > > This information can be used to handle sparse/discontiguous nodes. > > > > Reviewed-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com> > > Signed-off-by: Satheesh Rajendran <sathn...@linux.vnet.ibm.com> > > Signed-off-by: Balamuruhan S <bal...@linux.vnet.ibm.com> > > --- > > tools/perf/bench/numa.c | 35 +++ > > 1 file changed, 35 insertions(+) > > > > diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c > > index 469d65b..efd7595 100644 > > --- a/tools/perf/bench/numa.c > > +++ b/tools/perf/bench/numa.c > > @@ -215,6 +215,41 @@ static const char * const numa_usage[] = { > > NULL > > }; > > > > +static int nr_numa_nodes(void) > > +{ > > + int node = 0, i; > > + > > +for (i = 0; i < g->p.nr_nodes; i++) { > > + if (numa_bitmask_isbitset(numa_nodes_ptr, i)) > > + node++; > > + } > > + return node; > Humm, can you rename 'node' to 'nr_nodes'? > Sure, will Change it. > > > > +} > > + > > +static bool is_node_present(int node) > > +{ > > + if (numa_bitmask_isbitset(numa_nodes_ptr, node)) > > + return true; > > + else > > + return false; > > +} > Why four lines instead of just one? Isn't this equivalent: > Sure. > static bool is_node_present(int node) > { > return numa_bitmask_isbitset(numa_nodes_ptr, node); > } > > ? > > > + > > +static bool is_node_hascpu(int node) > Can you rename this function, the name is confusing :-\ > > Based on the documentation for this function, that you left only in > the > changelog (please put it just before the function, as a comment, I > think > it should be named node_has_cpus()? > make sense, will change it. > > > > +{ > > + struct bitmask *cpu; > > + unsigned int i; > > + > > + cpu = numa_allocate_cpumask(); > Please put the line with the initialization together with the > declaration, making it: > > struct bitmask *cpu = numa_allocate_cpumask(); > > Also, this is a "alloc" function, I bet it can fail? If so, check it > and > return something useful if it fails, which probably will be difficult > since this function returns bool? > Sure, will check return false as failsafe. > > > > > + if (numa_node_to_cpus(node, cpu) == 0) { > > + for (i = 0; i < cpu->size; i++) { > > + if (numa_bitmask_isbitset(cpu, i)) > > + return true; > > + } > > + } else > > + return false; // lets fall back to nocpus safely > > + return false; > > +} > > + > > static cpu_set_t bind_to_cpu(int target_cpu) > > { > > cpu_set_t orig_mask, mask;
Re: [PATCH 1/2] perf/bench/numa: Add functions to detect sparse numa nodes
Thanks Arnaldo for the detailed review :-) Will address them and send across v2. On Thu, 2017-08-10 at 16:22 -0300, Arnaldo Carvalho de Melo wrote: > Em Thu, Aug 10, 2017 at 12:58:49PM +0530, sathn...@linux.vnet.ibm.com > escreveu: > > > > From: Satheesh Rajendran > > > > Added functions 1) to get a count of all nodes that are exposed to > > userspace. These nodes could be memoryless cpu nodes or cpuless > > memory > > nodes, 2) to check given node is present and 3) to check given > > node has cpus > > > > This information can be used to handle sparse/discontiguous nodes. > > > > Reviewed-by: Srikar Dronamraju > > Signed-off-by: Satheesh Rajendran > > Signed-off-by: Balamuruhan S > > --- > > tools/perf/bench/numa.c | 35 +++ > > 1 file changed, 35 insertions(+) > > > > diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c > > index 469d65b..efd7595 100644 > > --- a/tools/perf/bench/numa.c > > +++ b/tools/perf/bench/numa.c > > @@ -215,6 +215,41 @@ static const char * const numa_usage[] = { > > NULL > > }; > > > > +static int nr_numa_nodes(void) > > +{ > > + int node = 0, i; > > + > > +for (i = 0; i < g->p.nr_nodes; i++) { > > + if (numa_bitmask_isbitset(numa_nodes_ptr, i)) > > + node++; > > + } > > + return node; > Humm, can you rename 'node' to 'nr_nodes'? > Sure, will Change it. > > > > +} > > + > > +static bool is_node_present(int node) > > +{ > > + if (numa_bitmask_isbitset(numa_nodes_ptr, node)) > > + return true; > > + else > > + return false; > > +} > Why four lines instead of just one? Isn't this equivalent: > Sure. > static bool is_node_present(int node) > { > return numa_bitmask_isbitset(numa_nodes_ptr, node); > } > > ? > > > + > > +static bool is_node_hascpu(int node) > Can you rename this function, the name is confusing :-\ > > Based on the documentation for this function, that you left only in > the > changelog (please put it just before the function, as a comment, I > think > it should be named node_has_cpus()? > make sense, will change it. > > > > +{ > > + struct bitmask *cpu; > > + unsigned int i; > > + > > + cpu = numa_allocate_cpumask(); > Please put the line with the initialization together with the > declaration, making it: > > struct bitmask *cpu = numa_allocate_cpumask(); > > Also, this is a "alloc" function, I bet it can fail? If so, check it > and > return something useful if it fails, which probably will be difficult > since this function returns bool? > Sure, will check return false as failsafe. > > > > > + if (numa_node_to_cpus(node, cpu) == 0) { > > + for (i = 0; i < cpu->size; i++) { > > + if (numa_bitmask_isbitset(cpu, i)) > > + return true; > > + } > > + } else > > + return false; // lets fall back to nocpus safely > > + return false; > > +} > > + > > static cpu_set_t bind_to_cpu(int target_cpu) > > { > > cpu_set_t orig_mask, mask;