On Mon, 7 Dec 2020 at 10:59, Song Bao Hua (Barry Song) <[email protected]> wrote: > > > > > -----Original Message----- > > From: Vincent Guittot [mailto:[email protected]] > > Sent: Thursday, December 3, 2020 10:39 PM > > To: Song Bao Hua (Barry Song) <[email protected]> > > Cc: Valentin Schneider <[email protected]>; Catalin Marinas > > <[email protected]>; Will Deacon <[email protected]>; Rafael J. Wysocki > > <[email protected]>; Cc: Len Brown <[email protected]>; > > [email protected]; Jonathan Cameron <[email protected]>; > > Ingo Molnar <[email protected]>; Peter Zijlstra <[email protected]>; Juri > > Lelli <[email protected]>; Dietmar Eggemann <[email protected]>; > > Steven Rostedt <[email protected]>; Ben Segall <[email protected]>; Mel > > Gorman <[email protected]>; Mark Rutland <[email protected]>; LAK > > <[email protected]>; linux-kernel > > <[email protected]>; ACPI Devel Maling List > > <[email protected]>; Linuxarm <[email protected]>; xuwei (O) > > <[email protected]>; Zengtao (B) <[email protected]> > > Subject: Re: [RFC PATCH v2 2/2] scheduler: add scheduler level for clusters > > > > On Thu, 3 Dec 2020 at 10:11, Song Bao Hua (Barry Song) > > <[email protected]> wrote: > > > > > > > > > > > > > -----Original Message----- > > > > From: Vincent Guittot [mailto:[email protected]] > > > > Sent: Thursday, December 3, 2020 10:04 PM > > > > To: Song Bao Hua (Barry Song) <[email protected]> > > > > Cc: Valentin Schneider <[email protected]>; Catalin Marinas > > > > <[email protected]>; Will Deacon <[email protected]>; Rafael J. > > > > Wysocki > > > > <[email protected]>; Cc: Len Brown <[email protected]>; > > > > [email protected]; Jonathan Cameron > > <[email protected]>; > > > > Ingo Molnar <[email protected]>; Peter Zijlstra <[email protected]>; > > > > Juri > > > > Lelli <[email protected]>; Dietmar Eggemann > > <[email protected]>; > > > > Steven Rostedt <[email protected]>; Ben Segall <[email protected]>; > > > > Mel > > > > Gorman <[email protected]>; Mark Rutland <[email protected]>; LAK > > > > <[email protected]>; linux-kernel > > > > <[email protected]>; ACPI Devel Maling List > > > > <[email protected]>; Linuxarm <[email protected]>; xuwei (O) > > > > <[email protected]>; Zengtao (B) <[email protected]> > > > > Subject: Re: [RFC PATCH v2 2/2] scheduler: add scheduler level for > > > > clusters > > > > > > > > On Wed, 2 Dec 2020 at 21:58, Song Bao Hua (Barry Song) > > > > <[email protected]> wrote: > > > > > > > > > > > > > > > > > Sorry. Please ignore this. I added some printk here while testing > > > > > > one numa. Will update you the data in another email. > > > > > > > > > > Re-tested in one NUMA node(cpu0-cpu23): > > > > > > > > > > g=1 > > > > > Running in threaded mode with 1 groups using 40 file descriptors > > > > > Each sender will pass 100000 messages of 100 bytes > > > > > w/o: 7.689 7.485 7.485 7.458 7.524 7.539 7.738 7.693 7.568 > > > > > 7.674=7.5853 > > > > > w/ : 7.516 7.941 7.374 7.963 7.881 7.910 7.420 7.556 7.695 > > > > > 7.441=7.6697 > > > > > w/ but dropped select_idle_cluster: > > > > > 7.752 7.739 7.739 7.571 7.545 7.685 7.407 7.580 7.605 7.487=7.611 > > > > > > > > > > g=2 > > > > > Running in threaded mode with 2 groups using 40 file descriptors > > > > > Each sender will pass 100000 messages of 100 bytes > > > > > w/o: 10.127 10.119 10.070 10.196 10.057 10.111 10.045 10.164 10.162 > > > > > 9.955=10.1006 > > > > > w/ : 9.694 9.654 9.612 9.649 9.686 9.734 9.607 9.842 9.690 > > > > > 9.710=9.6878 > > > > > w/ but dropped select_idle_cluster: > > > > > 9.877 10.069 9.951 9.918 9.947 9.790 9.906 9.820 9.863 > > > > > 9.906=9.9047 > > > > > > > > > > g=3 > > > > > Running in threaded mode with 3 groups using 40 file descriptors > > > > > Each sender will pass 100000 messages of 100 bytes > > > > > w/o: 15.885 15.254 15.932 15.647 16.120 15.878 15.857 15.759 15.674 > > > > > 15.721=15.7727 > > > > > w/ : 14.974 14.657 13.969 14.985 14.728 15.665 15.191 14.995 14.946 > > > > > 14.895=14.9005 > > > > > w/ but dropped select_idle_cluster: > > > > > 15.405 15.177 15.373 15.187 15.450 15.540 15.278 15.628 15.228 > > > > 15.325=15.3591 > > > > > > > > > > g=4 > > > > > Running in threaded mode with 4 groups using 40 file descriptors > > > > > Each sender will pass 100000 messages of 100 bytes > > > > > w/o: 20.014 21.025 21.119 21.235 19.767 20.971 20.962 20.914 21.090 > > > > 21.090=20.8187 > > > > > w/ : 20.331 20.608 20.338 20.445 20.456 20.146 20.693 20.797 21.381 > > > > 20.452=20.5647 > > > > > w/ but dropped select_idle_cluster: > > > > > 19.814 20.126 20.229 20.350 20.750 20.404 19.957 19.888 20.226 > > > > 20.562=20.2306 > > > > > > > > > > > > > I assume that you have run this on v5.9 as previous tests. > > > > > > Yep > > > > > > > The results don't show any real benefit of select_idle_cluster() > > > > inside a node whereas this is where we could expect most of the > > > > benefit. We have to understand why we have such an impact on numa > > > > tests only. > > > > > > There is a 4-5.5% increase while g=2 and g=3. > > > > my point was with vs without select_idle_cluster() but still having a > > cluster domain level > > In this case, the diff is -0.8% for g=1 +2.2% for g=2, +3% for g=3 and > > -1.7% for g=4 > > > > > > > > Regarding the huge increase in NUMA case, at the first beginning, I > > > suspect > > > we have wrong llc domain. For example, if cpu0's llc domain span > > > cpu0-cpu47, then select_idle_cpu() is running in wrong range while > > > it should run in cpu0-cpu23. > > > > > > But after printing the llc domain's span, I find it is completely right. > > > Cpu0's llc span: cpu0-cpu23 > > > Cpu24's llc span: cpu24-cpu47 > > > > Have you checked that the cluster mask was also correct ? > > > > > > > > Maybe I need more trace data to figure out if select_idle_cpu() is running > > > correctly. For example, maybe I can figure out if it is always returning > > > -1, > > > or it returns -1 very often? > > > > yes, could be interesting to check how often select_idle_cpu return -1 > > > > > > > > Or do you have any idea? > > > > tracking migration across nod could help to understand too > > I set a bootargs mem=4G to do swapping test before working on cluster > scheduler issue. but I forgot to remove the parameter. > > The huge increase on across-numa case can only be reproduced while > i use this mem=4G cmdline which means numa1 has no memory. > After removing the limitation, I can't reproduce the huge increase > for two NUMAs any more.
Ok. Make more sense > > Guess select_idle_cluster() somehow workaround an scheduler issue > for numa without memory. > > > > > Vincent > > > > > > > > Thanks > Barry >

