Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On 01/30/2019 07:06 PM, Tim Chen wrote: > On 12/27/18 6:55 PM, Waiman Long wrote: >> On 12/27/2018 08:31 PM, Wang, Kemi wrote: >>> Hi, Waiman >>>Did you post that patch? Let's see if it helps. >> I did post the patch a while ago. I will need to rebase it to a new >> baseline. Will do that in a week or 2. >> >> -Longman >> > Waiman, > > In a large scale OLTP benchmark, we are also seeing a 1% loss in performance > from patch 9bc8039e71. We'll be interested to try out your patch to > restore some optimistic spinning on reader held sem. > > Thanks. > > Tim I am reworking the rwsem code and reader-optimistic spinning code isn't ready with the reworked code yet. I need a bit more time to work on that and certainly keep you posted once it is ready for review. Cheers, Longman
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On 12/27/18 6:55 PM, Waiman Long wrote: > On 12/27/2018 08:31 PM, Wang, Kemi wrote: >> Hi, Waiman >>Did you post that patch? Let's see if it helps. > > I did post the patch a while ago. I will need to rebase it to a new > baseline. Will do that in a week or 2. > > -Longman > Waiman, In a large scale OLTP benchmark, we are also seeing a 1% loss in performance from patch 9bc8039e71. We'll be interested to try out your patch to restore some optimistic spinning on reader held sem. Thanks. Tim
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On 2018/12/28 上午10:55, Waiman Long wrote: > On 12/27/2018 08:31 PM, Wang, Kemi wrote: >> Hi, Waiman >>Did you post that patch? Let's see if it helps. > > I did post the patch a while ago. I will need to rebase it to a new > baseline. Will do that in a week or 2. > OK.I will take a look at it and try to rebase it on shi's patch to see if the regression can be fixed. May I know where I can get that patch, I didn't find it in my inbox. Thanks > -Longman > >> >> -Original Message- >> From: LKP [mailto:lkp-boun...@lists.01.org] On Behalf Of Waiman Long >> Sent: Tuesday, November 6, 2018 6:40 AM >> To: Linus Torvalds ; vba...@suse.cz; >> Davidlohr Bueso >> Cc: yang@linux.alibaba.com; Linux Kernel Mailing List >> ; Matthew Wilcox ; >> mho...@kernel.org; Colin King ; Andrew Morton >> ; lduf...@linux.vnet.ibm.com; l...@01.org; >> kirill.shute...@linux.intel.com >> Subject: Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% >> regression >> >> On 11/05/2018 05:14 PM, Linus Torvalds wrote: >>> On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka wrote: >>>> I didn't spot an obvious mistake in the patch itself, so it looks >>>> like some bad interaction between scheduler and the mmap downgrade? >>> I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by >>> the downgrade. >>> >>> It looks like the benchmark used to be basically CPU-bound, at about >>> 800% CPU, and now it's somewhere in the 200% CPU region: >>> >>> will-it-scale.time.percent_of_cpu_this_job_got >>> >>> 800 >>> +-+---+ >>> |.+.+.+.+.+.+.+. .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. >>> .+.+.+.| >>> 700 +-+ +.+ >>> | >>> | >>> | >>> 600 +-+ >>> | >>> | >>> | >>> 500 +-+ >>> | >>> | >>> | >>> 400 +-+ >>> | >>> | >>> | >>> 300 +-+ >>> | >>> | >>> | >>> 200 O-O O O O OO >>> | >>> | O O O O O O O O O O O O O O O O O O >>> | >>> 100 >>> +-+---+ >>> >>> which sounds like the downgrade really messes with the "spin waiting >>> for lock" logic. >>> >>> I'm thinking it's the "wake up waiter" logic that has some bad >>> interaction with spinning, and breaks that whole optimization. >>> >>> Adding Waiman and Davidlohr to the participants, because they seem to >>> be the obvious experts in this area. >>> >>> Linus >> Optimistic spinning on rwsem is done only on writers spinning on a >> writer-owned rwsem. If a write-lock is downgraded to a read-lock, all >> the spinning waiters will quit. That may explain the drop in cpu >> utilization. I do have a old patch that enable a certain amount of >> reader spinning which may help the situation. I can rebase that and send >> it out for review if people have interest. >> >> Cheers, >> Longman >> >> >> ___ >> LKP mailing list >> l...@lists.01.org >> https://lists.01.org/mailman/listinfo/lkp > >
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On 12/27/2018 08:31 PM, Wang, Kemi wrote: > Hi, Waiman >Did you post that patch? Let's see if it helps. I did post the patch a while ago. I will need to rebase it to a new baseline. Will do that in a week or 2. -Longman > > -Original Message- > From: LKP [mailto:lkp-boun...@lists.01.org] On Behalf Of Waiman Long > Sent: Tuesday, November 6, 2018 6:40 AM > To: Linus Torvalds ; vba...@suse.cz; Davidlohr > Bueso > Cc: yang@linux.alibaba.com; Linux Kernel Mailing List > ; Matthew Wilcox ; > mho...@kernel.org; Colin King ; Andrew Morton > ; lduf...@linux.vnet.ibm.com; l...@01.org; > kirill.shute...@linux.intel.com > Subject: Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% > regression > > On 11/05/2018 05:14 PM, Linus Torvalds wrote: >> On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka wrote: >>> I didn't spot an obvious mistake in the patch itself, so it looks >>> like some bad interaction between scheduler and the mmap downgrade? >> I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by >> the downgrade. >> >> It looks like the benchmark used to be basically CPU-bound, at about >> 800% CPU, and now it's somewhere in the 200% CPU region: >> >> will-it-scale.time.percent_of_cpu_this_job_got >> >> 800 +-+---+ >> |.+.+.+.+.+.+.+. .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. .+.+.+.| >> 700 +-+ +.+ | >> | | >> 600 +-+ | >> | | >> 500 +-+ | >> | | >> 400 +-+ | >> | | >> 300 +-+ | >> | | >> 200 O-O O O O OO | >> | O O O O O O O O O O O O O O O O O O| >> 100 +-+---+ >> >> which sounds like the downgrade really messes with the "spin waiting >> for lock" logic. >> >> I'm thinking it's the "wake up waiter" logic that has some bad >> interaction with spinning, and breaks that whole optimization. >> >> Adding Waiman and Davidlohr to the participants, because they seem to >> be the obvious experts in this area. >> >> Linus > Optimistic spinning on rwsem is done only on writers spinning on a > writer-owned rwsem. If a write-lock is downgraded to a read-lock, all > the spinning waiters will quit. That may explain the drop in cpu > utilization. I do have a old patch that enable a certain amount of > reader spinning which may help the situation. I can rebase that and send > it out for review if people have interest. > > Cheers, > Longman > > > ___ > LKP mailing list > l...@lists.01.org > https://lists.01.org/mailman/listinfo/lkp
RE: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
Hi, Waiman Did you post that patch? Let's see if it helps. -Original Message- From: LKP [mailto:lkp-boun...@lists.01.org] On Behalf Of Waiman Long Sent: Tuesday, November 6, 2018 6:40 AM To: Linus Torvalds ; vba...@suse.cz; Davidlohr Bueso Cc: yang@linux.alibaba.com; Linux Kernel Mailing List ; Matthew Wilcox ; mho...@kernel.org; Colin King ; Andrew Morton ; lduf...@linux.vnet.ibm.com; l...@01.org; kirill.shute...@linux.intel.com Subject: Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression On 11/05/2018 05:14 PM, Linus Torvalds wrote: > On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka wrote: >> I didn't spot an obvious mistake in the patch itself, so it looks >> like some bad interaction between scheduler and the mmap downgrade? > I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by > the downgrade. > > It looks like the benchmark used to be basically CPU-bound, at about > 800% CPU, and now it's somewhere in the 200% CPU region: > > will-it-scale.time.percent_of_cpu_this_job_got > > 800 +-+---+ > |.+.+.+.+.+.+.+. .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. .+.+.+.| > 700 +-+ +.+ | > | | > 600 +-+ | > | | > 500 +-+ | > | | > 400 +-+ | > | | > 300 +-+ | > | | > 200 O-O O O O OO | > | O O O O O O O O O O O O O O O O O O| > 100 +-+---+ > > which sounds like the downgrade really messes with the "spin waiting > for lock" logic. > > I'm thinking it's the "wake up waiter" logic that has some bad > interaction with spinning, and breaks that whole optimization. > > Adding Waiman and Davidlohr to the participants, because they seem to > be the obvious experts in this area. > > Linus Optimistic spinning on rwsem is done only on writers spinning on a writer-owned rwsem. If a write-lock is downgraded to a read-lock, all the spinning waiters will quit. That may explain the drop in cpu utilization. I do have a old patch that enable a certain amount of reader spinning which may help the situation. I can rebase that and send it out for review if people have interest. Cheers, Longman ___ LKP mailing list l...@lists.01.org https://lists.01.org/mailman/listinfo/lkp
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On 11/05/2018 05:14 PM, Linus Torvalds wrote: > On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka wrote: >> I didn't spot an obvious mistake in the patch itself, so it looks >> like some bad interaction between scheduler and the mmap downgrade? > I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by > the downgrade. > > It looks like the benchmark used to be basically CPU-bound, at about > 800% CPU, and now it's somewhere in the 200% CPU region: > > will-it-scale.time.percent_of_cpu_this_job_got > > 800 +-+---+ > |.+.+.+.+.+.+.+. .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. .+.+.+.| > 700 +-+ +.+ | > | | > 600 +-+ | > | | > 500 +-+ | > | | > 400 +-+ | > | | > 300 +-+ | > | | > 200 O-O O O O OO | > | O O O O O O O O O O O O O O O O O O| > 100 +-+---+ > > which sounds like the downgrade really messes with the "spin waiting > for lock" logic. > > I'm thinking it's the "wake up waiter" logic that has some bad > interaction with spinning, and breaks that whole optimization. > > Adding Waiman and Davidlohr to the participants, because they seem to > be the obvious experts in this area. > > Linus Optimistic spinning on rwsem is done only on writers spinning on a writer-owned rwsem. If a write-lock is downgraded to a read-lock, all the spinning waiters will quit. That may explain the drop in cpu utilization. I do have a old patch that enable a certain amount of reader spinning which may help the situation. I can rebase that and send it out for review if people have interest. Cheers, Longman
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On 11/05/2018 05:14 PM, Linus Torvalds wrote: > On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka wrote: >> I didn't spot an obvious mistake in the patch itself, so it looks >> like some bad interaction between scheduler and the mmap downgrade? > I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by > the downgrade. > > It looks like the benchmark used to be basically CPU-bound, at about > 800% CPU, and now it's somewhere in the 200% CPU region: > > will-it-scale.time.percent_of_cpu_this_job_got > > 800 +-+---+ > |.+.+.+.+.+.+.+. .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. .+.+.+.| > 700 +-+ +.+ | > | | > 600 +-+ | > | | > 500 +-+ | > | | > 400 +-+ | > | | > 300 +-+ | > | | > 200 O-O O O O OO | > | O O O O O O O O O O O O O O O O O O| > 100 +-+---+ > > which sounds like the downgrade really messes with the "spin waiting > for lock" logic. > > I'm thinking it's the "wake up waiter" logic that has some bad > interaction with spinning, and breaks that whole optimization. > > Adding Waiman and Davidlohr to the participants, because they seem to > be the obvious experts in this area. > > Linus Optimistic spinning on rwsem is done only on writers spinning on a writer-owned rwsem. If a write-lock is downgraded to a read-lock, all the spinning waiters will quit. That may explain the drop in cpu utilization. I do have a old patch that enable a certain amount of reader spinning which may help the situation. I can rebase that and send it out for review if people have interest. Cheers, Longman
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka wrote: > > I didn't spot an obvious mistake in the patch itself, so it looks > like some bad interaction between scheduler and the mmap downgrade? I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by the downgrade. It looks like the benchmark used to be basically CPU-bound, at about 800% CPU, and now it's somewhere in the 200% CPU region: will-it-scale.time.percent_of_cpu_this_job_got 800 +-+---+ |.+.+.+.+.+.+.+. .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. .+.+.+.| 700 +-+ +.+ | | | 600 +-+ | | | 500 +-+ | | | 400 +-+ | | | 300 +-+ | | | 200 O-O O O O OO | | O O O O O O O O O O O O O O O O O O| 100 +-+---+ which sounds like the downgrade really messes with the "spin waiting for lock" logic. I'm thinking it's the "wake up waiter" logic that has some bad interaction with spinning, and breaks that whole optimization. Adding Waiman and Davidlohr to the participants, because they seem to be the obvious experts in this area. Linus
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka wrote: > > I didn't spot an obvious mistake in the patch itself, so it looks > like some bad interaction between scheduler and the mmap downgrade? I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by the downgrade. It looks like the benchmark used to be basically CPU-bound, at about 800% CPU, and now it's somewhere in the 200% CPU region: will-it-scale.time.percent_of_cpu_this_job_got 800 +-+---+ |.+.+.+.+.+.+.+. .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. .+.+.+.| 700 +-+ +.+ | | | 600 +-+ | | | 500 +-+ | | | 400 +-+ | | | 300 +-+ | | | 200 O-O O O O OO | | O O O O O O O O O O O O O O O O O O| 100 +-+---+ which sounds like the downgrade really messes with the "spin waiting for lock" logic. I'm thinking it's the "wake up waiter" logic that has some bad interaction with spinning, and breaks that whole optimization. Adding Waiman and Davidlohr to the participants, because they seem to be the obvious experts in this area. Linus
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On 11/5/18 10:35 AM, Linus Torvalds wrote: On Mon, Nov 5, 2018 at 10:28 AM Yang Shi wrote: Actually, the commit is mainly for optimizing the long stall time caused by holding mmap_sem by write when unmapping or shrinking large mapping. It downgrades write mmap_sem to read when zapping pages. So, it looks the downgrade incurs more context switches. This is kind of expected. However, the test looks just shrink the mapping with one normal 4K page size. It sounds the overhead of context switches outpace the gain in this case at the first glance. I'm not seeing why there should be a context switch in the first place. Even if you have lots of concurrent brk() users, they should all block exactly the same way as before (a write lock blocks against a write lock, but it *also* blocks against a downgraded read lock). Yes, it is true. The brk() users will not get waken up. What I can think of for now is there might be other helper processes and/or kernel threads are waiting for read mmap_sem. They might get waken up by the downgrade. But, I also saw huge increase in cpu idle time and sched_goidle events. Not have clue yet for why idle goes up. 20610709 ± 15% +2376.0% 5.103e+08 ± 34% cpuidle.C1.time 28753819 ± 39% +1054.5% 3.319e+08 ± 49% cpuidle.C3.time 175049 ± 72%+840.7%1646720 ± 72% sched_debug.cpu.sched_goidle.stddev Thanks, Yang So no, I don't want just some limit to hide this problem for that particular test. There's something else going on. Linus
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On 11/5/18 10:35 AM, Linus Torvalds wrote: On Mon, Nov 5, 2018 at 10:28 AM Yang Shi wrote: Actually, the commit is mainly for optimizing the long stall time caused by holding mmap_sem by write when unmapping or shrinking large mapping. It downgrades write mmap_sem to read when zapping pages. So, it looks the downgrade incurs more context switches. This is kind of expected. However, the test looks just shrink the mapping with one normal 4K page size. It sounds the overhead of context switches outpace the gain in this case at the first glance. I'm not seeing why there should be a context switch in the first place. Even if you have lots of concurrent brk() users, they should all block exactly the same way as before (a write lock blocks against a write lock, but it *also* blocks against a downgraded read lock). Yes, it is true. The brk() users will not get waken up. What I can think of for now is there might be other helper processes and/or kernel threads are waiting for read mmap_sem. They might get waken up by the downgrade. But, I also saw huge increase in cpu idle time and sched_goidle events. Not have clue yet for why idle goes up. 20610709 ± 15% +2376.0% 5.103e+08 ± 34% cpuidle.C1.time 28753819 ± 39% +1054.5% 3.319e+08 ± 49% cpuidle.C3.time 175049 ± 72%+840.7%1646720 ± 72% sched_debug.cpu.sched_goidle.stddev Thanks, Yang So no, I don't want just some limit to hide this problem for that particular test. There's something else going on. Linus
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On 11/5/18 6:50 PM, Linus Torvalds wrote: > On Sun, Nov 4, 2018 at 9:08 PM kernel test robot > wrote: >> >> FYI, we noticed a -64.1% regression of will-it-scale.per_thread_ops >> due to commit 9bc8039e715d ("mm: brk: downgrade mmap_sem to read when >> shrinking") > > Ugh. That looks pretty bad. > >> in testcase: will-it-scale >> on test machine: 8 threads Ivy Bridge with 16G memory >> with following parameters: >> >> nr_task: 100% >> mode: thread >> test: brk1 >> ucode: 0x20 >> cpufreq_governor: performance > > The reason seems to be way more scheduler time due to lots more > context switches: > >> 34925294 ± 18%+270.3% 1.293e+08 ± 4% >> will-it-scale.time.voluntary_context_switches And what about this: 0.83 ± 27% +25.9 26.75 ± 11% perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary 1.09 ± 32% +30.9 31.97 ± 10% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64 1.62 ± 36% +44.4 46.01 ± 9% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64 1.63 ± 36% +44.5 46.18 ± 9% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64 1.63 ± 36% +44.6 46.21 ± 9% perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64 1.73 ± 29% +51.1 52.86 ± 2% perf-profile.calltrace.cycles-pp.secondary_startup_64 And the graphs showing less user/kernel time and less "percent_of_cpu_this_job_got"... I didn't spot an obvious mistake in the patch itself, so it looks like some bad interaction between scheduler and the mmap downgrade? > Yang Shi, would you mind taking a look at what's going on? > > Linus >
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On 11/5/18 6:50 PM, Linus Torvalds wrote: > On Sun, Nov 4, 2018 at 9:08 PM kernel test robot > wrote: >> >> FYI, we noticed a -64.1% regression of will-it-scale.per_thread_ops >> due to commit 9bc8039e715d ("mm: brk: downgrade mmap_sem to read when >> shrinking") > > Ugh. That looks pretty bad. > >> in testcase: will-it-scale >> on test machine: 8 threads Ivy Bridge with 16G memory >> with following parameters: >> >> nr_task: 100% >> mode: thread >> test: brk1 >> ucode: 0x20 >> cpufreq_governor: performance > > The reason seems to be way more scheduler time due to lots more > context switches: > >> 34925294 ± 18%+270.3% 1.293e+08 ± 4% >> will-it-scale.time.voluntary_context_switches And what about this: 0.83 ± 27% +25.9 26.75 ± 11% perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary 1.09 ± 32% +30.9 31.97 ± 10% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64 1.62 ± 36% +44.4 46.01 ± 9% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64 1.63 ± 36% +44.5 46.18 ± 9% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64 1.63 ± 36% +44.6 46.21 ± 9% perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64 1.73 ± 29% +51.1 52.86 ± 2% perf-profile.calltrace.cycles-pp.secondary_startup_64 And the graphs showing less user/kernel time and less "percent_of_cpu_this_job_got"... I didn't spot an obvious mistake in the patch itself, so it looks like some bad interaction between scheduler and the mmap downgrade? > Yang Shi, would you mind taking a look at what's going on? > > Linus >
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On Mon, Nov 5, 2018 at 10:28 AM Yang Shi wrote: > > Actually, the commit is mainly for optimizing the long stall time caused > by holding mmap_sem by write when unmapping or shrinking large mapping. > It downgrades write mmap_sem to read when zapping pages. So, it looks > the downgrade incurs more context switches. This is kind of expected. > > However, the test looks just shrink the mapping with one normal 4K page > size. It sounds the overhead of context switches outpace the gain in > this case at the first glance. I'm not seeing why there should be a context switch in the first place. Even if you have lots of concurrent brk() users, they should all block exactly the same way as before (a write lock blocks against a write lock, but it *also* blocks against a downgraded read lock). So no, I don't want just some limit to hide this problem for that particular test. There's something else going on. Linus
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On Mon, Nov 5, 2018 at 10:28 AM Yang Shi wrote: > > Actually, the commit is mainly for optimizing the long stall time caused > by holding mmap_sem by write when unmapping or shrinking large mapping. > It downgrades write mmap_sem to read when zapping pages. So, it looks > the downgrade incurs more context switches. This is kind of expected. > > However, the test looks just shrink the mapping with one normal 4K page > size. It sounds the overhead of context switches outpace the gain in > this case at the first glance. I'm not seeing why there should be a context switch in the first place. Even if you have lots of concurrent brk() users, they should all block exactly the same way as before (a write lock blocks against a write lock, but it *also* blocks against a downgraded read lock). So no, I don't want just some limit to hide this problem for that particular test. There's something else going on. Linus
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On 11/5/18 9:50 AM, Linus Torvalds wrote: On Sun, Nov 4, 2018 at 9:08 PM kernel test robot wrote: FYI, we noticed a -64.1% regression of will-it-scale.per_thread_ops due to commit 9bc8039e715d ("mm: brk: downgrade mmap_sem to read when shrinking") Ugh. That looks pretty bad. in testcase: will-it-scale on test machine: 8 threads Ivy Bridge with 16G memory with following parameters: nr_task: 100% mode: thread test: brk1 ucode: 0x20 cpufreq_governor: performance The reason seems to be way more scheduler time due to lots more context switches: 34925294 ± 18%+270.3% 1.293e+08 ± 4% will-it-scale.time.voluntary_context_switches Yang Shi, would you mind taking a look at what's going on? No problem. Actually, the commit is mainly for optimizing the long stall time caused by holding mmap_sem by write when unmapping or shrinking large mapping. It downgrades write mmap_sem to read when zapping pages. So, it looks the downgrade incurs more context switches. This is kind of expected. However, the test looks just shrink the mapping with one normal 4K page size. It sounds the overhead of context switches outpace the gain in this case at the first glance. Since the optimization makes more sense to large mapping, how about restore the mapping size limit, e.g. just downgrade mmap_sem for >= 1g (PUD_SIZE) mapping? Thanks, Yang Linus
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On 11/5/18 9:50 AM, Linus Torvalds wrote: On Sun, Nov 4, 2018 at 9:08 PM kernel test robot wrote: FYI, we noticed a -64.1% regression of will-it-scale.per_thread_ops due to commit 9bc8039e715d ("mm: brk: downgrade mmap_sem to read when shrinking") Ugh. That looks pretty bad. in testcase: will-it-scale on test machine: 8 threads Ivy Bridge with 16G memory with following parameters: nr_task: 100% mode: thread test: brk1 ucode: 0x20 cpufreq_governor: performance The reason seems to be way more scheduler time due to lots more context switches: 34925294 ± 18%+270.3% 1.293e+08 ± 4% will-it-scale.time.voluntary_context_switches Yang Shi, would you mind taking a look at what's going on? No problem. Actually, the commit is mainly for optimizing the long stall time caused by holding mmap_sem by write when unmapping or shrinking large mapping. It downgrades write mmap_sem to read when zapping pages. So, it looks the downgrade incurs more context switches. This is kind of expected. However, the test looks just shrink the mapping with one normal 4K page size. It sounds the overhead of context switches outpace the gain in this case at the first glance. Since the optimization makes more sense to large mapping, how about restore the mapping size limit, e.g. just downgrade mmap_sem for >= 1g (PUD_SIZE) mapping? Thanks, Yang Linus
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On Sun, Nov 4, 2018 at 9:08 PM kernel test robot wrote: > > FYI, we noticed a -64.1% regression of will-it-scale.per_thread_ops > due to commit 9bc8039e715d ("mm: brk: downgrade mmap_sem to read when > shrinking") Ugh. That looks pretty bad. > in testcase: will-it-scale > on test machine: 8 threads Ivy Bridge with 16G memory > with following parameters: > > nr_task: 100% > mode: thread > test: brk1 > ucode: 0x20 > cpufreq_governor: performance The reason seems to be way more scheduler time due to lots more context switches: > 34925294 ± 18%+270.3% 1.293e+08 ± 4% > will-it-scale.time.voluntary_context_switches Yang Shi, would you mind taking a look at what's going on? Linus
Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression
On Sun, Nov 4, 2018 at 9:08 PM kernel test robot wrote: > > FYI, we noticed a -64.1% regression of will-it-scale.per_thread_ops > due to commit 9bc8039e715d ("mm: brk: downgrade mmap_sem to read when > shrinking") Ugh. That looks pretty bad. > in testcase: will-it-scale > on test machine: 8 threads Ivy Bridge with 16G memory > with following parameters: > > nr_task: 100% > mode: thread > test: brk1 > ucode: 0x20 > cpufreq_governor: performance The reason seems to be way more scheduler time due to lots more context switches: > 34925294 ± 18%+270.3% 1.293e+08 ± 4% > will-it-scale.time.voluntary_context_switches Yang Shi, would you mind taking a look at what's going on? Linus