Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2019-01-30 Thread Waiman Long
On 01/30/2019 07:06 PM, Tim Chen wrote:
> On 12/27/18 6:55 PM, Waiman Long wrote:
>> On 12/27/2018 08:31 PM, Wang, Kemi wrote:
>>> Hi, Waiman
>>>Did you post that patch? Let's see if it helps.
>> I did post the patch a while ago. I will need to rebase it to a new
>> baseline. Will do that in a week or 2.
>>
>> -Longman
>>
> Waiman,
>
> In a large scale OLTP benchmark, we are also seeing a 1% loss in performance
> from patch 9bc8039e71. We'll be interested to try out your patch to
> restore some optimistic spinning on reader held sem.
>
> Thanks.
>
> Tim

I am reworking the rwsem code and reader-optimistic spinning code isn't
ready with the reworked code yet. I need a bit more time to work on that
and certainly keep you posted once it is ready for review.

Cheers,
Longman



Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2019-01-30 Thread Tim Chen
On 12/27/18 6:55 PM, Waiman Long wrote:
> On 12/27/2018 08:31 PM, Wang, Kemi wrote:
>> Hi, Waiman
>>Did you post that patch? Let's see if it helps.
> 
> I did post the patch a while ago. I will need to rebase it to a new
> baseline. Will do that in a week or 2.
> 
> -Longman
> 

Waiman,

In a large scale OLTP benchmark, we are also seeing a 1% loss in performance
from patch 9bc8039e71. We'll be interested to try out your patch to
restore some optimistic spinning on reader held sem.

Thanks.

Tim


Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-12-27 Thread kemi



On 2018/12/28 上午10:55, Waiman Long wrote:
> On 12/27/2018 08:31 PM, Wang, Kemi wrote:
>> Hi, Waiman
>>Did you post that patch? Let's see if it helps.
> 
> I did post the patch a while ago. I will need to rebase it to a new
> baseline. Will do that in a week or 2.
> 

OK.I will take a look at it and try to rebase it on shi's patch to see if 
the regression can be fixed.
May I know where I can get that patch, I didn't find it in my inbox. Thanks

> -Longman
> 
>>
>> -Original Message-
>> From: LKP [mailto:lkp-boun...@lists.01.org] On Behalf Of Waiman Long
>> Sent: Tuesday, November 6, 2018 6:40 AM
>> To: Linus Torvalds ; vba...@suse.cz; 
>> Davidlohr Bueso 
>> Cc: yang@linux.alibaba.com; Linux Kernel Mailing List 
>> ; Matthew Wilcox ; 
>> mho...@kernel.org; Colin King ; Andrew Morton 
>> ; lduf...@linux.vnet.ibm.com; l...@01.org; 
>> kirill.shute...@linux.intel.com
>> Subject: Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% 
>> regression
>>
>> On 11/05/2018 05:14 PM, Linus Torvalds wrote:
>>> On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka  wrote:
>>>> I didn't spot an obvious mistake in the patch itself, so it looks
>>>> like some bad interaction between scheduler and the mmap downgrade?
>>> I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by
>>> the downgrade.
>>>
>>> It looks like the benchmark used to be basically CPU-bound, at about
>>> 800% CPU, and now it's somewhere in the 200% CPU region:
>>>
>>>   will-it-scale.time.percent_of_cpu_this_job_got
>>>
>>>   800 
>>> +-+---+
>>>   |.+.+.+.+.+.+.+.  .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. 
>>> .+.+.+.|
>>>   700 +-+ +.+   
>>> |
>>>   | 
>>> |
>>>   600 +-+   
>>> |
>>>   | 
>>> |
>>>   500 +-+   
>>> |
>>>   | 
>>> |
>>>   400 +-+   
>>> |
>>>   | 
>>> |
>>>   300 +-+   
>>> |
>>>   | 
>>> |
>>>   200 O-O O O O OO  
>>> |
>>>   |   O O O  O O O O   O O O O O O O O O O O
>>> |
>>>   100 
>>> +-+---+
>>>
>>> which sounds like the downgrade really messes with the "spin waiting
>>> for lock" logic.
>>>
>>> I'm thinking it's the "wake up waiter" logic that has some bad
>>> interaction with spinning, and breaks that whole optimization.
>>>
>>> Adding Waiman and Davidlohr to the participants, because they seem to
>>> be the obvious experts in this area.
>>>
>>> Linus
>> Optimistic spinning on rwsem is done only on writers spinning on a
>> writer-owned rwsem. If a write-lock is downgraded to a read-lock, all
>> the spinning waiters will quit. That may explain the drop in cpu
>> utilization. I do have a old patch that enable a certain amount of
>> reader spinning which may help the situation. I can rebase that and send
>> it out for review if people have interest.
>>
>> Cheers,
>> Longman
>>
>>
>> ___
>> LKP mailing list
>> l...@lists.01.org
>> https://lists.01.org/mailman/listinfo/lkp
> 
> 


Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-12-27 Thread Waiman Long
On 12/27/2018 08:31 PM, Wang, Kemi wrote:
> Hi, Waiman
>Did you post that patch? Let's see if it helps.

I did post the patch a while ago. I will need to rebase it to a new
baseline. Will do that in a week or 2.

-Longman

>
> -Original Message-
> From: LKP [mailto:lkp-boun...@lists.01.org] On Behalf Of Waiman Long
> Sent: Tuesday, November 6, 2018 6:40 AM
> To: Linus Torvalds ; vba...@suse.cz; Davidlohr 
> Bueso 
> Cc: yang@linux.alibaba.com; Linux Kernel Mailing List 
> ; Matthew Wilcox ; 
> mho...@kernel.org; Colin King ; Andrew Morton 
> ; lduf...@linux.vnet.ibm.com; l...@01.org; 
> kirill.shute...@linux.intel.com
> Subject: Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% 
> regression
>
> On 11/05/2018 05:14 PM, Linus Torvalds wrote:
>> On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka  wrote:
>>> I didn't spot an obvious mistake in the patch itself, so it looks
>>> like some bad interaction between scheduler and the mmap downgrade?
>> I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by
>> the downgrade.
>>
>> It looks like the benchmark used to be basically CPU-bound, at about
>> 800% CPU, and now it's somewhere in the 200% CPU region:
>>
>>   will-it-scale.time.percent_of_cpu_this_job_got
>>
>>   800 +-+---+
>>   |.+.+.+.+.+.+.+.  .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. .+.+.+.|
>>   700 +-+ +.+   |
>>   | |
>>   600 +-+   |
>>   | |
>>   500 +-+   |
>>   | |
>>   400 +-+   |
>>   | |
>>   300 +-+   |
>>   | |
>>   200 O-O O O O OO  |
>>   |   O O O  O O O O   O O O O O O O O O O O|
>>   100 +-+---+
>>
>> which sounds like the downgrade really messes with the "spin waiting
>> for lock" logic.
>>
>> I'm thinking it's the "wake up waiter" logic that has some bad
>> interaction with spinning, and breaks that whole optimization.
>>
>> Adding Waiman and Davidlohr to the participants, because they seem to
>> be the obvious experts in this area.
>>
>> Linus
> Optimistic spinning on rwsem is done only on writers spinning on a
> writer-owned rwsem. If a write-lock is downgraded to a read-lock, all
> the spinning waiters will quit. That may explain the drop in cpu
> utilization. I do have a old patch that enable a certain amount of
> reader spinning which may help the situation. I can rebase that and send
> it out for review if people have interest.
>
> Cheers,
> Longman
>
>
> ___
> LKP mailing list
> l...@lists.01.org
> https://lists.01.org/mailman/listinfo/lkp




RE: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-12-27 Thread Wang, Kemi
Hi, Waiman
   Did you post that patch? Let's see if it helps.

-Original Message-
From: LKP [mailto:lkp-boun...@lists.01.org] On Behalf Of Waiman Long
Sent: Tuesday, November 6, 2018 6:40 AM
To: Linus Torvalds ; vba...@suse.cz; Davidlohr 
Bueso 
Cc: yang@linux.alibaba.com; Linux Kernel Mailing List 
; Matthew Wilcox ; 
mho...@kernel.org; Colin King ; Andrew Morton 
; lduf...@linux.vnet.ibm.com; l...@01.org; 
kirill.shute...@linux.intel.com
Subject: Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% 
regression

On 11/05/2018 05:14 PM, Linus Torvalds wrote:
> On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka  wrote:
>> I didn't spot an obvious mistake in the patch itself, so it looks
>> like some bad interaction between scheduler and the mmap downgrade?
> I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by
> the downgrade.
>
> It looks like the benchmark used to be basically CPU-bound, at about
> 800% CPU, and now it's somewhere in the 200% CPU region:
>
>   will-it-scale.time.percent_of_cpu_this_job_got
>
>   800 +-+---+
>   |.+.+.+.+.+.+.+.  .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. .+.+.+.|
>   700 +-+ +.+   |
>   | |
>   600 +-+   |
>   | |
>   500 +-+   |
>   | |
>   400 +-+   |
>   | |
>   300 +-+   |
>   | |
>   200 O-O O O O OO  |
>   |   O O O  O O O O   O O O O O O O O O O O|
>   100 +-+---+
>
> which sounds like the downgrade really messes with the "spin waiting
> for lock" logic.
>
> I'm thinking it's the "wake up waiter" logic that has some bad
> interaction with spinning, and breaks that whole optimization.
>
> Adding Waiman and Davidlohr to the participants, because they seem to
> be the obvious experts in this area.
>
> Linus

Optimistic spinning on rwsem is done only on writers spinning on a
writer-owned rwsem. If a write-lock is downgraded to a read-lock, all
the spinning waiters will quit. That may explain the drop in cpu
utilization. I do have a old patch that enable a certain amount of
reader spinning which may help the situation. I can rebase that and send
it out for review if people have interest.

Cheers,
Longman


___
LKP mailing list
l...@lists.01.org
https://lists.01.org/mailman/listinfo/lkp


Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-11-05 Thread Waiman Long
On 11/05/2018 05:14 PM, Linus Torvalds wrote:
> On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka  wrote:
>> I didn't spot an obvious mistake in the patch itself, so it looks
>> like some bad interaction between scheduler and the mmap downgrade?
> I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by
> the downgrade.
>
> It looks like the benchmark used to be basically CPU-bound, at about
> 800% CPU, and now it's somewhere in the 200% CPU region:
>
>   will-it-scale.time.percent_of_cpu_this_job_got
>
>   800 +-+---+
>   |.+.+.+.+.+.+.+.  .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. .+.+.+.|
>   700 +-+ +.+   |
>   | |
>   600 +-+   |
>   | |
>   500 +-+   |
>   | |
>   400 +-+   |
>   | |
>   300 +-+   |
>   | |
>   200 O-O O O O OO  |
>   |   O O O  O O O O   O O O O O O O O O O O|
>   100 +-+---+
>
> which sounds like the downgrade really messes with the "spin waiting
> for lock" logic.
>
> I'm thinking it's the "wake up waiter" logic that has some bad
> interaction with spinning, and breaks that whole optimization.
>
> Adding Waiman and Davidlohr to the participants, because they seem to
> be the obvious experts in this area.
>
> Linus

Optimistic spinning on rwsem is done only on writers spinning on a
writer-owned rwsem. If a write-lock is downgraded to a read-lock, all
the spinning waiters will quit. That may explain the drop in cpu
utilization. I do have a old patch that enable a certain amount of
reader spinning which may help the situation. I can rebase that and send
it out for review if people have interest.

Cheers,
Longman




Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-11-05 Thread Waiman Long
On 11/05/2018 05:14 PM, Linus Torvalds wrote:
> On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka  wrote:
>> I didn't spot an obvious mistake in the patch itself, so it looks
>> like some bad interaction between scheduler and the mmap downgrade?
> I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by
> the downgrade.
>
> It looks like the benchmark used to be basically CPU-bound, at about
> 800% CPU, and now it's somewhere in the 200% CPU region:
>
>   will-it-scale.time.percent_of_cpu_this_job_got
>
>   800 +-+---+
>   |.+.+.+.+.+.+.+.  .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. .+.+.+.|
>   700 +-+ +.+   |
>   | |
>   600 +-+   |
>   | |
>   500 +-+   |
>   | |
>   400 +-+   |
>   | |
>   300 +-+   |
>   | |
>   200 O-O O O O OO  |
>   |   O O O  O O O O   O O O O O O O O O O O|
>   100 +-+---+
>
> which sounds like the downgrade really messes with the "spin waiting
> for lock" logic.
>
> I'm thinking it's the "wake up waiter" logic that has some bad
> interaction with spinning, and breaks that whole optimization.
>
> Adding Waiman and Davidlohr to the participants, because they seem to
> be the obvious experts in this area.
>
> Linus

Optimistic spinning on rwsem is done only on writers spinning on a
writer-owned rwsem. If a write-lock is downgraded to a read-lock, all
the spinning waiters will quit. That may explain the drop in cpu
utilization. I do have a old patch that enable a certain amount of
reader spinning which may help the situation. I can rebase that and send
it out for review if people have interest.

Cheers,
Longman




Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-11-05 Thread Linus Torvalds
On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka  wrote:
>
> I didn't spot an obvious mistake in the patch itself, so it looks
> like some bad interaction between scheduler and the mmap downgrade?

I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by
the downgrade.

It looks like the benchmark used to be basically CPU-bound, at about
800% CPU, and now it's somewhere in the 200% CPU region:

  will-it-scale.time.percent_of_cpu_this_job_got

  800 +-+---+
  |.+.+.+.+.+.+.+.  .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. .+.+.+.|
  700 +-+ +.+   |
  | |
  600 +-+   |
  | |
  500 +-+   |
  | |
  400 +-+   |
  | |
  300 +-+   |
  | |
  200 O-O O O O OO  |
  |   O O O  O O O O   O O O O O O O O O O O|
  100 +-+---+

which sounds like the downgrade really messes with the "spin waiting
for lock" logic.

I'm thinking it's the "wake up waiter" logic that has some bad
interaction with spinning, and breaks that whole optimization.

Adding Waiman and Davidlohr to the participants, because they seem to
be the obvious experts in this area.

Linus


Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-11-05 Thread Linus Torvalds
On Mon, Nov 5, 2018 at 12:12 PM Vlastimil Babka  wrote:
>
> I didn't spot an obvious mistake in the patch itself, so it looks
> like some bad interaction between scheduler and the mmap downgrade?

I'm thinking it's RWSEM_SPIN_ON_OWNER that ends up being confused by
the downgrade.

It looks like the benchmark used to be basically CPU-bound, at about
800% CPU, and now it's somewhere in the 200% CPU region:

  will-it-scale.time.percent_of_cpu_this_job_got

  800 +-+---+
  |.+.+.+.+.+.+.+.  .+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+. .+.+.+.|
  700 +-+ +.+   |
  | |
  600 +-+   |
  | |
  500 +-+   |
  | |
  400 +-+   |
  | |
  300 +-+   |
  | |
  200 O-O O O O OO  |
  |   O O O  O O O O   O O O O O O O O O O O|
  100 +-+---+

which sounds like the downgrade really messes with the "spin waiting
for lock" logic.

I'm thinking it's the "wake up waiter" logic that has some bad
interaction with spinning, and breaks that whole optimization.

Adding Waiman and Davidlohr to the participants, because they seem to
be the obvious experts in this area.

Linus


Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-11-05 Thread Yang Shi




On 11/5/18 10:35 AM, Linus Torvalds wrote:

On Mon, Nov 5, 2018 at 10:28 AM Yang Shi  wrote:

Actually, the commit is mainly for optimizing the long stall time caused
by holding mmap_sem by write when unmapping or shrinking large mapping.
It downgrades write mmap_sem to read when zapping pages. So, it looks
the downgrade incurs more context switches. This is kind of expected.

However, the test looks just shrink the mapping with one normal 4K page
size. It sounds the overhead of context switches outpace the gain in
this case at the first glance.

I'm not seeing why there should be a context switch in the first place.

Even if you have lots of concurrent brk() users, they should all block
exactly the same way as before (a write lock blocks against a write
lock, but it *also* blocks against a downgraded read lock).


Yes, it is true. The brk() users will not get waken up. What I can think 
of for now is there might be other helper processes and/or kernel 
threads are waiting for read mmap_sem. They might get waken up by the 
downgrade.


But, I also saw huge increase in cpu idle time and sched_goidle events. 
Not have clue yet for why idle goes up.


20610709 ± 15%   +2376.0%  5.103e+08 ± 34%  cpuidle.C1.time
28753819 ± 39%   +1054.5%  3.319e+08 ± 49%  cpuidle.C3.time

175049 ± 72%+840.7%1646720 ± 72%  sched_debug.cpu.sched_goidle.stddev


Thanks,
Yang



So no, I don't want just some limit to hide this problem for that
particular test. There's something else going on.

  Linus




Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-11-05 Thread Yang Shi




On 11/5/18 10:35 AM, Linus Torvalds wrote:

On Mon, Nov 5, 2018 at 10:28 AM Yang Shi  wrote:

Actually, the commit is mainly for optimizing the long stall time caused
by holding mmap_sem by write when unmapping or shrinking large mapping.
It downgrades write mmap_sem to read when zapping pages. So, it looks
the downgrade incurs more context switches. This is kind of expected.

However, the test looks just shrink the mapping with one normal 4K page
size. It sounds the overhead of context switches outpace the gain in
this case at the first glance.

I'm not seeing why there should be a context switch in the first place.

Even if you have lots of concurrent brk() users, they should all block
exactly the same way as before (a write lock blocks against a write
lock, but it *also* blocks against a downgraded read lock).


Yes, it is true. The brk() users will not get waken up. What I can think 
of for now is there might be other helper processes and/or kernel 
threads are waiting for read mmap_sem. They might get waken up by the 
downgrade.


But, I also saw huge increase in cpu idle time and sched_goidle events. 
Not have clue yet for why idle goes up.


20610709 ± 15%   +2376.0%  5.103e+08 ± 34%  cpuidle.C1.time
28753819 ± 39%   +1054.5%  3.319e+08 ± 49%  cpuidle.C3.time

175049 ± 72%+840.7%1646720 ± 72%  sched_debug.cpu.sched_goidle.stddev


Thanks,
Yang



So no, I don't want just some limit to hide this problem for that
particular test. There's something else going on.

  Linus




Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-11-05 Thread Vlastimil Babka
On 11/5/18 6:50 PM, Linus Torvalds wrote:
> On Sun, Nov 4, 2018 at 9:08 PM kernel test robot  
> wrote:
>>
>> FYI, we noticed a -64.1% regression of will-it-scale.per_thread_ops
>> due to commit 9bc8039e715d ("mm: brk: downgrade mmap_sem to read when
>> shrinking")
> 
> Ugh. That looks pretty bad.
> 
>> in testcase: will-it-scale
>> on test machine: 8 threads Ivy Bridge with 16G memory
>> with following parameters:
>>
>> nr_task: 100%
>> mode: thread
>> test: brk1
>> ucode: 0x20
>> cpufreq_governor: performance
> 
> The reason seems to be way more scheduler time due to lots more
> context switches:
> 
>>   34925294 ± 18%+270.3%  1.293e+08 ±  4%  
>> will-it-scale.time.voluntary_context_switches

And what about this:

  0.83 ± 27% +25.9   26.75 ± 11%  
perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary
  1.09 ± 32% +30.9   31.97 ± 10%  
perf-profile.calltrace.cycles-pp.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
  1.62 ± 36% +44.4   46.01 ±  9%  
perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
  1.63 ± 36% +44.5   46.18 ±  9%  
perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64
  1.63 ± 36% +44.6   46.21 ±  9%  
perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64
  1.73 ± 29% +51.1   52.86 ±  2%  
perf-profile.calltrace.cycles-pp.secondary_startup_64

And the graphs showing less user/kernel time and less 
"percent_of_cpu_this_job_got"...

I didn't spot an obvious mistake in the patch itself, so it looks
like some bad interaction between scheduler and the mmap downgrade?
 
> Yang Shi, would you mind taking a look at what's going on?
> 
>   Linus
> 



Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-11-05 Thread Vlastimil Babka
On 11/5/18 6:50 PM, Linus Torvalds wrote:
> On Sun, Nov 4, 2018 at 9:08 PM kernel test robot  
> wrote:
>>
>> FYI, we noticed a -64.1% regression of will-it-scale.per_thread_ops
>> due to commit 9bc8039e715d ("mm: brk: downgrade mmap_sem to read when
>> shrinking")
> 
> Ugh. That looks pretty bad.
> 
>> in testcase: will-it-scale
>> on test machine: 8 threads Ivy Bridge with 16G memory
>> with following parameters:
>>
>> nr_task: 100%
>> mode: thread
>> test: brk1
>> ucode: 0x20
>> cpufreq_governor: performance
> 
> The reason seems to be way more scheduler time due to lots more
> context switches:
> 
>>   34925294 ± 18%+270.3%  1.293e+08 ±  4%  
>> will-it-scale.time.voluntary_context_switches

And what about this:

  0.83 ± 27% +25.9   26.75 ± 11%  
perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary
  1.09 ± 32% +30.9   31.97 ± 10%  
perf-profile.calltrace.cycles-pp.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
  1.62 ± 36% +44.4   46.01 ±  9%  
perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
  1.63 ± 36% +44.5   46.18 ±  9%  
perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64
  1.63 ± 36% +44.6   46.21 ±  9%  
perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64
  1.73 ± 29% +51.1   52.86 ±  2%  
perf-profile.calltrace.cycles-pp.secondary_startup_64

And the graphs showing less user/kernel time and less 
"percent_of_cpu_this_job_got"...

I didn't spot an obvious mistake in the patch itself, so it looks
like some bad interaction between scheduler and the mmap downgrade?
 
> Yang Shi, would you mind taking a look at what's going on?
> 
>   Linus
> 



Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-11-05 Thread Linus Torvalds
On Mon, Nov 5, 2018 at 10:28 AM Yang Shi  wrote:
>
> Actually, the commit is mainly for optimizing the long stall time caused
> by holding mmap_sem by write when unmapping or shrinking large mapping.
> It downgrades write mmap_sem to read when zapping pages. So, it looks
> the downgrade incurs more context switches. This is kind of expected.
>
> However, the test looks just shrink the mapping with one normal 4K page
> size. It sounds the overhead of context switches outpace the gain in
> this case at the first glance.

I'm not seeing why there should be a context switch in the first place.

Even if you have lots of concurrent brk() users, they should all block
exactly the same way as before (a write lock blocks against a write
lock, but it *also* blocks against a downgraded read lock).

So no, I don't want just some limit to hide this problem for that
particular test. There's something else going on.

 Linus


Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-11-05 Thread Linus Torvalds
On Mon, Nov 5, 2018 at 10:28 AM Yang Shi  wrote:
>
> Actually, the commit is mainly for optimizing the long stall time caused
> by holding mmap_sem by write when unmapping or shrinking large mapping.
> It downgrades write mmap_sem to read when zapping pages. So, it looks
> the downgrade incurs more context switches. This is kind of expected.
>
> However, the test looks just shrink the mapping with one normal 4K page
> size. It sounds the overhead of context switches outpace the gain in
> this case at the first glance.

I'm not seeing why there should be a context switch in the first place.

Even if you have lots of concurrent brk() users, they should all block
exactly the same way as before (a write lock blocks against a write
lock, but it *also* blocks against a downgraded read lock).

So no, I don't want just some limit to hide this problem for that
particular test. There's something else going on.

 Linus


Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-11-05 Thread Yang Shi




On 11/5/18 9:50 AM, Linus Torvalds wrote:

On Sun, Nov 4, 2018 at 9:08 PM kernel test robot  wrote:

FYI, we noticed a -64.1% regression of will-it-scale.per_thread_ops
due to commit 9bc8039e715d ("mm: brk: downgrade mmap_sem to read when
shrinking")

Ugh. That looks pretty bad.


in testcase: will-it-scale
on test machine: 8 threads Ivy Bridge with 16G memory
with following parameters:

 nr_task: 100%
 mode: thread
 test: brk1
 ucode: 0x20
 cpufreq_governor: performance

The reason seems to be way more scheduler time due to lots more
context switches:


   34925294 ± 18%+270.3%  1.293e+08 ±  4%  
will-it-scale.time.voluntary_context_switches

Yang Shi, would you mind taking a look at what's going on?


No problem.

Actually, the commit is mainly for optimizing the long stall time caused 
by holding mmap_sem by write when unmapping or shrinking large mapping. 
It downgrades write mmap_sem to read when zapping pages. So, it looks 
the downgrade incurs more context switches. This is kind of expected.


However, the test looks just shrink the mapping with one normal 4K page 
size. It sounds the overhead of context switches outpace the gain in 
this case at the first glance.


Since the optimization makes more sense to large mapping, how about 
restore the mapping size limit, e.g. just downgrade mmap_sem for >= 1g 
(PUD_SIZE) mapping?


Thanks,
Yang



   Linus




Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-11-05 Thread Yang Shi




On 11/5/18 9:50 AM, Linus Torvalds wrote:

On Sun, Nov 4, 2018 at 9:08 PM kernel test robot  wrote:

FYI, we noticed a -64.1% regression of will-it-scale.per_thread_ops
due to commit 9bc8039e715d ("mm: brk: downgrade mmap_sem to read when
shrinking")

Ugh. That looks pretty bad.


in testcase: will-it-scale
on test machine: 8 threads Ivy Bridge with 16G memory
with following parameters:

 nr_task: 100%
 mode: thread
 test: brk1
 ucode: 0x20
 cpufreq_governor: performance

The reason seems to be way more scheduler time due to lots more
context switches:


   34925294 ± 18%+270.3%  1.293e+08 ±  4%  
will-it-scale.time.voluntary_context_switches

Yang Shi, would you mind taking a look at what's going on?


No problem.

Actually, the commit is mainly for optimizing the long stall time caused 
by holding mmap_sem by write when unmapping or shrinking large mapping. 
It downgrades write mmap_sem to read when zapping pages. So, it looks 
the downgrade incurs more context switches. This is kind of expected.


However, the test looks just shrink the mapping with one normal 4K page 
size. It sounds the overhead of context switches outpace the gain in 
this case at the first glance.


Since the optimization makes more sense to large mapping, how about 
restore the mapping size limit, e.g. just downgrade mmap_sem for >= 1g 
(PUD_SIZE) mapping?


Thanks,
Yang



   Linus




Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-11-05 Thread Linus Torvalds
On Sun, Nov 4, 2018 at 9:08 PM kernel test robot  wrote:
>
> FYI, we noticed a -64.1% regression of will-it-scale.per_thread_ops
> due to commit 9bc8039e715d ("mm: brk: downgrade mmap_sem to read when
> shrinking")

Ugh. That looks pretty bad.

> in testcase: will-it-scale
> on test machine: 8 threads Ivy Bridge with 16G memory
> with following parameters:
>
> nr_task: 100%
> mode: thread
> test: brk1
> ucode: 0x20
> cpufreq_governor: performance

The reason seems to be way more scheduler time due to lots more
context switches:

>   34925294 ± 18%+270.3%  1.293e+08 ±  4%  
> will-it-scale.time.voluntary_context_switches

Yang Shi, would you mind taking a look at what's going on?

  Linus


Re: [LKP] [mm] 9bc8039e71: will-it-scale.per_thread_ops -64.1% regression

2018-11-05 Thread Linus Torvalds
On Sun, Nov 4, 2018 at 9:08 PM kernel test robot  wrote:
>
> FYI, we noticed a -64.1% regression of will-it-scale.per_thread_ops
> due to commit 9bc8039e715d ("mm: brk: downgrade mmap_sem to read when
> shrinking")

Ugh. That looks pretty bad.

> in testcase: will-it-scale
> on test machine: 8 threads Ivy Bridge with 16G memory
> with following parameters:
>
> nr_task: 100%
> mode: thread
> test: brk1
> ucode: 0x20
> cpufreq_governor: performance

The reason seems to be way more scheduler time due to lots more
context switches:

>   34925294 ± 18%+270.3%  1.293e+08 ±  4%  
> will-it-scale.time.voluntary_context_switches

Yang Shi, would you mind taking a look at what's going on?

  Linus