Re: [x86] d55564cfc2: will-it-scale.per_thread_ops -5.8% regression

Linus Torvalds Thu, 07 Jan 2021 09:45:01 -0800

On Thu, Jan 7, 2021 at 5:32 AM kernel test robot <oliver.s...@intel.com> wrote:
>
> FYI, we noticed a -5.8% regression of will-it-scale.per_thread_ops due to 
> commit:


Ok, that's noticeable.

And:

> commit: d55564cfc222326e944893eff0c4118353e349ec ("x86: Make __put_user() 
> generate an out-of-line call")

Yeah, that wasn't supposed to cause any performance regressions. No
core code should use __put_user() so much.

But:

> | testcase: change | will-it-scale: will-it-scale.per_process_ops -7.3% 
> regression             |
> | test machine     | 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz 
> with 192G memory |
> | test parameters  | cpufreq_governor=performance                             
>                  |
> |                  | mode=process                                             
>                  |
> |                  | nr_task=100%                                             
>                  |
> |                  | test=poll2                                               
>                  |
> |                  | ucode=0x16                                               
>                  |

Ok, it's poll(), and it's definitely the __put_user() there:

>       0.00            +1.8        1.77 ą  3%  
> perf-profile.children.cycles-pp.__put_user_nocheck_2
>       0.00            +1.6        1.64 ą  3%  
> perf-profile.self.cycles-pp.__put_user_nocheck_2

And in fact, it's that final "write back the 16-bit revents field" at the end.

Which must have sucked before too, because it used to do a "stac/clac"
for every word - but now it does it out of line.

The fix is to convert that loop to use "unsafe_put_user()" with the
necessary accoutrements around it, and that should speed things up
quite nicely. The (double) loop itself is actually just 14
instructions, it's ridiculous how bad the code used to be, and how
much better it is with the nice unsafe_put_user(). The whole double
loop ends up being just

        lea    0x68(%rsp),%rsi
        mov    %rcx,%rax
  1:    mov    0x8(%rsi),%ecx
        lea    0xc(%rsi),%rdx
        test   %ecx,%ecx
        je     3f
        lea    (%rax,%rcx,8),%rdi
  2:    movzwl 0x6(%rdx),%ecx
        mov    %cx,0x6(%rax)
        add    $0x8,%rax
        add    $0x8,%rdx
        cmp    %rdi,%rax
        jne    2b
  3:    mov    (%rsi),%rsi
        test   %rsi,%rsi
        jne    1b

with the attached patch.

Before, it would do the whole CLAC/STAC dance inside that loop for
every entry (and with that commit d55564cfc22 it would be a function
call, of course).

Can you verify that this fixes the regression (and in fact I'd expect
it to improve that test-case)?

NOTE! The patch is entirely untested. I verified that the code
generation now looks sane, and it all looks ObviouslyCorrect(tm) to
me, but mistakes happen and maybe I missed some detail..

               Linus

patch
Description: Binary data

Re: [x86] d55564cfc2: will-it-scale.per_thread_ops -5.8% regression

Reply via email to