* Kyle Huey <[email protected]> wrote:

> rr (http://rr-project.org/), a userspace record-and-replay reverse-
> execution debugger, would like to trap and emulate the CPUID instruction.
> This would allow us to a) mask away certain hardware features that rr does
> not support (e.g. RDRAND) and b) enable trace portability across machines
> by providing constant results.
> 
> Newer Intel CPUs (Ivy Bridge and later) can fault when CPUID is executed at
> CPL > 0. Expose this capability to userspace as a new pair of arch_prctls,
> ARCH_GET_CPUID and ARCH_SET_CPUID.
> 
> Since v12:
> Patch 4: x86/syscalls/32: Wire up arch_prctl on x86-32
> - compat_sys_arch_prctl prototype has argument names.

So while I am fine with the feature, I'm still unconvinced about the 
implementation:

1)

As I pointed out before, the arbitrary 'code' argument name x86-ism should be 
changed to 'option' like the canonical core kernel option name is for prctls(). 

This is still unfixed.

2)

As I complained about in my first review, TIF_NOCPUID flag is too far removed 
from 
the value what will be written into the MSR.

The result is poor code generation on 64-bit defconfig+CONFIG_PREEMPT=y:

        if (test_tsk_thread_flag(prev_p, TIF_NOCPUID) ^
            test_tsk_thread_flag(next_p, TIF_NOCPUID)) {
                set_cpuid_faulting(test_tsk_thread_flag(next_p, TIF_NOCPUID));

is compiled as:

  476:   49 8b 06                mov    (%r14),%rax
  479:   49 8b 55 00             mov    0x0(%r13),%rdx
  47d:   48 c1 e8 0f             shr    $0xf,%rax
  481:   48 c1 ea 0f             shr    $0xf,%rdx
  485:   83 e2 01                and    $0x1,%edx
  488:   83 e0 01                and    $0x1,%eax
  48b:   38 c2                   cmp    %al,%dl
  48d:   74 10                   je     49f <__switch_to_xtra+0x9f>
  48f:   49 8b 7d 00             mov    0x0(%r13),%rdi
  493:   48 c1 ef 0f             shr    $0xf,%rdi
  497:   83 e7 01                and    $0x1,%edi
  49a:   e8 61 fb ff ff          callq  0 <set_cpuid_faulting>

... the first 7 instructions burdens all __switch_to_xtra() users, not just the 
faulting-CPUID users.

The set_cpuid_faulting() call is also unnecessary and the set_cpuid_faulting() 
call generates into an obscene sequence of:

0000000000000000 <set_cpuid_faulting>:
   0:   8b 15 00 00 00 00       mov    0x0(%rip),%edx        # 6 
<set_cpuid_faulting+0x6>
   6:   55                      push   %rbp
   7:   48 89 e5                mov    %rsp,%rbp
   a:   53                      push   %rbx
   b:   40 0f b6 df             movzbl %dil,%ebx
   f:   85 d2                   test   %edx,%edx
  11:   75 07                   jne    1a <set_cpuid_faulting+0x1a>
  13:   9c                      pushfq 
  14:   58                      pop    %rax
  15:   f6 c4 02                test   $0x2,%ah
  18:   75 48                   jne    62 <set_cpuid_faulting+0x62>
  1a:   65 48 8b 05 00 00 00    mov    %gs:0x0(%rip),%rax        # 22 
<set_cpuid_faulting+0x22>
  21:   00 
  22:   48 83 e0 fe             and    $0xfffffffffffffffe,%rax
  26:   b9 40 01 00 00          mov    $0x140,%ecx
  2b:   48 09 d8                or     %rbx,%rax
  2e:   48 89 c2                mov    %rax,%rdx
  31:   65 48 89 05 00 00 00    mov    %rax,%gs:0x0(%rip)        # 39 
<set_cpuid_faulting+0x39>
  38:   00 
  39:   48 c1 ea 20             shr    $0x20,%rdx
  3d:   0f 30                   wrmsr  
  3f:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  44:   5b                      pop    %rbx
  45:   5d                      pop    %rbp
  46:   c3                      retq   
  47:   48 c1 e2 20             shl    $0x20,%rdx
  4b:   89 c0                   mov    %eax,%eax
  4d:   bf 40 01 00 00          mov    $0x140,%edi
  52:   48 09 d0                or     %rdx,%rax
  55:   31 d2                   xor    %edx,%edx
  57:   48 89 c6                mov    %rax,%rsi
  5a:   e8 00 00 00 00          callq  5f <set_cpuid_faulting+0x5f>
  5f:   5b                      pop    %rbx
  60:   5d                      pop    %rbp
  61:   c3                      retq   
  62:   e8 00 00 00 00          callq  67 <set_cpuid_faulting+0x67>
  67:   85 c0                   test   %eax,%eax
  69:   74 af                   je     1a <set_cpuid_faulting+0x1a>
  6b:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 71 
<set_cpuid_faulting+0x71>
  71:   85 c0                   test   %eax,%eax
  73:   75 a5                   jne    1a <set_cpuid_faulting+0x1a>
  75:   48 c7 c1 00 00 00 00    mov    $0x0,%rcx
  7c:   48 c7 c2 00 00 00 00    mov    $0x0,%rdx
  83:   be b9 00 00 00          mov    $0xb9,%esi
  88:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
  8f:   e8 00 00 00 00          callq  94 <set_cpuid_faulting+0x94>
  94:   eb 84                   jmp    1a <set_cpuid_faulting+0x1a>
  96:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  9d:   00 00 00 

The affected object file code size blows up as well, by 17%:

 arch/x86/kernel/process.o:
    text    data     bss     dec     hex filename
    3325    8577      32   11934    2e9e process.o.before
    3889    8609      32   12530    30f2 process.o.after

A good deal of this overhead and complexity comes from the implementation 
inefficiency I pointed out, and all this can be avoided with the method I 
suggested in my previous review, by caching the per task MSR value in the 
thread 
struct.

So sorry, NAK for this implementation - especially considering how relatively 
straightforward the changes I suggested are to implement.

Thanks,

        Ingo

Reply via email to