Re: [PATCH 1/6] x86: Introduce x86_decode_lite()

Jan Beulich Tue, 23 Apr 2024 03:46:40 -0700

On 23.04.2024 12:27, Andrew Cooper wrote:
> On 23/04/2024 10:17 am, Jan Beulich wrote:
>> On 22.04.2024 20:14, Andrew Cooper wrote:
>>> --- /dev/null
>>> +++ b/xen/arch/x86/x86_emulate/decode-lite.c
>>> @@ -0,0 +1,245 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +
>>> +#include "private.h"
>>> +
>>> +#define Imm8   (1 << 0)
>>> +#define Imm    (1 << 1)
>>> +#define Branch (1 << 5) /* ... that we care about */
>>> +/*      ModRM  (1 << 6) */
>>> +#define Known  (1 << 7)
>>> +
>>> +#define ALU_OPS                                 \
>>> +    (Known|ModRM),                              \
>>> +    (Known|ModRM),                              \
>>> +    (Known|ModRM),                              \
>>> +    (Known|ModRM),                              \
>>> +    (Known|Imm8),                               \
>>> +    (Known|Imm)
>>> +
>>> +static const uint8_t init_or_livepatch_const onebyte[256] = {
>>> +    [0x00] = ALU_OPS, /* ADD */ [0x08] = ALU_OPS, /* OR  */
>>> +    [0x10] = ALU_OPS, /* ADC */ [0x18] = ALU_OPS, /* SBB */
>>> +    [0x20] = ALU_OPS, /* AND */ [0x28] = ALU_OPS, /* SUB */
>>> +    [0x30] = ALU_OPS, /* XOR */ [0x38] = ALU_OPS, /* CMP */
>>> +
>>> +    [0x50 ... 0x5f] = (Known),             /* PUSH/POP %reg */
>>> +
>>> +    [0x70 ... 0x7f] = (Known|Branch|Imm8), /* Jcc disp8 */
>>> +    [0x80]          = (Known|ModRM|Imm8),
>>> +    [0x81]          = (Known|ModRM|Imm),
>>> +
>>> +    [0x83]          = (Known|ModRM|Imm8),
>>> +    [0x84 ... 0x8e] = (Known|ModRM),       /* TEST/XCHG/MOV/MOV-SREG/LEA 
>>> r/rm */
>>> +
>>> +    [0xb0 ... 0xb7] = (Known|Imm8),        /* MOV $imm8, %reg */
>> I'm surprised you get away without at least NOP also marked as known.
>> Imo the whole 0x90 row should be marked so.
>>
>> Similarly I'm not convinced that leaving the 0xA0 row unpopulated is a
>> good idea. It's at best a trap set up for somebody to fall into rather
>> sooner than later.
>>
>>> +    [0xb8 ... 0xbf] = (Known|Imm),         /* MOV $imm32, %reg */
>>> +
>>> +    [0xcc]          = (Known),             /* INT3 */
>>> +    [0xcd]          = (Known|Imm8),        /* INT $imm8 */
>> Like above, what about in particular any of the shifts/rotates and the
>> MOV that's in the 0xC0 row?
>>
>> While the last sentence in the description is likely meant to cover
>> that, I think the description wants to go further as to any pretty
>> common but omitted insns. Already "x86: re-work memset()" and "x86: re-
>> work memcpy()" (v2 pending for, soon, 3 years) would make it necessary
>> to touch this table, thus increasing complexity of those changes to an
>> area they shouldn't be concerned about at all.
>>
>>> +    [0xe8 ... 0xe9] = (Known|Branch|Imm),  /* CALL/JMP disp32 */
>>> +    [0xeb]          = (Known|Branch|Imm8), /* JMP disp8 */
>> 0xe0 ... 0xe7 and 0xec ... 0xef would imo also better be covered, as
>> they easily can be (much like you also cover e.g. CMC despite it
>> appearing pretty unlikely that we use that insn anywhere).
>>
>>> +    [0xf1]          = (Known),             /* ICEBP */
>>> +    [0xf4]          = (Known),             /* HLT */
>>> +    [0xf5]          = (Known),             /* CMC */
>>> +    [0xf6 ... 0xf7] = (Known|ModRM),       /* Grp3 */
>>> +    [0xf8 ... 0xfd] = (Known),             /* CLC ... STD */
>>> +    [0xfe ... 0xff] = (Known|ModRM),       /* Grp4 */
>>> +};
>>> +static const uint8_t init_or_livepatch_const twobyte[256] = {
>>> +    [0x00 ... 0x01] = (Known|ModRM),       /* Grp6/Grp7 */
>> LAR / LSL? CLTS? WBINVD? UD2?
>>
>>> +    [0x18 ... 0x1f] = (Known|ModRM),       /* Grp16 (Hint Nop) */
>>> +
>>> +    [0x20 ... 0x23] = (Known|ModRM),       /* MOV CR/DR */
>>> +
>>> +    [0x30 ... 0x34] = (Known),             /* WRMSR ... RDPMC */
>> 0x34 is SYSENTER, isn't it?
>>
>>> +    [0x40 ... 0x4f] = (Known|ModRM),       /* CMOVcc */
>>> +
>>> +    [0x80 ... 0x8f] = (Known|Branch|Imm),  /* Jcc disp32 */
>> What about things like VMREAD/VMWRITE?
>>
>>> +    [0x90 ... 0x9f] = (Known|ModRM),       /* SETcc */
>> PUSH/POP fs/gs? CPUID?
>>
>>> +    [0xab]          = (Known|ModRM),       /* BTS */
>> BTS (and BTC below) but not BT and BTR?
>>
>>> +    [0xac]          = (Known|ModRM|Imm8),  /* SHRD $imm8 */
>>> +    [0xad ... 0xaf] = (Known|ModRM),       /* SHRD %cl / Grp15 / IMUL */
>> SHRD but not SHLD?
>>
>> CMPXCHG?
>>
>>> +    [0xb8 ... 0xb9] = (Known|ModRM),       /* POPCNT/Grp10 (UD1) */
>>> +    [0xba]          = (Known|ModRM|Imm8),  /* Grp8 */
>>> +    [0xbb ... 0xbf] = (Known|ModRM),       /* BSR/BSF/BSR/MOVSX */
>> Nit (comment only): 0xbb is BTC.
>>
>> MOVSX but not MOVZX and also not MOVSXD (in the 1-byte table)?
>>
>>> +};
>> XADD, MOVNTI, and the whole 0xc7-based group?
> 
> As you may have guessed, I filled in the opcode table until I could
> parse all replacements.
> 
> When, at the end of this, I didn't need the osize=8 movs, I took the
> decoding complexity back out.


While I can see that this requiring extra logic makes it desirable to
leave out, it'll easily be a surprise when, eventually, someone adds an
alternative using such. Please may I ask that for any "simple" integer
insn left out, that be clearly mentioned in or ahead of the tables?

>>> + *  - The 67 prefix is not implemented, so the address size is only 64bit.
>>> + *
>>> + * Inputs:
>>> + *  @ip  The position to start decoding from.
>>> + *  @end End of the replacement block.  Exceeding this is considered an 
>>> error.
>>> + *
>>> + * Returns: x86_decode_lite_t
>>> + *  - On failure, length of -1.
>>> + *  - On success, length > 0 and REL_TYPE_*.  For REL_TYPE != NONE, rel 
>>> points
>>> + *    at the relative field in the instruction stream.
>>> + */
>>> +x86_decode_lite_t init_or_livepatch x86_decode_lite(void *ip, void *end)
>>> +{
>>> +    void *start = ip, *rel = NULL;
>>> +    unsigned int opc, type = REL_TYPE_NONE;
>>> +    uint8_t b, d, osize = 4;
>>> +
>>> +#define OPC_TWOBYTE (1 << 8)
>>> +
>>> +    /* Mutates IP, uses END. */
>>> +#define FETCH(ty)                                       \
>>> +    ({                                                  \
>>> +        ty _val;                                        \
>>> +                                                        \
>>> +        if ( (ip + sizeof(ty)) > end )                  \
>>> +            goto overrun;                               \
>>> +        _val = *(ty *)ip;                               \
>>> +        ip += sizeof(ty);                               \
>>> +        _val;                                           \
>>> +    })
>>> +
>>> +    for ( ;; ) /* Prefixes */
>>> +    {
>>> +        switch ( b = FETCH(uint8_t) )
>>> +        {
>>> +        case 0x26: /* ES override */
>>> +        case 0x2e: /* CS override */
>>> +        case 0x36: /* DS override */
>>> +        case 0x3e: /* SS override */
>>> +        case 0x64: /* FS override */
>>> +        case 0x65: /* GS override */
>>> +        case 0xf0: /* Lock */
>>> +        case 0xf2: /* REPNE */
>>> +        case 0xf3: /* REP */
>>> +            break;
>>> +
>>> +        case 0x66: /* Operand size override */
>>> +            osize = 2;
>>> +            break;
>>> +
>>> +        /* case 0x67: Address size override, not implemented */
>>> +
>>> +        case 0x40 ... 0x4f: /* REX */
>>> +            continue;
>> Imo at least a comment is needed as to osize here: We don't use 0x66
>> followed by REX.W, I suppose, when 0x66 determines operand size. It
>> may also be an opcode extension, though, in which case osize set to
>> 2 is (latently) wrong. "Latently" because all you need osize for is
>> to determine Imm's length.
>>
>> However, what I again think need covering right away are opcodes
>> 0xb8 ... 0xbc in combination with REX.W (osize needing to be 8 there).
>>
>> Finally - why "continue" here, but "break" further up? Both (right
>> now) have exactly the same effect.
> 
> They're not the same when ...
> 
>>
>>> +        default:
>>> +            goto prefixes_done;
>>> +        }
> 
> 
> ... this has "cancel the REX prefix" in it.

Of course.

> I started by decoding REX, only to find I didn't need it, so took it
> back out.

At which point imo they want to all be "break". Once the cancellation
needs adding, where necessary "break" can be switched to "continue".
(Interestingly REX2 is different from REX in this regard, and hence
wouldn't need such cancellation, if ever we end up patching APX insns.)

>>> +    }
>>> + prefixes_done:
>>> +
>>> +    /* Fetch the main opcode byte(s) */
>>> +    if ( b == 0x0f )
>>> +    {
>>> +        b = FETCH(uint8_t);
>>> +        opc = OPC_TWOBYTE | b;
>>> +
>>> +        d = twobyte[b];
>>> +    }
>>> +    else
>>> +    {
>>> +        opc = b;
>>> +        d = onebyte[b];
>>> +    }
>> IOW GPR insns in 0f38 and 0f3a spaces are left out, too. That's perhaps
>> less of an issue than some of the other omissions (and would be more
>> involved to cover when considering that some of them are VEX-encoded),
>> but still not ideal.
> 
> They can all be added if needed, but right now they're not.
> 
> This decoder only need to cover instructions likely to be used in
> alternatives, and that pretty limits us to simple integer operations.
> 
> Any extra complexity here makes the function less and less "lite".

I understand this. At the same time to me anything pending and anything
previously submitted but disliked for whatever reason wants at least
considering to cover. In that context please recall that once there was
BMI2 patching (that you didn't like) ...

>>> +    }
>>> +
>>> +    if ( d & (Imm|Imm8) )
>>> +    {
>>> +        if ( d & Imm8 )
>>> +            osize = 1;
>>> +
>>> +        switch ( osize )
>>> +        {
>>> +        case 1: FETCH(uint8_t);  break;
>>> +        case 2: FETCH(uint16_t); break;
>>> +        case 4: FETCH(uint32_t); break;
>>> +        default: goto bad_osize;
>>> +        }
>>> +    }
>>> +
>>> +    return (x86_decode_lite_t){ ip - start, type, rel };
>>> +
>>> + bad_osize:
>>> +    printk(XENLOG_ERR "%s() Bad osize %u in %*ph\n",
>>> +           __func__, osize,
>>> +           (int)(unsigned long)(end - start), start);
>>> +    return (x86_decode_lite_t){ -1, REL_TYPE_NONE, NULL };
>> Maybe limit opcode quoting to ip - start here?
> 
> In the case that we've taken the bad_osize path, we've not decoded the
> full instruction.  The bytes beyond ip are useful for diagnostics.

Hmm. The reason for the reported failure lies within the [start,ip)
range. Yet I agree that would not be a complete insn. Otoh what is
being patched may be a meaningfully long series of insns, which aren't
useful to all log. If you think including the immediate (the value of
which is of no interest) is relevant, may I then please ask that you
bound logging at the insn size limit of 15?

>>> --- a/xen/arch/x86/x86_emulate/private.h
>>> +++ b/xen/arch/x86/x86_emulate/private.h
>>> @@ -9,7 +9,9 @@
>>>  #ifdef __XEN__
>>>  
>>>  # include <xen/bug.h>
>>> +# include <xen/init.h>
>>>  # include <xen/kernel.h>
>>> +# include <xen/livepatch.h>
>> Are both of these really needed here, rather than just in decode-lite.c?
> 
> Yes, for the userpsace harness.

The user space harness shouldn't include any Xen headers. Patch context
(the #ifdef at the top) even shows that this Xen-only code.

>>> +    void *rel;
>> While I understand the goal of omitting const here and ...
>>
>>> +} x86_decode_lite_t;
>>> +
>>> +x86_decode_lite_t x86_decode_lite(void *ip, void *end);
>> ... here, I still find this fragile / misleading (the function itself, after
>> all, only ever fetches from memory). Even with the goal in mind, surely at
>> least "end" can be pointer-to-const?
>>
>> The (struct) return type would also be easier for the compiler to deal
>> with if it didn't have a pointer field (and hence needs to be 128-bit). How
>> about returning an offset relative to "start"? That would then allow proper
>> constifying of both function parameters as well.
> 
> Quite the contrary.
> 
> I did initially pack it all into a single GPR, but both the written C
> and code generation of this form is better.
> 
> This is what the code generation looks like:
> 
> <xdl>:
> 55                      push   %rbp
> 48 89 e5                mov    %rsp,%rbp
> e8 77 fd ff ff          callq  ffff82d04033c580 <x86_decode_lite>
> 5d                      pop    %rbp
> 88 05 01 77 2f 00       mov    %al,0x2f7701(%rip)        # <xdl_len>
> 0f b6 c4                movzbl %ah,%eax
> 88 05 f7 76 2f 00       mov    %al,0x2f76f7(%rip)        # <xdl_type>
> 48 89 15 e8 76 2f 00    mov    %rdx,0x2f76e8(%rip)       # <xdl_rel>
> c3                      retq   

Well, "easier" in my earlier reply wasn't referring to the complexity
of generated code. Instead I'm slightly wary of compiler issues with
the calling convention for 128-bit struct returns.

> and keeping rel as a full pointer simplifies both sides of the function.

Hmm, I can see that being the case. I wonder whether const-correctness
doesn't weigh higher, though. But I'm also not going to insist.

Jan

Re: [PATCH 1/6] x86: Introduce x86_decode_lite()

Reply via email to