RE: [EXTERNAL] Re: speed up verifying UTF-8

Godfrin, Philippe E Fri, 10 Dec 2021 10:49:35 -0800

>-----Original Message-----
>From: Heikki Linnakangas <hlinn...@iki.fi> 
>Sent: Friday, December 10, 2021 12:34 PM
>To: John Naylor <john.nay...@enterprisedb.com>; Vladimir Sitnikov 
><sitnikov.vladi...@gmail.com>
>Cc: pgsql-hackers <pgsql-hack...@postgresql.org>; Amit Khandekar 
><amitdkhan...@gmail.com>; Thomas Munro <thomas.mu...@gmail.com>; Greg Stark 
><st...@mit.edu>
>Subject: [EXTERNAL] Re: speed up verifying UTF-8
>
>On 20/10/2021 00:42, John Naylor wrote:
>> I've decided I'm not quite comfortable with the additional complexity 
>> in the build system introduced by the SIMD portion of the previous patches.
>> It would make more sense if the pure C portion were unchanged, but 
>> with the shift-based DFA plus the bitwise ASCII check, we have a 
>> portable implementation that's still a substantial improvement over 
>> the current validator. In v24, I've included only that much, and the 
>> diff is only about 1/3 as many lines. If future improvements to COPY 
>> FROM put additional pressure on this path, we can always add SIMD support 
>> later.
>
>+1.
>
>I had another look at this now. Looks good, just a few minor comments below:
>
>> +/*
>> + * Verify a chunk of bytes for valid ASCII, including a zero-byte check.
>> + */
>> +static inline bool
>> +is_valid_ascii(const unsigned char *s, int len) {
>> +    uint64          chunk,
>> +                            highbit_cum = UINT64CONST(0),
>> +                            zero_cum = UINT64CONST(0x8080808080808080);
>> +
>> +    Assert(len % sizeof(chunk) == 0);
>> +
>> +    while (len >= sizeof(chunk))
>> +    {
>> +            memcpy(&chunk, s, sizeof(chunk));
>> +
>> +            /*
>> +             * Capture any zero bytes in this chunk.
>> +             *
>> +             * First, add 0x7f to each byte. This sets the high bit in each 
>> byte,
>> +             * unless it was a zero. We will check later that none of the 
>> bytes in
>> +             * the chunk had the high bit set, in which case the max value 
>> each
>> +             * byte can have after the addition is 0x7f + 0x7f = 0xfe, and 
>> we
>> +             * don't need to worry about carrying over to the next byte.
>> +             *
>> +             * If any resulting high bits are zero, the corresponding high 
>> bits in
>> +             * the zero accumulator will be cleared.
>> +             */
>> +            zero_cum &= (chunk + UINT64CONST(0x7f7f7f7f7f7f7f7f));
>> +
>> +            /* Capture any set bits in this chunk. */
>> +            highbit_cum |= chunk;
>> +
>> +            s += sizeof(chunk);
>> +            len -= sizeof(chunk);
>> +    }
>
>This function assumes that the input len is a multiple of 8. There's an 
>assertion for that, but it would be good to also mention it in the function 
>comment. I took me a moment to realize that.
>
>Given that assumption, I wonder if "while (len >= 0)" would marginally faster. 
>Or compute "s_end = s + len" first, and check for "while (s < s_end)", so that 
>you don't need to update 'len' in the loop.
>
>Also would be good to mention what exactly the return value means. I.e 
>"returns false if the input contains any bytes with the high-bit set, or 
>zeros".
>
>> +    /*
>> +     * Check if any high bits in the zero accumulator got cleared.
>> +     *
>> +     * XXX: As noted above, the zero check is only valid if the chunk had no
>> +     * high bits set. However, the compiler may perform these two checks in
>> +     * any order. That's okay because if any high bits were set, we would
>> +     * return false regardless, so invalid results from the zero check don't
>> +     * matter.
>> +     */
>> +    if (zero_cum != UINT64CONST(0x8080808080808080))
>> +            return false;
>
>I don't understand the "the compiler may perform these checks in any order" 
>comment. We trust the compiler to do the right thing, and only reorder things 
>when it's safe to do so. What is special here, why is it worth mentioning here?
>
>> @@ -1721,7 +1777,7 @@ pg_gb18030_verifystr(const unsigned char *s, int len)
>>      return s - start;
>>  }
>>  
>> -static int
>> +static pg_noinline int
>>  pg_utf8_verifychar(const unsigned char *s, int len)  {
>>      int                     l;
>
>Why force it to not be inlined?
>
>> + * In a shift-based DFA, the input byte is an index into array of 
>> + integers
>> + * whose bit pattern encodes the state transitions. To compute the 
>> + current
>> + * state, we simply right-shift the integer by the current state and 
>> + apply a
>> + * mask. In this scheme, the address of the transition only depends 
>> + on the
>> + * input byte, so there is better pipelining.
>
>Should be "To compute the *next* state, ...", I think.
>
>The way the state transition table works is pretty inscrutable. That's 
>understandable, because the values were found by an SMT solver, so I'm not 
>sure if anything can be done about it.
>
>- Heikki
>


If I remember correctly the shift instruction is very fast...

RE: [EXTERNAL] Re: speed up verifying UTF-8

Reply via email to