>-----Original Message----- >From: Heikki Linnakangas <hlinn...@iki.fi> >Sent: Friday, December 10, 2021 12:34 PM >To: John Naylor <john.nay...@enterprisedb.com>; Vladimir Sitnikov ><sitnikov.vladi...@gmail.com> >Cc: pgsql-hackers <pgsql-hack...@postgresql.org>; Amit Khandekar ><amitdkhan...@gmail.com>; Thomas Munro <thomas.mu...@gmail.com>; Greg Stark ><st...@mit.edu> >Subject: [EXTERNAL] Re: speed up verifying UTF-8 > >On 20/10/2021 00:42, John Naylor wrote: >> I've decided I'm not quite comfortable with the additional complexity >> in the build system introduced by the SIMD portion of the previous patches. >> It would make more sense if the pure C portion were unchanged, but >> with the shift-based DFA plus the bitwise ASCII check, we have a >> portable implementation that's still a substantial improvement over >> the current validator. In v24, I've included only that much, and the >> diff is only about 1/3 as many lines. If future improvements to COPY >> FROM put additional pressure on this path, we can always add SIMD support >> later. > >+1. > >I had another look at this now. Looks good, just a few minor comments below: > >> +/* >> + * Verify a chunk of bytes for valid ASCII, including a zero-byte check. >> + */ >> +static inline bool >> +is_valid_ascii(const unsigned char *s, int len) { >> + uint64 chunk, >> + highbit_cum = UINT64CONST(0), >> + zero_cum = UINT64CONST(0x8080808080808080); >> + >> + Assert(len % sizeof(chunk) == 0); >> + >> + while (len >= sizeof(chunk)) >> + { >> + memcpy(&chunk, s, sizeof(chunk)); >> + >> + /* >> + * Capture any zero bytes in this chunk. >> + * >> + * First, add 0x7f to each byte. This sets the high bit in each >> byte, >> + * unless it was a zero. We will check later that none of the >> bytes in >> + * the chunk had the high bit set, in which case the max value >> each >> + * byte can have after the addition is 0x7f + 0x7f = 0xfe, and >> we >> + * don't need to worry about carrying over to the next byte. >> + * >> + * If any resulting high bits are zero, the corresponding high >> bits in >> + * the zero accumulator will be cleared. >> + */ >> + zero_cum &= (chunk + UINT64CONST(0x7f7f7f7f7f7f7f7f)); >> + >> + /* Capture any set bits in this chunk. */ >> + highbit_cum |= chunk; >> + >> + s += sizeof(chunk); >> + len -= sizeof(chunk); >> + } > >This function assumes that the input len is a multiple of 8. There's an >assertion for that, but it would be good to also mention it in the function >comment. I took me a moment to realize that. > >Given that assumption, I wonder if "while (len >= 0)" would marginally faster. >Or compute "s_end = s + len" first, and check for "while (s < s_end)", so that >you don't need to update 'len' in the loop. > >Also would be good to mention what exactly the return value means. I.e >"returns false if the input contains any bytes with the high-bit set, or >zeros". > >> + /* >> + * Check if any high bits in the zero accumulator got cleared. >> + * >> + * XXX: As noted above, the zero check is only valid if the chunk had no >> + * high bits set. However, the compiler may perform these two checks in >> + * any order. That's okay because if any high bits were set, we would >> + * return false regardless, so invalid results from the zero check don't >> + * matter. >> + */ >> + if (zero_cum != UINT64CONST(0x8080808080808080)) >> + return false; > >I don't understand the "the compiler may perform these checks in any order" >comment. We trust the compiler to do the right thing, and only reorder things >when it's safe to do so. What is special here, why is it worth mentioning here? > >> @@ -1721,7 +1777,7 @@ pg_gb18030_verifystr(const unsigned char *s, int len) >> return s - start; >> } >> >> -static int >> +static pg_noinline int >> pg_utf8_verifychar(const unsigned char *s, int len) { >> int l; > >Why force it to not be inlined? > >> + * In a shift-based DFA, the input byte is an index into array of >> + integers >> + * whose bit pattern encodes the state transitions. To compute the >> + current >> + * state, we simply right-shift the integer by the current state and >> + apply a >> + * mask. In this scheme, the address of the transition only depends >> + on the >> + * input byte, so there is better pipelining. > >Should be "To compute the *next* state, ...", I think. > >The way the state transition table works is pretty inscrutable. That's >understandable, because the values were found by an SMT solver, so I'm not >sure if anything can be done about it. > >- Heikki >
If I remember correctly the shift instruction is very fast...