On Wed, Jun 9, 2021 at 7:02 AM Heikki Linnakangas <hlinn...@iki.fi> wrote: > What is the worst case scenario for this algorithm? Something where the > new fast ASCII check never helps, but is as fast as possible with the > old code. For that, I added a repeating pattern of '123456789012345ä' to > the test set (these results are from my Intel laptop, not the raspberry pi): > > Master: > > chinese | mixed | ascii | mixed2 > ---------+-------+-------+-------- > 1333 | 757 | 410 | 573 > (1 row) > > v11-0001-Rewrite-pg_utf8_verifystr-for-speed.patch: > > chinese | mixed | ascii | mixed2 > ---------+-------+-------+-------- > 942 | 470 | 66 | 1249 > (1 row)
I get a much smaller regression on my laptop with clang 12: master: chinese | mixed | ascii | mixed2 ---------+-------+-------+-------- 978 | 685 | 370 | 452 v11-0001: chinese | mixed | ascii | mixed2 ---------+-------+-------+-------- 686 | 438 | 64 | 595 > So there's a regression with that input. Maybe that's acceptable, this > is the worst case, after all. Or you could tweak check_ascii for a > different performance tradeoff, by checking the two 64-bit words > separately and returning "8" if the failure happens in the second word. For v12 (unformatted and without 0002 rebased) I tried the following: -- highbits_set = (half1) & UINT64CONST(0x8080808080808080); if (highbits_set) return 0; x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f); x1 &= UINT64CONST(0x8080808080808080); if (x1 != UINT64CONST(0x8080808080808080)) return 0; /* now we know we have at least 8 bytes of valid ascii, so if any of these tests fails, return that */ highbits_set = (half2) & UINT64CONST(0x8080808080808080); if (highbits_set) return sizeof(uint64); x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f); x2 &= UINT64CONST(0x8080808080808080); if (x2 != UINT64CONST(0x8080808080808080)) return sizeof(uint64); return 2 * sizeof(uint64); -- and got this: chinese | mixed | ascii | mixed2 ---------+-------+-------+-------- 674 | 499 | 170 | 421 Pure ascii is significantly slower, but the regression is gone. I used the string repeat('123456789012345ä', 3647) to match the ~62000 bytes in the other strings (62000 / 17 = 3647) > And I haven't tried the SSE patch yet, maybe that compensates for this. I would expect that this case is identical to all-multibyte. The worst case for SSE might be alternating 16-byte chunks of ascii-only and chunks of multibyte, since that's one of the few places it branches. In simdjson, they check ascii on 64 byte blocks at a time ((c1 | c2) | (c3 | c4)) and check only the previous block's "chunk 4" for incomplete sequences at the end. It's a bit messier, so I haven't done it, but it's an option. Also, if SSE is accepted into the tree, then the C fallback is only important on platforms like PowerPC64 and Arm64, so we can make the tradeoff by testing those more carefully. I'll test on PowerPC soon. -- John Naylor EDB: http://www.enterprisedb.com
v12-Rewrite-pg_utf8_verifystr-for-speed.patch
Description: Binary data