I decided to do some experiments with how we use Flex. The main takeaway is that backtracking, which we removed in 2005, doesn't seem to matter anymore for the core scanner. Also, state table size is of marginal importance.
Using the information_schema Flex+Bison microbenchmark from Tom [1], I tested removing most of the "fail" rules designed to avoid backtracking ("decimalfail" is needed by PL/pgSQL). Below are the best times (most runs within 1%), followed by postgres binary size. The numbers are with Flex 2.5.35 on MacOS, no asserts or debugging symbols. HEAD: 1.53s 7139132 bytes HEAD minus "fail" rules (patch attached): 1.53s 6971204 bytes Surprisingly, it has the same performance and a much smaller binary. The size difference is because the size of the elements of the yy_transition array is constrained by the number of elements in the array. Since there are now fewer than INT16_MAX state transitions, the struct members go from 32 bit: struct yy_trans_info { flex_int32_t yy_verify; flex_int32_t yy_nxt; }; static yyconst struct yy_trans_info yy_transition[37045] = ... to 16 bit: struct yy_trans_info { flex_int16_t yy_verify; flex_int16_t yy_nxt; }; static yyconst struct yy_trans_info yy_transition[31763] = ... To test if array size was the deciding factor, I tried bloating it by essentially undoing commit a5ff502fcea. Doing so produced an array with 62583 elements and 32-bit members, so nearly quadruple in size, and it was still not much slower than HEAD: HEAD minus "fail" rules, minus %xusend/%xuiend: 1.56s 7343932 bytes While at it, I repeated the benchmark with different Flex flags: HEAD, plus -Cf: 1.60s 6995788 bytes HEAD, minus "fail" rules, plus -Cf: 1.59s 6979396 bytes HEAD, plus -Cfe: 1.65s 6868804 bytes So this recommendation of the Flex manual (-CF) still holds true. It's worth noting that using perfect hashing for keyword lookup (20% faster) had a much bigger effect than switching from -Cfe to -CF (7% faster). It would be nice to have confirmation to make sure I didn't err somewhere, and to try a more real-world benchmark. (Also for the moment I only have Linux on a virtual machine.) The regression tests pass, but some comments are now wrong. If it's confirmed that backtracking doesn't matter for recent Flex/hardware, disregarding it would make maintenance of our scanners a bit easier. [1] https://www.postgresql.org/message-id/14616.1558560331%40sss.pgh.pa.us -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
remove-scanner-fail-rules.patch
Description: Binary data