It occurred to me that the DFA + ascii quick check approach could also be adapted to speed up some cases where we currently walk a string counting characters, like this snippet in text_position_get_match_pos():
/* Convert the byte position to char position. */ while (state->refpoint < state->last_match) { state->refpoint += pg_mblen(state->refpoint); state->refpos++; } This coding changed in 9556aa01c69 (Use single-byte Boyer-Moore-Horspool search even with multibyte encodings), in which I found the majority of cases were faster, but some were slower. It would be nice to regain the speed lost and do even better. In the case of UTF-8, we could just run it through the DFA, incrementing a count of the states found. The number of END states should be the number of characters. The ascii quick check would still be applicable as well. I think all that is needed is to export some symbols and add the counting function. That wouldn't materially affect the current patch for input verification, and would be separate, but it would be nice to get the symbol visibility right up front. I've set this to waiting on author while I experiment with that. -- John Naylor EDB: http://www.enterprisedb.com