On Sat, Jan 22, 2022 at 12:41:56AM +0000, Colin Watson wrote: >> Technically, UTF-8 validation can be done at a few gigabytes per second >> per core: >> >> >> https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/ >> >> but that is probably overkill. :-) > Quite :-)
It struck me that it can probably be folded for free into the lexer. If you add symbols for all invalid UTF-8 sequences, I believe it should just go into the state machine. But I'm fine with those 20%; the perfect need not be the enemy of the good here. In general, I don't think I need to look at it again now, unless there are any special questions. Thanks for taking care of this! Looking forward to bookworm being faster (and of course sid before that), and then I'll happily live with this on bullseye, knowing that it's transient. /* Steinar */ -- Homepage: https://www.sesse.net/