Control: tag -1 fixed-upstream

On Sun, Jan 23, 2022 at 06:11:19PM +0100, Steinar H. Gunderson wrote:
> On Sat, Jan 22, 2022 at 12:41:56AM +0000, Colin Watson wrote:
> >> Technically, UTF-8 validation can be done at a few gigabytes per second
> >> per core:
> >> 
> >>   
> >> https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/
> >> 
> >> but that is probably overkill. :-)
> > Quite :-)
> 
> It struck me that it can probably be folded for free into the lexer.
> If you add symbols for all invalid UTF-8 sequences, I believe it should just
> go into the state machine. But I'm fine with those 20%; the perfect need not
> be the enemy of the good here.

Mm.  I'd somewhat prefer not to put it in the lexer though, because in
general the next stage after encoding conversion can be something other
than the lexer, and I don't want to store up too much confusion for my
future self.

I grabbed glib's UTF-8 validator on the basis that it was a simple,
portable, and compatibly-licensed one that I could verify by eye and
dropped it in, replacing the trial conversion pass if source_encoding !=
UTF-8 and target_encoding == UTF-8.  This saves about 8% on my test
system on top of the previous optimizations (10.589s → 9.791s, median of
nine runs), so it might be possible to do better with a faster
validator, but this seems likely to be good enough and we're probably
approaching diminishing returns.

> In general, I don't think I need to look at it again now, unless there are
> any special questions. Thanks for taking care of this! Looking forward to
> bookworm being faster (and of course sid before that), and then I'll happily
> live with this on bullseye, knowing that it's transient.

Thanks a lot for the initial prod and the review comments - they've
definitely improved things.  I've gone ahead and merged all this.  I'll
need to do a call for translations before releasing since there are some
other changes that will necessitate that, but I expect to produce a new
upstream release in a couple of weeks.

-- 
Colin Watson (he/him)                              [cjwat...@debian.org]

Reply via email to