On Wed, May 22, 2024 at 11:33:08AM -0700, Ivan Raikov wrote:
> Hello Peter,
> 
> Thanks a lot for the patch! Overall it looks ok, but it has been quite
> a while since I have had to deal with UTF-8 at this level of detail,
> so I don't really understand all the bitwise operations and range
> comparisons.

Yeah, it's rather tricky business, because UTF-8 is a variable-width
encoding in two ways:

- It encodes codepoints using 1, 2, 3 or 4 bytes.
- It encodes how many bytes to use in the first byte's leading bit,
  leading three bits, leading four bits or leading five bits depending
  on the length.

This latter property is extra annoying because you can't just extract
the length from the first byte - you have to scan the first bit to
decide what to do next.  Then, you scan the second and third bit etc.
It has some irregularity to it, which has good reasons (it's quite
well designed for the constraints) but is a pain to deal with
nonetheless.

> I am wondering if it is possible to factor out the
> UTF-8-specific logic into a separate module and let it be invoked by
> the uri-generic parsing routines.

The coding itself must be done in one pass, so the percent-encoding
and decoding functions need to be fully encoding-aware.  They're a bit
monolithic in that sense, especially the decoder - it eats bytes
according to the encoding, and if the resulting character doesn't
fit the char-set, the percent-encoded bytes that were consumed are
put back onto the list.

I suppose one could SRFI-39 parameterize the pct-decode and pct-encode
functions, to make them pluggable, and provide utf8 by default?
We might even include latin1 or other decoders.

> Also, I think it would be tremendously helpful to use named constants,
> as I don't quite know the significance of #x800 or #x10000.

These values don't have much meaning, except that they are the ranges
of Unicode codepoints which are encoded using 2 or 3 bytes in UTF-8
encoding.  They're lifted straight from the table in RFC 3629.

> Perhaps CHICKEN 6 already offers the definitions and routines to make
> this code more readable?

It has no constants for these either, it uses bare constants just
like my code.  All this code is isolated in utf.c, see utf8_encode()
and utf8_decode():
https://code.call-cc.org/cgi-bin/gitweb.cgi?p=chicken-core.git;a=blob;f=utf.c;h=cf7eafa9701850877542f8bfbf6f5517655f8d18;hb=refs/heads/utf%2Br7rs#l3179

> I will try to install CHICKEN 6 and actually run the code with your patch 
> soon.

Thanks, let me know if you see any further improvements we could make.

Note, for getting the utf+r7rs branch (that's the CHICKEN 6 to be) to
work, you must first build from the utf+r7rs-bootstrap branch.

Cheers,
Peter

Reply via email to