Re: Using libutf8proc for svn_utf_cstring_utf8_width()

Timofei Zhakov Wed, 20 May 2026 02:22:48 -0700

On Tue, May 19, 2026 at 10:18 PM Timofei Zhakov <[email protected]> wrote:
>
> On Tue, May 19, 2026 at 9:42 PM Evgeny Kotkov
> <[email protected]> wrote:
> >
> > Timofei Zhakov <[email protected]> writes:
> >
> > > Committed the test in r1934405 and the rest of the patch in r1934407.
> > > Removed svn_utf__cstring_is_valid() check in r1934408.
> >
> > Thanks for the change!  I like the idea, and the implementation looks
> > good to me overall.
> >
> > I'm a bit late to the party, but here are a few minor comments and
> > suggestions regarding the code:
> >
> > +  /* Convert the UTF-8 string to UTF-32 (UCS4) which is the format
> > +   * utf8proc_charwidth() expects, and get the width of each character.
> > +   * We don't need much error checking since the input is valid UTF-8. */
> > +  while (*cstr)
> >
> > After r1934408, the comment stating that "the input is valid UTF-8" seems
> > outdated, because we no longer check for valid UTF-8 in advance.
>
> Oh, this's a good catch, thanks. I'll fix the comment.


Committed in r1934427.

>
> > +      int nbytes = utf8proc_iterate((apr_byte_t*)cstr, -1, &ucs);
> >
> > Since `cstr` is a `const char *`, I think this cast unnecessarily discards
> > constness.
> >
> > Also, because utf8proc_iterate() expects a `const utf8proc_uint8_t *`,
> > I'd say that casting directly to that target type is safer than using
> > `apr_byte_t *`, because the latter could theoretically denote a
> > different underlying type.

Committed in r1934429.

> >
> > A couple of similar observations:
> >
> > - utf8proc_iterate() returns a utf8proc_ssize_t.  Assigning this to
> >   `int nbytes` risks truncation in environments where int is smaller
> >   than utf8proc_ssize_t:
> >
> > +      int nbytes = utf8proc_iterate((apr_byte_t*)cstr, -1, &ucs);
> >

Well technically it only returns 1,2,3,4, or -3, but I agree that it
should be better to use this type anyway.

Committed in r1934430.

> > - utf8proc_iterate() puts the codepoint result in a utf8proc_int32_t,
> >   whereas we currently type it as an apr_int32_t:

Committed in r1934428.

> >
> > +      apr_int32_t ucs;
> >
> >   While those types are almost certainly compatible, I think that changing
> >   it to `utf8proc_int32_t` would guarantee that we use a correct type.
> >   It would also ensure that `utf8proc_charwidth(utf8proc_int32_t c)` gets
> >   called with an argument of the exactly expected type.
>
> I'm pretty sure I copied this code from another function in the same
> file, so I guess this is an API misuse that needs to be fixed in all
> those places (svn_utf__fuzzy_escape is the only other place).

-- 
Timofei Zhakov

Re: Using libutf8proc for svn_utf_cstring_utf8_width()

Reply via email to