On Sun, May 17, 2026 at 12:16 PM Ivan Zhakov <[email protected]> wrote:
> On Sat, 16 May 2026 at 17:08, Timofei Zhakov <[email protected]> wrote: > >> On Thu, May 14, 2026 at 7:40 PM Ivan Zhakov <[email protected]> wrote: >> >>> On Thu, 14 May 2026 at 14:13, Timofei Zhakov <[email protected]> wrote: >>> >>>> This function counts real printable UTF characters in a string. It >>>> currently contains a table of all patterns that is manually checked. I >>>> believe it was stolen from elsewhere a long time ago. Before we had >>>> utf8proc as a required dependency. >>>> >>>> I have a few reasons to rewrite it to use the library instead; >>>> >>>> 1. I'm pretty sure nobody would ever care to update the dataset. On >>>> the other hand, utf8proc bundles all available information about the >>>> latest Unicode version that is supported on the current platform. >>>> >>>> 2. There is also a property that defines *display* width, that >>>> basically makes symbols like emojis wider than normal characters even >>>> on monospace fonts. >>>> >>>> (For context I want to fix indentation in places throughout our >>>> cmdline like the authors in 'svn list -v' that mess up the tables. >>>> This is where a function like that will be useful.) >>>> >>>> 3. Cleanup redundant code. >>>> >>>> 4. It might be slightly faster to use their dataset because utf8proc >>>> only accesses a table in static memory twice (for address and then >>>> retrieves properties) instead of binary searching and checking all >>>> ranges. Maybe it's slower though idk. >>>> >>>> Thoughts? >>>> >>>> Sounds good to me. >>> >>> Regarding potential performance regression: is it something we can >>> measure? As far as I understand svn_utf_cstring_utf8_width() is not used >>> for performance critical code, but it would be nice to know if there is >>> significant performance regression anyway. >>> >>> >> Well, I guess it was a bad argument because the utf8proc version looks >> well optimised anyway and we already use in multiple places. >> >> I'm done with the first version of a patch to make this switch. Please >> find it in the attachments. I also added a unit test to confirm. >> >> > The patch looks good to me. > > A few minor suggestions: > 1. I suggest to split patch in separate commits: one with test, another > with actual code change and adjusting test expectation. This would help to > compare the difference. > > Okay. > 2. I suggest to add test case for Byte Order Mark (BOM): > > const char *bom = "\xEF\xBB\xBF" "abc"; > > That's an interesting idea. Done in r1934406. > > 3. I agree that call to svn_utf__cstring_is_valid() can be removed since > utf8proc_iterate() checks for invalid UTF-8 sequence. > > Thanks! > > I think since this function depends on utf8proc it should be implemented >> in utf8proc.c file. Hence utf_width.c can be removed entirely. I guess the >> main reason it was in a separate file was to keep the dataset away from >> other code. >> >> By the way, I also found that before traversing the string, this function >> calls svn_utf__cstring_is_valid() to check validity. I >> think utf8proc_iterate() does the same check, but I really am not >> sure. svn_utf__cstring_is_valid() skips all first octets with values >0x80. >> Then it basically does the same work utf8proc_iterate() would. >> >> Committed the test in r1934405 and the rest of the patch in r1934407. Removed svn_utf__cstring_is_valid() check in r1934408. Thanks to everyone! -- Timofei Zhakov

