On Sun, May 17, 2026 at 12:16 PM Ivan Zhakov <[email protected]> wrote:

> On Sat, 16 May 2026 at 17:08, Timofei Zhakov <[email protected]> wrote:
>
>> On Thu, May 14, 2026 at 7:40 PM Ivan Zhakov <[email protected]> wrote:
>>
>>> On Thu, 14 May 2026 at 14:13, Timofei Zhakov <[email protected]> wrote:
>>>
>>>> This function counts real printable UTF characters in a string. It
>>>> currently contains a table of all patterns that is manually checked. I
>>>> believe it was stolen from elsewhere a long time ago. Before we had
>>>> utf8proc as a required dependency.
>>>>
>>>> I have a few reasons to rewrite it to use the library instead;
>>>>
>>>> 1. I'm pretty sure nobody would ever care to update the dataset. On
>>>> the other hand, utf8proc bundles all available information about the
>>>> latest Unicode version that is supported on the current platform.
>>>>
>>>> 2. There is also a property that defines *display* width, that
>>>> basically makes symbols like emojis wider than normal characters even
>>>> on monospace fonts.
>>>>
>>>> (For context I want to fix indentation in places throughout our
>>>> cmdline like the authors in 'svn list -v' that mess up the tables.
>>>> This is where a function like that will be useful.)
>>>>
>>>> 3. Cleanup redundant code.
>>>>
>>>> 4. It might be slightly faster to use their dataset because utf8proc
>>>> only accesses a table in static memory twice (for address and then
>>>> retrieves properties) instead of binary searching and checking all
>>>> ranges. Maybe it's slower though idk.
>>>>
>>>> Thoughts?
>>>>
>>>> Sounds good to me.
>>>
>>> Regarding potential performance regression: is it something we can
>>> measure? As far as I understand svn_utf_cstring_utf8_width() is not used
>>> for performance critical code, but it would be nice to know if there is
>>> significant performance regression anyway.
>>>
>>>
>> Well, I guess it was a bad argument because the utf8proc version looks
>> well optimised anyway and we already use in multiple places.
>>
>> I'm done with the first version of a patch to make this switch. Please
>> find it in the attachments. I also added a unit test to confirm.
>>
>>
> The patch looks good to me.
>
> A few minor suggestions:
> 1. I suggest to split patch in separate commits: one with test, another
> with actual code change and adjusting test expectation. This would help to
> compare the difference.
>
>
Okay.


> 2. I suggest to add test case for Byte Order Mark (BOM):
>
>   const char *bom = "\xEF\xBB\xBF" "abc";
>
>
That's an interesting idea. Done in r1934406.

>
> 3. I agree that call to  svn_utf__cstring_is_valid() can be removed since
> utf8proc_iterate() checks for invalid UTF-8 sequence.
>
>
Thanks!
>
> I think since this function depends on utf8proc it should be implemented
>> in utf8proc.c file. Hence utf_width.c can be removed entirely. I guess the
>> main reason it was in a separate file was to keep the dataset away from
>> other code.
>>
>> By the way, I also found that before traversing the string, this function
>> calls svn_utf__cstring_is_valid() to check validity. I
>> think utf8proc_iterate() does the same check, but I really am not
>> sure. svn_utf__cstring_is_valid() skips all first octets with values >0x80.
>> Then it basically does the same work utf8proc_iterate() would.
>>
>>
Committed the test in r1934405 and the rest of the patch in r1934407.
Removed svn_utf__cstring_is_valid() check in r1934408.

Thanks to everyone!

-- 
Timofei Zhakov

Reply via email to