Re: [vim/vim] Not able to convert between byte index and UTF indices (PR #12216)

Bram Moolenaar Thu, 20 Apr 2023 11:38:55 -0700


Yegappan wrote:


> @yegappan pushed 2 commits.
> 
> 87c7f0f888bd61604659930276973374dc408e92  Add the utf16idx() function
> and add UTF-16 flag to the byteidx() and byteidxcomp() functions
> 84147e31e7f05403bfaab20ccb7689c74a87befb  Add support for converting
> from byte or character index in a string to UTF-16 index and vice
> versa

This looks like the right way to do this, but I find the help a bit
difficult to interpret.  I hope others, especially those who want to use
the functionality, have a good look and make comments if something is
missing or unclear.

For byteidx() there is an extra argument, which, when TRUE, makes the
{nr} argument used differently:

                When {utf16} is TRUE, {nr} is used as the UTF-16 index in the
                String {expr} instead of as the character index.

The first thing that is unclear: what is "the UTF-16 index"?  In the
context of the discussion we had I can understand it is the index in the
string when it is encoded with UTF-16, thus with 16 bit words.  This
should be explained better.  I do not expect many to understand what
UTF-16 encoding means.

The examples are supposed to help understand this:

                        echo byteidx('a😊😊', 2)  returns 5
                        echo byteidx('a😊😊', 2, 1)       returns 1

However, this raises questions: why does the second call return 1?

For the first call I can compute the result: when {nr} is 2 then the
index of the third character is returned, thus the bytes of the first
two characters are added together.  These are 1 and 4, total 5.  You can
see the second character is 4 bytes by using "g8" on it.

With the second call the second character would take two UTF-16 words.
With {nr} being 2 we refer to the third UTF-16 word, thus halfway the
second character.  This is apparently rounded down and only the one byte
for "a" is counted.

This rounding down is new, it should be explained.  Perhaps adding this
explanation of how the two examples work is sufficient.  But it would be
good to add a third call that is more likely to happen:

                        echo byteidx('a😊😊', 3, 1)       returns 5

This refers to the same character as the first call, thus has the same
return value.  This also makes clear (esp. for those who don't know
UTF-16 well) that a character can consist of two words.

For charidx() there is this example:

                        echo charidx('a😊😊', 4, 0, 1)    returns 3

I would think that index 4 is halfway the third character, thus I would
expect a return value of 2.  Am I wrong?

-- 
"I know that there are people who don't love their fellow man,
and I hate those people!" - Tom Lehrer

 /// Bram Moolenaar -- b...@moolenaar.net -- http://www.Moolenaar.net   \\\
///                                                                      \\\
\\\        sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
 \\\            help me help AIDS victims -- http://ICCF-Holland.org    ///

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to vim_dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/vim_dev/20230420183849.4F46F1C0782%40moolenaar.net.

Re: [vim/vim] Not able to convert between byte index and UTF indices (PR #12216)

Raspunde prin e-mail lui