Re: [vim/vim] Not able to convert between byte index and UTF indices (PR #12216)

Yegappan Lakshmanan Thu, 20 Apr 2023 21:28:43 -0700

Hi Bram,

On Thu, Apr 20, 2023 at 11:38 AM Bram Moolenaar <[email protected]> wrote:
>
>
> Yegappan wrote:
>
> > @yegappan pushed 2 commits.
> >
> > 87c7f0f888bd61604659930276973374dc408e92  Add the utf16idx() function
> > and add UTF-16 flag to the byteidx() and byteidxcomp() functions
> > 84147e31e7f05403bfaab20ccb7689c74a87befb  Add support for converting
> > from byte or character index in a string to UTF-16 index and vice
> > versa
>
> This looks like the right way to do this, but I find the help a bit
> difficult to interpret.  I hope others, especially those who want to use
> the functionality, have a good look and make comments if something is
> missing or unclear.
>


These functions are mostly useful for LSP plugin developers.  I am going to
use it in the Vim9 LSP plugin.  Hopefully other LSP authors can comment
on these functions.

>
> For byteidx() there is an extra argument, which, when TRUE, makes the
> {nr} argument used differently:
>
>                 When {utf16} is TRUE, {nr} is used as the UTF-16 index in the
>                 String {expr} instead of as the character index.
>
> The first thing that is unclear: what is "the UTF-16 index"?  In the
> context of the discussion we had I can understand it is the index in the
> string when it is encoded with UTF-16, thus with 16 bit words.  This
> should be explained better.  I do not expect many to understand what
> UTF-16 encoding means.
>

I have updated the help text.  Let me know if this needs to be expanded further.

>
> The examples are supposed to help understand this:
>
>                         echo byteidx('a😊😊', 2)        returns 5
>                         echo byteidx('a😊😊', 2, 1)     returns 1
>
> However, this raises questions: why does the second call return 1?
>

The byteidx() function returns the index of the first byte in a character
(as you have mentioned below).  In the second call, the specified UTF-16
index refers to the second UTF-16 code point in the second character in
the string.

>
> For the first call I can compute the result: when {nr} is 2 then the
> index of the third character is returned, thus the bytes of the first
> two characters are added together.  These are 1 and 4, total 5.  You can
> see the second character is 4 bytes by using "g8" on it.
>
> With the second call the second character would take two UTF-16 words.
> With {nr} being 2 we refer to the third UTF-16 word, thus halfway the
> second character.  This is apparently rounded down and only the one byte
> for "a" is counted.
>

Yes.

>
> This rounding down is new, it should be explained.  Perhaps adding this
> explanation of how the two examples work is sufficient.  But it would be
> good to add a third call that is more likely to happen:
>
>                         echo byteidx('a😊😊', 3, 1)     returns 5
>
> This refers to the same character as the first call, thus has the same
> return value.  This also makes clear (esp. for those who don't know
> UTF-16 well) that a character can consist of two words.
>

I have updated the help with this example and added a note about the
round-down.

>
> For charidx() there is this example:
>
>                         echo charidx('a😊😊', 4, 0, 1)  returns 3
>
> I would think that index 4 is halfway the third character, thus I would
> expect a return value of 2.  Am I wrong?
>

Good catch.  The example is wrong.  It does return 2.  I have updated the help.

Regards,
Yegappan

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/vim_dev/CAAW7x7kQQOJkPBk1ih%3D0_SKPYGhcCc71SVyxpnoq-p%2BUgXY9cQ%40mail.gmail.com.

Re: [vim/vim] Not able to convert between byte index and UTF indices (PR #12216)

Raspunde prin e-mail lui