Hi Bram,
On Thu, Apr 20, 2023 at 11:38 AM Bram Moolenaar <[email protected]> wrote:
>
>
> Yegappan wrote:
>
> > @yegappan pushed 2 commits.
> >
> > 87c7f0f888bd61604659930276973374dc408e92 Add the utf16idx() function
> > and add UTF-16 flag to the byteidx() and byteidxcomp() functions
> > 84147e31e7f05403bfaab20ccb7689c74a87befb Add support for converting
> > from byte or character index in a string to UTF-16 index and vice
> > versa
>
> This looks like the right way to do this, but I find the help a bit
> difficult to interpret. I hope others, especially those who want to use
> the functionality, have a good look and make comments if something is
> missing or unclear.
>
These functions are mostly useful for LSP plugin developers. I am going to
use it in the Vim9 LSP plugin. Hopefully other LSP authors can comment
on these functions.
>
> For byteidx() there is an extra argument, which, when TRUE, makes the
> {nr} argument used differently:
>
> When {utf16} is TRUE, {nr} is used as the UTF-16 index in the
> String {expr} instead of as the character index.
>
> The first thing that is unclear: what is "the UTF-16 index"? In the
> context of the discussion we had I can understand it is the index in the
> string when it is encoded with UTF-16, thus with 16 bit words. This
> should be explained better. I do not expect many to understand what
> UTF-16 encoding means.
>
I have updated the help text. Let me know if this needs to be expanded further.
>
> The examples are supposed to help understand this:
>
> echo byteidx('a😊😊', 2) returns 5
> echo byteidx('a😊😊', 2, 1) returns 1
>
> However, this raises questions: why does the second call return 1?
>
The byteidx() function returns the index of the first byte in a character
(as you have mentioned below). In the second call, the specified UTF-16
index refers to the second UTF-16 code point in the second character in
the string.
>
> For the first call I can compute the result: when {nr} is 2 then the
> index of the third character is returned, thus the bytes of the first
> two characters are added together. These are 1 and 4, total 5. You can
> see the second character is 4 bytes by using "g8" on it.
>
> With the second call the second character would take two UTF-16 words.
> With {nr} being 2 we refer to the third UTF-16 word, thus halfway the
> second character. This is apparently rounded down and only the one byte
> for "a" is counted.
>
Yes.
>
> This rounding down is new, it should be explained. Perhaps adding this
> explanation of how the two examples work is sufficient. But it would be
> good to add a third call that is more likely to happen:
>
> echo byteidx('a😊😊', 3, 1) returns 5
>
> This refers to the same character as the first call, thus has the same
> return value. This also makes clear (esp. for those who don't know
> UTF-16 well) that a character can consist of two words.
>
I have updated the help with this example and added a note about the
round-down.
>
> For charidx() there is this example:
>
> echo charidx('a😊😊', 4, 0, 1) returns 3
>
> I would think that index 4 is halfway the third character, thus I would
> expect a return value of 2. Am I wrong?
>
Good catch. The example is wrong. It does return 2. I have updated the help.
Regards,
Yegappan
--
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
---
You received this message because you are subscribed to the Google Groups
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/vim_dev/CAAW7x7kQQOJkPBk1ih%3D0_SKPYGhcCc71SVyxpnoq-p%2BUgXY9cQ%40mail.gmail.com.