Re: [vim/vim] Not able to convert between byte index and UTF indices (PR #12216)

Yegappan Lakshmanan Wed, 12 Apr 2023 21:52:11 -0700

Hi Bram,

On Wed, Apr 12, 2023 at 10:36 AM Bram Moolenaar <[email protected]>
wrote:


>
> Yegappan wrote:
>
> > The language server protocol supports specifying offsets in text
> > documents using UTF-8 or UTF-16 or UTF-32 code units.
> > The UTF-16 code unit is the default.
> >
> >
> https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments
> >
> > Different language servers have different levels of support for using
> > the different code units. Vim uses the UTF-32 code units for the
> > offsets. This makes it difficult to support different language
> > servers from a Vim LSP plugin.
> >
> > Port the strutfindex() and strbyteindex() functions from Neovim to
> > support this.
>
> I find the function names hard to read and confusing. We might be able
> to think of better names when the exact functionality is described.
>
> The terminology is confusing. "UTF-32 byte index" contradicts itself,
> since each character is four bytes. I think what is meant is "UTF-32
> encoded character index", which is equal to "character index", since
> there is no Unicode character that takes more than one UTF-32 code
> point.
>
> In Vim all Unicode characters are internally encoded with UTF-8. Thus
> the "{string}" argument of strbyteindex() will be UTF-8 encoded. This
> is also confusing. The help should be clearer about what this means
> exactly. I'm not sure how, saying something like "the character index
> of "{string}" if it would be encoded with UTF-32" makes it complex. I
> think that instead of using "UTF-32 index" we can just use "character
> index", and somewhere mention that "UTF-32" can be considered the same
> (if we need to mention this at all, since the term "UTF-32" isn't widely
> used).
>
> For "UTF-16" it gets more complicated, we can't avoid mentioning that
> the index applies to "{string}" encoded as UTF-16. Looking back UTF-16
> should have never been made a standard IMHO, but it exists and it is
> used (especially on MS-Windows), thus we need to support it.
>
> Conversion between UTF-8 and character index already exists, you can use
> charidx() and byteidx()/byteidxcomp(). Possibly we only need to add
> functions to convert between UTF-8 and UTF-16 indexes? Or between
> character (UTF-32) and UTF-16 indexes? The latter makes more sense.
>

What about introducing a function that converts a character index in a
string
to a UTF-16 index?

utf16idx({string}, {idx} [, {countcc}])

This is similar to the existing charidx() function.  The "idx" here
specifies
the character index in {string} and this function returns the corresponding
UTF-16 index.

To convert from a UTF-16 index to a character index, we can either introduce
a new function or modify the existing charidx() function to accept an
additional
boolean argument.  If this argument is specified, then {idx} is a UTF-16
index
instead of a byte index.  If we are going with a new function for this, what
do you think about naming the function as utf16tocharidx()?

- Yegappan


>
> It should also be possible to specify the handling of composing
> characters. Either as an argument, like with charidx(), or using
> separate functions, as with byteidx()/byteidxcomp().
>
>
>

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/vim_dev/CAAW7x7nCgUDw80Z-kVdd1Z95N%2B1CHX9uhffYhBGihdqx8pMOJA%40mail.gmail.com.

Re: [vim/vim] Not able to convert between byte index and UTF indices (PR #12216)

Raspunde prin e-mail lui