Re: [vim/vim] Not able to convert between byte index and UTF indices (PR #12216)

Bram Moolenaar Tue, 02 May 2023 16:39:13 -0700


[resend, picky postmaster refused the message]



Yegappan wrote:

> > > The language server protocol supports specifying offsets in text
> > > documents using UTF-8 or UTF-16 or UTF-32 code units.
> > > The UTF-16 code unit is the default.
> > >
> > >
> > https://microsoft.github.io/language-server-protocol/specifications/lsp/3=
> .17/specification/#textDocuments
> > >
> > > Different language servers have different levels of support for using
> > > the different code units. Vim uses the UTF-32 code units for the
> > > offsets. This makes it difficult to support different language
> > > servers from a Vim LSP plugin.
> > >
> > > Port the strutfindex() and strbyteindex() functions from Neovim to
> > > support this.
> >
> > I find the function names hard to read and confusing. We might be able
> > to think of better names when the exact functionality is described.
> >
> > The terminology is confusing. "UTF-32 byte index" contradicts itself,
> > since each character is four bytes. I think what is meant is "UTF-32
> > encoded character index", which is equal to "character index", since
> > there is no Unicode character that takes more than one UTF-32 code
> > point.
> >
> > In Vim all Unicode characters are internally encoded with UTF-8. Thus
> > the "{string}" argument of strbyteindex() will be UTF-8 encoded. This
> > is also confusing. The help should be clearer about what this means
> > exactly. I'm not sure how, saying something like "the character index
> > of "{string}" if it would be encoded with UTF-32" makes it complex. I
> > think that instead of using "UTF-32 index" we can just use "character
> > index", and somewhere mention that "UTF-32" can be considered the same
> > (if we need to mention this at all, since the term "UTF-32" isn't widely
> > used).
> >
> > For "UTF-16" it gets more complicated, we can't avoid mentioning that
> > the index applies to "{string}" encoded as UTF-16. Looking back UTF-16
> > should have never been made a standard IMHO, but it exists and it is
> > used (especially on MS-Windows), thus we need to support it.
> >
> > Conversion between UTF-8 and character index already exists, you can use
> > charidx() and byteidx()/byteidxcomp(). Possibly we only need to add
> > functions to convert between UTF-8 and UTF-16 indexes? Or between
> > character (UTF-32) and UTF-16 indexes? The latter makes more sense.
> 
> What about introducing a function that converts a character index in a
> string to a UTF-16 index?
> 
> utf16idx({string}, {idx} [, {countcc}])
> 
> This is similar to the existing charidx() function.  The "idx" here
> specifies the character index in {string} and this function returns
> the corresponding UTF-16 index.

charidx() converts a byte index of an UTF-8 encoded string to a
character index.  This can't simply be changed to UTF-16, since we don't
support UTF-16 encoded strings.  We could (pretend to) convert the
string to UTF-16 and then apply {idx}.  But that is doing the opposite
of what you suggested.

> To convert from a UTF-16 index to a character index, we can either introduce
> a new function or modify the existing charidx() function to accept an
> additional boolean argument.  If this argument is specified, then
> {idx} is a UTF-16 index instead of a byte index.  If we are going with
> a new function for this, what do you think about naming the function
> as utf16tocharidx()?

The function still returns a character index, thus using "charidx" with
something appended works better.  At least then they sort next to each
other.

For the other direction an equivalent to byteidx().  That could be
utf16idx() perhaps.

-- 
ARTHUR:          What does it say?
BROTHER MAYNARD: It reads ... "Here may be found the last words of Joseph of
                 Aramathea." "He who is valorous and pure of heart may find
                 the Holy Grail in the aaaaarrrrrrggghhh..."
ARTHUR:          What?
BROTHER MAYNARD: "The Aaaaarrrrrrggghhh..."
                 "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

 /// Bram Moolenaar -- [email protected] -- http://www.Moolenaar.net   \\\
///                                                                      \\\
\\\        sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
 \\\            help me help AIDS victims -- http://ICCF-Holland.org    ///

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/vim_dev/20230502233906.D39591C0916%40moolenaar.net.

Re: [vim/vim] Not able to convert between byte index and UTF indices (PR #12216)

Raspunde prin e-mail lui