Re: Getting the byte index (column) given the character column number

Dominique Pellé Mon, 21 Nov 2022 23:05:23 -0800

Yegappan Lakshmanan <[email protected]> wrote:
> Hi Bram,
>
> On Mon, Nov 21, 2022 at 2:17 PM Bram Moolenaar <[email protected]> wrote:
> >
> >
> > Yegappan wrote:
> >
> > > > > > > The language server protocol messages use character column number
> > > > > > > whereas many of the built-in Vim functions (e.g. matchaddpos()) 
> > > > > > > deal
> > > > > > > with byte column number.
> > > > > > >
> > > > > > > Several built-in functions were added to convert between the 
> > > > > > > character
> > > > > > > and byte column numbers (byteidx(), charcol(), charidx(),
> > > > > > > getcharpos(), getcursorcharpos(), etc,).
> > > > > > > But these functions deal with strings, current cursor position or 
> > > > > > > the
> > > > > > > position of a mark.
> > > > > > >
> > > > > > > We currently don't have a function to return the byte number 
> > > > > > > given the
> > > > > > > character number in a line in a buffer.  The workaround is to use
> > > > > > > getbufline() to get the entire buffer line and then use byteidx() 
> > > > > > > to
> > > > > > > get the byte number from the character number.
> > > > > > >
> > > > > > > I am thinking of introducing a new function named 
> > > > > > > charcol2bytecol()
> > > > > > > that accepts a buffer number, line number and the character 
> > > > > > > number in
> > > > > > > the line and returns the corresponding byte number.  Any
> > > > > > > suggestions/comments on this?
> > > > > > >
> > > > > > > We should also modify the matchaddpos() function to accept
> > > > > > > character numbers in a line in addition to the byte numbers.
> > > > > >
> > > > > > Just to make sure we understand what we are talking about: This is
> > > > > > always about text in a buffer?  Thus the buffer text is somehow 
> > > > > > passed
> > > > > > through the LSP to a server, which then returns information with
> > > > > > character indexes.
> > > > >
> > > > > Yes.  The location information returned by the LSP server is about the
> > > > > text in the buffer.
> > > > >
> > > > > > One detail that matters: Are composing characters counted 
> > > > > > separately, or
> > > > > > not counted (part of the base character)?
> > > > >
> > > > > I think composing counters are not counted.  But I couldn't find this
> > > > > mentioned in the LSP specification:
> > > > >
> > > > > https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#position
> > > >
> > > > Disappointing to not mention such an important part of the interface.
> > > > Since I do not see any mention of composing characters, I would guess
> > > > that each utf-8 character is counted separately.
> > > >
> > > > > > Also, I assume a Tab is counted as just one character, not the 
> > > > > > number of
> > > > > > display cells it occupies.
> > > > >
> > > > > Yes. Tab is counted as one character.
> > > > >
> > > > > > I wonder if it's really helpful to add a new function if it can
> > > > > > currently be done with two.  You already mention that the text can 
> > > > > > be
> > > > > > obtained with getbufline(), and then get the byte index from the
> > > > > > character index with byteidx().  What is the problem with doing it 
> > > > > > that
> > > > > > way?
> > > > >
> > > > > If the conversion has to be done too many times then it is not 
> > > > > efficient.
> > > >
> > > > How can you say that without trying?
> > >
> > > I used the attached Vim9 script to measure the performance of
> > > getbufline() + byteidx()
> > > compared to calling the col() function.  I see that the first one
> > > takes three times longer to get the column number compared to the
> > > second one.
> >
> > This must be because getbufline() always returns a list of strings.
> > Creating the list, adding a list item and then making a copy of the text
> > takes longer.  Using getline() (just to try it out, wouldn't work in
> > your actual code) brings the difference down to less than two times.
> >
> > Not storing the result of getbufline() in a variable, but passing it to
> > byteidx() with "->" also helps make it faster.
> >
> > The range should be bigger, I used 10x to get more stable results.  As a
> > rule of thumb: the profiling time should be at least 100 msec to avoid
> > too much fluctuation.
> >
> > After making some adjustments it is now only about 16% slower.
> > I'll make a patch to get getbufoneline(), since just getting the string
> > for one line would be very common and it is about twice as fast.
> >
> > The name getbufoneline() isn't nice, couldn't come up with something
> > better.  Should have called the existing function getbuflines() instead
> > of getbufline(), but we can't change that now.
> >
> > The resulting essential line in ProfByteIdxFunction():
> >
> >     idx = getbufoneline('', 5344)->byteidx(77)
> >
> > > > Getting the buffer line means making a copy of the text, that's
> > > > quite cheap.  The only added overhead is two function calls instead
> > > > of one, which has really minimal impact in the context of all the
> > > > other things being done.  Also, if there are multiple positions in
> > > > one line then getbufline() only needs to be called once, thus
> > > > performance should be very close to whatever function we would use
> > > > instead.
> > > >
> > > > > > Other message:
> > > > > >
> > > > > > > Another alternative is to extend the col() function.  The col()
> > > > > > > function currently accepts a list with two numbers (a line number 
> > > > > > > and
> > > > > > > a byte number or "$") and returns the byte number.
> > > > > > > This can be modified to also accept a list with three numbers 
> > > > > > > (line
> > > > > > > number, column number and a boolean indicating character column or
> > > > > > > byte column) and return the byte number.
> > > > > >
> > > > > > I don't like this, the first line for the col() help is:
> > > > > >
> > > > > >         The result is a Number, which is the byte index of the 
> > > > > > column
> > > > > >
> > > > > > When the boolean is true this would be the character index, that is 
> > > > > > hard
> > > > > > to explain.  A user would have to look really hard to find this
> > > > > > functionality.
> > > > >
> > > > > The boolean doesn't change the return value of the col() function.  
> > > > > It just
> > > > > changes how the col() function interprets the column number in the 
> > > > > list.
> > > > > If it is true, then the col() function will use the column number as 
> > > > > the
> > > > > character number.  If it is false or not specified, then the col() 
> > > > > function
> > > > > will use it as the byte number.  In both cases the col() function 
> > > > > will always
> > > > > return the byte index of the column.
> > > >
> > > > I was confused.  Currently in the [lnum, col] value of {expr} the column
> > > > is the character offset.
> > >
> > > Currently in the [lnum, col] value of [expr], the column is the byte 
> > > offset.
> > > For example, if you use multibyte characters in a line and get the column
> > > number:
> > >
> > > =====================================================
> > > new
> > > call setline(1, "\u2345\u2346\u2347\u2348")
> > > echo col([1, 3])
> > > =====================================================
> > >
> > > The above script echos 3 instead of 7.  The byte index of the third
> > > character is 7.
> >
> > Should really update the help to avoid the term "column number", it is
> > confusing.  The remark "Most useful when the column is "$"" is a hint
> > that is easily missed.
> >
> > OK, I finally see your point, sorry it took so long.
> >
> > Unfortunately, adding a third argument that is a flag, indicating whether
> > the second argument means bytes or characters, conflicts with other
> > places where the third argument is "coloff".  This is used with
> > virtcol() for example.
> >
> > You also still have the limitation that col() only works for the current
> > buffer.
> >
> > Making matchaddpos() accept a character index instead of a byte index is
> > going to trigger doing this in many more places.  And internally the
> > conversion will have to be done anyway.  Therefore sticking to using a
> > byte index in most places that deal with text avoids a lot of complexity
> > in the arguments of the functions.
> >
> > So let's go back to making the character index to byte index conversion
> > fast.  That is a generic solution and avoids changes all over the place.
> > Please try out the new getbufoneline() function, as mentioned above.
> >
>
> I tested the new getbufoneline() function and the performance is much
> better.  Thanks for adding this function.
>
> >
> > If the performance is indeed quite bad, adding a function that converts
> > a text location in a buffer specified by character index to a byte index
> > could be a solution.  Perhaps:
> >
> >    bufcol({buf}, {expr})             {expr} a string like with col()
> >    bufcol({buf}, {lnum}, {expr})     {expr} a string like with col()
> >    bufcol({buf}, {lnum}, {charidx})
> >
>
> For now, I think we can use the getbufoneline() and byteidx() functions.
> If another use case for this comes up in the future, we can add this.
>
> Regards,
> Yegappan


Related to this thread, the grammar checker LanguageTool has
changed its API [1] and now defines the position of errors as:
- an offset in Unicode characters from the beginning of the
  document (not from the beginning of the line! newlines \n are
  counted as 1 character)
- and length in Unicode characters.

This API change breaks my LanguageTool grammar checker
plugin [2] with the latest LanguageTool.

LanguageTool API is poorly documented, but experimenting
with it, I see that combining Unicode characters such as
U+0065 + U+0301 for e-acute are counted as 2
characters.

I wonder whether vim has suitable functions() to find the
corresponding byte offset of a line/column with such input
data (i.e. Unicode character offset from start of file + Unicode
character length). At first glance, I did not see any suitable
Vim function.

Regards
Dominique

[1] https://languagetool.org/http-api/#!/default/post_check
[2] https://github.com/dpelle/vim-LanguageTool

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/vim_dev/CAON-T_gw3teh%2BGpFJDM34m4c4GbNRasXW9gTNDxJXLiBO2bsoA%40mail.gmail.com.

Re: Getting the byte index (column) given the character column number

Raspunde prin e-mail lui