Re: Getting the byte index (column) given the character column number

Bram Moolenaar Mon, 21 Nov 2022 14:17:59 -0800


Yegappan wrote:


> > > > > The language server protocol messages use character column number
> > > > > whereas many of the built-in Vim functions (e.g. matchaddpos()) deal
> > > > > with byte column number.
> > > > >
> > > > > Several built-in functions were added to convert between the character
> > > > > and byte column numbers (byteidx(), charcol(), charidx(),
> > > > > getcharpos(), getcursorcharpos(), etc,).
> > > > > But these functions deal with strings, current cursor position or the
> > > > > position of a mark.
> > > > >
> > > > > We currently don't have a function to return the byte number given the
> > > > > character number in a line in a buffer.  The workaround is to use
> > > > > getbufline() to get the entire buffer line and then use byteidx() to
> > > > > get the byte number from the character number.
> > > > >
> > > > > I am thinking of introducing a new function named charcol2bytecol()
> > > > > that accepts a buffer number, line number and the character number in
> > > > > the line and returns the corresponding byte number.  Any
> > > > > suggestions/comments on this?
> > > > >
> > > > > We should also modify the matchaddpos() function to accept
> > > > > character numbers in a line in addition to the byte numbers.
> > > >
> > > > Just to make sure we understand what we are talking about: This is
> > > > always about text in a buffer?  Thus the buffer text is somehow passed
> > > > through the LSP to a server, which then returns information with
> > > > character indexes.
> > >
> > > Yes.  The location information returned by the LSP server is about the
> > > text in the buffer.
> > >
> > > > One detail that matters: Are composing characters counted separately, or
> > > > not counted (part of the base character)?
> > >
> > > I think composing counters are not counted.  But I couldn't find this
> > > mentioned in the LSP specification:
> > >
> > > https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#position
> >
> > Disappointing to not mention such an important part of the interface.
> > Since I do not see any mention of composing characters, I would guess
> > that each utf-8 character is counted separately.
> >
> > > > Also, I assume a Tab is counted as just one character, not the number of
> > > > display cells it occupies.
> > >
> > > Yes. Tab is counted as one character.
> > >
> > > > I wonder if it's really helpful to add a new function if it can
> > > > currently be done with two.  You already mention that the text can be
> > > > obtained with getbufline(), and then get the byte index from the
> > > > character index with byteidx().  What is the problem with doing it that
> > > > way?
> > >
> > > If the conversion has to be done too many times then it is not efficient.
> >
> > How can you say that without trying?
> 
> I used the attached Vim9 script to measure the performance of
> getbufline() + byteidx()
> compared to calling the col() function.  I see that the first one
> takes three times longer to get the column number compared to the
> second one.

This must be because getbufline() always returns a list of strings.
Creating the list, adding a list item and then making a copy of the text
takes longer.  Using getline() (just to try it out, wouldn't work in
your actual code) brings the difference down to less than two times.

Not storing the result of getbufline() in a variable, but passing it to
byteidx() with "->" also helps make it faster.

The range should be bigger, I used 10x to get more stable results.  As a
rule of thumb: the profiling time should be at least 100 msec to avoid
too much fluctuation.

After making some adjustments it is now only about 16% slower.
I'll make a patch to get getbufoneline(), since just getting the string
for one line would be very common and it is about twice as fast.

The name getbufoneline() isn't nice, couldn't come up with something
better.  Should have called the existing function getbuflines() instead
of getbufline(), but we can't change that now.

The resulting essential line in ProfByteIdxFunction():

    idx = getbufoneline('', 5344)->byteidx(77)

> > Getting the buffer line means making a copy of the text, that's
> > quite cheap.  The only added overhead is two function calls instead
> > of one, which has really minimal impact in the context of all the
> > other things being done.  Also, if there are multiple positions in
> > one line then getbufline() only needs to be called once, thus
> > performance should be very close to whatever function we would use
> > instead.
> >
> > > > Other message:
> > > >
> > > > > Another alternative is to extend the col() function.  The col()
> > > > > function currently accepts a list with two numbers (a line number and
> > > > > a byte number or "$") and returns the byte number.
> > > > > This can be modified to also accept a list with three numbers (line
> > > > > number, column number and a boolean indicating character column or
> > > > > byte column) and return the byte number.
> > > >
> > > > I don't like this, the first line for the col() help is:
> > > >
> > > >         The result is a Number, which is the byte index of the column
> > > >
> > > > When the boolean is true this would be the character index, that is hard
> > > > to explain.  A user would have to look really hard to find this
> > > > functionality.
> > >
> > > The boolean doesn't change the return value of the col() function.  It 
> > > just
> > > changes how the col() function interprets the column number in the list.
> > > If it is true, then the col() function will use the column number as the
> > > character number.  If it is false or not specified, then the col() 
> > > function
> > > will use it as the byte number.  In both cases the col() function will 
> > > always
> > > return the byte index of the column.
> >
> > I was confused.  Currently in the [lnum, col] value of {expr} the column
> > is the character offset.
> 
> Currently in the [lnum, col] value of [expr], the column is the byte offset.
> For example, if you use multibyte characters in a line and get the column
> number:
> 
> =====================================================
> new
> call setline(1, "\u2345\u2346\u2347\u2348")
> echo col([1, 3])
> =====================================================
> 
> The above script echos 3 instead of 7.  The byte index of the third
> character is 7.

Should really update the help to avoid the term "column number", it is
confusing.  The remark "Most useful when the column is "$"" is a hint
that is easily missed. 

OK, I finally see your point, sorry it took so long.

Unfortunately, adding a third argument that is a flag, indicating whether
the second argument means bytes or characters, conflicts with other
places where the third argument is "coloff".  This is used with
virtcol() for example.

You also still have the limitation that col() only works for the current
buffer.

Making matchaddpos() accept a character index instead of a byte index is
going to trigger doing this in many more places.  And internally the
conversion will have to be done anyway.  Therefore sticking to using a
byte index in most places that deal with text avoids a lot of complexity
in the arguments of the functions.

So let's go back to making the character index to byte index conversion
fast.  That is a generic solution and avoids changes all over the place.
Please try out the new getbufoneline() function, as mentioned above.

If the performance is indeed quite bad, adding a function that converts
a text location in a buffer specified by character index to a byte index
could be a solution.  Perhaps:

   bufcol({buf}, {expr})             {expr} a string like with col()
   bufcol({buf}, {lnum}, {expr})     {expr} a string like with col()
   bufcol({buf}, {lnum}, {charidx})


-- 
hundred-and-one symptoms of being an internet addict:
100. The most exciting sporting events you noticed during summer 1996
    was Netscape vs. Microsoft.

 /// Bram Moolenaar -- [email protected] -- http://www.Moolenaar.net   \\\
///                                                                      \\\
\\\        sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
 \\\            help me help AIDS victims -- http://ICCF-Holland.org    ///

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/vim_dev/20221121221745.5B0711C12B2%40moolenaar.net.

Re: Getting the byte index (column) given the character column number

Raspunde prin e-mail lui