Re: Getting the byte index (column) given the character column number

Bram Moolenaar Mon, 21 Nov 2022 03:23:54 -0800


Yegappan wrote:


> > > The language server protocol messages use character column number
> > > whereas many of the built-in Vim functions (e.g. matchaddpos()) deal
> > > with byte column number.
> > >
> > > Several built-in functions were added to convert between the character
> > > and byte column numbers (byteidx(), charcol(), charidx(),
> > > getcharpos(), getcursorcharpos(), etc,).
> > > But these functions deal with strings, current cursor position or the
> > > position of a mark.
> > >
> > > We currently don't have a function to return the byte number given the
> > > character number in a line in a buffer.  The workaround is to use
> > > getbufline() to get the entire buffer line and then use byteidx() to
> > > get the byte number from the character number.
> > >
> > > I am thinking of introducing a new function named charcol2bytecol()
> > > that accepts a buffer number, line number and the character number in
> > > the line and returns the corresponding byte number.  Any
> > > suggestions/comments on this?
> > >
> > > We should also modify the matchaddpos() function to accept
> > > character numbers in a line in addition to the byte numbers.
> >
> > Just to make sure we understand what we are talking about: This is
> > always about text in a buffer?  Thus the buffer text is somehow passed
> > through the LSP to a server, which then returns information with
> > character indexes.
> 
> Yes.  The location information returned by the LSP server is about the
> text in the buffer.
> 
> > One detail that matters: Are composing characters counted separately, or
> > not counted (part of the base character)?
> 
> I think composing counters are not counted.  But I couldn't find this
> mentioned in the LSP specification:
> 
> https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#position

Disappointing to not mention such an important part of the interface.
Since I do not see any mention of composing characters, I would guess
that each utf-8 character is counted separately.  

> > Also, I assume a Tab is counted as just one character, not the number of
> > display cells it occupies.
> 
> Yes. Tab is counted as one character.
> 
> > I wonder if it's really helpful to add a new function if it can
> > currently be done with two.  You already mention that the text can be
> > obtained with getbufline(), and then get the byte index from the
> > character index with byteidx().  What is the problem with doing it that
> > way?
> 
> If the conversion has to be done too many times then it is not efficient.

How can you say that without trying?  Getting the buffer line means
making a copy of the text, that's quite cheap.  The only added overhead
is two function calls instead of one, which has really minimal impact in
the context of all the other things being done.  Also, if there are
multiple positions in one line then getbufline() only needs to be called
once, thus performance should be very close to whatever function we
would use instead.

> > Other message:
> >
> > > Another alternative is to extend the col() function.  The col()
> > > function currently accepts a list with two numbers (a line number and
> > > a byte number or "$") and returns the byte number.
> > > This can be modified to also accept a list with three numbers (line
> > > number, column number and a boolean indicating character column or
> > > byte column) and return the byte number.
> >
> > I don't like this, the first line for the col() help is:
> >
> >         The result is a Number, which is the byte index of the column
> >
> > When the boolean is true this would be the character index, that is hard
> > to explain.  A user would have to look really hard to find this
> > functionality.
> 
> The boolean doesn't change the return value of the col() function.  It just
> changes how the col() function interprets the column number in the list.
> If it is true, then the col() function will use the column number as the
> character number.  If it is false or not specified, then the col() function
> will use it as the byte number.  In both cases the col() function will always
> return the byte index of the column.

I was confused.  Currently in the [lnum, col] value of {expr} the column
is the character offset.  Since you are converting from character offset
to byte index, I don't see how you would pass the byte index here, since
you'll get the same byte index back.  What would be the point in passing
[lnum, col, false] ?  BTW, leving out the flag must mean using the
column number (for backwards compatibility).


> > There is also charcol(), it appears to be doing what you want already.
> 
> The charcol() function returns the character number in a line.  This
> function cannot be used to get the byte index given the character
> index.

But then using col() would already work without any changes...


-- 
hundred-and-one symptoms of being an internet addict:
95. Only communication in your household is through email.

 /// Bram Moolenaar -- [email protected] -- http://www.Moolenaar.net   \\\
///                                                                      \\\
\\\        sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
 \\\            help me help AIDS victims -- http://ICCF-Holland.org    ///

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/vim_dev/20221121112339.2C7351C12B2%40moolenaar.net.

Re: Getting the byte index (column) given the character column number

Raspunde prin e-mail lui