Yegappan Lakshmanan <[email protected]> wrote: > Hi Bram, > > On Mon, Nov 21, 2022 at 2:17 PM Bram Moolenaar <[email protected]> wrote: > > > > > > Yegappan wrote: > > > > > > > > > The language server protocol messages use character column number > > > > > > > whereas many of the built-in Vim functions (e.g. matchaddpos()) > > > > > > > deal > > > > > > > with byte column number. > > > > > > > > > > > > > > Several built-in functions were added to convert between the > > > > > > > character > > > > > > > and byte column numbers (byteidx(), charcol(), charidx(), > > > > > > > getcharpos(), getcursorcharpos(), etc,). > > > > > > > But these functions deal with strings, current cursor position or > > > > > > > the > > > > > > > position of a mark. > > > > > > > > > > > > > > We currently don't have a function to return the byte number > > > > > > > given the > > > > > > > character number in a line in a buffer. The workaround is to use > > > > > > > getbufline() to get the entire buffer line and then use byteidx() > > > > > > > to > > > > > > > get the byte number from the character number. > > > > > > > > > > > > > > I am thinking of introducing a new function named > > > > > > > charcol2bytecol() > > > > > > > that accepts a buffer number, line number and the character > > > > > > > number in > > > > > > > the line and returns the corresponding byte number. Any > > > > > > > suggestions/comments on this? > > > > > > > > > > > > > > We should also modify the matchaddpos() function to accept > > > > > > > character numbers in a line in addition to the byte numbers. > > > > > > > > > > > > Just to make sure we understand what we are talking about: This is > > > > > > always about text in a buffer? Thus the buffer text is somehow > > > > > > passed > > > > > > through the LSP to a server, which then returns information with > > > > > > character indexes. > > > > > > > > > > Yes. The location information returned by the LSP server is about the > > > > > text in the buffer. > > > > > > > > > > > One detail that matters: Are composing characters counted > > > > > > separately, or > > > > > > not counted (part of the base character)? > > > > > > > > > > I think composing counters are not counted. But I couldn't find this > > > > > mentioned in the LSP specification: > > > > > > > > > > https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#position > > > > > > > > Disappointing to not mention such an important part of the interface. > > > > Since I do not see any mention of composing characters, I would guess > > > > that each utf-8 character is counted separately. > > > > > > > > > > Also, I assume a Tab is counted as just one character, not the > > > > > > number of > > > > > > display cells it occupies. > > > > > > > > > > Yes. Tab is counted as one character. > > > > > > > > > > > I wonder if it's really helpful to add a new function if it can > > > > > > currently be done with two. You already mention that the text can > > > > > > be > > > > > > obtained with getbufline(), and then get the byte index from the > > > > > > character index with byteidx(). What is the problem with doing it > > > > > > that > > > > > > way? > > > > > > > > > > If the conversion has to be done too many times then it is not > > > > > efficient. > > > > > > > > How can you say that without trying? > > > > > > I used the attached Vim9 script to measure the performance of > > > getbufline() + byteidx() > > > compared to calling the col() function. I see that the first one > > > takes three times longer to get the column number compared to the > > > second one. > > > > This must be because getbufline() always returns a list of strings. > > Creating the list, adding a list item and then making a copy of the text > > takes longer. Using getline() (just to try it out, wouldn't work in > > your actual code) brings the difference down to less than two times. > > > > Not storing the result of getbufline() in a variable, but passing it to > > byteidx() with "->" also helps make it faster. > > > > The range should be bigger, I used 10x to get more stable results. As a > > rule of thumb: the profiling time should be at least 100 msec to avoid > > too much fluctuation. > > > > After making some adjustments it is now only about 16% slower. > > I'll make a patch to get getbufoneline(), since just getting the string > > for one line would be very common and it is about twice as fast. > > > > The name getbufoneline() isn't nice, couldn't come up with something > > better. Should have called the existing function getbuflines() instead > > of getbufline(), but we can't change that now. > > > > The resulting essential line in ProfByteIdxFunction(): > > > > idx = getbufoneline('', 5344)->byteidx(77) > > > > > > Getting the buffer line means making a copy of the text, that's > > > > quite cheap. The only added overhead is two function calls instead > > > > of one, which has really minimal impact in the context of all the > > > > other things being done. Also, if there are multiple positions in > > > > one line then getbufline() only needs to be called once, thus > > > > performance should be very close to whatever function we would use > > > > instead. > > > > > > > > > > Other message: > > > > > > > > > > > > > Another alternative is to extend the col() function. The col() > > > > > > > function currently accepts a list with two numbers (a line number > > > > > > > and > > > > > > > a byte number or "$") and returns the byte number. > > > > > > > This can be modified to also accept a list with three numbers > > > > > > > (line > > > > > > > number, column number and a boolean indicating character column or > > > > > > > byte column) and return the byte number. > > > > > > > > > > > > I don't like this, the first line for the col() help is: > > > > > > > > > > > > The result is a Number, which is the byte index of the > > > > > > column > > > > > > > > > > > > When the boolean is true this would be the character index, that is > > > > > > hard > > > > > > to explain. A user would have to look really hard to find this > > > > > > functionality. > > > > > > > > > > The boolean doesn't change the return value of the col() function. > > > > > It just > > > > > changes how the col() function interprets the column number in the > > > > > list. > > > > > If it is true, then the col() function will use the column number as > > > > > the > > > > > character number. If it is false or not specified, then the col() > > > > > function > > > > > will use it as the byte number. In both cases the col() function > > > > > will always > > > > > return the byte index of the column. > > > > > > > > I was confused. Currently in the [lnum, col] value of {expr} the column > > > > is the character offset. > > > > > > Currently in the [lnum, col] value of [expr], the column is the byte > > > offset. > > > For example, if you use multibyte characters in a line and get the column > > > number: > > > > > > ===================================================== > > > new > > > call setline(1, "\u2345\u2346\u2347\u2348") > > > echo col([1, 3]) > > > ===================================================== > > > > > > The above script echos 3 instead of 7. The byte index of the third > > > character is 7. > > > > Should really update the help to avoid the term "column number", it is > > confusing. The remark "Most useful when the column is "$"" is a hint > > that is easily missed. > > > > OK, I finally see your point, sorry it took so long. > > > > Unfortunately, adding a third argument that is a flag, indicating whether > > the second argument means bytes or characters, conflicts with other > > places where the third argument is "coloff". This is used with > > virtcol() for example. > > > > You also still have the limitation that col() only works for the current > > buffer. > > > > Making matchaddpos() accept a character index instead of a byte index is > > going to trigger doing this in many more places. And internally the > > conversion will have to be done anyway. Therefore sticking to using a > > byte index in most places that deal with text avoids a lot of complexity > > in the arguments of the functions. > > > > So let's go back to making the character index to byte index conversion > > fast. That is a generic solution and avoids changes all over the place. > > Please try out the new getbufoneline() function, as mentioned above. > > > > I tested the new getbufoneline() function and the performance is much > better. Thanks for adding this function. > > > > > If the performance is indeed quite bad, adding a function that converts > > a text location in a buffer specified by character index to a byte index > > could be a solution. Perhaps: > > > > bufcol({buf}, {expr}) {expr} a string like with col() > > bufcol({buf}, {lnum}, {expr}) {expr} a string like with col() > > bufcol({buf}, {lnum}, {charidx}) > > > > For now, I think we can use the getbufoneline() and byteidx() functions. > If another use case for this comes up in the future, we can add this. > > Regards, > Yegappan
Related to this thread, the grammar checker LanguageTool has changed its API [1] and now defines the position of errors as: - an offset in Unicode characters from the beginning of the document (not from the beginning of the line! newlines \n are counted as 1 character) - and length in Unicode characters. This API change breaks my LanguageTool grammar checker plugin [2] with the latest LanguageTool. LanguageTool API is poorly documented, but experimenting with it, I see that combining Unicode characters such as U+0065 + U+0301 for e-acute are counted as 2 characters. I wonder whether vim has suitable functions() to find the corresponding byte offset of a line/column with such input data (i.e. Unicode character offset from start of file + Unicode character length). At first glance, I did not see any suitable Vim function. Regards Dominique [1] https://languagetool.org/http-api/#!/default/post_check [2] https://github.com/dpelle/vim-LanguageTool -- -- You received this message from the "vim_dev" maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php --- You received this message because you are subscribed to the Google Groups "vim_dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/vim_dev/CAON-T_gw3teh%2BGpFJDM34m4c4GbNRasXW9gTNDxJXLiBO2bsoA%40mail.gmail.com.
