Re: Issue in match() function with multi-byte characters

Nikolay Pavlov Sun, 30 Mar 2014 00:41:26 -0700

On Mar 30, 2014 5:54 AM, "Andre Sihera" <andre.sih...@hotmail.co.jp> wrote:
>
>
> On 30/03/14 09:03, Nikolay Pavlov wrote:
>>
>>
>> On Mar 30, 2014 3:35 AM, "Dmitry Frank" <dimon.fr...@gmail.com> wrote:
>> >
>> > Hello all.
>> >
>> > match() function returns index of first match, but if there are
multi-byte chars before first match, then each multi-byte chars is
interpreted as several chars, so, index becomes wrong.
>> >
>> > Say, match("foobar", "bar") returns 3, which is correct.  But
match("яfoobar", "bar")  returns 5, which is wrong (should be 4)
>>
>> This is completely correct. What are you going to do with 4?
"яfoobar"[4] is "o" (specifically, second one).
>
>
> This is only marginally correct, even according to my documentation
(7.3.475)
> which *starts* by talking about characters and *ends* by talking about
bytes,
> even when referring to the same notions. stridx(), strpart(), and most
other
> functions start from the outset by talking about bytes with no mention of
> characters. At minimum, the OP was probably mislead by the match()'s
description.
>
>>
>> > But we surely need to make match() work as expected when &encoding is
"utf-8" too.
>>
>> >
>>
>> Also col(), string indexing /\%Nc and so on? Not going to happen, this
is incompatible change.
>
>
> This kind of flat-refusal mentality gets nobody anywhere.
>
> You can't go touting ViM around as a multilingual editor and fill it with
lots of
> features and settings that handle multi-byte encodings and ISO-10646
support if this
> kind of English-only support prevails in the script language and prevents
you from
> processing what the user has input in the first place.
>
> There are so many easy real-life examples I could cherry-pick as to why
the OPs
> thinking is correct it isn't funny.
>
> For example, say in Japanese (the input language I use) I'm processing
buffer lines
> or user input where the first 20 characters are not useful. So you think
I can go and
> just do this?
>
>     match(szUserInput, szSearchString, 20)
>
> In 8-byte *legacy* encodings, maybe. But in UTF-8? You must be kidding!
Here's what
> I have as my input:
>
>     "今日 時間 日 本語 勉強 思      今日は２時間ぐらい日本語を勉強したいと思います。",
>
> I am looking for "勉強" in the right hand portion (character 33). Just how
on earth
> do I specify the position *in bytes*, as match() expects, of the 20th
*character*?
> By having to force me, the user, to *binary dump* every string I want to
use to extract
> the byte index? What about if that position has to be calculated
dynamically based on
> previous user/file input (this is typically necessary as even whitespace
can vary in
> width in Japanese, meaning an isspace()-like whitespace test succeeds but
the number
> of bytes occupied varies).
>
> Incidentally, in the above example, character 20 is the first character
of "今日",
> the word after the larger whitespace portion in the middle. However,
*byte* 20 is
> the "語" of the 3rd word "日本語". Thus, the ViM script:
>
>     szLine = "今日 時間 日 本語 勉強 思      今日は２時間ぐらい日本語を勉 強したいと思います。"
>     szSearch = input(...)
>     ...
>     match(szLine, szInput, 20)
>
> comes back with 24 (byte 24). At minimum, I want it to come back with 79
(the byte
> index of what I'm looking for) except that there was no easy way to
dynamically
> compute 40, the byte position of where the search actually needs to start
from.


Usually match(str, '.\{20}') is used in this case. I would ask though where
did you obtain the number 20.

>
> This basic lack of support in the script language for multi-lingual
features needs
> to be addressed, either through new functions or through fixing of the
existing ones
> so they match the behaviour that the user expects when modifying
*related* settings
> like encoding, fileencoding, etc.

Indexing string to get a character would be good idea for most use-cases
that will fix a number of plugins. But unfortunately there is a whole
*class* of plugins that will be *broken* by this change: any plugin
implementing hash calculation function. You may have expected this in
neovim (not as long as I am responsible for new VimL implementation), but
Bram hates including incompatible changes (and neither I like this). So you
cannot expect existing functions to be fixed.

About adding new functions: do not know. Maybe if somebody writes a patch
to add mbstrlen() (alias to existing strchars() for consistency),
mbmatch(,end,str,list), mbstrpart(), mbstridx(), mbstrridx(), mbcol() and
//\%NC they will be included.

>
>
>
>
>> >
>> > --
>> > Regards,
>> > Dmitry
>> >
>> > --
>> > --
>> > You received this message from the "vim_dev" maillist.
>> > Do not top-post! Type your reply below the text you are replying to.
>> > For more information, visit http://www.vim.org/maillist.php
>> >
>> > ---
>> > You received this message because you are subscribed to the Google
Groups "vim_dev" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
an email to vim_dev+unsubscr...@googlegroups.com.
>> > For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> --
>> You received this message from the "vim_dev" maillist.
>> Do not top-post! Type your reply below the text you are replying to.
>> For more information, visit http://www.vim.org/maillist.php
>>
>> ---
>> You received this message because you are subscribed to the Google
Groups "vim_dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send
an email to vim_dev+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> --
> You received this message from the "vim_dev" maillist.
> Do not top-post! Type your reply below the text you are replying to.
> For more information, visit http://www.vim.org/maillist.php
>
> ---
> You received this message because you are subscribed to the Google Groups
"vim_dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
email to vim_dev+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to vim_dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Issue in match() function with multi-byte characters

Reply via email to