Re: [vim/vim] Not all regexp classes [:...:] were not tested. (#1560)

Michal Grochmal Thu, 16 Mar 2017 20:15:52 -0700

On Thu, Mar 16, 2017 at 11:23:20PM +0900, Kazunobu Kuriyama wrote:
>  2017-03-16 8:40 GMT+09:00 Michal Grochmal <[1]groch...@member.fsf.org>:
>
>  On Wed, Mar 15, 2017 at 11:16:49PM +0100, Bram Moolenaar wrote:
>  >
>  > Kazunobu Kuriyama wrote:
>>  >
>>  > > > But it seems strange that we need to restrict [:cntrl:] and
>>  [:graph:] to ASCII only.
>>  > >
>>  > > Quite understandable.  But otherwise, we will have to either
>>  rely
>>  > > entirely on the is*() functions provided by the OS in use or
>>  define
>>  > > our own character classes independently of any of it.
>>  > >
>>  > > The former case implies that the behavior of Vim scripts using
>>  > > [:class:] depends on the OS in use.   Surely, the latter case is
>>  > > expected to resolve the flaw of the former, but I'm not sure we
>>  can
>>  > > specify character classes in such a way that almost all users on
>>  > > various platforms are satisfied with them.
>>  > >
>>  > > So, I think at the moment that the ASCII restriction is a
>>  reasonable
>>  > > compromise.  But I'm still quite open to other better solutions.
>>  >
>>  > It's a difficult choice.  Either we say the regexp should be
>>  >  portable,
>>  > and we let Vim define exactly what those classes mean, or we say
>>  > we must
>>  > follow how the current system considers characters to be
>>  > classified.  I
>>  > wonder when the system knows better, perhaps when something in the
>>  > system configuration, e.g. the country or language, changes what
>>  > characters mean?
>>  Yes, it does.  At least on glibc (i.e. GNU).  iswcntrl(3) is defined
>>  as any
>>  character that is *not* part of "print", "alpha", "upper", "lower",
>>  "digit",
>>  "xdigit", "punct".
>>  The problem is that "alpha" and "punct" are affected by locale
>>  settings,
>>  therefore "cntrl" is affected too.  In other words, with a simple
>>  regex Vim
>>  would likely either: classify all >0x80 characters as [:cntrl:] or
>>  none of
>>  them, which may be erroneous since some UTF-8 characters are not
>>  printable in
>>  the higher ranges.
>>  So, the restriction to ASCII values, or better to 0x00-0x00ff, makes
>>  sense.
>>  For example (using some UTF-8 aware terminal emulator and a UTF-8
>>  locale):
>>  1.   printf "\x00\xc0"  # will print an À
>>  2.   printf "\x00\x9f"  # will give the same [:cntrl:] character
>>  that 0x9f
>>                          # gives under LC_CTYPE=latin1
>>  iswgraph(3) also has a note that it depends on LC_CTYPE but the
>>  defintion on
>>  how this happens seems more convoluted.
>>  For non-UTF things should be simpler to regex I guess.
>>  Yet, still for UTF-8, different version of glibc do have different
>>  UTF-8
>>  tables.  And other systems may as well be more or less updated to
>>  the unicode
>>  consortium.  Other OSes may be more or less often updated too.
>>  I'd make a regex for all the ISO8859, KOI8 and EUC locales and leave
>>  the system to deal with the others.  Then, on *nix LC_TYPE=C should
>>  work like
>>  latin1 (iso8859-1) and on MS windows *I believe* that you can set
>>  the locale to
>>  latin1 on any version of it.
>>  Will not test all locales but there will be some tests at least.
>
>
>    Hi Michal,
>    Thank you for the comment.
>    After I read Bram's comment and yours, I spent a few hours looking into
>    the issue and skimming [2]http://www.unicode.org/reports/tr18/, in
>    particular,
>    [3]http://www.unicode.org/reports/tr18/#Compatibility_Properties .
>    According to the table of Compatibility Property Names, it looks to me
>    that, as far as :cntrl: and :graph: are concerned, we can get around
>    the difficulties regarding LC_CTYPE and can implement ctype-like
>    functions, say, vim_iscntrl() and vim_isgraph(), for unicode code
>    points in a way closer to Bram's suggestion.
>    But I may be missing something.  So I'm anxious to hear your view on
>    that.  Do you think it is feasible to implement ctype-like functions
>    for :cntrl: and :graph: if we follow the assignment recommendation of
>    Annex C?


Allow me to borrow some python experience for this.  And sorry for the long
read, I've just learned one or two things looking into it.

For a start using something like (for unicode locales):

    #define vim_iscntrl iswcntrl
    #define vim_isgraph iswgraph

Would work for *nix alright (even OSX, i.e. macvim, as it is posix 2001
compatible to a good extent).

On the other hand the Annex C is nice, but I'd be worry about its support in
the real world.  I did a couple of simple tests with Python's `re` (as far as
I'm aware it simply links into the OS, glibc in my case) and found that the

    NEGATIVE_SPEC := ("\P{" PROP_SPEC "}") | ("[:^" PROP_SPEC ":]")

Does not really work in glibc.  For example:

    None != re.compile('[:alpha:]').match('a')  # True
    None != re.compile('[:alpha:]').match('1')  # False

All good until we use the negative unicode class spec

    None != re.compile('[:^alpha:]').match('1')  # False
    None != re.compile('[:^alpha:]').match('=')  # False
    None != re.compile('[:^alpha:]').match('a')  # True

Basically the ^ is ignored.  So glibc fails to implement some of that spec.

And, unfortunately :graph: is defined with a negation:

    [^
    \p{space}
    \p{gc=Control}
    \p{gc=Surrogate}
    \p{gc=Unassigned}]

Python of course has a very similar problem to this one.  Since `re` links into
the OS then the matching is inconsistent between OSes.  They solved that by
making the `regex` library which implements its own matching and is therefore
consistent across OSes.  The `regex` library also has the unicode symbols
updated very often, so it is quite popular to work with esoteric scripts.

On the other hand I do not believe that Vim would have enough resources to pull
something like that and maintain its own regex engine unicode treatment.  To
get an idea here is the source of how the `regex` python module deals with
unicode:

https://bitbucket.org/mrabarnett/mrab-regex/src/4600a157989dc1671e4415ebe57aac53cfda2d8a/regex_3/regex/_regex_unicode.c?at=default&fileviewer=file-view-default

It has almost 15k lines of code and needs to be updated with every release of
the unicode spec (which is not *that* bad since a new spec comes about every
year).

------

Anyhow, trying to make regexes to *define* unicode classes seems like a fools
errand.  A lot of work for little results.  I'd stick to the `isw*()` functions
on *nix, find the windows equivalents (I'm pretty clueless on Windows devel
sorry), and use that for unicode locales.  Ensure that tests work under
LC_TYPE=C and LC_TYPE=latin1 (on windows); and do not adventure too far into
esoteric parts of the unicode tables with the tests.

After all tests in Vim are to test Vim, not test unicode coverage in the glibc
(or windows' libc).

Other locales (non-unicode) can certainly use regexes.

-- 
Mike Grochmal
GPG key ID 0xC840C4F6

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to vim_dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [vim/vim] Not all regexp classes [:...:] were not tested. (#1560)

Reply via email to