I've verified the behavior you witnessed, and it does indeed differ from
MacOS at least.  The root cause is a problem with the localedef database
(if it is indeed a problem, I'm not sure about that):

Here's the relevant data from the en_US.UTF-8 data file from Unicode.org:

space   <tab>;/
        <newline>;/
        <vertical-tab>;/
        <form-feed>;/
        <carriage-return>;/
        <space>

Its unclear to me this is an error or not.  Its also possible that the
definitions of the above characters were added in later Unicode standards.
 I believe the stuff we have is based on a somewhat older version of the
CLDR.  The last import seems to have happened back in 2012 based on CLDR
2.0.1.




On Thu, Jul 24, 2014 at 7:42 PM, Garrett D'Amore <[email protected]> wrote:

> I will check when later this weekend.  I have to investigate.
>
> Sent from my iPhone
>
> > On Jul 24, 2014, at 4:09 PM, "Alexander Pyhalov via illumos-discuss" <
> [email protected]> wrote:
> >
> > Hello.
> > During gnu grep update I've found out that one test fails, specifically
> > gawk 'BEGIN { printf "\xe2\x80\x80\n" }'  doesn't match for grep '\s'
> >
> > GNU grep testsuite checks that following UTF-8 symbols are spaces:
> >
> > utf8_space_characters=$(sed 's/.*://;s/  */\\x/g' <<\EOF
> > U+0009 Horizontal Tab:            09
> > U+000B Vertical Tab:              0b
> > U+000C Form feed:                 0c
> > U+000D Carriage return:           0d
> > U+0020 SPACE:                     20
> > U+1680 OGHAM SPACE MARK:          e1 9a 80
> > U+2000 EN QUAD:                   e2 80 80
> > U+2001 EM QUAD:                   e2 80 81
> > U+2002 EN SPACE:                  e2 80 82
> > U+2003 EM SPACE:                  e2 80 83
> > U+2004 THREE-PER-EM SPACE:        e2 80 84
> > U+2005 FOUR-PER-EM SPACE:         e2 80 85
> > U+2006 SIX-PER-EM SPACE:          e2 80 86
> > U+2008 PUNCTUATION SPACE:         e2 80 88
> > U+2009 THIN SPACE:                e2 80 89
> > U+200A HAIR SPACE:                e2 80 8a
> > U+205F MEDIUM MATHEMATICAL SPACE: e2 81 9f
> > U+3000 IDEOGRAPHIC SPACE:         e3 80 80
> > EOF
> > )
> >
> > Checks for
> > e1 9a 80
> > e2 80 80 - e2 80 8a
> > e2 81 9f, e3 80 80
> > fail.
> >
> > I've verified whith the following C99 program
> > #include <wchar.h>
> > #include <wctype.h>
> > #include <locale.h>
> > #include <stdio.h>
> > void try_with(wchar_t c, const char* loc)
> > {
> >    setlocale(LC_ALL, loc);
> >    printf("in locale %s iswspace returned  %d\n",loc,iswspace(c));
> > }
> > int main()
> > {
> > //    wchar_t EM_SPACE = L'\u2003'; // Unicode character 'EM SPACE'
> >    wchar_t EM_SPACE = L'\u205f';
> >    try_with(EM_SPACE, "C");
> >    try_with(EM_SPACE, "en_US.UTF-8");
> > }
> >
> > that iswspace considers \u2003 (as I understand it corresponds to e2 80
> 83) and \u205f ( e2 81 9f) non-spaces.
> > I've run the same test program on FreeBSD. It considers both characters
> spaces in en_US.UTF-8 locale.
> > Is it a bug or do I miss something?
> >
> > --
> > System Administrator of Southern Federal University Computer Center
> >
> >
> > -------------------------------------------
> > illumos-discuss
> > Archives: https://www.listbox.com/member/archive/182180/=now
> > RSS Feed:
> https://www.listbox.com/member/archive/rss/182180/22003744-9012f59c
> > Modify Your Subscription:
> https://www.listbox.com/member/?&;
> > Powered by Listbox: http://www.listbox.com
>



-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Reply via email to