subject:"Re\: Uppercase RE matching problems in FreeBSD 11"

Re: Uppercase RE matching problems in FreeBSD 11

2016-11-06 Thread Baptiste Daroussin

On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
> I happened to run an old script today that uses sed(1) to extract the system
> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer works as
> expected:
> 
> $ sysctl kern.boottime
> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5 16:18:34 2016
> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
> v  5 16:18:34 2016
> 
> sed passes over 'S' and 'N' until it hits 'v', which it considers uppercase
> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as
> expected:
> 
> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
> Nov  5 16:18:34 2016
> 
> Testing every lowercase character separately gives even more inconsistent
> results:
> 
> $ cat < > a
> > b
> > c
> > d
> > e
> > f
> > g
> > h
> > i
> > j
> > k
> > l
> > m
> > n
> > o
> > p
> > q
> > r
> > s
> > t
> > u
> > v
> > w
> > x
> > y
> > z
> > !
> b
> c
> d
> e
> f
> g
> h
> i
> j
> k
> l
> m
> n
> o
> p
> q
> r
> s
> t
> u
> v
> w
> x
> y
> z
> 
> Here sed thinks every lowercase character except for 'a' is uppercase! This
> differs from the first test where sed did not think 'o' is uppercase. Again,
> the above behaves as expected with LANG=C.
> 
> Does anyone have any insight into this? This is likely to break a lot of
> existing code.
> 

Yes A-Z only means uppercase in an ASCII only world in a unicode world it means
AaBb... Z because there are way more characters that simple A-Z. In FreeBSD 11
we have a unicode collation instead of falling back in on LC_COLLATE=C which
means ascii only

For regrexp for example one should use the classes: :upper: or :lower:.

Best regards,
Bapt


signature.asc
Description: PGP signature

Re: Uppercase RE matching problems in FreeBSD 11

2016-11-06 Thread Mark Martinec

2016-11-06 12:07, Baptiste Daroussin wrote:
Yes A-Z only means uppercase in an ASCII only world in a unicode world 
it means
AaBb... Z because there are way more characters that simple A-Z. In 
FreeBSD 11
we have a unicode collation instead of falling back in on LC_COLLATE=C 
which

means ascii only

For regrexp for example one should use the classes: :upper: or :lower:.

It is a good idea to keep LC_COLLATE and LC_NUMERIC (and LC_MONETARY?) 
at "C"

when LANG or LC_CTYPE is set to something else, otherwise unexpected
things may happen.

  Mark

On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
I happened to run an old script today that uses sed(1) to extract the 
system
boot time from the kern.boottime sysctl MIB. On 11.0 this no longer 
works as

expected:

$ sysctl kern.boottime
kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5 16:18:34 
2016

$ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
v  5 16:18:34 2016

sed passes over 'S' and 'N' until it hits 'v', which it considers 
uppercase
apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works 
as

expected:

$ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
Nov  5 16:18:34 2016

Testing every lowercase character separately gives even more 
inconsistent

results:

$ cat < a
> b
> c
> d
> e
> f
> g
> h
> i
> j
> k
> l
> m
> n
> o
> p
> q
> r
> s
> t
> u
> v
> w
> x
> y
> z
> !
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z

Here sed thinks every lowercase character except for 'a' is uppercase! 
This
differs from the first test where sed did not think 'o' is uppercase. 
Again,

the above behaves as expected with LANG=C.

Does anyone have any insight into this? This is likely to break a lot 
of

existing code.

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Uppercase RE matching problems in FreeBSD 11

2016-11-06 Thread Baptiste Daroussin

On Sun, Nov 06, 2016 at 01:26:51PM +0100, Mark Martinec wrote:
> 2016-11-06 12:07, Baptiste Daroussin wrote:
> > Yes A-Z only means uppercase in an ASCII only world in a unicode world
> > it means
> > AaBb... Z because there are way more characters that simple A-Z. In
> > FreeBSD 11
> > we have a unicode collation instead of falling back in on LC_COLLATE=C
> > which
> > means ascii only
> > 
> > For regrexp for example one should use the classes: :upper: or :lower:.
> 
> It is a good idea to keep LC_COLLATE and LC_NUMERIC (and LC_MONETARY?) at
> "C"
> when LANG or LC_CTYPE is set to something else, otherwise unexpected
> things may happen.
> 

In scripts clearly, the collation rules, numeric rules and monetary rules may
vary depending on the locale.

Best regards,
Bapt


signature.asc
Description: PGP signature

Re: Uppercase RE matching problems in FreeBSD 11

2016-11-06 Thread Stefan Bethke


> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin :
> 
> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
>> I happened to run an old script today that uses sed(1) to extract the system
>> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer works as
>> expected:
>> 
>> $ sysctl kern.boottime
>> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5 16:18:34 2016
>> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
>> v  5 16:18:34 2016
>> 
>> sed passes over 'S' and 'N' until it hits 'v', which it considers uppercase
>> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as
>> expected:
>> 
>> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
>> Nov  5 16:18:34 2016
>> 
>> Testing every lowercase character separately gives even more inconsistent
>> results:
>> 
>> $ cat <> Here sed thinks every lowercase character except for 'a' is uppercase! This
>> differs from the first test where sed did not think 'o' is uppercase. Again,
>> the above behaves as expected with LANG=C.
>> 
>> Does anyone have any insight into this? This is likely to break a lot of
>> existing code.
>> 
> 
> Yes A-Z only means uppercase in an ASCII only world in a unicode world it 
> means
> AaBb... Z because there are way more characters that simple A-Z. In FreeBSD 11
> we have a unicode collation instead of falling back in on LC_COLLATE=C which
> means ascii only
> 
> For regrexp for example one should use the classes: :upper: or :lower:.

That is rather surprising.  Is there a normative reference for the treatment of 
bracket expressions and character classes when using locales other than C 
and/or encodings like UTF-8?


Stefan

-- 
Stefan BethkeFon +49 151 14070811




___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Uppercase RE matching problems in FreeBSD 11

2016-11-06 Thread Baptiste Daroussin

On Sun, Nov 06, 2016 at 09:57:00PM +0100, Stefan Bethke wrote:
> 
> > Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin :
> > 
> > On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
> >> I happened to run an old script today that uses sed(1) to extract the 
> >> system
> >> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer works 
> >> as
> >> expected:
> >> 
> >> $ sysctl kern.boottime
> >> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5 16:18:34 2016
> >> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
> >> v  5 16:18:34 2016
> >> 
> >> sed passes over 'S' and 'N' until it hits 'v', which it considers uppercase
> >> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as
> >> expected:
> >> 
> >> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
> >> Nov  5 16:18:34 2016
> >> 
> >> Testing every lowercase character separately gives even more inconsistent
> >> results:
> >> 
> >> $ cat < 
> >> Here sed thinks every lowercase character except for 'a' is uppercase! This
> >> differs from the first test where sed did not think 'o' is uppercase. 
> >> Again,
> >> the above behaves as expected with LANG=C.
> >> 
> >> Does anyone have any insight into this? This is likely to break a lot of
> >> existing code.
> >> 
> > 
> > Yes A-Z only means uppercase in an ASCII only world in a unicode world it 
> > means
> > AaBb... Z because there are way more characters that simple A-Z. In FreeBSD 
> > 11
> > we have a unicode collation instead of falling back in on LC_COLLATE=C which
> > means ascii only
> > 
> > For regrexp for example one should use the classes: :upper: or :lower:.
> 
> That is rather surprising.  Is there a normative reference for the treatment 
> of bracket expressions and character classes when using locales other than C 
> and/or encodings like UTF-8?

http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html

For example:

"Regular expressions are a context-independent syntax that can represent a wide
variety of character sets and character set orderings, where these character
sets are interpreted according to the current locale. While many regular
expressions can be interpreted differently depending on the current locale, many
features, such as character class expressions, provide for contextual invariance
across locales."

Best regards,
Bapt


signature.asc
Description: PGP signature

Re: Uppercase RE matching problems in FreeBSD 11

2016-11-06 Thread Stefan Ehmann

On 06.11.2016 21:57, Stefan Bethke wrote:
> 
>> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin
>> :
>> 
>> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
>>> I happened to run an old script today that uses sed(1) to extract
>>> the system boot time from the kern.boottime sysctl MIB. On 11.0
>>> this no longer works as expected:
..
>>> Here sed thinks every lowercase character except for 'a' is
>>> uppercase! This differs from the first test where sed did not
>>> think 'o' is uppercase. Again, the above behaves as expected with
>>> LANG=C.
>>> 
>>> Does anyone have any insight into this? This is likely to break a
>>> lot of existing code.
>>> 
>> 
>> Yes A-Z only means uppercase in an ASCII only world in a unicode
>> world it means AaBb... Z because there are way more characters that
>> simple A-Z. In FreeBSD 11 we have a unicode collation instead of
>> falling back in on LC_COLLATE=C which means ascii only
>> 
>> For regrexp for example one should use the classes: :upper: or
>> :lower:.
> 
> That is rather surprising.  Is there a normative reference for the
> treatment of bracket expressions and character classes when using
> locales other than C and/or encodings like UTF-8?

I found an interesting article about this issue in gawk:
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html

Apparently the meaning of ranges is unspecified outside the "C" locale.

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05
says:

"In the POSIX locale, a range expression represents the set of collating
elements that fall between two elements in the collation sequence,
inclusive. In other locales, a range expression has unspecified
behavior: strictly conforming applications shall not rely on whether the
range expression is valid, or on the set of collating elements matched"
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Uppercase RE matching problems in FreeBSD 11

2016-11-06 Thread Stefan Bethke


> Am 06.11.2016 um 22:06 schrieb Baptiste Daroussin :
> 
> On Sun, Nov 06, 2016 at 09:57:00PM +0100, Stefan Bethke wrote:
>> 
>>> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin :
>>> 
>>> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
 I happened to run an old script today that uses sed(1) to extract the 
 system
 boot time from the kern.boottime sysctl MIB. On 11.0 this no longer works 
 as
 expected:
 
 $ sysctl kern.boottime
 kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5 16:18:34 2016
 $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
 v  5 16:18:34 2016
 
 sed passes over 'S' and 'N' until it hits 'v', which it considers uppercase
 apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as
 expected:
 
 $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
 Nov  5 16:18:34 2016
 
 Testing every lowercase character separately gives even more inconsistent
 results:
 
 $ cat <> 
 Here sed thinks every lowercase character except for 'a' is uppercase! This
 differs from the first test where sed did not think 'o' is uppercase. 
 Again,
 the above behaves as expected with LANG=C.
 
 Does anyone have any insight into this? This is likely to break a lot of
 existing code.
 
>>> 
>>> Yes A-Z only means uppercase in an ASCII only world in a unicode world it 
>>> means
>>> AaBb... Z because there are way more characters that simple A-Z. In FreeBSD 
>>> 11
>>> we have a unicode collation instead of falling back in on LC_COLLATE=C which
>>> means ascii only
>>> 
>>> For regrexp for example one should use the classes: :upper: or :lower:.
>> 
>> That is rather surprising.  Is there a normative reference for the treatment 
>> of bracket expressions and character classes when using locales other than C 
>> and/or encodings like UTF-8?
> 
> http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html
> 
> For example:
> 
> "Regular expressions are a context-independent syntax that can represent a 
> wide
> variety of character sets and character set orderings, where these character
> sets are interpreted according to the current locale. While many regular
> expressions can be interpreted differently depending on the current locale, 
> many
> features, such as character class expressions, provide for contextual 
> invariance
> across locales.“

Sorry, maybe I wasn’t clear enough with my question.  When a character class 
fits the problem, it is clearly advantageous.

But under what circumstances would [A-Z] mean anything other than a character 
whose Unicode codepoint is between U+0041 and U+005A, inclusive?  Especially 
given the locale in the example is en_US.UTF-8.  Or, put another way, why would 
an implementation interpret [A-Z] as anything other than [ABCDE…XYZ]?

From reading your reference, I can see in 9.3.5.7:
> In the POSIX locale, a range expression represents the set of collating 
> elements that fall between two elements in the collation sequence, inclusive. 
> In other locales, a range expression has unspecified behavior[…]

So even if the observed behaviour is conforming, I’d think it’s still highly 
undesirable.


Stefan

-- 
Stefan BethkeFon +49 151 14070811




___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Uppercase RE matching problems in FreeBSD 11

2016-11-06 Thread Baptiste Daroussin

On Sun, Nov 06, 2016 at 10:20:54PM +0100, Stefan Bethke wrote:
> 
> > Am 06.11.2016 um 22:06 schrieb Baptiste Daroussin :
> > 
> > On Sun, Nov 06, 2016 at 09:57:00PM +0100, Stefan Bethke wrote:
> >> 
> >>> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin :
> >>> 
> >>> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
>  I happened to run an old script today that uses sed(1) to extract the 
>  system
>  boot time from the kern.boottime sysctl MIB. On 11.0 this no longer 
>  works as
>  expected:
>  
>  $ sysctl kern.boottime
>  kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov  5 16:18:34 
>  2016
>  $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
>  v  5 16:18:34 2016
>  
>  sed passes over 'S' and 'N' until it hits 'v', which it considers 
>  uppercase
>  apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works as
>  expected:
>  
>  $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
>  Nov  5 16:18:34 2016
>  
>  Testing every lowercase character separately gives even more inconsistent
>  results:
>  
>  $ cat < >> 
>  Here sed thinks every lowercase character except for 'a' is uppercase! 
>  This
>  differs from the first test where sed did not think 'o' is uppercase. 
>  Again,
>  the above behaves as expected with LANG=C.
>  
>  Does anyone have any insight into this? This is likely to break a lot of
>  existing code.
>  
> >>> 
> >>> Yes A-Z only means uppercase in an ASCII only world in a unicode world it 
> >>> means
> >>> AaBb... Z because there are way more characters that simple A-Z. In 
> >>> FreeBSD 11
> >>> we have a unicode collation instead of falling back in on LC_COLLATE=C 
> >>> which
> >>> means ascii only
> >>> 
> >>> For regrexp for example one should use the classes: :upper: or :lower:.
> >> 
> >> That is rather surprising.  Is there a normative reference for the 
> >> treatment of bracket expressions and character classes when using locales 
> >> other than C and/or encodings like UTF-8?
> > 
> > http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html
> > 
> > For example:
> > 
> > "Regular expressions are a context-independent syntax that can represent a 
> > wide
> > variety of character sets and character set orderings, where these character
> > sets are interpreted according to the current locale. While many regular
> > expressions can be interpreted differently depending on the current locale, 
> > many
> > features, such as character class expressions, provide for contextual 
> > invariance
> > across locales.“
> 
> Sorry, maybe I wasn’t clear enough with my question.  When a character class 
> fits the problem, it is clearly advantageous.
> 
> But under what circumstances would [A-Z] mean anything other than a character 
> whose Unicode codepoint is between U+0041 and U+005A, inclusive?  Especially 
> given the locale in the example is en_US.UTF-8.  Or, put another way, why 
> would an implementation interpret [A-Z] as anything other than [ABCDE…XYZ]?

The collation rules for unicode comes from: http://cldr.unicode.org/ and they do
match the one on linux for example and the one on illumos.

On some gnu tool they explicitly decide to be non locale aware to avoid that
kind of "surprises"
> 
> From reading your reference, I can see in 9.3.5.7:
> > In the POSIX locale, a range expression represents the set of collating 
> > elements that fall between two elements in the collation sequence, 
> > inclusive. In other locales, a range expression has unspecified behavior[…]
> 
> So even if the observed behaviour is conforming, I’d think it’s still highly 
> undesirable.
> 
That works for POSIX locale aka C aka ASCII only world

Best regards,
Bapt


signature.asc
Description: PGP signature

Re: Uppercase RE matching problems in FreeBSD 11

2016-11-06 Thread Stefan Bethke


> Am 06.11.2016 um 22:14 schrieb Stefan Ehmann :
> 
>> That is rather surprising.  Is there a normative reference for the
>> treatment of bracket expressions and character classes when using
>> locales other than C and/or encodings like UTF-8?
> 
> I found an interesting article about this issue in gawk:
> https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html

OK, I give up.  Back to jwz: "now you have two problems.“

Although with en_US.UTF-8 on other systems, I have not had that experience.  A 
quick check on stuff I have immediate access to:

macOS 10.12:
$ echo 'abcdABCD' | sed 's/[A-Z]/X/g’
abcd

Ubuntu 14.04.5
$ echo 'abcdABCD' | sed 's/[A-Z]/X/g’
abcd

FreeBSD 10-stable
$ echo 'abcdABCD' | sed 's/[A-Z]/X/g'
abcd


Stefan

-- 
Stefan BethkeFon +49 151 14070811




___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Uppercase RE matching problems in FreeBSD 11

2016-11-06 Thread Andriy Gapon

On 06/11/2016 23:30, Stefan Bethke wrote:
> Although with en_US.UTF-8 on other systems, I have not had that experience.  
> A quick check on stuff I have immediate access to:
> 
> macOS 10.12:
> $ echo 'abcdABCD' | sed 's/[A-Z]/X/g’
> abcd
> 
> Ubuntu 14.04.5
> $ echo 'abcdABCD' | sed 's/[A-Z]/X/g’
> abcd
> 
> FreeBSD 10-stable
> $ echo 'abcdABCD' | sed 's/[A-Z]/X/g'
> abcd

Latest Gentoo:
$ echo 'abcdABCD' | sed 's/[A-Z]/X/g'
aXXX

Recent OpenIndiana (an illumos based OS):
$ echo 'abcdABCD' | sed 's/[A-Z]/X/g'
aXXX

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Uppercase RE matching problems in FreeBSD 11

2016-11-06 Thread Stefan Bethke

Am 06.11.2016 um 22:27 schrieb Baptiste Daroussin :
> 
>> But under what circumstances would [A-Z] mean anything other than a 
>> character whose Unicode codepoint is between U+0041 and U+005A, inclusive?  
>> Especially given the locale in the example is en_US.UTF-8.  Or, put another 
>> way, why would an implementation interpret [A-Z] as anything other than 
>> [ABCDE…XYZ]?
> 
> The collation rules for unicode comes from: http://cldr.unicode.org/ and they 
> do
> match the one on linux for example and the one on illumos.
> 
> On some gnu tool they explicitly decide to be non locale aware to avoid that
> kind of "surprises"
>> 
>> From reading your reference, I can see in 9.3.5.7:
>>> In the POSIX locale, a range expression represents the set of collating 
>>> elements that fall between two elements in the collation sequence, 
>>> inclusive. In other locales, a range expression has unspecified behavior[…]
>> 
>> So even if the observed behaviour is conforming, I’d think it’s still highly 
>> undesirable.
>> 
> That works for POSIX locale aka C aka ASCII only world

So what do I set my LANG and LC variables to?  I do want UTF-8, but I do also 
want my scripts to continue to work.  Clearly, en_US.UTF-8 is not what I want.  
Is it C.UTF-8?  Or do I set LANG=en_US.UTF-8 and LC_COLLATE=C?


Stefan

-- 
Stefan BethkeFon +49 151 14070811




___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Uppercase RE matching problems in FreeBSD 11

2016-11-07 Thread Charles Swiger

On Nov 6, 2016, at 1:49 PM, Stefan Bethke  wrote:
> Am 06.11.2016 um 22:27 schrieb Baptiste Daroussin :
>> That works for POSIX locale aka C aka ASCII only world
> 
> So what do I set my LANG and LC variables to?  I do want UTF-8, but I do also 
> want my scripts to continue to work.  Clearly, en_US.UTF-8 is not what I 
> want.  Is it C.UTF-8?  Or do I set LANG=en_US.UTF-8 and LC_COLLATE=C?

If you want to use a UTF8 locale, then you must start using character classes 
like '[:upper:]' and '[:lower:]' because those will-- or at least "should", 
modulo bugs-- properly handle the collation issues including for languages 
which do not possess a 1-1 mapping between upper and lower case letters.

Someone with a German email address is presumably familiar with ß / Eszett...?  
:-)

Regards,
-- 
-Chuck

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Uppercase RE matching problems in FreeBSD 11

2016-11-07 Thread Mark Martinec


2016-11-06 22:49, Stefan Bethke wrote:

So what do I set my LANG and LC variables to?  I do want UTF-8, but I
do also want my scripts to continue to work.  Clearly, en_US.UTF-8 is
not what I want.  Is it C.UTF-8?




Or do I set LANG=en_US.UTF-8 and LC_COLLATE=C?


Yes, that is the safest bet. The LANG sets a default, but the
LC_COLLATE, LC_TIME, LC_NUMERIC and LC_MONETARY should better
be set to "C" to overrule the LANG in their domains.

Leave the LC_ALL undefined or empty, as this one overrules
every other locale setting (unless you really want everything
to be set to "C").

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Uppercase RE matching problems in FreeBSD 11

2016-11-08 Thread Stefan Ehmann

On 07.11.2016 22:13, Charles Swiger wrote:
> On Nov 6, 2016, at 1:49 PM, Stefan Bethke  wrote:
>> Am 06.11.2016 um 22:27 schrieb Baptiste Daroussin
>> :
>>> That works for POSIX locale aka C aka ASCII only world
>> 
>> So what do I set my LANG and LC variables to?  I do want UTF-8, but
>> I do also want my scripts to continue to work.  Clearly,
>> en_US.UTF-8 is not what I want.  Is it C.UTF-8?  Or do I set
>> LANG=en_US.UTF-8 and LC_COLLATE=C?
> 
> If you want to use a UTF8 locale, then you must start using character
> classes like '[:upper:]' and '[:lower:]' because those will-- or at
> least "should", modulo bugs-- properly handle the collation issues
> including for languages which do not possess a 1-1 mapping between
> upper and lower case letters.
> 
> Someone with a German email address is presumably familiar with ß /
> Eszett...?  :-)

Character classes work fine for [a-z], but I don't know of a simple way
to match a range like [a-k].

Personally, I prefer the "Rational Range Interpretation" because it
doesn't break backward compatibility and is still standard compliant.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Uppercase RE matching problems in FreeBSD 11

2016-11-08 Thread Chuck Swiger

On Nov 8, 2016, at 11:54 AM, Stefan Ehmann  wrote:
> On 07.11.2016 22:13, Charles Swiger wrote:
>> On Nov 6, 2016, at 1:49 PM, Stefan Bethke  wrote:
>>> Am 06.11.2016 um 22:27 schrieb Baptiste Daroussin
>>> :
 That works for POSIX locale aka C aka ASCII only world
>>> 
>>> So what do I set my LANG and LC variables to?  I do want UTF-8, but
>>> I do also want my scripts to continue to work.  Clearly,
>>> en_US.UTF-8 is not what I want.  Is it C.UTF-8?  Or do I set
>>> LANG=en_US.UTF-8 and LC_COLLATE=C?
>> 
>> If you want to use a UTF8 locale, then you must start using character
>> classes like '[:upper:]' and '[:lower:]' because those will-- or at
>> least "should", modulo bugs-- properly handle the collation issues
>> including for languages which do not possess a 1-1 mapping between
>> upper and lower case letters.
>> 
>> Someone with a German email address is presumably familiar with ß /
>> Eszett...?  :-)
> 
> Character classes work fine for [a-z], but I don't know of a simple way
> to match a range like [a-k].

True.  If you need smaller ranges, I don't see a portable way of doing
so in a non-POSIX / "C" locale beyond listing them out.  Or:

> Personally, I prefer the "Rational Range Interpretation" because it
> doesn't break backward compatibility and is still standard compliant.

...yes, +1.  Many of the GNU tools like grep and gawk have adopted this,
but they are replacing the system regex routines with their own code.

However, you can't rely on RRI without testing whether you've got a gawk
in the $PATH or whether /usr/bin/awk or whichever is really GNU awk.

Regards,
-- 
-Chuck

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Uppercase RE matching problems in FreeBSD 11

Re: Uppercase RE matching problems in FreeBSD 11

Re: Uppercase RE matching problems in FreeBSD 11

Re: Uppercase RE matching problems in FreeBSD 11

Re: Uppercase RE matching problems in FreeBSD 11

Re: Uppercase RE matching problems in FreeBSD 11

Re: Uppercase RE matching problems in FreeBSD 11

Re: Uppercase RE matching problems in FreeBSD 11

Re: Uppercase RE matching problems in FreeBSD 11

Re: Uppercase RE matching problems in FreeBSD 11

Re: Uppercase RE matching problems in FreeBSD 11

Re: Uppercase RE matching problems in FreeBSD 11

Re: Uppercase RE matching problems in FreeBSD 11

Re: Uppercase RE matching problems in FreeBSD 11

Re: Uppercase RE matching problems in FreeBSD 11

15 matches

Site Navigation

Mail list logo

Footer information