Re: tr A-Z a-z in locales other than C

2011-06-07 Thread Jilles Tjoelker
On Tue, Jun 07, 2011 at 04:24:43AM +0400, Andrey Chernov wrote:
 On Tue, Jun 07, 2011 at 12:41:05AM +0200, Jilles Tjoelker wrote:

  There is a related issue with ranges in regular expressions, glob and
  fnmatch (likewise unspecified by POSIX outside the POSIX locale), but
  this is less likely to cause problems.

 You care about ports, but suggested change is americano-centrism which 
 kills tr usage for national language documents due to impossibility to 
 specify whole national alphabet easily, just by two letters.

Hmm, so that's with translation to a constant, or with the -d and/or -s
options. In such cases, there may be a range for all letters with
collation order, but not with codeset order (mainly if all letters
includes letters with diacritical marks).

In FreeBSD, upper case sorts before lower case, so cases can be
distinguished this way but all letters may require two ranges. In most
other operating systems the cases go together so a single range is
sufficient, but cases cannot be distinguished. Making such things work
on multiple operating systems requires careful testing.

 Moreover, having differently treated regex ranges in tr vs other places 
 you mention will produce additional chaos.

I think this is already inconsistent because some programs do not enable
locale or use different locale code.

With UTF-8 or other multibyte character sets, this is even more so
because functions like isalpha work very poorly by definition and there
is no collation support for such character sets in FreeBSD.

 Back to the ports: it is not hard to run _any_ port's make or configure 
 with LANG=C directly by the ports Mk system to eliminate that problem.

True, but some ports install scripts with problematic tr calls.

-- 
Jilles Tjoelker
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: tr A-Z a-z in locales other than C

2011-06-07 Thread Atom Smasher

the man page makes it clear...

 Translate the contents of file1 to upper-case.

   tr [:lower:] [:upper:]  file1

 (This should be preferred over the traditional UNIX idiom of ``tr a-z
A-Z'', since it works correctly in all locales.)


for any other uses, either build the port with locale specified as C as 
mentioned, or patch the port so:

tr '[a-z]' '[A-Z]'
 becomes:
env LC_ALL=C tr '[a-z]' '[A-Z]'

the only change that would be appropriate to the tr utility would be a 
command-line option to select a locale... something like:

tr -l C '[a-z]' '[A-Z]'

i don't think anyone would object to that, but it would still require 
patching some ports under some locales...


maybe another option would be modifying tr to recognize other [new] 
environment variables... TR_LANG, TR_LC_ALL, TR_LC_CTYPE and 
TR_LC_COLLATE. done that way, things could be set in /etc/make.conf (or 
sys.mk), not need any patching, and not interfere with other uses of 
locale.



--
...atom

 
 http://atom.smasher.org/
 762A 3B98 A3C3 96C9 C6B7 582A B88D 52E4 D9F5 7808
 -

We in the West must bear in mind that the poor countries
 are poor primarily because we have exploited them through
 political or economic colonialism.
-- Martin Luther King, Jr

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: tr A-Z a-z in locales other than C

2011-06-07 Thread Jilles Tjoelker
On Wed, Jun 08, 2011 at 09:56:39AM +1200, Atom Smasher wrote:
 the man page makes it clear...

   Translate the contents of file1 to upper-case.

 tr [:lower:] [:upper:]  file1

   (This should be preferred over the traditional UNIX idiom of ``tr a-z
   A-Z'', since it works correctly in all locales.)

 for any other uses, either build the port with locale specified as C as 
 mentioned, or patch the port so:
   tr '[a-z]' '[A-Z]'
   becomes:
   env LC_ALL=C tr '[a-z]' '[A-Z]'

 the only change that would be appropriate to the tr utility would be a 
 command-line option to select a locale... something like:
   tr -l C '[a-z]' '[A-Z]'

 i don't think anyone would object to that, but it would still require 
 patching some ports under some locales...

That new option would provide zero benefit. If things are going to be
patched anyway then patch them to be standards compliant.

 maybe another option would be modifying tr to recognize other [new] 
 environment variables... TR_LANG, TR_LC_ALL, TR_LC_CTYPE and 
 TR_LC_COLLATE. done that way, things could be set in /etc/make.conf (or 
 sys.mk), not need any patching, and not interfere with other uses of 
 locale.

That would be rather ugly.

If  tr a-z A-Z  is supposed to be deceiving in some locales, then let it
remain so unconditionally.

-- 
Jilles Tjoelker
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: tr A-Z a-z in locales other than C

2011-06-07 Thread Atom Smasher

On Wed, 8 Jun 2011, Jilles Tjoelker wrote:

maybe another option would be modifying tr to recognize other [new] 
environment variables... TR_LANG, TR_LC_ALL, TR_LC_CTYPE and 
TR_LC_COLLATE. done that way, things could be set in /etc/make.conf (or 
sys.mk), not need any patching, and not interfere with other uses of 
locale.


That would be rather ugly.

If tr a-z A-Z is supposed to be deceiving in some locales, then let it 
remain so unconditionally.

=

it can still be as ugly as one wants it to be, and in some ports that 
might be fine. but this option would provide a very simple option to reign 
in how ugly it is.



--
...atom

 
 http://atom.smasher.org/
 762A 3B98 A3C3 96C9 C6B7 582A B88D 52E4 D9F5 7808
 -

The livestock sector is a major player [in climate
 change], responsible for 18% of greenhouse gas
 emissions measured in CO2 equivalent. This is a higher
 share than transport.
-- Livestock's long shadow, 2006
UN report sponsored by WTO, EU, AS-AID, FAO, et al

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: tr A-Z a-z in locales other than C

2011-06-07 Thread Andrey Chernov
On Tue, Jun 07, 2011 at 11:17:12PM +0200, Jilles Tjoelker wrote:
 In FreeBSD, upper case sorts before lower case, so cases can be
 distinguished this way but all letters may require two ranges. In most
 other operating systems the cases go together so a single range is
 sufficient, but cases cannot be distinguished. Making such things work
 on multiple operating systems requires careful testing.

Such thing can't work consistenly on multiple operating systems by 
definition, because POSIX states undefined here. So the best we can is 
to concentrace on our system. No program should relay on that until POSIX 
define that somehow.

  Moreover, having differently treated regex ranges in tr vs other places 
  you mention will produce additional chaos.
 
 I think this is already inconsistent because some programs do not enable
 locale or use different locale code.

I say the same, producing additional chaos is not bringing chaos from 
nowhere.
AFAIK nobody use different locale code but often different regex 
implemetation.

  Back to the ports: it is not hard to run _any_ port's make or configure 
  with LANG=C directly by the ports Mk system to eliminate that problem.
 
 True, but some ports install scripts with problematic tr calls.

What count says, how many ports do that?

Summarizing I suggest to consider two models:
1) Developer/programer etc. tr coderange does good for it.
2) Working with national language docs/end user/ tr coderange does bad for 
it.

Sacrificing model 2) for 1) is not the thing we need, if such ports number 
is low. If such ports number is significant, we can consider additional 
options like automatically search and replace such tr's through pkg-plist
(similar scanning we already do for security reasons).

-- 
http://ache.vniz.net/
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: tr A-Z a-z in locales other than C

2011-06-07 Thread perryh
Jilles Tjoelker jil...@stack.nl wrote:

 On Tue, Jun 07, 2011 at 04:24:43AM +0400, Andrey Chernov wrote:
...
  Back to the ports: it is not hard to run _any_ port's make
  or configure with LANG=C directly by the ports Mk system to
  eliminate that problem.

 True, but some ports install scripts with problematic tr calls.

So part of the porting effort may be to provide a patch that
prepends something along the lines of env LANG=C to tr calls in
those scripts.  It would surely not be the only kind of situation
in which a port needed to patch the ported code to get it to run
correctly on FreeBSD :)
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


tr A-Z a-z in locales other than C

2011-06-06 Thread Jilles Tjoelker
A few years ago, when locale support was added to the tr utility,
character ranges (except ones containing one or two octal escapes) were
changed to use the collation order instead of the character code order.
At the time, this matched other implementations of tr and was apparently
somewhat generally accepted.

However, this behaviour is not intuitive, not portable as it deeply
depends on the collation order and it is very hard to find a useful use
for it. Perhaps there is a use case in EBCDIC locales that only contain
the 2*26 basic Latin letters, but that is rather exotic.

The command tr A-Z a-z may do something unexpected even if there is an
1:1 mapping between upper and lower case, since it also assumes that 'z'
is the last letter.

This is not a POSIX issue as POSIX leaves character ranges in tr
unspecified for locales other than the POSIX locale (except for ranges
containing octal escapes).

If there is no reason to keep using the collation order, I would like to
change tr's character ranges back to character codes. GNU tr does this
and many ports wrongly take advantage of it, so following it will reduce
the need to patch ports.

The below patch demonstrates the new behaviour. The code could be
simplified more as the flags for octal escapes are no longer needed.

The man page may need some additional change as well. In particular, the
command
  tr [:upper:] [:lower:]
in a user's locale is a good choice for text specified by the user, but
a poor choice for doing case-insensitive comparisons of constant
strings, because in Turkish locales the upper case version of 'i' is a
capital I with dot and the lower case version of 'I' is a lower case i
without dot. In such cases,
  LC_ALL=C tr [:upper:] [:lower:]
may be a better option (A-Z a-z could be used at the cost of breaking
EBCDIC support).

There is a related issue with ranges in regular expressions, glob and
fnmatch (likewise unspecified by POSIX outside the POSIX locale), but
this is less likely to cause problems.


Index: usr.bin/tr/tr.1
===
--- usr.bin/tr/tr.1 (revision 222648)
+++ usr.bin/tr/tr.1 (working copy)
@@ -31,7 +31,7 @@
 .\ @(#)tr.1   8.1 (Berkeley) 6/6/93
 .\ $FreeBSD$
 .\
-.Dd October 13, 2006
+.Dd June 6, 2011
 .Dt TR 1
 .Os
 .Sh NAME
@@ -158,12 +158,7 @@
 .Pp
 A backslash followed by any other character maps to that character.
 .It c-c
-For non-octal range endpoints
-represents the range of characters between the range endpoints, inclusive,
-in ascending order,
-as defined by the collation sequence.
-If either or both of the range endpoints are octal sequences, it
-represents the range of specific coded values between the
+A range represents the range of specific coded values between the
 range endpoints, inclusive.
 .Pp
 .Bf Em
@@ -309,20 +304,18 @@
 .Pp
 .Dl tr \*q[=e=]\*q \*qe\*q
 .Sh COMPATIBILITY
-Previous
-.Fx
-implementations of
-.Nm
-did not order characters in range expressions according to the current
-locale's collation order, making it possible to convert unaccented Latin
+Some implementations of
+.Nm ,
+including the ones in previous versions of
+.Fx ,
+order characters in range expressions according to the current
+locale's collation order, making it impossible to convert unaccented Latin
 characters (esp.\ as found in English text) from upper to lower case using
 the traditional
 .Ux
 idiom of
 .Dq Li tr A-Z a-z .
-Since
-.Nm
-now obeys the locale's collation order, this idiom may not produce
+In such implementations, this idiom may not produce
 correct results when there is not a 1:1 mapping between lower and
 upper case, or when the order of characters within the two cases differs.
 As noted in the
Index: usr.bin/tr/str.c
===
--- usr.bin/tr/str.c(revision 222648)
+++ usr.bin/tr/str.c(working copy)
@@ -260,37 +260,13 @@
stopval = wc;
s-str += clen;
}
-   /*
-* XXX Characters are not ordered according to collating sequence in
-* multibyte locales.
-*/
-   if (octal || was_octal || MB_CUR_MAX  1) {
-   if (stopval  s-lastch) {
-   s-str = savestart;
-   return (0);
-   }
-   s-cnt = stopval - s-lastch + 1;
-   s-state = RANGE;
-   --s-lastch;
-   return (1);
-   }
-   if (charcoll((const void *)stopval, (const void *)(s-lastch))  0) {
+   if (stopval  s-lastch) {
s-str = savestart;
return (0);
}
-   if ((s-set = p = malloc((NCHARS_SB + 1) * sizeof(int))) == NULL)
-   err(1, genrange() malloc);
-   for (cnt = 0; cnt  NCHARS_SB; cnt++)
-   if (charcoll((const void *)cnt, (const void *)(s-lastch)) = 
0 
-   charcoll((const void *)cnt, (const void *)stopval) = 0)
-   *p++ = 

Re: tr A-Z a-z in locales other than C

2011-06-06 Thread Andrey Chernov
On Tue, Jun 07, 2011 at 12:41:05AM +0200, Jilles Tjoelker wrote:
 
 There is a related issue with ranges in regular expressions, glob and
 fnmatch (likewise unspecified by POSIX outside the POSIX locale), but
 this is less likely to cause problems.
 

You care about ports, but suggested change is americano-centrism which 
kills tr usage for national language documents due to impossibility to 
specify whole national alphabet easily, just by two letters.

Moreover, having differently treated regex ranges in tr vs other places 
you mention will produce additional chaos.

Back to the ports: it is not hard to run _any_ port's make or configure 
with LANG=C directly by the ports Mk system to eliminate that problem.

-- 
http://ache.vniz.net/
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org