Re: tr A-Z a-z in locales other than C
On Tue, Jun 07, 2011 at 04:24:43AM +0400, Andrey Chernov wrote: On Tue, Jun 07, 2011 at 12:41:05AM +0200, Jilles Tjoelker wrote: There is a related issue with ranges in regular expressions, glob and fnmatch (likewise unspecified by POSIX outside the POSIX locale), but this is less likely to cause problems. You care about ports, but suggested change is americano-centrism which kills tr usage for national language documents due to impossibility to specify whole national alphabet easily, just by two letters. Hmm, so that's with translation to a constant, or with the -d and/or -s options. In such cases, there may be a range for all letters with collation order, but not with codeset order (mainly if all letters includes letters with diacritical marks). In FreeBSD, upper case sorts before lower case, so cases can be distinguished this way but all letters may require two ranges. In most other operating systems the cases go together so a single range is sufficient, but cases cannot be distinguished. Making such things work on multiple operating systems requires careful testing. Moreover, having differently treated regex ranges in tr vs other places you mention will produce additional chaos. I think this is already inconsistent because some programs do not enable locale or use different locale code. With UTF-8 or other multibyte character sets, this is even more so because functions like isalpha work very poorly by definition and there is no collation support for such character sets in FreeBSD. Back to the ports: it is not hard to run _any_ port's make or configure with LANG=C directly by the ports Mk system to eliminate that problem. True, but some ports install scripts with problematic tr calls. -- Jilles Tjoelker ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: tr A-Z a-z in locales other than C
the man page makes it clear... Translate the contents of file1 to upper-case. tr [:lower:] [:upper:] file1 (This should be preferred over the traditional UNIX idiom of ``tr a-z A-Z'', since it works correctly in all locales.) for any other uses, either build the port with locale specified as C as mentioned, or patch the port so: tr '[a-z]' '[A-Z]' becomes: env LC_ALL=C tr '[a-z]' '[A-Z]' the only change that would be appropriate to the tr utility would be a command-line option to select a locale... something like: tr -l C '[a-z]' '[A-Z]' i don't think anyone would object to that, but it would still require patching some ports under some locales... maybe another option would be modifying tr to recognize other [new] environment variables... TR_LANG, TR_LC_ALL, TR_LC_CTYPE and TR_LC_COLLATE. done that way, things could be set in /etc/make.conf (or sys.mk), not need any patching, and not interfere with other uses of locale. -- ...atom http://atom.smasher.org/ 762A 3B98 A3C3 96C9 C6B7 582A B88D 52E4 D9F5 7808 - We in the West must bear in mind that the poor countries are poor primarily because we have exploited them through political or economic colonialism. -- Martin Luther King, Jr ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: tr A-Z a-z in locales other than C
On Wed, Jun 08, 2011 at 09:56:39AM +1200, Atom Smasher wrote: the man page makes it clear... Translate the contents of file1 to upper-case. tr [:lower:] [:upper:] file1 (This should be preferred over the traditional UNIX idiom of ``tr a-z A-Z'', since it works correctly in all locales.) for any other uses, either build the port with locale specified as C as mentioned, or patch the port so: tr '[a-z]' '[A-Z]' becomes: env LC_ALL=C tr '[a-z]' '[A-Z]' the only change that would be appropriate to the tr utility would be a command-line option to select a locale... something like: tr -l C '[a-z]' '[A-Z]' i don't think anyone would object to that, but it would still require patching some ports under some locales... That new option would provide zero benefit. If things are going to be patched anyway then patch them to be standards compliant. maybe another option would be modifying tr to recognize other [new] environment variables... TR_LANG, TR_LC_ALL, TR_LC_CTYPE and TR_LC_COLLATE. done that way, things could be set in /etc/make.conf (or sys.mk), not need any patching, and not interfere with other uses of locale. That would be rather ugly. If tr a-z A-Z is supposed to be deceiving in some locales, then let it remain so unconditionally. -- Jilles Tjoelker ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: tr A-Z a-z in locales other than C
On Wed, 8 Jun 2011, Jilles Tjoelker wrote: maybe another option would be modifying tr to recognize other [new] environment variables... TR_LANG, TR_LC_ALL, TR_LC_CTYPE and TR_LC_COLLATE. done that way, things could be set in /etc/make.conf (or sys.mk), not need any patching, and not interfere with other uses of locale. That would be rather ugly. If tr a-z A-Z is supposed to be deceiving in some locales, then let it remain so unconditionally. = it can still be as ugly as one wants it to be, and in some ports that might be fine. but this option would provide a very simple option to reign in how ugly it is. -- ...atom http://atom.smasher.org/ 762A 3B98 A3C3 96C9 C6B7 582A B88D 52E4 D9F5 7808 - The livestock sector is a major player [in climate change], responsible for 18% of greenhouse gas emissions measured in CO2 equivalent. This is a higher share than transport. -- Livestock's long shadow, 2006 UN report sponsored by WTO, EU, AS-AID, FAO, et al ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: tr A-Z a-z in locales other than C
On Tue, Jun 07, 2011 at 11:17:12PM +0200, Jilles Tjoelker wrote: In FreeBSD, upper case sorts before lower case, so cases can be distinguished this way but all letters may require two ranges. In most other operating systems the cases go together so a single range is sufficient, but cases cannot be distinguished. Making such things work on multiple operating systems requires careful testing. Such thing can't work consistenly on multiple operating systems by definition, because POSIX states undefined here. So the best we can is to concentrace on our system. No program should relay on that until POSIX define that somehow. Moreover, having differently treated regex ranges in tr vs other places you mention will produce additional chaos. I think this is already inconsistent because some programs do not enable locale or use different locale code. I say the same, producing additional chaos is not bringing chaos from nowhere. AFAIK nobody use different locale code but often different regex implemetation. Back to the ports: it is not hard to run _any_ port's make or configure with LANG=C directly by the ports Mk system to eliminate that problem. True, but some ports install scripts with problematic tr calls. What count says, how many ports do that? Summarizing I suggest to consider two models: 1) Developer/programer etc. tr coderange does good for it. 2) Working with national language docs/end user/ tr coderange does bad for it. Sacrificing model 2) for 1) is not the thing we need, if such ports number is low. If such ports number is significant, we can consider additional options like automatically search and replace such tr's through pkg-plist (similar scanning we already do for security reasons). -- http://ache.vniz.net/ ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: tr A-Z a-z in locales other than C
Jilles Tjoelker jil...@stack.nl wrote: On Tue, Jun 07, 2011 at 04:24:43AM +0400, Andrey Chernov wrote: ... Back to the ports: it is not hard to run _any_ port's make or configure with LANG=C directly by the ports Mk system to eliminate that problem. True, but some ports install scripts with problematic tr calls. So part of the porting effort may be to provide a patch that prepends something along the lines of env LANG=C to tr calls in those scripts. It would surely not be the only kind of situation in which a port needed to patch the ported code to get it to run correctly on FreeBSD :) ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
tr A-Z a-z in locales other than C
A few years ago, when locale support was added to the tr utility, character ranges (except ones containing one or two octal escapes) were changed to use the collation order instead of the character code order. At the time, this matched other implementations of tr and was apparently somewhat generally accepted. However, this behaviour is not intuitive, not portable as it deeply depends on the collation order and it is very hard to find a useful use for it. Perhaps there is a use case in EBCDIC locales that only contain the 2*26 basic Latin letters, but that is rather exotic. The command tr A-Z a-z may do something unexpected even if there is an 1:1 mapping between upper and lower case, since it also assumes that 'z' is the last letter. This is not a POSIX issue as POSIX leaves character ranges in tr unspecified for locales other than the POSIX locale (except for ranges containing octal escapes). If there is no reason to keep using the collation order, I would like to change tr's character ranges back to character codes. GNU tr does this and many ports wrongly take advantage of it, so following it will reduce the need to patch ports. The below patch demonstrates the new behaviour. The code could be simplified more as the flags for octal escapes are no longer needed. The man page may need some additional change as well. In particular, the command tr [:upper:] [:lower:] in a user's locale is a good choice for text specified by the user, but a poor choice for doing case-insensitive comparisons of constant strings, because in Turkish locales the upper case version of 'i' is a capital I with dot and the lower case version of 'I' is a lower case i without dot. In such cases, LC_ALL=C tr [:upper:] [:lower:] may be a better option (A-Z a-z could be used at the cost of breaking EBCDIC support). There is a related issue with ranges in regular expressions, glob and fnmatch (likewise unspecified by POSIX outside the POSIX locale), but this is less likely to cause problems. Index: usr.bin/tr/tr.1 === --- usr.bin/tr/tr.1 (revision 222648) +++ usr.bin/tr/tr.1 (working copy) @@ -31,7 +31,7 @@ .\ @(#)tr.1 8.1 (Berkeley) 6/6/93 .\ $FreeBSD$ .\ -.Dd October 13, 2006 +.Dd June 6, 2011 .Dt TR 1 .Os .Sh NAME @@ -158,12 +158,7 @@ .Pp A backslash followed by any other character maps to that character. .It c-c -For non-octal range endpoints -represents the range of characters between the range endpoints, inclusive, -in ascending order, -as defined by the collation sequence. -If either or both of the range endpoints are octal sequences, it -represents the range of specific coded values between the +A range represents the range of specific coded values between the range endpoints, inclusive. .Pp .Bf Em @@ -309,20 +304,18 @@ .Pp .Dl tr \*q[=e=]\*q \*qe\*q .Sh COMPATIBILITY -Previous -.Fx -implementations of -.Nm -did not order characters in range expressions according to the current -locale's collation order, making it possible to convert unaccented Latin +Some implementations of +.Nm , +including the ones in previous versions of +.Fx , +order characters in range expressions according to the current +locale's collation order, making it impossible to convert unaccented Latin characters (esp.\ as found in English text) from upper to lower case using the traditional .Ux idiom of .Dq Li tr A-Z a-z . -Since -.Nm -now obeys the locale's collation order, this idiom may not produce +In such implementations, this idiom may not produce correct results when there is not a 1:1 mapping between lower and upper case, or when the order of characters within the two cases differs. As noted in the Index: usr.bin/tr/str.c === --- usr.bin/tr/str.c(revision 222648) +++ usr.bin/tr/str.c(working copy) @@ -260,37 +260,13 @@ stopval = wc; s-str += clen; } - /* -* XXX Characters are not ordered according to collating sequence in -* multibyte locales. -*/ - if (octal || was_octal || MB_CUR_MAX 1) { - if (stopval s-lastch) { - s-str = savestart; - return (0); - } - s-cnt = stopval - s-lastch + 1; - s-state = RANGE; - --s-lastch; - return (1); - } - if (charcoll((const void *)stopval, (const void *)(s-lastch)) 0) { + if (stopval s-lastch) { s-str = savestart; return (0); } - if ((s-set = p = malloc((NCHARS_SB + 1) * sizeof(int))) == NULL) - err(1, genrange() malloc); - for (cnt = 0; cnt NCHARS_SB; cnt++) - if (charcoll((const void *)cnt, (const void *)(s-lastch)) = 0 - charcoll((const void *)cnt, (const void *)stopval) = 0) - *p++ =
Re: tr A-Z a-z in locales other than C
On Tue, Jun 07, 2011 at 12:41:05AM +0200, Jilles Tjoelker wrote: There is a related issue with ranges in regular expressions, glob and fnmatch (likewise unspecified by POSIX outside the POSIX locale), but this is less likely to cause problems. You care about ports, but suggested change is americano-centrism which kills tr usage for national language documents due to impossibility to specify whole national alphabet easily, just by two letters. Moreover, having differently treated regex ranges in tr vs other places you mention will produce additional chaos. Back to the ports: it is not hard to run _any_ port's make or configure with LANG=C directly by the ports Mk system to eliminate that problem. -- http://ache.vniz.net/ ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org