Re: [coreutils] tr: case mapping anomaly
Pádraig Brady wrote: On 25/09/10 07:53, Jim Meyering wrote: Eric Blake wrote: On 09/24/2010 04:47 PM, Pádraig Brady wrote: I was just looking at a bug reported to fedora there where this abort()s $ LC_ALL=en_US tr '[:upper:] ' '[:lower:]' Ouch! Thanks for reporting it here. How many more bugs lurk in tr... Consolation: this one is a failure to diagnose invalid inputs. I found a few more issues: Nice catches. Thanks for all the work. Please mention the above abort-inducing case in the log along with the other three. While you're at it, please use LC_ALL= there, not LANG=, since the latter is a glibc-ism. This valid translation spec aborted: LANG=en_US tr '[:upper:]- ' '[:lower:]_' The above does not abort w/glibc when LC_ALL=C happens to be set. This misaligned conversion spec was allowed: LANG=C tr 'A-Y[:lower:]' 'a-z[:upper:]' This misaligned spec was allowed by extending the class: LANG=C tr '[:upper:] ' '[:lower:]' ... + tr: fix various issues with case conversion classes. In some locales + valid conversion specifications caused tr to abort, while some invalid + specifications were undiagnosed. + [bugs introduced in coreutils 6.9.90 and 6.9.92] Two separate bugs! Thanks for tracking them down. I'm surprised they are so recent. I suppose that means I introduced both. Let's see... These two seem like the only candidates: 6efd10462d8103208f4575f0b5edddf841c7d87c af5d0c363a52e787a4311a11f035209eecdc4115 Whichever they are, please list them in the commit log. ** New features cp now accepts the --attributes-only option to not copy file data, diff --git a/src/tr.c b/src/tr.c index a5b6810..3adc85f 100644 --- a/src/tr.c +++ b/src/tr.c @@ -1177,6 +1177,77 @@ card_of_complement (struct Spec_list *s) return cardinality; } +/* Discard the lengths associated with a case conversion, + as using the actual number of upper or lower case characters + is problematic when they don't match in some locales. + Also ensure the case conversion classes in string2 are + aligned correctly with those in string1. + Note POSIX says the behavior of `tr [:upper:] [:upper:]' + is undefined. Therefore we allow it (unlike Solaris) + and treat it as a no-op. */ + +static void +validate_case_classes (struct Spec_list *s1, struct Spec_list *s2) +{ + size_t upper_chars = 0; + size_t lower_chars = 0; Please name these n_upper and n_lower, so each is obviously a counter. Otherwise, seeing the name out of context might evoke an array. + int i; Please make this unsigned int, since it never needs negative values. + int c1=0, c2=0; These days, I prefer one per line, and with spaces: int c1 = 0; int c2 = 0; Hmm.. I see you're following existing style that did int c1, c2;, though in existing code, those didn't have initializers. And the int i; was essentially moved. And there are others. I blame the author ;-) (me, 18 years ago) + count old_s1_len = s1-length; + count old_s2_len = s2-length; + struct List_element *s1_tail = s1-tail; + struct List_element *s2_tail = s2-tail; + bool s1_new_element = true; + bool s2_new_element = true; ... + +# Before coreutils 8.6 the disparat number of upper and lower s/disparat/disparate/ +# characters in some locales, triggered abort()s and invalid behavior +if test $(LC_ALL=en_US.ISO-8859-1 locale charmap 2/dev/null) = ISO-8859-1; +then + ( No big deal, but why run this in a subshell? It doesn't seem necessary here. +export LC_ALL=en_US.ISO-8859-1 +tr '[:upper:] ' '[:lower:]' /dev/null 2out _fail=1 +echo 'tr: when translating with string1 longer than string2, +the latter string must not end with a character class' exp +compare out exp || _fail=1 + +# Ensure when there are a different number of elements +# in each string, we validate the case mapping correctly +tr 'ab[:lower:]' '0-1[:upper:]' /dev/null || _fail=1 It might be nice to ensure that e.g., abc.xyz maps to ABC.XYZ. +# Ensure we extend string2 appropriately +tr '[:upper:]- ' '[:lower:]_' /dev/null || _fail=1 Similarly, that ABC-XYZ maps to abc_xyz Thanks again! +# Ensure the size of the case classes are accounted +# for as a unit. +echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' | +tr '[:upper:]A-B' '[:lower:]0' out || _fail=1 +echo '00cdefghijklmnopqrstuvwxyz' exp +compare out exp || _fail=1 ...
Re: [coreutils] tr: case mapping anomaly
On 29/09/10 07:45, Jim Meyering wrote: Pádraig Brady wrote: +# characters in some locales, triggered abort()s and invalid behavior +if test $(LC_ALL=en_US.ISO-8859-1 locale charmap 2/dev/null) = ISO-8859-1; +then + ( No big deal, but why run this in a subshell? It doesn't seem necessary here. Yes that is overkill in this case. I had copied the sub-shell bit from my locally improved join-i18n test. I've just pushed that. cheers, Pádraig.
Re: [coreutils] tr: case mapping anomaly
On 09/29/2010 06:40 AM, Pádraig Brady wrote: +# Ensure the size of the case classes are accounted +# for as a unit. +echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' | +tr '[:upper:]A-B' '[:lower:]0'out || _fail=1 +echo '00cdefghijklmnopqrstuvwxyz' exp Huh? A and B are both in [:upper:]; when a character is listed more than once in string1, it is only transliterated according to the first listing. I think this should be 'abc...' not '00c...' for the expected results. Does POSIX specify that? Hmm - POSIX appears to be silent for both tr and m4's translit (but whereas 'tr long short' is unspecified and we extend the last byte of short to match, m4's translit(data,long,short) is explicitly documented as deleting bytes from long with no match in short). That's not what we do, nor what I would expect. $ echo 'A' | LANG=C tr 'AA' '01' 1 And Solaris' tr does this as well. But m4 behaves in the way I specified: $ echo 'translit(a,aa,01)' | m4 0 $ echo 'translit(a,aa,01)' | /usr/ccs/bin/m4 0 Time for me to ask for clarification from the Austin Group, I suppose. But given existing practice, I guess the argument should be that tr is explicitly different than m4. -- Eric Blake ebl...@redhat.com+1-801-349-2682 Libvirt virtualization library http://libvirt.org
Re: [coreutils] tr: case mapping anomaly
On 25/09/10 00:22, Eric Blake wrote: On 09/24/2010 04:47 PM, Pádraig Brady wrote: I was just looking at a bug reported to fedora there where this abort()s $ LC_ALL=en_US tr '[:upper:] ' '[:lower:]' Behavior is already unspecified by POSIX when string1 is longer than string2. But given what POSIX does say: When both the -d and -s options are specified, any of the character class names shall be accepted in string2. Otherwise, only character class names lower or upper are valid in string2 and then only if the corresponding character class ( upper and lower, respectively) is specified in the same relative position in string1. Such a specification shall be interpreted as a request for case conversion. When [: lower:] appears in string1 and [: upper:] appears in string2, the arrays shall contain the characters from the toupper mapping in the LC_CTYPE category of the current locale. When [: upper:] appears in string1 and [: lower:] appears in string2, the arrays shall contain the characters from the tolower mapping in the LC_CTYPE category of the current locale. The first character from each mapping pair shall be in the array for string1 and the second character from each mapping pair shall be in the array for string2 in the same relative position. Except for case conversion, the characters specified by a character class expression shall be placed in the array in an unspecified order. ... However, in a case conversion, as described previously, such as: tr -s '[:upper:]' '[:lower:]' the last operand's array shall contain only those characters defined as the second characters in each of the toupper or tolower character pairs, as appropriate. I interpret this to mean that even though there are 59 lower and 56 upper in en_US, there are a fixed number of toupper case-mapping pairs, and there are likewise a fixed number of tolower case-mapping pairs. Therefore, [:upper:] and [:lower:] expand to the same number of array entries (whether that is 59 pairs or 56 pairs is irrelevant), and mappings like tr '[:lower:] ' '[:upper:]_' must unambiguously convert space to underscore and also guarantee that no lower-case letter becomes an underscore. Thanks for digging up the relevant POSIX bits. Yes I agree that '[:lower:]' '[:upper:]' should be treated as a unit and not leak into adjacent elements. Your question is basically what should we do on the unspecified behavior of '[:lower:] ' '[:upper:]', where string1 is longer than string2, since that falls outside the bounds of POSIX. I.E. 0xDE (the last upper char) is output from: $ echo _ _ | LC_ALL=en_US ./src/tr '[:lower:] ' '[:upper:]' That matches the behavior we choose in all other instances where string1 is longer than string2, where GNU tr follows BSD behavior of padding the last character of string2 to meet the length of string1. But, since POSIX is clear that the order of [:upper:] mappings is unspecified, I agree that it is not a good guarantee to the user of which byte gets duplicated to fill out the conversion, and we are better off rejecting that attempted usage. That seems quite inconsistent given that other classes are not allowed in string 2 when translating: $ echo ab . | LANG=en_US tr '[:digit:]' '[:alpha:]' tr: when translating, the only character classes that may appear in string2 are `upper' and `lower' For consistency I think it better to keep the classes in string 2 just for case mapping, and do something like: $ tr '[:upper:] ' '[:lower:]' tr: when not truncating set1, a character class can't be the last entity in string2 I'd rather see it phrased: When string2 is shorter than string1, a character class can't be the last entity in string2. OK. That is a bit clearer. Note BSD allows extending the above, but that's at least consistent with any class being allowed in string2. I.E. this is disallowed by coreutils but Ok on BSD: $ echo 1 2 | LC_ALL=en_US.iso-8859-1 tr ' ' '[:alpha:]' 1A2 The BSD behavior violates an explicit POSIX wording; we can't do an extension like that without either turning on a POSIXLY_CORRECT check or adding a command line option, neither of which I think is necessary. So I see no reason to copy the BSD behavior of allowing any character class. Yes I agree. I was just pointing out what BSD does here. Is it OK to change tr like this? I can't see anything depending on that. Seems reasonable to me, once we decide on the error message wording. Great, I'll change it as above. cheers, Pádraig.
Re: [coreutils] tr: case mapping anomaly
Eric Blake wrote: On 09/24/2010 04:47 PM, Pádraig Brady wrote: I was just looking at a bug reported to fedora there where this abort()s $ LC_ALL=en_US tr '[:upper:] ' '[:lower:]' Ouch! Thanks for reporting it here. How many more bugs lurk in tr... Consolation: this one is a failure to diagnose invalid inputs. ... I interpret this to mean that even though there are 59 lower and 56 upper in en_US, there are a fixed number of toupper case-mapping pairs, and there are likewise a fixed number of tolower case-mapping pairs. Therefore, [:upper:] and [:lower:] expand to the same number of array entries (whether that is 59 pairs or 56 pairs is irrelevant), and mappings like tr '[:lower:] ' '[:upper:]_' must unambiguously convert space to underscore and also guarantee that no lower-case letter becomes an underscore. Your question is basically what should we do on the unspecified behavior of '[:lower:] ' '[:upper:]', where string1 is longer than string2, since that falls outside the bounds of POSIX. Right. I.E. 0xDE (the last upper char) is output from: $ echo _ _ | LC_ALL=en_US ./src/tr '[:lower:] ' '[:upper:]' That matches the behavior we choose in all other instances where string1 is longer than string2, where GNU tr follows BSD behavior of padding the last character of string2 to meet the length of string1. But, since POSIX is clear that the order of [:upper:] mappings is unspecified, I agree that it is not a good guarantee to the user of which byte gets duplicated to fill out the conversion, and we are better off rejecting that attempted usage. That seems quite inconsistent given that other classes are not allowed in string 2 when translating: $ echo ab . | LANG=en_US tr '[:digit:]' '[:alpha:]' tr: when translating, the only character classes that may appear in string2 are `upper' and `lower' For consistency I think it better to keep the classes in string 2 just for case mapping, and do something like: $ tr '[:upper:] ' '[:lower:]' tr: when not truncating set1, a character class can't be the last entity in string2 I'd rather see it phrased: When string2 is shorter than string1, a character class can't be the last entity in string2. Thanks, I find it easier to read when string1 and string2 are listed in order -- and this applies only when translating. How about this? When translating with string1 longer than string2, the latter string must not end with a character class. Note BSD allows extending the above, but that's at least consistent with any class being allowed in string2. I.E. this is disallowed by coreutils but Ok on BSD: $ echo 1 2 | LC_ALL=en_US.iso-8859-1 tr ' ' '[:alpha:]' 1A2 The BSD behavior violates an explicit POSIX wording; we can't do an extension like that without either turning on a POSIXLY_CORRECT check or adding a command line option, neither of which I think is necessary. So I see no reason to copy the BSD behavior of allowing any character class. Yes. I deliberately opted not to provide the BSD behavior, because it cannot be portable. Is it OK to change tr like this? I can't see anything depending on that. Seems reasonable to me, once we decide on the error message wording. Yes. Thanks for bringing this up and dealing with it.