On 25/09/10 00:22, Eric Blake wrote: > On 09/24/2010 04:47 PM, Pádraig Brady wrote: >> I was just looking at a bug reported to fedora there where this abort()s >> >> $ LC_ALL=en_US tr '[:upper:] ' '[:lower:]' > > Behavior is already unspecified by POSIX when string1 is longer than > string2. But given what POSIX does say: > > "When both the -d and -s options are specified, any of the character > class names shall be accepted in string2. Otherwise, only character > class names lower or upper are valid in string2 and then only if the > corresponding character class ( upper and lower, respectively) is > specified in the same relative position in string1. Such a specification > shall be interpreted as a request for case conversion. When [: lower:] > appears in string1 and [: upper:] appears in string2, the arrays shall > contain the characters from the toupper mapping in the LC_CTYPE category > of the current locale. When [: upper:] appears in string1 and [: lower:] > appears in string2, the arrays shall contain the characters from the > tolower mapping in the LC_CTYPE category of the current locale. The > first character from each mapping pair shall be in the array for string1 > and the second character from each mapping pair shall be in the array > for string2 in the same relative position. > > Except for case conversion, the characters specified by a character > class expression shall be placed in the array in an unspecified order. > ... > > However, in a case conversion, as described previously, such as: > > tr -s '[:upper:]' '[:lower:]' > > the last operand's array shall contain only those characters defined as > the second characters in each of the toupper or tolower character pairs, > as appropriate." > > > > I interpret this to mean that even though there are 59 lower and 56 > upper in en_US, there are a fixed number of toupper case-mapping pairs, > and there are likewise a fixed number of tolower case-mapping pairs. > Therefore, [:upper:] and [:lower:] expand to the same number of array > entries (whether that is 59 pairs or 56 pairs is irrelevant), and > mappings like "tr '[:lower:] ' '[:upper:]_'" must unambiguously convert > space to underscore and also guarantee that no lower-case letter becomes > an underscore.
Thanks for digging up the relevant POSIX bits. Yes I agree that '[:lower:]' '[:upper:]' should be treated as a unit and not leak into adjacent elements. > > Your question is basically what should we do on the unspecified behavior > of '[:lower:] ' '[:upper:]', where string1 is longer than string2, since > that falls outside the bounds of POSIX. > >> I.E. 0xDE (the last upper char) is output from: >> >> $ echo "_ _" | LC_ALL=en_US ./src/tr '[:lower:] ' '[:upper:]' > > That matches the behavior we choose in all other instances where string1 > is longer than string2, where GNU tr follows BSD behavior of padding the > last character of string2 to meet the length of string1. > > But, since POSIX is clear that the order of [:upper:] mappings is > unspecified, I agree that it is not a good guarantee to the user of > which byte gets duplicated to fill out the conversion, and we are better > off rejecting that attempted usage. > >> >> That seems quite inconsistent given that other classes >> are not allowed in string 2 when translating: >> >> $ echo "ab ." | LANG=en_US tr '[:digit:]' '[:alpha:]' >> tr: when translating, the only character classes that may appear in >> string2 are `upper' and `lower' >> >> For consistency I think it better to keep the classes >> in string 2 just for case mapping, and do something like: >> >> $ tr '[:upper:] ' '[:lower:]' >> tr: when not truncating set1, a character class can't be >> the last entity in string2 > > I'd rather see it phrased: > > When string2 is shorter than string1, a character class can't be the > last entity in string2. OK. That is a bit clearer. >> Note BSD allows extending the above, but that's at least >> consistent with any class being allowed in string2. >> I.E. this is disallowed by coreutils but Ok on BSD: >> >> $ echo "1 2" | LC_ALL=en_US.iso-8859-1 tr ' ' '[:alpha:]' >> 1A2 > > The BSD behavior violates an explicit POSIX wording; we can't do an > extension like that without either turning on a POSIXLY_CORRECT check or > adding a command line option, neither of which I think is necessary. So > I see no reason to copy the BSD behavior of allowing any character class. Yes I agree. I was just pointing out what BSD does here. >> Is it OK to change tr like this? >> I can't see anything depending on that. > > Seems reasonable to me, once we decide on the error message wording. Great, I'll change it as above. cheers, Pádraig.