Eric Blake wrote: > On 09/24/2010 04:47 PM, Pádraig Brady wrote: >> I was just looking at a bug reported to fedora there where this abort()s >> >> $ LC_ALL=en_US tr '[:upper:] ' '[:lower:]'
Ouch! Thanks for reporting it here. How many more bugs lurk in tr... Consolation: this one is a failure to diagnose invalid inputs. ... > I interpret this to mean that even though there are 59 lower and 56 > upper in en_US, there are a fixed number of toupper case-mapping > pairs, and there are likewise a fixed number of tolower case-mapping > pairs. Therefore, [:upper:] and [:lower:] expand to the same number of > array entries (whether that is 59 pairs or 56 pairs is irrelevant), > and mappings like "tr '[:lower:] ' '[:upper:]_'" must unambiguously > convert space to underscore and also guarantee that no lower-case > letter becomes an underscore. > > Your question is basically what should we do on the unspecified > behavior of '[:lower:] ' '[:upper:]', where string1 is longer than > string2, since that falls outside the bounds of POSIX. Right. >> I.E. 0xDE (the last upper char) is output from: >> >> $ echo "_ _" | LC_ALL=en_US ./src/tr '[:lower:] ' '[:upper:]' > > That matches the behavior we choose in all other instances where > string1 is longer than string2, where GNU tr follows BSD behavior of > padding the last character of string2 to meet the length of string1. > > But, since POSIX is clear that the order of [:upper:] mappings is > unspecified, I agree that it is not a good guarantee to the user of > which byte gets duplicated to fill out the conversion, and we are > better off rejecting that attempted usage. > >> >> That seems quite inconsistent given that other classes >> are not allowed in string 2 when translating: >> >> $ echo "ab ." | LANG=en_US tr '[:digit:]' '[:alpha:]' >> tr: when translating, the only character classes that may appear in >> string2 are `upper' and `lower' >> >> For consistency I think it better to keep the classes >> in string 2 just for case mapping, and do something like: >> >> $ tr '[:upper:] ' '[:lower:]' >> tr: when not truncating set1, a character class can't be >> the last entity in string2 > > I'd rather see it phrased: > > When string2 is shorter than string1, a character class can't be the > last entity in string2. Thanks, I find it easier to read when string1 and string2 are listed in order -- and this applies only when translating. How about this? When translating with string1 longer than string2, the latter string must not end with a character class. >> Note BSD allows extending the above, but that's at least >> consistent with any class being allowed in string2. >> I.E. this is disallowed by coreutils but Ok on BSD: >> >> $ echo "1 2" | LC_ALL=en_US.iso-8859-1 tr ' ' '[:alpha:]' >> 1A2 > > The BSD behavior violates an explicit POSIX wording; we can't do an > extension like that without either turning on a POSIXLY_CORRECT check > or adding a command line option, neither of which I think is > necessary. So I see no reason to copy the BSD behavior of allowing > any character class. Yes. I deliberately opted not to provide the BSD behavior, because it cannot be portable. >> Is it OK to change tr like this? >> I can't see anything depending on that. > > Seems reasonable to me, once we decide on the error message wording. Yes. Thanks for bringing this up and dealing with it.