Re: [coreutils] tr: case mapping anomaly

2010-09-29 Thread Jim Meyering
Pádraig Brady wrote:
 On 25/09/10 07:53, Jim Meyering wrote:
 Eric Blake wrote:
 On 09/24/2010 04:47 PM, Pádraig Brady wrote:
 I was just looking at a bug reported to fedora there where this abort()s

   $ LC_ALL=en_US tr '[:upper:] ' '[:lower:]'

 Ouch!  Thanks for reporting it here.
 How many more bugs lurk in tr...
 Consolation: this one is a failure to diagnose invalid inputs.

 I found a few more issues:

Nice catches.  Thanks for all the work.

Please mention the above abort-inducing case in the log along with the
other three.  While you're at it, please use LC_ALL= there, not LANG=,
since the latter is a glibc-ism.

 This valid translation spec aborted:
   LANG=en_US tr '[:upper:]- ' '[:lower:]_'

The above does not abort w/glibc when LC_ALL=C happens to be set.

 This misaligned conversion spec was allowed:
   LANG=C tr 'A-Y[:lower:]' 'a-z[:upper:]'
 This misaligned spec was allowed by extending the class:
   LANG=C tr '[:upper:] ' '[:lower:]'

...
 +  tr: fix various issues with case conversion classes.  In some locales
 +  valid conversion specifications caused tr to abort, while some invalid
 +  specifications were undiagnosed.
 +  [bugs introduced in coreutils 6.9.90 and 6.9.92]

Two separate bugs!  Thanks for tracking them down.
I'm surprised they are so recent.
I suppose that means I introduced both.  Let's see...
These two seem like the only candidates:

  6efd10462d8103208f4575f0b5edddf841c7d87c
  af5d0c363a52e787a4311a11f035209eecdc4115

Whichever they are, please list them in the commit log.

  ** New features

cp now accepts the --attributes-only option to not copy file data,
 diff --git a/src/tr.c b/src/tr.c
 index a5b6810..3adc85f 100644
 --- a/src/tr.c
 +++ b/src/tr.c
 @@ -1177,6 +1177,77 @@ card_of_complement (struct Spec_list *s)
return cardinality;
  }

 +/* Discard the lengths associated with a case conversion,
 +   as using the actual number of upper or lower case characters
 +   is problematic when they don't match in some locales.
 +   Also ensure the case conversion classes in string2 are
 +   aligned correctly with those in string1.
 +   Note POSIX says the behavior of `tr [:upper:] [:upper:]'
 +   is undefined.  Therefore we allow it (unlike Solaris)
 +   and treat it as a no-op.  */
 +
 +static void
 +validate_case_classes (struct Spec_list *s1, struct Spec_list *s2)
 +{
 +  size_t upper_chars = 0;
 +  size_t lower_chars = 0;

Please name these n_upper and n_lower, so each is obviously a counter.
Otherwise, seeing the name out of context might evoke an array.

 +  int i;

Please make this unsigned int, since it never needs negative values.

 +  int c1=0, c2=0;

These days, I prefer one per line, and with spaces:

int c1 = 0;
int c2 = 0;

Hmm..  I see you're following existing style that did int c1, c2;,
though in existing code, those didn't have initializers.
And the int i; was essentially moved.  And there are others.
I blame the author ;-) (me, 18 years ago)

 +  count old_s1_len = s1-length;
 +  count old_s2_len = s2-length;
 +  struct List_element *s1_tail = s1-tail;
 +  struct List_element *s2_tail = s2-tail;
 +  bool s1_new_element = true;
 +  bool s2_new_element = true;
...
 +
 +# Before coreutils 8.6 the disparat number of upper and lower

s/disparat/disparate/

 +# characters in some locales, triggered abort()s and invalid behavior
 +if test $(LC_ALL=en_US.ISO-8859-1 locale charmap 2/dev/null) = ISO-8859-1;
 +then
 +  (

No big deal, but why run this in a subshell?
It doesn't seem necessary here.

 +export LC_ALL=en_US.ISO-8859-1
 +tr '[:upper:] ' '[:lower:]'  /dev/null 2out  _fail=1
 +echo 'tr: when translating with string1 longer than string2,
 +the latter string must not end with a character class'  exp
 +compare out exp || _fail=1
 +
 +# Ensure when there are a different number of elements
 +# in each string, we validate the case mapping correctly
 +tr 'ab[:lower:]' '0-1[:upper:]'  /dev/null || _fail=1

It might be nice to ensure that e.g., abc.xyz maps to ABC.XYZ.

 +# Ensure we extend string2 appropriately
 +tr '[:upper:]- ' '[:lower:]_'  /dev/null || _fail=1

Similarly, that ABC-XYZ maps to abc_xyz

Thanks again!

 +# Ensure the size of the case classes are accounted
 +# for as a unit.
 +echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' |
 +tr '[:upper:]A-B' '[:lower:]0' out ||  _fail=1
 +echo '00cdefghijklmnopqrstuvwxyz'  exp
 +compare out exp || _fail=1
...



Re: [coreutils] tr: case mapping anomaly

2010-09-29 Thread Pádraig Brady
On 29/09/10 07:45, Jim Meyering wrote:
 Pádraig Brady wrote:
 +# characters in some locales, triggered abort()s and invalid behavior
 +if test $(LC_ALL=en_US.ISO-8859-1 locale charmap 2/dev/null) = 
 ISO-8859-1;
 +then
 +  (
 
 No big deal, but why run this in a subshell?
 It doesn't seem necessary here.

Yes that is overkill in this case.
I had copied the sub-shell bit from my
locally improved join-i18n test.

I've just pushed that.

cheers,
Pádraig.



Re: [coreutils] tr: case mapping anomaly

2010-09-29 Thread Eric Blake

On 09/29/2010 06:40 AM, Pádraig Brady wrote:

+# Ensure the size of the case classes are accounted
+# for as a unit.
+echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' |
+tr '[:upper:]A-B' '[:lower:]0'out ||  _fail=1
+echo '00cdefghijklmnopqrstuvwxyz'  exp


Huh?  A and B are both in [:upper:]; when a character is listed more
than once in string1, it is only transliterated according to the first
listing.  I think this should be 'abc...' not '00c...' for the expected
results.


Does POSIX specify that?


Hmm - POSIX appears to be silent for both tr and m4's translit (but 
whereas 'tr long short' is unspecified and we extend the last byte of 
short to match, m4's translit(data,long,short) is explicitly documented 
as deleting bytes from long with no match in short).



That's not what we do, nor what I would expect.

$ echo 'A' | LANG=C tr 'AA' '01'
1


And Solaris' tr does this as well.

But m4 behaves in the way I specified:

$ echo 'translit(a,aa,01)' | m4
0
$ echo 'translit(a,aa,01)' | /usr/ccs/bin/m4
0

Time for me to ask for clarification from the Austin Group, I suppose. 
But given existing practice, I guess the argument should be that tr is 
explicitly different than m4.


--
Eric Blake   ebl...@redhat.com+1-801-349-2682
Libvirt virtualization library http://libvirt.org



Re: [coreutils] tr: case mapping anomaly

2010-09-25 Thread Pádraig Brady
On 25/09/10 00:22, Eric Blake wrote:
 On 09/24/2010 04:47 PM, Pádraig Brady wrote:
 I was just looking at a bug reported to fedora there where this abort()s

   $ LC_ALL=en_US tr '[:upper:] ' '[:lower:]'
 
 Behavior is already unspecified by POSIX when string1 is longer than
 string2.  But given what POSIX does say:
 
 When both the -d and -s options are specified, any of the character
 class names shall be accepted in string2. Otherwise, only character
 class names lower or upper are valid in string2 and then only if the
 corresponding character class ( upper and lower, respectively) is
 specified in the same relative position in string1. Such a specification
 shall be interpreted as a request for case conversion. When [: lower:]
 appears in string1 and [: upper:] appears in string2, the arrays shall
 contain the characters from the toupper mapping in the LC_CTYPE category
 of the current locale. When [: upper:] appears in string1 and [: lower:]
 appears in string2, the arrays shall contain the characters from the
 tolower mapping in the LC_CTYPE category of the current locale. The
 first character from each mapping pair shall be in the array for string1
 and the second character from each mapping pair shall be in the array
 for string2 in the same relative position.
 
 Except for case conversion, the characters specified by a character
 class expression shall be placed in the array in an unspecified order.
 ...
 
 However, in a case conversion, as described previously, such as:
 
 tr -s '[:upper:]' '[:lower:]'
 
 the last operand's array shall contain only those characters defined as
 the second characters in each of the toupper or tolower character pairs,
 as appropriate.
 
 
 
 I interpret this to mean that even though there are 59 lower and 56
 upper in en_US, there are a fixed number of toupper case-mapping pairs,
 and there are likewise a fixed number of tolower case-mapping pairs.
 Therefore, [:upper:] and [:lower:] expand to the same number of array
 entries (whether that is 59 pairs or 56 pairs is irrelevant), and
 mappings like tr '[:lower:] ' '[:upper:]_' must unambiguously convert
 space to underscore and also guarantee that no lower-case letter becomes
 an underscore.

Thanks for digging up the relevant POSIX bits.
Yes I agree that '[:lower:]' '[:upper:]' should
be treated as a unit and not leak into adjacent elements.

 
 Your question is basically what should we do on the unspecified behavior
 of '[:lower:] ' '[:upper:]', where string1 is longer than string2, since
 that falls outside the bounds of POSIX.
 
 I.E. 0xDE (the last upper char) is output from:

   $ echo _ _ | LC_ALL=en_US ./src/tr '[:lower:] ' '[:upper:]'
 
 That matches the behavior we choose in all other instances where string1
 is longer than string2, where GNU tr follows BSD behavior of padding the
 last character of string2 to meet the length of string1.
 
 But, since POSIX is clear that the order of [:upper:] mappings is
 unspecified, I agree that it is not a good guarantee to the user of
 which byte gets duplicated to fill out the conversion, and we are better
 off rejecting that attempted usage.
 

 That seems quite inconsistent given that other classes
 are not allowed in string 2 when translating:

   $ echo ab . | LANG=en_US tr '[:digit:]' '[:alpha:]'
   tr: when translating, the only character classes that may appear in
   string2 are `upper' and `lower'

 For consistency I think it better to keep the classes
 in string 2 just for case mapping, and do something like:

   $ tr '[:upper:] ' '[:lower:]'
   tr: when not truncating set1, a character class can't be
   the last entity in string2
 
 I'd rather see it phrased:
 
 When string2 is shorter than string1, a character class can't be the
 last entity in string2.

OK. That is a bit clearer.

 Note BSD allows extending the above, but that's at least
 consistent with any class being allowed in string2.
 I.E. this is disallowed by coreutils but Ok on BSD:

   $ echo 1 2 | LC_ALL=en_US.iso-8859-1 tr ' ' '[:alpha:]'
   1A2
 
 The BSD behavior violates an explicit POSIX wording; we can't do an
 extension like that without either turning on a POSIXLY_CORRECT check or
 adding a command line option, neither of which I think is necessary.  So
 I see no reason to copy the BSD behavior of allowing any character class.

Yes I agree. I was just pointing out what BSD does here.

 Is it OK to change tr like this?
 I can't see anything depending on that.
 
 Seems reasonable to me, once we decide on the error message wording.

Great, I'll change it as above.

cheers,
Pádraig.



Re: [coreutils] tr: case mapping anomaly

2010-09-25 Thread Jim Meyering
Eric Blake wrote:
 On 09/24/2010 04:47 PM, Pádraig Brady wrote:
 I was just looking at a bug reported to fedora there where this abort()s

   $ LC_ALL=en_US tr '[:upper:] ' '[:lower:]'

Ouch!  Thanks for reporting it here.
How many more bugs lurk in tr...
Consolation: this one is a failure to diagnose invalid inputs.

...
 I interpret this to mean that even though there are 59 lower and 56
 upper in en_US, there are a fixed number of toupper case-mapping
 pairs, and there are likewise a fixed number of tolower case-mapping
 pairs. Therefore, [:upper:] and [:lower:] expand to the same number of
 array entries (whether that is 59 pairs or 56 pairs is irrelevant),
 and mappings like tr '[:lower:] ' '[:upper:]_' must unambiguously
 convert space to underscore and also guarantee that no lower-case
 letter becomes an underscore.

 Your question is basically what should we do on the unspecified
 behavior of '[:lower:] ' '[:upper:]', where string1 is longer than
 string2, since that falls outside the bounds of POSIX.

Right.

 I.E. 0xDE (the last upper char) is output from:

   $ echo _ _ | LC_ALL=en_US ./src/tr '[:lower:] ' '[:upper:]'

 That matches the behavior we choose in all other instances where
 string1 is longer than string2, where GNU tr follows BSD behavior of
 padding the last character of string2 to meet the length of string1.

 But, since POSIX is clear that the order of [:upper:] mappings is
 unspecified, I agree that it is not a good guarantee to the user of
 which byte gets duplicated to fill out the conversion, and we are
 better off rejecting that attempted usage.


 That seems quite inconsistent given that other classes
 are not allowed in string 2 when translating:

   $ echo ab . | LANG=en_US tr '[:digit:]' '[:alpha:]'
   tr: when translating, the only character classes that may appear in
   string2 are `upper' and `lower'

 For consistency I think it better to keep the classes
 in string 2 just for case mapping, and do something like:

   $ tr '[:upper:] ' '[:lower:]'
   tr: when not truncating set1, a character class can't be
   the last entity in string2

 I'd rather see it phrased:

 When string2 is shorter than string1, a character class can't be the
 last entity in string2.

Thanks, I find it easier to read when string1 and string2 are
listed in order -- and this applies only when translating.
How about this?

When translating with string1 longer than string2,
the latter string must not end with a character class.

 Note BSD allows extending the above, but that's at least
 consistent with any class being allowed in string2.
 I.E. this is disallowed by coreutils but Ok on BSD:

   $ echo 1 2 | LC_ALL=en_US.iso-8859-1 tr ' ' '[:alpha:]'
   1A2

 The BSD behavior violates an explicit POSIX wording; we can't do an
 extension like that without either turning on a POSIXLY_CORRECT check
 or adding a command line option, neither of which I think is
 necessary.  So I see no reason to copy the BSD behavior of allowing
 any character class.

Yes.  I deliberately opted not to provide the BSD behavior,
because it cannot be portable.

 Is it OK to change tr like this?
 I can't see anything depending on that.

 Seems reasonable to me, once we decide on the error message wording.

Yes.  Thanks for bringing this up and dealing with it.