On 07/06/11 13:45, Chet Ramey wrote:
[...]
I'm not going to add much to this discussion except to note that I believe
`sorts' is correct.  Consider the following script:

unset LANG LC_ALL LC_COLLATE

export LC_COLLATE=de_DE.UTF-8
printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' '
echo
That's really interesting -- and not just your intended point, but what happens with those ranges if you take 'sort' out of the pipe. The curly brace {A..Z} syntax doesn't obey the locale! Observe:

$ printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' '
a A b B c C d D e E f F g G h H i I j J k K l L m M n N o O p P q Q r R s S t T u U v V w W x X y Y z Z

(as you expect, but...)

$ printf "%s\n" {A..Z} {a..z} | tr $'\n' ' '
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z

So, if I want C-like behaviour out of "[a-z]*", I can write it as "{a..z}*"? Is that a bug or a feature?

It's not quite the same, of course, because the {} syntax expands to copies of the entire expression for everything in the range, e.g.:

$ ls {a..f}*
ls: cannot access a*: No such file or directory
ls: cannot access b*: No such file or directory
ls: cannot access c*: No such file or directory
ls: cannot access d*: No such file or directory
ls: cannot access f*: No such file or directory
example.txt

But interesting nonetheless.


[...]

That sure looks like `C' doesn't sort between `a' and `c' in de_DE.UTF-8
and en_GB.UTF-8.
Not in a case like that, with single-character strings. But my point was that it's possible for 'C' to sort between 'a' and 'c' in longer strings. Try sorting this:

aa
cc
Ca

Because 'c' and 'C' have equal sort weights but 'a' comes before 'c', this list will sort as:

aa
Ca
cc

...which has 'C' between 'a' and 'c'.

I realize it's pedantic, but documentation should be pedantically accurate :) I would be OK with changing the man page so it says, "sorts between those two characters in a list of single-character strings", as that would also describe the current behaviour.
I believe it would also be helpful for the documentation to then go on to
say something like this:

[...]
You might like the text in item 13 of the COMPAT file included in the bash
distribution.  It doesn't take quite so cautionary a tone, but the basic
information is there.
Actually, I discovered that the grep man page says something quite similar too, even giving almost the same example I did:

"Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set. For example, in the default C locale, [a-d] is equivalent to [abcd]. Many locales sort characters in dictionary order, and in these locales [a-d] is typically not equivalent to [abcd]; it might be equivalent to [aBbCcDd], for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value C."

I would be very happy if the bash man page said that. (It also says "sorts between", but because it also says "using the locale's collating sequence and character set" it squeaks by as being technically correct -- in the sorting example I gave above, where 'C' was between 'a' and 'c', only collation weights were in play, not sequence numbers, whereas if one is forced to sort by comparing 'c' to 'C', collation sequence numbers come into play as the tie-breaker. So saying "using the collating sequence" implies a similar situation, and would force 'C' to come after 'c'.)

~Felix.

Reply via email to