On 07/06/11 13:45, Chet Ramey wrote:
[...]
I'm not going to add much to this discussion except to note that I believe
`sorts' is correct. Consider the following script:
unset LANG LC_ALL LC_COLLATE
export LC_COLLATE=de_DE.UTF-8
printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' '
echo
That's really interesting -- and not just your intended point, but what
happens with those ranges if you take 'sort' out of the pipe. The curly
brace {A..Z} syntax doesn't obey the locale! Observe:
$ printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' '
a A b B c C d D e E f F g G h H i I j J k K l L m M n N o O p P q Q r R
s S t T u U v V w W x X y Y z Z
(as you expect, but...)
$ printf "%s\n" {A..Z} {a..z} | tr $'\n' ' '
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j
k l m n o p q r s t u v w x y z
So, if I want C-like behaviour out of "[a-z]*", I can write it as
"{a..z}*"? Is that a bug or a feature?
It's not quite the same, of course, because the {} syntax expands to
copies of the entire expression for everything in the range, e.g.:
$ ls {a..f}*
ls: cannot access a*: No such file or directory
ls: cannot access b*: No such file or directory
ls: cannot access c*: No such file or directory
ls: cannot access d*: No such file or directory
ls: cannot access f*: No such file or directory
example.txt
But interesting nonetheless.
[...]
That sure looks like `C' doesn't sort between `a' and `c' in de_DE.UTF-8
and en_GB.UTF-8.
Not in a case like that, with single-character strings. But my point was
that it's possible for 'C' to sort between 'a' and 'c' in longer
strings. Try sorting this:
aa
cc
Ca
Because 'c' and 'C' have equal sort weights but 'a' comes before 'c',
this list will sort as:
aa
Ca
cc
...which has 'C' between 'a' and 'c'.
I realize it's pedantic, but documentation should be pedantically
accurate :) I would be OK with changing the man page so it says, "sorts
between those two characters in a list of single-character strings", as
that would also describe the current behaviour.
I believe it would also be helpful for the documentation to then go on to
say something like this:
[...]
You might like the text in item 13 of the COMPAT file included in the bash
distribution. It doesn't take quite so cautionary a tone, but the basic
information is there.
Actually, I discovered that the grep man page says something quite
similar too, even giving almost the same example I did:
"Within a bracket expression, a range expression consists of two
characters separated by a hyphen. It matches any single character that
sorts between the two characters, inclusive, using the locale's
collating sequence and character set. For example, in the default C
locale, [a-d] is equivalent to [abcd]. Many locales sort characters in
dictionary order, and in these locales [a-d] is typically not equivalent
to [abcd]; it might be equivalent to [aBbCcDd], for example. To obtain
the traditional interpretation of bracket expressions, you can use the C
locale by setting the LC_ALL environment variable to the value C."
I would be very happy if the bash man page said that. (It also says
"sorts between", but because it also says "using the locale's collating
sequence and character set" it squeaks by as being technically correct
-- in the sorting example I gave above, where 'C' was between 'a' and
'c', only collation weights were in play, not sequence numbers, whereas
if one is forced to sort by comparing 'c' to 'C', collation sequence
numbers come into play as the tie-breaker. So saying "using the
collating sequence" implies a similar situation, and would force 'C' to
come after 'c'.)
~Felix.