Re: documentation bug re character range expressions

Marcel (Felix) Giannelia Wed, 08 Jun 2011 15:03:28 -0700

On 07/06/11 13:45, Chet Ramey wrote:

[...]
I'm not going to add much to this discussion except to note that I believe
`sorts' is correct.  Consider the following script:


unset LANG LC_ALL LC_COLLATE

export LC_COLLATE=de_DE.UTF-8
printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' '
echo

That's really interesting -- and not just your intended point, but whathappens with those ranges if you take 'sort' out of the pipe. The curlybrace {A..Z} syntax doesn't obey the locale! Observe:


$ printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' '

a A b B c C d D e E f F g G h H i I j J k K l L m M n N o O p P q Q r Rs S t T u U v V w W x X y Y z Z


(as you expect, but...)

$ printf "%s\n" {A..Z} {a..z} | tr $'\n' ' '

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i jk l m n o p q r s t u v w x y z

So, if I want C-like behaviour out of "[a-z]*", I can write it as"{a..z}*"? Is that a bug or a feature?

It's not quite the same, of course, because the {} syntax expands tocopies of the entire expression for everything in the range, e.g.:


$ ls {a..f}*
ls: cannot access a*: No such file or directory
ls: cannot access b*: No such file or directory
ls: cannot access c*: No such file or directory
ls: cannot access d*: No such file or directory
ls: cannot access f*: No such file or directory
example.txt

But interesting nonetheless.

[...]

That sure looks like `C' doesn't sort between `a' and `c' in de_DE.UTF-8
and en_GB.UTF-8.

Not in a case like that, with single-character strings. But my point wasthat it's possible for 'C' to sort between 'a' and 'c' in longerstrings. Try sorting this:


aa
cc
Ca

Because 'c' and 'C' have equal sort weights but 'a' comes before 'c',this list will sort as:


aa
Ca
cc

...which has 'C' between 'a' and 'c'.

I realize it's pedantic, but documentation should be pedanticallyaccurate :) I would be OK with changing the man page so it says, "sortsbetween those two characters in a list of single-character strings", asthat would also describe the current behaviour.

I believe it would also be helpful for the documentation to then go on to
say something like this:

[...]

You might like the text in item 13 of the COMPAT file included in the bash
distribution.  It doesn't take quite so cautionary a tone, but the basic
information is there.

Actually, I discovered that the grep man page says something quitesimilar too, even giving almost the same example I did:

"Within a bracket expression, a range expression consists of twocharacters separated by a hyphen. It matches any single character thatsorts between the two characters, inclusive, using the locale'scollating sequence and character set. For example, in the default Clocale, [a-d] is equivalent to [abcd]. Many locales sort characters indictionary order, and in these locales [a-d] is typically not equivalentto [abcd]; it might be equivalent to [aBbCcDd], for example. To obtainthe traditional interpretation of bracket expressions, you can use the Clocale by setting the LC_ALL environment variable to the value C."

I would be very happy if the bash man page said that. (It also says"sorts between", but because it also says "using the locale's collatingsequence and character set" it squeaks by as being technically correct-- in the sorting example I gave above, where 'C' was between 'a' and'c', only collation weights were in play, not sequence numbers, whereasif one is forced to sort by comparing 'c' to 'C', collation sequencenumbers come into play as the tie-breaker. So saying "using thecollating sequence" implies a similar situation, and would force 'C' tocome after 'c'.)


~Felix.

Re: documentation bug re character range expressions

Reply via email to