Control: retitle -1 sort -u and uniq "lose" non-identical lines with some locales
I was hurt by this bug, too. I had a simple-minded script to check files for dodgy characters before publishing them. How was I to know that em-dash and en-dash would be considered identical in a standard GB locale, as provided by Debian's installer? Spotting inconsistent use of characters that look alike is exactly what my script was supposed to achieve. LANG=en_GB.UTF-8 $ printf "\xe2\x80\x93\n\xe2\x80\x94\n" – — $ printf "\xe2\x80\x93\n\xe2\x80\x94\n" | od -An -tx1 e2 80 93 0a e2 80 94 0a $ printf "\xe2\x80\x93\n\xe2\x80\x94\n" | uniq | od -An -tx1 e2 80 93 0a It's true that the man page for "uniq" mentions LC_COLLATE, though I don't consider that adequate warning. However, it's also true that the official-looking spec at http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html says: > To remove duplicate lines based on whether they collate equally > instead of whether they are identical, applications should use: > > sort -u > > instead of: > > sort | uniq Also, the spec does not mention LC_COLLATE in the ENVIRONMENT VARIABLES section. Does coreutils attempt to follow that spec? The work-around, of course, is to set LC_COLLATE to C when uniq is invoked: $ printf "\xe2\x80\x93\n\xe2\x80\x94\n" | uniq | od -An -tx1 e2 80 93 0a $ printf "\xe2\x80\x93\n\xe2\x80\x94\n" | LC_COLLATE=C uniq | od -An -tx1 e2 80 93 0a e2 80 94 0a