bug#21916: sort -u drops unique lines with some locales

Christoph Anton Mitterer Sun, 15 Nov 2015 22:48:07 -0800

Hey Pádraig

On Sat, 2015-11-14 at 11:06 +0000, Pádraig Brady wrote:
> Unfortunately the roman numeral code points compare equal:
> 
>   $ printf '%s\n' Ⅱ Ⅰ | ltrace -e strcoll sort
>   sort->strcoll("\342\205\241", "\342\205\240") = 0
>   Ⅱ
>   Ⅰ
> 
> If you compare at the byte level you'll get appropriate grouping:
> 
>   $ printf '%s\n' Ⅱ Ⅰ | LC_ALL=C sort
>   Ⅰ
>   Ⅱ
> 
> The same goes for other similar representations,
> like full width forms of latin numbers:
> 
>   $ printf '%s\n' ２ １ | ltrace -e strcoll sort
>   sort->strcoll("\357\274\222", "\357\274\221") = 0
>   ２
>   １
So the bug's basically in the locales?



> That's a bit surprising, though maybe since only a limited
> number of these representations are provided, it was
> not thought appropriate to provide collation orders for them.
Really strange...


> One thing we might do immediately, is maybe with the sort --debug
> option,
> to provide some indication when strcoll() and memcmp() differ in
> direction.
Well I think the main problem here is that -u does then actually not
what most people would expect from it.
AFAIU, it removes any lines that *collation would consider as
duplicate* ... and not any lines which *actually are duplicates*.

God knows how many scripts and other stuff this already breaks... and I
wonder whether any other tools may be badly affected by that collation
stuff, too...
Imagine you do a cp -a ... or diff -qr and these would leave out any of
such files they consider duplicate :-(
That could really result in data loss.

Actually that's how I stumbled over it... I made some lists with find,
of files which are then to be binary compared on a source and copy
filesystem... over the find result I once used just sort and once sort
-u and was quite shocked then.

If I had taken the sort -u sorted list, then I might have lost some
files to copy / compare.


The semantics of -u are IMHO even more problematic, as it (AFAIU) won't
happen with LANG=C.
But normally people wouldn't expect that different locales lead to
completely different behaviour, especially with respect to collation -
they would only expect that things are ordered differently.

Does it seems possible that sort -u spills out a warning on stderr,
when such case occurs where -u drops lines, which are considered
identical in terms of collation but which aren't really identical?

Cheers,
Chris.


btw: Is that bugtracker somewhere accessible? Cause I'd like to update
the Debian bug to having been forwarded to this one here.

smime.p7s
Description: S/MIME cryptographic signature

bug#21916: sort -u drops unique lines with some locales

Reply via email to