Hey Pádraig On Sat, 2015-11-14 at 11:06 +0000, Pádraig Brady wrote: > Unfortunately the roman numeral code points compare equal: > > $ printf '%s\n' Ⅱ Ⅰ | ltrace -e strcoll sort > sort->strcoll("\342\205\241", "\342\205\240") = 0 > Ⅱ > Ⅰ > > If you compare at the byte level you'll get appropriate grouping: > > $ printf '%s\n' Ⅱ Ⅰ | LC_ALL=C sort > Ⅰ > Ⅱ > > The same goes for other similar representations, > like full width forms of latin numbers: > > $ printf '%s\n' 2 1 | ltrace -e strcoll sort > sort->strcoll("\357\274\222", "\357\274\221") = 0 > 2 > 1 So the bug's basically in the locales?
> That's a bit surprising, though maybe since only a limited > number of these representations are provided, it was > not thought appropriate to provide collation orders for them. Really strange... > One thing we might do immediately, is maybe with the sort --debug > option, > to provide some indication when strcoll() and memcmp() differ in > direction. Well I think the main problem here is that -u does then actually not what most people would expect from it. AFAIU, it removes any lines that *collation would consider as duplicate* ... and not any lines which *actually are duplicates*. God knows how many scripts and other stuff this already breaks... and I wonder whether any other tools may be badly affected by that collation stuff, too... Imagine you do a cp -a ... or diff -qr and these would leave out any of such files they consider duplicate :-( That could really result in data loss. Actually that's how I stumbled over it... I made some lists with find, of files which are then to be binary compared on a source and copy filesystem... over the find result I once used just sort and once sort -u and was quite shocked then. If I had taken the sort -u sorted list, then I might have lost some files to copy / compare. The semantics of -u are IMHO even more problematic, as it (AFAIU) won't happen with LANG=C. But normally people wouldn't expect that different locales lead to completely different behaviour, especially with respect to collation - they would only expect that things are ordered differently. Does it seems possible that sort -u spills out a warning on stderr, when such case occurs where -u drops lines, which are considered identical in terms of collation but which aren't really identical? Cheers, Chris. btw: Is that bugtracker somewhere accessible? Cause I'd like to update the Debian bug to having been forwarded to this one here.
smime.p7s
Description: S/MIME cryptographic signature