Bug#649729: uniq: merges obscure Cyrillic characters

Alex Shinn Thu, 02 Feb 2012 22:33:29 -0800

The problem is in strcoll/strxfrm as described in:

http://unix.stackexchange.com/questions/17198/where-has-my-uniq-or-sort-u-line-gone-with-some-unicode-characters


$ LANG=en_US.UTF-8 perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*",
strxfrm($_)) foreach @ARGV' a b c А В Г Ѯ Ѻ Ѳ
a c010801020
b d010801020
c e010801020
А 2cbb10801090
В 2cdb10801090
Г 2ceb10801090
Ѯ 101010102c6b102c6b
Ѻ 101010102c6b102c6b
Ѳ 101010102c6b102c6b

The latin and common cyrillic chars all have different values,
but the rare characters all convert to the same collation element.
It also does this for Japanese kana, but not kanji.

As the link states, it's pretty clearly a bug - the correct behavior
would be to sort the unknown characters after all known characters
and consider them distinct.  As a workaround, adding values for
all characters to every locale file in /usr/share/i18n/locales/ should
work.

-- 
Alex



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#649729: uniq: merges obscure Cyrillic characters

Reply via email to