I stopped short of actually building uniq.c from source (bootstrap, prerequisites, ...), but looking at the code, it looks like the call chain is:
different() xmemcoll() memcoll() strcoll() so I tried a little test at the strcoll() level: #include <stdio.h> #include <unistd.h> #include <string.h> int main (int argc, char **argv) { unsigned char null[] = { 0342, 0201, 0277, 0341, 0265, 0230, 0313, 0241, 0313, 0241, 0 }; unsigned char iraq[] = { 0334, 0245, 0334, 0235, 0334, 0252, 0334, 0220, 0334, 0251, 0}; printf("%s\n", null); printf("%s\n", iraq); int m = strcoll(null, iraq); printf("m = %d\n", m); } That correctly says the strings are different: $ LANG=en_US.UTF-8 ./a.out ⁿᵘˡˡ ܥܝܪܐܩ m = 6 > On Dec 16, 2019, at 7:46 PM, Roy Smith <r...@panix.com> wrote: > > Yup, this does depend on the locale. In my original example, I had > LANG=en_US.UTF-8. Setting it to C.UTF-8 gets me the right result: > >> $ LANG=C.UTF-8 uniq -c x >> 1 "ⁿᵘˡˡ" >> 1 "ܥܝܪܐܩ" > > > But, that doesn't fully explain what's going on. I find it difficult to > believe that there's any collation sequence in the world where those two > strings should compare the same. I've been playing around with the ICU > string compare demo > <http://demo.icu-project.org/icu-bin/locexp?_=en_US&d_=en&x=col> and can't > reproduce this there. Possibly I just haven't hit upon the right combination > of options to set, but I think it's far-fetched that there's any such > combination for which those two strings comparing equal is legitimate. >