bug#38627: uniq -c gets wrong count with non-ascii strings

2019-12-17 Thread Bruno Haible
> However, the 2016 edition of POSIX removed mention of LC_COLLATE from 'uniq' Indeed. The change was done in . Quote: "On Page: 3309 Line: 111067 Section: uniq In the ENVIRONMENT VARIABLES section, delete: LC_COLLATE Determine the locale for ord

bug#38627: uniq -c gets wrong count with non-ascii strings

2019-12-17 Thread Jim Meyering
On Mon, Dec 16, 2019 at 1:41 AM Paul Eggert wrote: > On 12/15/19 11:40 AM, Roy Smith wrote: > > With the following input: > > > >> $ cat x > >> "ⁿᵘˡˡ" > >> "ܥܝܪܐܩ" > > > > > > Running "uniq -c" says there's two copies of the same line! > > > >> $ uniq -c x > >> 2 "ⁿᵘˡˡ" > > Thanks for the bu

bug#38627: uniq -c gets wrong count with non-ascii strings

2019-12-17 Thread Roy Smith
I stopped short of actually building uniq.c from source (bootstrap, prerequisites, ...), but looking at the code, it looks like the call chain is: different() xmemcoll() memcoll() strcoll() so I tried a little test at the strcoll() level: #include #include #include int main (int argc, char