On 12/15/19 11:40 AM, Roy Smith wrote: > With the following input: > >> $ cat x >> "ⁿᵘˡˡ" >> "ܥܝܪܐܩ" > > > Running "uniq -c" says there's two copies of the same line! > >> $ uniq -c x >> 2 "ⁿᵘˡˡ"
Thanks for the bug report. I expect this is because GNU 'uniq' uses the equivalent of strcoll (locale-dependent comparison) to compare lines, whereas macOS 'uniq' uses the equivalent of strcmp (byte comparison). Since the two lines compare equal in your locale, GNU 'uniq' says there's just one line. The GNU 'uniq' behavior appears to be a consequence of this commit: commit 545c2323d493c7ed9c770d9b8e45a15db6f615bc Author: Jim Meyering <j...@meyering.net> Date: Fri Aug 2 14:42:37 2002 +0000 with a change noted this way in NEWS: * uniq now obeys the LC_COLLATE locale, as per POSIX 1003.1-2001 TC1. However, the 2016 edition of POSIX removed mention of LC_COLLATE from 'uniq', and I expect this means that the 2002 commit should be reverted so that GNU 'uniq' behaves like macOS 'uniq' (a behavior that I think makes more sense anyway). I'll CC: this email to Jim Meyering to see whether he has an opinion about this. In the meantime you can work around the problem by using 'LC_ALL=C uniq' instead of plain 'uniq' in your shell script.