Re: How to sort unicode properly?

Eric Fischer Wed, 25 Sep 2019 09:37:55 -0700

Unfortunately, multibyte collation is simply unimplemented in MacOS X, so
there is no alternate locale definition that will fix it. As far as I can
tell this is documented only in the BUGS section of `man wcscoll`:


BUGS
     The current implementation of wcscoll() only works in single-byte
     LC_CTYPE locales, and falls back to using wcscmp() in locales with
     extended character sets.

(
https://opensource.apple.com/source/Libc/Libc-1272.250.1/string/FreeBSD/wcscoll.3.auto.html
)

Eric

On Wed, Sep 25, 2019 at 8:59 AM Peng Yu <[email protected]> wrote:

> I want to make my `sort` to be machine-independent and always use the
> correct Unicode sort order. Is there a way to do so?
>
> I don't know how to check where en_US.UTF-8 comes from. Do you know
> how to check it? (I use Mac OS X.)
>
> On 9/25/19, Eric Blake <[email protected]> wrote:
> > On 9/25/19 10:20 AM, Peng Yu wrote:
> >> Hi,
> >>
> >> It seems that "café" should be sorted before "caff" in Unicode.
> >>
> >> https://github.com/jtauber/pyuca
> >>
> >> But `sort` does not do so.
> >>
> >> $ printf '%s\n' cafe caff café | LC_ALL=UTF8  sort
> >> cafe
> >> caff
> >> café
> >> $ printf '%s\n' cafe caff café | LC_ALL=en_US.UTF-8  sort
> >> cafe
> >> caff
> >> café
> >>
> >> How to make `sort` sort according to Unicode order? Thanks.
> >
> > You'll have to write a locale definition where strcoll() sorts in the
> > order you want.  Coreutils sort is calling strcoll(), and if it doesn't
> > sort the way you think it should, the bug is in your locale and not in
> > coreutils.  You'll want to report this issue to whoever provided your
> > en_US.UTF-8 locale (perhaps glibc?)
> >
> > --
> > Eric Blake, Principal Software Engineer
> > Red Hat, Inc.           +1-919-301-3226
> > Virtualization:  qemu.org | libvirt.org
> >
>
>
> --
> Regards,
> Peng
>
>

Re: How to sort unicode properly?

Reply via email to