Re: Unexpected behavior of sort -hu

Pascal Stumpf Mon, 31 Mar 2025 15:38:16 -0700

On Sun, 30 Mar 2025 16:02:04 +0200, Ingo Schwarze wrote:
> Hello Pascal,
> 
> Pascal Stumpf wrote on Thu, Mar 27, 2025 at 07:33:27PM +0100:
> 
> [...]
> > I probably should have explained myself a little better.  The problem
> > with your explanation is that the terms "upper case" and "lower case"
> > letters are too broad and are not limited to ASCII.  A Greek upper case
> > alpha is an upper case letter, and is certainly not sorted before a
> > lower case ASCII 'a', even if LC_COLLATE were implemented (I think).
> 
> I don't see a problem here.  Our sort(1) manual page already says:
> 
>   STANDARDS
>      The sort utility is compliant with the IEEE Std 1003.1-2008 (POSIX.1)
>      specification, except that it ignores the user's locale(1) and always
>      assumes LC_ALL=C.
> 
> So it's clear that we are talking about ASCII characters only and not
> about Greek letters.


I disagree; that statement is quite hidden, and even then, it's a bit of
a leap to conclude that LC_ALL=C also applies to the man page's
terminology of what counts as a 'character'.  However ...

> > So I would avoid using these classifications entirely.
> 
> That would be possible with option 2 below.

I very much agree with this direction and your diff below.  This option
makes it abundantly clear that the comparison order is only defined for
ascii(7), and anything else is unspecified.  OK for that.

But regarding sort -V, I think the reality is even more ugly ...

> > On Thu, 27 Mar 2025 13:55:29 +0100, Ingo Schwarze wrote:
> 
> >> Do you have an idea of what we might say to achieve a reasonable
> >> level of vagueness?
> 
> > The first paragraph of DESCRIPTION uses the word 'lexicographically' to
> > describe the default comparison mode,
> 
> That default sorting order is selected by get_sort_func() in coll.c
> as wstrcoll(), which defers to bwscoll() in bwstring.c, which compares
> by memcmp(3):
> 
>    $ printf "|\na\n_\nA\n=\n" | sort | perl -ne 'chomp;print'
>   =A_a|
> 
> That behaviour is definitely right because POSIX says in
> https://pubs.opengroup.org/onlinepubs/9799919799/utilities/sort.html
> 
>   Comparisons ... shall be performed using the collating sequence
>   of the current locale.
> 
> and
> https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html#tag_07_03_02_06
> requires the collating order of the POSIX locale to follow ASCII.
> 
> Calling the collating order of the current locale "lexicographically"
> is maybe OK, too.  Maybe.  Or maybe it is slightly confusing because
> POSIX uses the term "lexicographic" only for tsort(1), ctags(1) and cflow(1)
> but not in relation to anything involving locales.

Small note: I don't think the usage of the term "lexicographic" in POSIX
should have any impact on the sort(1) man page.  They are two different
documents, with different conventions and different terminology.

> Using the same term for -V seems problematic to me bevause -V does *not*
> use the same order, *not* the collating sequence of the POSIX locale:
> 
>    $ printf "|\na\n_\nA\n=\n" | sort -V | perl -ne 'chomp;print'
>   Aa=_|
> 
> Arguably, that is even more lexicographic than the POSIX collating sequence.
> What a mess.  Either way, using the same word for two different orderings
> is not good.

This difference in behaviour is due to the way sort -V attempts to find
"suffix strings", I believe in vsort.c:find_suffix(), not due to
different considerations about lexical order of characters.  Skimming
over the code and comments, the chief design pattern here seems to have
been to replicate "whatever GNU sort does".  Oh god.

And here are some more samples:

~ $ cat t
sort-1.23.tar.gz
sort-1.23.tar.bz|
sort-1.23.tar.bz_
sort-1.23.tar.bz=
sort-1.23.tar.bza
sort-1.23.tar.bz2
sort-1.23.tar.bz~
~ $ sort -V t
sort-1.23.tar.bz2
sort-1.23.tar.bza
sort-1.23.tar.bz~
sort-1.23.tar.gz
sort-1.23.tar.bz=
sort-1.23.tar.bz_
sort-1.23.tar.bz|

> > perhaps intentionally not going into the details anywhere in the page.
> 
> I doubt that whoever wrote our sort(1) manual - or the associated code,
> for that matter - did anything out of wisdom or restraint.  The much
> more likely explanation seems to be thoughtlessness and sloppiness.
> 
> I think we should improve the initial paragraph of the DESCRIPTION
> to avoid the term "lexicographically".  It is vague and confusing
> in so far as POSIX does not define it.  Introducing the proper term
> "collation sequence" would be over the top given that the concepts
> involved are very complicated and we deliberately do not support
> any of them.
> 
> I think from the user pespective, it is most helpful to clearly state
> what our implementation actually does - ascii(7) ordering.  In particular
> since that coincides with what POSIX requires as the default.
> We should not be vague given that POSIX requires specific behaviour.
> 
> While here, let's also fix the first sentence: talking about
> sorting "by lines" only to talk about sorting "by keys" right afterwards
> is confusing.  I guess what is meant is sorting "the lines".  Also,
> the "and" is dubious; sorting text and binary files together isn't
> really such a great idea.  Let's better regard all the files as either
> text files *or* binary files.
> 
> If we put this in (OK?), after that, i see three options for -V:
> 
>  1. Leave the -V text as is; it is accurate and easy to understand.
>  2. Say something like
>     in ascii(7) order, except that all letters are sorted before all
>     other characters

"in ascii(7) order" is correct, however the "except" sentence is wrong.
See above how 2 is sorted above a.  Documenting this properly is
complicated because of how substring and suffix separation is done, 
and the original authors' decision to just leave a regex there is
telling.

I think simply "in ascii(7) order" is sufficient.

>  3. Say something like
>     for non-digits, the sorting order is unspecified
> 
> I'd be fine with both 1. and 2. and i like 3. less.
> Saying "lexicographically" seems even worse to me than 3. because it feels
> misleading.  It sounds as if it would say something of substance, but it's
> unclear what that is, and however you define "lexicographically", it's
> likely not what -V does.  For example, it certainly does not match
> how we use the term "lexicographically" in strcoll(3) or strcmp(3).
> 
> Yours,
>   Ingo
> 
> 
> Index: sort.1
> ===================================================================
> RCS file: /cvs/src/usr.bin/sort/sort.1,v
> diff -u -r1.68 sort.1
> --- sort.1    28 Mar 2025 14:35:50 -0000      1.68
> +++ sort.1    30 Mar 2025 13:19:15 -0000
> @@ -50,7 +50,7 @@
>  .Sh DESCRIPTION
>  The
>  .Nm
> -utility sorts text and binary files by lines.
> +utility sorts the lines of text or binary files.
>  A line is a record separated from the subsequent record by a
>  newline (default) or NUL
>  .Ql \e0
> @@ -61,12 +61,12 @@
>  .Pc .
>  A record can contain any printable or unprintable characters.
>  Comparisons are based on one or more sort keys extracted from
> -each line of input, and are performed lexicographically,
> -according to the specified command-line options
> -that can tune the actual sorting behavior.
> -By default, if keys are not given,
> +each line according to the specified command line options.
> +By default,
>  .Nm
> -uses entire lines for comparison.
> +uses entire lines for comparison and sorts in
> +.Xr ascii 7
> +order.
>  .Pp
>  If no
>  .Ar file

Re: Unexpected behavior of sort -hu

Reply via email to