Re: Unexpected behavior of sort -hu

Ingo Schwarze Sun, 30 Mar 2025 07:02:47 -0700

Hello Pascal,

Pascal Stumpf wrote on Thu, Mar 27, 2025 at 07:33:27PM +0100:

[...]
> I probably should have explained myself a little better.  The problem
> with your explanation is that the terms "upper case" and "lower case"
> letters are too broad and are not limited to ASCII.  A Greek upper case
> alpha is an upper case letter, and is certainly not sorted before a
> lower case ASCII 'a', even if LC_COLLATE were implemented (I think).

I don't see a problem here.  Our sort(1) manual page already says:

  STANDARDS
     The sort utility is compliant with the IEEE Std 1003.1-2008 (POSIX.1)
     specification, except that it ignores the user's locale(1) and always
     assumes LC_ALL=C.

So it's clear that we are talking about ASCII characters only and not
about Greek letters.

> So I would avoid using these classifications entirely.

That would be possible with option 2 below.

> On Thu, 27 Mar 2025 13:55:29 +0100, Ingo Schwarze wrote:

>> Do you have an idea of what we might say to achieve a reasonable
>> level of vagueness?

> The first paragraph of DESCRIPTION uses the word 'lexicographically' to
> describe the default comparison mode,

That default sorting order is selected by get_sort_func() in coll.c
as wstrcoll(), which defers to bwscoll() in bwstring.c, which compares
by memcmp(3):

   $ printf "|\na\n_\nA\n=\n" | sort | perl -ne 'chomp;print'
  =A_a|

That behaviour is definitely right because POSIX says in
https://pubs.opengroup.org/onlinepubs/9799919799/utilities/sort.html

  Comparisons ... shall be performed using the collating sequence
  of the current locale.

and
https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html#tag_07_03_02_06
requires the collating order of the POSIX locale to follow ASCII.

Calling the collating order of the current locale "lexicographically"
is maybe OK, too.  Maybe.  Or maybe it is slightly confusing because
POSIX uses the term "lexicographic" only for tsort(1), ctags(1) and cflow(1)
but not in relation to anything involving locales.

Using the same term for -V seems problematic to me bevause -V does *not*
use the same order, *not* the collating sequence of the POSIX locale:

   $ printf "|\na\n_\nA\n=\n" | sort -V | perl -ne 'chomp;print'
  Aa=_|

Arguably, that is even more lexicographic than the POSIX collating sequence.
What a mess.  Either way, using the same word for two different orderings
is not good.

> perhaps intentionally not going into the details anywhere in the page.

I doubt that whoever wrote our sort(1) manual - or the associated code,
for that matter - did anything out of wisdom or restraint.  The much
more likely explanation seems to be thoughtlessness and sloppiness.

I think we should improve the initial paragraph of the DESCRIPTION
to avoid the term "lexicographically".  It is vague and confusing
in so far as POSIX does not define it.  Introducing the proper term
"collation sequence" would be over the top given that the concepts
involved are very complicated and we deliberately do not support
any of them.

I think from the user pespective, it is most helpful to clearly state
what our implementation actually does - ascii(7) ordering.  In particular
since that coincides with what POSIX requires as the default.
We should not be vague given that POSIX requires specific behaviour.

While here, let's also fix the first sentence: talking about
sorting "by lines" only to talk about sorting "by keys" right afterwards
is confusing.  I guess what is meant is sorting "the lines".  Also,
the "and" is dubious; sorting text and binary files together isn't
really such a great idea.  Let's better regard all the files as either
text files *or* binary files.

If we put this in (OK?), after that, i see three options for -V:

 1. Leave the -V text as is; it is accurate and easy to understand.
 2. Say something like
    in ascii(7) order, except that all letters are sorted before all
    other characters
 3. Say something like
    for non-digits, the sorting order is unspecified

I'd be fine with both 1. and 2. and i like 3. less.
Saying "lexicographically" seems even worse to me than 3. because it feels
misleading.  It sounds as if it would say something of substance, but it's
unclear what that is, and however you define "lexicographically", it's
likely not what -V does.  For example, it certainly does not match
how we use the term "lexicographically" in strcoll(3) or strcmp(3).

Yours,
  Ingo

Index: sort.1
===================================================================
RCS file: /cvs/src/usr.bin/sort/sort.1,v
diff -u -r1.68 sort.1
--- sort.1      28 Mar 2025 14:35:50 -0000      1.68
+++ sort.1      30 Mar 2025 13:19:15 -0000
@@ -50,7 +50,7 @@
 .Sh DESCRIPTION
 The
 .Nm
-utility sorts text and binary files by lines.
+utility sorts the lines of text or binary files.
 A line is a record separated from the subsequent record by a
 newline (default) or NUL
 .Ql \e0
@@ -61,12 +61,12 @@
 .Pc .
 A record can contain any printable or unprintable characters.
 Comparisons are based on one or more sort keys extracted from
-each line of input, and are performed lexicographically,
-according to the specified command-line options
-that can tune the actual sorting behavior.
-By default, if keys are not given,
+each line according to the specified command line options.
+By default,
 .Nm
-uses entire lines for comparison.
+uses entire lines for comparison and sorts in
+.Xr ascii 7
+order.
 .Pp
 If no
 .Ar file

Re: Unexpected behavior of sort -hu

Reply via email to