Re: Unexpected behavior of sort -hu

Ingo Schwarze Mon, 31 Mar 2025 17:55:52 -0700

Hello Pascal,

Pascal Stumpf wrote on Tue, Apr 01, 2025 at 12:37:44AM +0200:
> On Sun, 30 Mar 2025 16:02:04 +0200, Ingo Schwarze wrote:
>> Pascal Stumpf wrote on Thu, Mar 27, 2025 at 07:33:27PM +0100:


>>> I probably should have explained myself a little better.  The problem
>>> with your explanation is that the terms "upper case" and "lower case"
>>> letters are too broad and are not limited to ASCII.  A Greek upper case
>>> alpha is an upper case letter, and is certainly not sorted before a
>>> lower case ASCII 'a', even if LC_COLLATE were implemented (I think).

>> I don't see a problem here.  Our sort(1) manual page already says:
>> 
>>   STANDARDS
>>      The sort utility is compliant with the IEEE Std 1003.1-2008 (POSIX.1)
>>      specification, except that it ignores the user's locale(1) and always
>>      assumes LC_ALL=C.
>> 
>> So it's clear that we are talking about ASCII characters only and not
>> about Greek letters.

> I disagree; that statement is quite hidden, and even then, it's a bit of
> a leap to conclude that LC_ALL=C also applies to the man page's
> terminology of what counts as a 'character'.  However ...

Thanks for explaining, i now understand better why you dislike talking in
this context about character attributes that are generally locale-dependent.
Yes, you are right that statement is "hidden" in the STANDARDS section.
To me, it is blatantly obvious that a manual page of a facility that is
in any way related to locales uses the term "character" in the sense
defined by LC_CTYPE.  That's such a general principle that there are large
numbers of manual pages making that explicit by saying something like

  ENVIRONMENT
    LC_CTYPE  The character encoding locale(1).  It decides which byte
              sequences form characters and [...]

for example colrm(1), column(1), cut(1), fmt(1), fold(1), less(1), ls(1),
nl(1), ps(1), rev(1), rs(1), ul(1), uniq(1), wc(1), printf(3), ...

Then again, you are probably right this convention is not obvious
to users less experienced with how locales work.

So let's see how we can avoid the kind of potential confusion you point out.

>>> So I would avoid using these classifications entirely.

>> That would be possible with option 2 below.

> I very much agree with this direction and your diff below.  This option
> makes it abundantly clear that the comparison order is only defined for
> ascii(7), and anything else is unspecified.  OK for that.

Committed, thanks for checking.

> But regarding sort -V, I think the reality is even more ugly ...

[...]
> Small note: I don't think the usage of the term "lexicographic" in POSIX
> should have any impact on the sort(1) man page.  They are two different
> documents, with different conventions and different terminology.

You have a point, it is not completely unheard of that our manual pages
deliberately use terminology that does not match the terminology of the
standards.  Often, we opt for simpler, less formal terminology, for the
sake of making our text easier to understand.

Also, our strcoll(3) and strcmp(3) manual pages already use the
term "lexicographically" in a loose manner that poorly aligns with
the usage in POSIX, and consistency within our own manual page corpus
matters more than consistency with POSIX.

Then again, in a field as full of traps and surprises as the field
of locales, i'd still hope to avoid as many terminological conflicts
and confusing and misleading choices of terminology as we can.

>> Using the same term for -V seems problematic to me bevause -V does *not*
>> use the same order, *not* the collating sequence of the POSIX locale:
>> 
>>    $ printf "|\na\n_\nA\n=\n" | sort -V | perl -ne 'chomp;print'
>>   Aa=_|
>> 
>> Arguably, that is even more lexicographic than the POSIX collating sequence.
>> What a mess.  Either way, using the same word for two different orderings
>> is not good.

> This difference in behaviour is due to the way sort -V attempts to find
> "suffix strings", I believe in vsort.c:find_suffix(), not due to
> different considerations about lexical order of characters.

Are you really sure about that?

According to the regex, the suffix either starts with a literal dot,
or it is empty.  The code in find_suffix() appears to agree.
The "while" loop always iterates the full string until the end,
including any suffixes.  There is no early breaking out of the loop.
When exiting the loop, clen is always the full length of the string.
Unless the sfx flag is set, that full length is returned, i.e.  there
is only a non-empty suffix if sfx == true.  But the only condition
that causes sfx = true to be set is when finding a literal dot.

There is no literal dot in my example, so find_suffix() is
completely irrelevant for what we are talking about.

The actual code governing my above example is cmpversions()
calling cmp_chars() - which sorts in ascii(7) order, except that
letters are put before non-letters.

The way we disagree about what this code does after we have both
inspected it exemplifies just how bad this code really is.  Very
hard to audit for a human being.  I mean, neither of us is a newbie
at code auditing...

> Skimming over the code and comments, the chief design pattern here
> seems to have been to replicate "whatever GNU sort does".  Oh god.

Sometimes, GNU compatibility is not all bad, i am aiming for GNU roff
compatibility in mandoc(1) as well.  While i admit that rarely results
in stellar design in the end of the day, gratuitious incompatibility
can be even worse.

> And here are some more samples:
> 
> ~ $ cat t
> sort-1.23.tar.gz
> sort-1.23.tar.bz|
> sort-1.23.tar.bz_
> sort-1.23.tar.bz=
> sort-1.23.tar.bza
> sort-1.23.tar.bz2
> sort-1.23.tar.bz~
> ~ $ sort -V t
> sort-1.23.tar.bz2
> sort-1.23.tar.bza
> sort-1.23.tar.bz~
> sort-1.23.tar.gz

The above are all suffixes of the form .tar.<alnum>, so these four
sort equal according to -V.  That activates the fallback of sorting
the whole lines, resulting in ascii(7) sorting bz2 < bza < bz~ < gz.

> sort-1.23.tar.bz=
> sort-1.23.tar.bz_
> sort-1.23.tar.bz|

These three experience the following field splitting:

  sort-  1  .  23  .tar.bz<final_byte>

The first four fields are equal in all three lines, and equal to the
four fields in the first four lines.  The last three lines all sort
after the first four lines (in cmpversions()) because they have
a fifth, additional non-empty field, and the sorting among these
three final lines is determined by the cmp_chars() call on
the final character.  Since none of these three final characters
are letters, we get ascii(7) order  = < _ < |.

So frankly, i fear your example does not tell us anything about the
question we are trying to investigate.  :-(

[...]
>> i see three options for -V:
>> 
>>  1. Leave the -V text as is; it is accurate and easy to understand.
>>  2. Say something like
>>     in ascii(7) order, except that all letters are sorted before all
>>     other characters

> "in ascii(7) order" is correct, however the "except" sentence is wrong.
> See above how 2 is sorted above a.

That isn't contradicting my wording.

sort-1.23.tar.bz2 and sort-1.23.tar.bza compare equal according to -V
because of the .tar.foo suffix, and then you get the whole-line tiebreak
sorting 2 before a according to ascii(7).

> Documenting this properly is complicated because of how substring
> and suffix separation is done, 

True, but i already documented substring separation in the second paragraph
and suffix separation in the fourth paragraph, so these problems are both
solved.

> and the original authors' decision to just leave a regex there is
> telling.

Indeed.  I opted to leave the regexp there, at the end of my new
fourth paragraph, because of how complicated the rules are, hoping
that some readers may understand my description in plain English,
some may understand the regular expression, and some may feel
enlightened by comparing both.  :-/

> I think simply "in ascii(7) order" is sufficient.

But that would be blatantly incorrect.  Consider my above example.
The following is *not* ascii(7) order:  Aa=_|

I believe that the following patch is correct and might address
your legitimate concerns.  Do you agree?

Yours,
  Ingo


Index: sort.1
===================================================================
RCS file: /cvs/src/usr.bin/sort/sort.1,v
diff -u -r1.69 sort.1
--- sort.1      1 Apr 2025 00:18:28 -0000       1.69
+++ sort.1      1 Apr 2025 00:38:40 -0000
@@ -208,11 +208,9 @@
 until a difference is found.
 The first substring can be empty; all others cannot.
 .Pp
-Non-digit substrings are compared alphabetically, with upper case
-letters sorting before lower case letters, letters sorting before
-non-letters, and non-letters sorting in
-.Xr ascii 7
-order.
+Non-digit substrings are compared according to
+.Xr ascii 7 ,
+except that all letters are sorted before all other characters.
 Substrings consisting of digits are compared as integer numbers.
 .Pp
 At the end of each string, zero or more suffixes that start with a dot,

Re: Unexpected behavior of sort -hu

Reply via email to