Re: Unexpected behavior of sort -hu

Pascal Stumpf Wed, 02 Apr 2025 12:05:30 -0700

Hi Ingo,

On Tue, 1 Apr 2025 02:55:17 +0200, Ingo Schwarze wrote:
> Hello Pascal,
> 
> Pascal Stumpf wrote on Tue, Apr 01, 2025 at 12:37:44AM +0200:
> > On Sun, 30 Mar 2025 16:02:04 +0200, Ingo Schwarze wrote:
> >> Pascal Stumpf wrote on Thu, Mar 27, 2025 at 07:33:27PM +0100:
> 
> >>> I probably should have explained myself a little better.  The problem
> >>> with your explanation is that the terms "upper case" and "lower case"
> >>> letters are too broad and are not limited to ASCII.  A Greek upper case
> >>> alpha is an upper case letter, and is certainly not sorted before a
> >>> lower case ASCII 'a', even if LC_COLLATE were implemented (I think).
> 
> >> I don't see a problem here.  Our sort(1) manual page already says:
> >> 
> >>   STANDARDS
> >>      The sort utility is compliant with the IEEE Std 1003.1-2008 (POSIX.1)
> >>      specification, except that it ignores the user's locale(1) and always
> >>      assumes LC_ALL=C.
> >> 
> >> So it's clear that we are talking about ASCII characters only and not
> >> about Greek letters.
> 
> > I disagree; that statement is quite hidden, and even then, it's a bit of
> > a leap to conclude that LC_ALL=C also applies to the man page's
> > terminology of what counts as a 'character'.  However ...
> 
> Thanks for explaining, i now understand better why you dislike talking in
> this context about character attributes that are generally locale-dependent.
> Yes, you are right that statement is "hidden" in the STANDARDS section.
> To me, it is blatantly obvious that a manual page of a facility that is
> in any way related to locales uses the term "character" in the sense
> defined by LC_CTYPE.  That's such a general principle that there are large
> numbers of manual pages making that explicit by saying something like
> 
>   ENVIRONMENT
>     LC_CTYPE  The character encoding locale(1).  It decides which byte
>               sequences form characters and [...]
> 
> for example colrm(1), column(1), cut(1), fmt(1), fold(1), less(1), ls(1),
> nl(1), ps(1), rev(1), rs(1), ul(1), uniq(1), wc(1), printf(3), ...
> 
> Then again, you are probably right this convention is not obvious
> to users less experienced with how locales work.
> 
> So let's see how we can avoid the kind of potential confusion you point out.
> 
> >>> So I would avoid using these classifications entirely.
> 
> >> That would be possible with option 2 below.
> 
> > I very much agree with this direction and your diff below.  This option
> > makes it abundantly clear that the comparison order is only defined for
> > ascii(7), and anything else is unspecified.  OK for that.
> 
> Committed, thanks for checking.
> 
> > But regarding sort -V, I think the reality is even more ugly ...
> 
> [...]
> > Small note: I don't think the usage of the term "lexicographic" in POSIX
> > should have any impact on the sort(1) man page.  They are two different
> > documents, with different conventions and different terminology.
> 
> You have a point, it is not completely unheard of that our manual pages
> deliberately use terminology that does not match the terminology of the
> standards.  Often, we opt for simpler, less formal terminology, for the
> sake of making our text easier to understand.
> 
> Also, our strcoll(3) and strcmp(3) manual pages already use the
> term "lexicographically" in a loose manner that poorly aligns with
> the usage in POSIX, and consistency within our own manual page corpus
> matters more than consistency with POSIX.
> 
> Then again, in a field as full of traps and surprises as the field
> of locales, i'd still hope to avoid as many terminological conflicts
> and confusing and misleading choices of terminology as we can.
> 
> >> Using the same term for -V seems problematic to me bevause -V does *not*
> >> use the same order, *not* the collating sequence of the POSIX locale:
> >> 
> >>    $ printf "|\na\n_\nA\n=\n" | sort -V | perl -ne 'chomp;print'
> >>   Aa=_|
> >> 
> >> Arguably, that is even more lexicographic than the POSIX collating 
> >> sequence.
> >> What a mess.  Either way, using the same word for two different orderings
> >> is not good.
> 
> > This difference in behaviour is due to the way sort -V attempts to find
> > "suffix strings", I believe in vsort.c:find_suffix(), not due to
> > different considerations about lexical order of characters.
> 
> Are you really sure about that?
> 
> According to the regex, the suffix either starts with a literal dot,
> or it is empty.  The code in find_suffix() appears to agree.
> The "while" loop always iterates the full string until the end,
> including any suffixes.  There is no early breaking out of the loop.
> When exiting the loop, clen is always the full length of the string.
> Unless the sfx flag is set, that full length is returned, i.e.  there
> is only a non-empty suffix if sfx == true.  But the only condition
> that causes sfx = true to be set is when finding a literal dot.
> 
> There is no literal dot in my example, so find_suffix() is
> completely irrelevant for what we are talking about.
> 
> The actual code governing my above example is cmpversions()
> calling cmp_chars() - which sorts in ascii(7) order, except that
> letters are put before non-letters.


Ah, yes, now I see.  You're correct.

> The way we disagree about what this code does after we have both
> inspected it exemplifies just how bad this code really is.  Very
> hard to audit for a human being.  I mean, neither of us is a newbie
> at code auditing...
> 
> > Skimming over the code and comments, the chief design pattern here
> > seems to have been to replicate "whatever GNU sort does".  Oh god.
> 
> Sometimes, GNU compatibility is not all bad, i am aiming for GNU roff
> compatibility in mandoc(1) as well.  While i admit that rarely results
> in stellar design in the end of the day, gratuitious incompatibility
> can be even worse.

What I meant was the comment beginning at line 92; modeling -V after an
existing implementation is fine, but apparently, there's a disagreement
between documentation and code even inside GNU sort, so the authors
chose bug compatibility.

> > And here are some more samples:
> > 
> > ~ $ cat t
> > sort-1.23.tar.gz
> > sort-1.23.tar.bz|
> > sort-1.23.tar.bz_
> > sort-1.23.tar.bz=
> > sort-1.23.tar.bza
> > sort-1.23.tar.bz2
> > sort-1.23.tar.bz~
> > ~ $ sort -V t
> > sort-1.23.tar.bz2
> > sort-1.23.tar.bza
> > sort-1.23.tar.bz~
> > sort-1.23.tar.gz
> 
> The above are all suffixes of the form .tar.<alnum>, so these four
> sort equal according to -V.  That activates the fallback of sorting
> the whole lines, resulting in ascii(7) sorting bz2 < bza < bz~ < gz.
> 
> > sort-1.23.tar.bz=
> > sort-1.23.tar.bz_
> > sort-1.23.tar.bz|
> 
> These three experience the following field splitting:
> 
>   sort-  1  .  23  .tar.bz<final_byte>
> 
> The first four fields are equal in all three lines, and equal to the
> four fields in the first four lines.  The last three lines all sort
> after the first four lines (in cmpversions()) because they have
> a fifth, additional non-empty field, and the sorting among these
> three final lines is determined by the cmp_chars() call on
> the final character.  Since none of these three final characters
> are letters, we get ascii(7) order  = < _ < |.
> 
> So frankly, i fear your example does not tell us anything about the
> question we are trying to investigate.  :-(

You're right again; however I just considered another case that throws
into question the validity of the regex used: there are some (even
reasonably widely used) filename sufffixes that begin with a digit.

sort-1.23.7z
sort-1.23.tar.gz
sort-1.23.7.gz

Is sorted as:

sort-1.23.tar.gz
sort-1.23.7.gz
sort-1.23.7z

which is arguably a bug in GNU sort's regex.  There's nothing we can
really do about this, if we want to keep compatibility.

> [...]
> >> i see three options for -V:
> >> 
> >>  1. Leave the -V text as is; it is accurate and easy to understand.
> >>  2. Say something like
> >>     in ascii(7) order, except that all letters are sorted before all
> >>     other characters
> 
> > "in ascii(7) order" is correct, however the "except" sentence is wrong.
> > See above how 2 is sorted above a.
> 
> That isn't contradicting my wording.
> 
> sort-1.23.tar.bz2 and sort-1.23.tar.bza compare equal according to -V
> because of the .tar.foo suffix, and then you get the whole-line tiebreak
> sorting 2 before a according to ascii(7).

Ah, yes.  As soon as you wrap your head around the idea that the suffix
is not a substring ...

> > Documenting this properly is complicated because of how substring
> > and suffix separation is done, 
> 
> True, but i already documented substring separation in the second paragraph
> and suffix separation in the fourth paragraph, so these problems are both
> solved.
> 
> > and the original authors' decision to just leave a regex there is
> > telling.
> 
> Indeed.  I opted to leave the regexp there, at the end of my new
> fourth paragraph, because of how complicated the rules are, hoping
> that some readers may understand my description in plain English,
> some may understand the regular expression, and some may feel
> enlightened by comparing both.  :-/
> 
> > I think simply "in ascii(7) order" is sufficient.
> 
> But that would be blatantly incorrect.  Consider my above example.
> The following is *not* ascii(7) order:  Aa=_|
> 
> I believe that the following patch is correct and might address
> your legitimate concerns.  Do you agree?

Yes.  OK.

> Yours,
>   Ingo
> 
> 
> Index: sort.1
> ===================================================================
> RCS file: /cvs/src/usr.bin/sort/sort.1,v
> diff -u -r1.69 sort.1
> --- sort.1    1 Apr 2025 00:18:28 -0000       1.69
> +++ sort.1    1 Apr 2025 00:38:40 -0000
> @@ -208,11 +208,9 @@
>  until a difference is found.
>  The first substring can be empty; all others cannot.
>  .Pp
> -Non-digit substrings are compared alphabetically, with upper case
> -letters sorting before lower case letters, letters sorting before
> -non-letters, and non-letters sorting in
> -.Xr ascii 7
> -order.
> +Non-digit substrings are compared according to
> +.Xr ascii 7 ,
> +except that all letters are sorted before all other characters.
>  Substrings consisting of digits are compared as integer numbers.
>  .Pp
>  At the end of each string, zero or more suffixes that start with a dot,

Re: Unexpected behavior of sort -hu

Reply via email to