Hi Ingo, On Tue, 1 Apr 2025 02:55:17 +0200, Ingo Schwarze wrote: > Hello Pascal, > > Pascal Stumpf wrote on Tue, Apr 01, 2025 at 12:37:44AM +0200: > > On Sun, 30 Mar 2025 16:02:04 +0200, Ingo Schwarze wrote: > >> Pascal Stumpf wrote on Thu, Mar 27, 2025 at 07:33:27PM +0100: > > >>> I probably should have explained myself a little better. The problem > >>> with your explanation is that the terms "upper case" and "lower case" > >>> letters are too broad and are not limited to ASCII. A Greek upper case > >>> alpha is an upper case letter, and is certainly not sorted before a > >>> lower case ASCII 'a', even if LC_COLLATE were implemented (I think). > > >> I don't see a problem here. Our sort(1) manual page already says: > >> > >> STANDARDS > >> The sort utility is compliant with the IEEE Std 1003.1-2008 (POSIX.1) > >> specification, except that it ignores the user's locale(1) and always > >> assumes LC_ALL=C. > >> > >> So it's clear that we are talking about ASCII characters only and not > >> about Greek letters. > > > I disagree; that statement is quite hidden, and even then, it's a bit of > > a leap to conclude that LC_ALL=C also applies to the man page's > > terminology of what counts as a 'character'. However ... > > Thanks for explaining, i now understand better why you dislike talking in > this context about character attributes that are generally locale-dependent. > Yes, you are right that statement is "hidden" in the STANDARDS section. > To me, it is blatantly obvious that a manual page of a facility that is > in any way related to locales uses the term "character" in the sense > defined by LC_CTYPE. That's such a general principle that there are large > numbers of manual pages making that explicit by saying something like > > ENVIRONMENT > LC_CTYPE The character encoding locale(1). It decides which byte > sequences form characters and [...] > > for example colrm(1), column(1), cut(1), fmt(1), fold(1), less(1), ls(1), > nl(1), ps(1), rev(1), rs(1), ul(1), uniq(1), wc(1), printf(3), ... > > Then again, you are probably right this convention is not obvious > to users less experienced with how locales work. > > So let's see how we can avoid the kind of potential confusion you point out. > > >>> So I would avoid using these classifications entirely. > > >> That would be possible with option 2 below. > > > I very much agree with this direction and your diff below. This option > > makes it abundantly clear that the comparison order is only defined for > > ascii(7), and anything else is unspecified. OK for that. > > Committed, thanks for checking. > > > But regarding sort -V, I think the reality is even more ugly ... > > [...] > > Small note: I don't think the usage of the term "lexicographic" in POSIX > > should have any impact on the sort(1) man page. They are two different > > documents, with different conventions and different terminology. > > You have a point, it is not completely unheard of that our manual pages > deliberately use terminology that does not match the terminology of the > standards. Often, we opt for simpler, less formal terminology, for the > sake of making our text easier to understand. > > Also, our strcoll(3) and strcmp(3) manual pages already use the > term "lexicographically" in a loose manner that poorly aligns with > the usage in POSIX, and consistency within our own manual page corpus > matters more than consistency with POSIX. > > Then again, in a field as full of traps and surprises as the field > of locales, i'd still hope to avoid as many terminological conflicts > and confusing and misleading choices of terminology as we can. > > >> Using the same term for -V seems problematic to me bevause -V does *not* > >> use the same order, *not* the collating sequence of the POSIX locale: > >> > >> $ printf "|\na\n_\nA\n=\n" | sort -V | perl -ne 'chomp;print' > >> Aa=_| > >> > >> Arguably, that is even more lexicographic than the POSIX collating > >> sequence. > >> What a mess. Either way, using the same word for two different orderings > >> is not good. > > > This difference in behaviour is due to the way sort -V attempts to find > > "suffix strings", I believe in vsort.c:find_suffix(), not due to > > different considerations about lexical order of characters. > > Are you really sure about that? > > According to the regex, the suffix either starts with a literal dot, > or it is empty. The code in find_suffix() appears to agree. > The "while" loop always iterates the full string until the end, > including any suffixes. There is no early breaking out of the loop. > When exiting the loop, clen is always the full length of the string. > Unless the sfx flag is set, that full length is returned, i.e. there > is only a non-empty suffix if sfx == true. But the only condition > that causes sfx = true to be set is when finding a literal dot. > > There is no literal dot in my example, so find_suffix() is > completely irrelevant for what we are talking about. > > The actual code governing my above example is cmpversions() > calling cmp_chars() - which sorts in ascii(7) order, except that > letters are put before non-letters.
Ah, yes, now I see. You're correct. > The way we disagree about what this code does after we have both > inspected it exemplifies just how bad this code really is. Very > hard to audit for a human being. I mean, neither of us is a newbie > at code auditing... > > > Skimming over the code and comments, the chief design pattern here > > seems to have been to replicate "whatever GNU sort does". Oh god. > > Sometimes, GNU compatibility is not all bad, i am aiming for GNU roff > compatibility in mandoc(1) as well. While i admit that rarely results > in stellar design in the end of the day, gratuitious incompatibility > can be even worse. What I meant was the comment beginning at line 92; modeling -V after an existing implementation is fine, but apparently, there's a disagreement between documentation and code even inside GNU sort, so the authors chose bug compatibility. > > And here are some more samples: > > > > ~ $ cat t > > sort-1.23.tar.gz > > sort-1.23.tar.bz| > > sort-1.23.tar.bz_ > > sort-1.23.tar.bz= > > sort-1.23.tar.bza > > sort-1.23.tar.bz2 > > sort-1.23.tar.bz~ > > ~ $ sort -V t > > sort-1.23.tar.bz2 > > sort-1.23.tar.bza > > sort-1.23.tar.bz~ > > sort-1.23.tar.gz > > The above are all suffixes of the form .tar.<alnum>, so these four > sort equal according to -V. That activates the fallback of sorting > the whole lines, resulting in ascii(7) sorting bz2 < bza < bz~ < gz. > > > sort-1.23.tar.bz= > > sort-1.23.tar.bz_ > > sort-1.23.tar.bz| > > These three experience the following field splitting: > > sort- 1 . 23 .tar.bz<final_byte> > > The first four fields are equal in all three lines, and equal to the > four fields in the first four lines. The last three lines all sort > after the first four lines (in cmpversions()) because they have > a fifth, additional non-empty field, and the sorting among these > three final lines is determined by the cmp_chars() call on > the final character. Since none of these three final characters > are letters, we get ascii(7) order = < _ < |. > > So frankly, i fear your example does not tell us anything about the > question we are trying to investigate. :-( You're right again; however I just considered another case that throws into question the validity of the regex used: there are some (even reasonably widely used) filename sufffixes that begin with a digit. sort-1.23.7z sort-1.23.tar.gz sort-1.23.7.gz Is sorted as: sort-1.23.tar.gz sort-1.23.7.gz sort-1.23.7z which is arguably a bug in GNU sort's regex. There's nothing we can really do about this, if we want to keep compatibility. > [...] > >> i see three options for -V: > >> > >> 1. Leave the -V text as is; it is accurate and easy to understand. > >> 2. Say something like > >> in ascii(7) order, except that all letters are sorted before all > >> other characters > > > "in ascii(7) order" is correct, however the "except" sentence is wrong. > > See above how 2 is sorted above a. > > That isn't contradicting my wording. > > sort-1.23.tar.bz2 and sort-1.23.tar.bza compare equal according to -V > because of the .tar.foo suffix, and then you get the whole-line tiebreak > sorting 2 before a according to ascii(7). Ah, yes. As soon as you wrap your head around the idea that the suffix is not a substring ... > > Documenting this properly is complicated because of how substring > > and suffix separation is done, > > True, but i already documented substring separation in the second paragraph > and suffix separation in the fourth paragraph, so these problems are both > solved. > > > and the original authors' decision to just leave a regex there is > > telling. > > Indeed. I opted to leave the regexp there, at the end of my new > fourth paragraph, because of how complicated the rules are, hoping > that some readers may understand my description in plain English, > some may understand the regular expression, and some may feel > enlightened by comparing both. :-/ > > > I think simply "in ascii(7) order" is sufficient. > > But that would be blatantly incorrect. Consider my above example. > The following is *not* ascii(7) order: Aa=_| > > I believe that the following patch is correct and might address > your legitimate concerns. Do you agree? Yes. OK. > Yours, > Ingo > > > Index: sort.1 > =================================================================== > RCS file: /cvs/src/usr.bin/sort/sort.1,v > diff -u -r1.69 sort.1 > --- sort.1 1 Apr 2025 00:18:28 -0000 1.69 > +++ sort.1 1 Apr 2025 00:38:40 -0000 > @@ -208,11 +208,9 @@ > until a difference is found. > The first substring can be empty; all others cannot. > .Pp > -Non-digit substrings are compared alphabetically, with upper case > -letters sorting before lower case letters, letters sorting before > -non-letters, and non-letters sorting in > -.Xr ascii 7 > -order. > +Non-digit substrings are compared according to > +.Xr ascii 7 , > +except that all letters are sorted before all other characters. > Substrings consisting of digits are compared as integer numbers. > .Pp > At the end of each string, zero or more suffixes that start with a dot,
