On Thu, Mar 27, 2025 at 01:55:29PM +0100, Ingo Schwarze wrote: > Hello Pascal, > > Pascal Stumpf wrote on Wed, Mar 26, 2025 at 08:39:15PM +0100: > > On Wed, 26 Mar 2025 13:59:23 +0100, Ingo Schwarze wrote: > > >> +When comparing two strings, both strings are split into substrings > >> +such that the first and every odd-numbered substring > >> +consists of non-digit characters only, > > > s/consists/consist/ > > I applied this correction before committing. > > I did not use Pascal's later suggestion of "each consist" because > i tend to agree with Jason's final conclusion that "consist is fine". > > I intended the wording "the first and every odd-numbered" to signal > 1-based numbering, but now i worry that indication is not unambigious > because the wording fails to call the first one "odd-numbered". > > The following wording tweak would resolve both issues, both making > 1-based numbering explicit and avoiding the singular/plural quibble: > > such that every odd-numbered substring including the first one > consists of non-digit characters only, >
or you could just change your orginal text to "the first and every otheri odd-numbered" i suppose. > >> +while every even-numbered substring consists of digits only. > >> +These substrings are compared in turn from left to right > >> +until a difference is found. > >> +The first substring can be empty; all others cannot. > >> +.Pp > >> +Non-digit substrings are compared alphabetically, with upper case > >> +letters sorting before lower case letters, letters sorting before > >> +non-letters, and non-letters sorting in > >> +.Xr ascii 7 > >> +order. > > > Hmm. This is wrong as soon as you step foot into Unicode. I don't > > think it hurts to be a bit more vague here. > > I don't think it's realistic or even a desirable goal to ever > implement LC_COLLATE support in our libc. The whole concept, even > though standardized in POSIX, is nothing but an instance of horrifically > complicated overengineering. I talked to bapt@ about it during EuroBSDCon > in Beograd (shortly after he had implented that nightmare for FreeBSD) > and he kept swearing about it like a trooper. Given that FreeBSD is not > really known for keeping stuff simple or shunning excessive complication, > his rage was quite telling. > > That said, we are talking about this call chain here: > > versioncoll [coll.c] > vcmp [vsort.c] > cmpversions [vsort.c] > cmp_chars [vsort.c] > > Unlike much of the other code in our sort(1), which contains unused > rigging for wchar_t handling in many places, none of this call chain > contains anything to handle Unicode, not even disabled dummy code. > Even if you would enable wchar_t support in our sort, ignoring my > screaming, none of this code chain would do any Unicode handling, > it would continue to do what i described, explicitely using its own, > hand-rolled re-implementation of single-byte isalpha(3). > > So short of saying somethings like > > It is unspecified how the non-digit substrings are compared. > > i can't think of a way to make this less specific, and i have no > idea what the intended behaviour of -V would be in the presence > of LC_COLLATE support. > > Do you have an idea of what we might say to achieve a reasonable > level of vagueness? > > >> +Substrings consisting of digits are compared as integer numbers. > >> +.Pp > >> +At the end of each string, zero or more suffixes that start with a dot, > >> +consist only of letters, digits, and tilde characters, and do not > >> +start with a digit are ignored, equivalent to the regular expression > >> +"(\e.([A-Za-z~][A-Za-z0-9~]*)?)*". > >> +This is intended for ignoring filename suffixes such as > >> +.Dq .tar.bz2 . > > > Maybe .tgz for consistency with the example below > > I slightly prefer demonstrating here that the suffix can contain digits, > in particular since the presence of digits in file name extensions can > result in confusion when people apply the suffix rule and the rule > about digit/non-digit splitting in the wrong order. > > Besides, when you have multiple examples, i don't consider it a goal > to have all examples demonstrate the same aspects. To the contrary, > having the examples cover as many different aspects as possible > feels preferable. > > > (and since we don't have bzip2(1) in base)? > > I don't think that's a problem. The base system is certainly > equipped to handle strings containing the substring "bz2", and even > to store files with a .bz2 file name extension. > > Besides, i doubt anyone uses OpenBSD without using ports, and use > of bzip2(1) is widespread in ports, so mentioning it in an example > does not feel exotic at all. > > >> .Pp > >> For example: > >> .Bd -literal -offset indent > > > Maybe clarify here that the 'odd-numbered substring' is simply a dot in > > the typical 'version sort' case. > > Like in the patch below? > > It feels slightly wordy, any idea how to bring the point across more > concisely? > i don;t, because the accompanyting text for sort -V is so massive that my eyes begin to glaze over. i would sacrifice precision and detail for simplicity here i think, but i realise that may be unsatisfactory. i do have a suggestion for your text though (inline): > Yours, > Ingo > > > Index: sort.1 > =================================================================== > RCS file: /cvs/src/usr.bin/sort/sort.1,v > diff -u -r1.67 sort.1 > --- sort.1 27 Mar 2025 11:43:58 -0000 1.67 > +++ sort.1 27 Mar 2025 12:46:22 -0000 > @@ -201,8 +201,8 @@ > IPv4 addresses in dotted quad notation. > .Pp > When comparing two strings, both strings are split into substrings > -such that the first and every odd-numbered substring > -consist of non-digit characters only, > +such that every odd-numbered substring including the first one > +consists of non-digit characters only, > while every even-numbered substring consists of digits only. > These substrings are compared in turn from left to right > until a difference is found. > @@ -222,7 +222,11 @@ > This is intended for ignoring filename suffixes such as > .Dq .tar.bz2 . > .Pp > -For example: > +In the following example, the first substring is > +.Qq sort\- > +and the other odd-numbered substrings are > +.Qq \&. > +each: maybe: are all ".". that's starting to look like morse code though... jmc > .Bd -literal -offset indent > $ ls sort* | sort -V > sort-1.022.tgz
