On Thu, Mar 27, 2025 at 01:55:29PM +0100, Ingo Schwarze wrote:
> Hello Pascal,
> 
> Pascal Stumpf wrote on Wed, Mar 26, 2025 at 08:39:15PM +0100:
> > On Wed, 26 Mar 2025 13:59:23 +0100, Ingo Schwarze wrote:
> 
> >> +When comparing two strings, both strings are split into substrings
> >> +such that the first and every odd-numbered substring
> >> +consists of non-digit characters only,
> 
> > s/consists/consist/
> 
> I applied this correction before committing.
> 
> I did not use Pascal's later suggestion of "each consist" because
> i tend to agree with Jason's final conclusion that "consist is fine".
> 
> I intended the wording "the first and every odd-numbered" to signal
> 1-based numbering, but now i worry that indication is not unambigious
> because the wording fails to call the first one "odd-numbered".
> 
> The following wording tweak would resolve both issues, both making
> 1-based numbering explicit and avoiding the singular/plural quibble:
> 
>   such that every odd-numbered substring including the first one
>   consists of non-digit characters only,
> 

or you could just change your orginal text to "the first and every
otheri odd-numbered" i suppose.

> >> +while every even-numbered substring consists of digits only.
> >> +These substrings are compared in turn from left to right
> >> +until a difference is found.
> >> +The first substring can be empty; all others cannot.
> >> +.Pp
> >> +Non-digit substrings are compared alphabetically, with upper case
> >> +letters sorting before lower case letters, letters sorting before
> >> +non-letters, and non-letters sorting in
> >> +.Xr ascii 7
> >> +order.
> 
> > Hmm.  This is wrong as soon as you step foot into Unicode.  I don't
> > think it hurts to be a bit more vague here.
> 
> I don't think it's realistic or even a desirable goal to ever
> implement LC_COLLATE support in our libc.  The whole concept, even
> though standardized in POSIX, is nothing but an instance of horrifically
> complicated overengineering.  I talked to bapt@ about it during EuroBSDCon
> in Beograd (shortly after he had implented that nightmare for FreeBSD)
> and he kept swearing about it like a trooper.  Given that FreeBSD is not
> really known for keeping stuff simple or shunning excessive complication,
> his rage was quite telling.
> 
> That said, we are talking about this call chain here:
> 
>   versioncoll [coll.c]
>   vcmp [vsort.c]
>   cmpversions [vsort.c]
>   cmp_chars [vsort.c]
> 
> Unlike much of the other code in our sort(1), which contains unused
> rigging for wchar_t handling in many places, none of this call chain
> contains anything to handle Unicode, not even disabled dummy code.
> Even if you would enable wchar_t support in our sort, ignoring my
> screaming, none of this code chain would do any Unicode handling,
> it would continue to do what i described, explicitely using its own,
> hand-rolled re-implementation of single-byte isalpha(3).
> 
> So short of saying somethings like
> 
>   It is unspecified how the non-digit substrings are compared.
> 
> i can't think of a way to make this less specific, and i have no
> idea what the intended behaviour of -V would be in the presence
> of LC_COLLATE support.
> 
> Do you have an idea of what we might say to achieve a reasonable
> level of vagueness?
> 
> >> +Substrings consisting of digits are compared as integer numbers.
> >> +.Pp
> >> +At the end of each string, zero or more suffixes that start with a dot,
> >> +consist only of letters, digits, and tilde characters, and do not
> >> +start with a digit are ignored, equivalent to the regular expression
> >> +"(\e.([A-Za-z~][A-Za-z0-9~]*)?)*".
> >> +This is intended for ignoring filename suffixes such as
> >> +.Dq .tar.bz2 .
> 
> > Maybe .tgz for consistency with the example below
> 
> I slightly prefer demonstrating here that the suffix can contain digits,
> in particular since the presence of digits in file name extensions can
> result in confusion when people apply the suffix rule and the rule
> about digit/non-digit splitting in the wrong order.
> 
> Besides, when you have multiple examples, i don't consider it a goal
> to have all examples demonstrate the same aspects.  To the contrary,
> having the examples cover as many different aspects as possible
> feels preferable.
> 
> > (and since we don't have bzip2(1) in base)?
> 
> I don't think that's a problem.  The base system is certainly
> equipped to handle strings containing the substring "bz2", and even
> to store files with a .bz2 file name extension.
> 
> Besides, i doubt anyone uses OpenBSD without using ports, and use
> of bzip2(1) is widespread in ports, so mentioning it in an example
> does not feel exotic at all.
> 
> >>  .Pp
> >>  For example:
> >>  .Bd -literal -offset indent
> 
> > Maybe clarify here that the 'odd-numbered substring' is simply a dot in
> > the typical 'version sort' case.
> 
> Like in the patch below?
> 
> It feels slightly wordy, any idea how to bring the point across more
> concisely?
> 

i don;t, because the accompanyting text for sort -V is so massive that
my eyes begin to glaze over. i would sacrifice precision and detail for
simplicity here i think, but i realise that may be unsatisfactory.

i do have a suggestion for your text though (inline):

> Yours,
>   Ingo
> 
> 
> Index: sort.1
> ===================================================================
> RCS file: /cvs/src/usr.bin/sort/sort.1,v
> diff -u -r1.67 sort.1
> --- sort.1    27 Mar 2025 11:43:58 -0000      1.67
> +++ sort.1    27 Mar 2025 12:46:22 -0000
> @@ -201,8 +201,8 @@
>  IPv4 addresses in dotted quad notation.
>  .Pp
>  When comparing two strings, both strings are split into substrings
> -such that the first and every odd-numbered substring
> -consist of non-digit characters only,
> +such that every odd-numbered substring including the first one
> +consists of non-digit characters only,
>  while every even-numbered substring consists of digits only.
>  These substrings are compared in turn from left to right
>  until a difference is found.
> @@ -222,7 +222,11 @@
>  This is intended for ignoring filename suffixes such as
>  .Dq .tar.bz2 .
>  .Pp
> -For example:
> +In the following example, the first substring is
> +.Qq sort\-
> +and the other odd-numbered substrings are
> +.Qq \&.
> +each:

maybe: are all ".".

that's starting to look like morse code though...

jmc

>  .Bd -literal -offset indent
>  $ ls sort* | sort -V
>  sort-1.022.tgz

Reply via email to