Hello Pascal,

Pascal Stumpf wrote on Wed, Mar 26, 2025 at 08:39:15PM +0100:
> On Wed, 26 Mar 2025 13:59:23 +0100, Ingo Schwarze wrote:

>> +When comparing two strings, both strings are split into substrings
>> +such that the first and every odd-numbered substring
>> +consists of non-digit characters only,

> s/consists/consist/

I applied this correction before committing.

I did not use Pascal's later suggestion of "each consist" because
i tend to agree with Jason's final conclusion that "consist is fine".

I intended the wording "the first and every odd-numbered" to signal
1-based numbering, but now i worry that indication is not unambigious
because the wording fails to call the first one "odd-numbered".

The following wording tweak would resolve both issues, both making
1-based numbering explicit and avoiding the singular/plural quibble:

  such that every odd-numbered substring including the first one
  consists of non-digit characters only,

>> +while every even-numbered substring consists of digits only.
>> +These substrings are compared in turn from left to right
>> +until a difference is found.
>> +The first substring can be empty; all others cannot.
>> +.Pp
>> +Non-digit substrings are compared alphabetically, with upper case
>> +letters sorting before lower case letters, letters sorting before
>> +non-letters, and non-letters sorting in
>> +.Xr ascii 7
>> +order.

> Hmm.  This is wrong as soon as you step foot into Unicode.  I don't
> think it hurts to be a bit more vague here.

I don't think it's realistic or even a desirable goal to ever
implement LC_COLLATE support in our libc.  The whole concept, even
though standardized in POSIX, is nothing but an instance of horrifically
complicated overengineering.  I talked to bapt@ about it during EuroBSDCon
in Beograd (shortly after he had implented that nightmare for FreeBSD)
and he kept swearing about it like a trooper.  Given that FreeBSD is not
really known for keeping stuff simple or shunning excessive complication,
his rage was quite telling.

That said, we are talking about this call chain here:

  versioncoll [coll.c]
  vcmp [vsort.c]
  cmpversions [vsort.c]
  cmp_chars [vsort.c]

Unlike much of the other code in our sort(1), which contains unused
rigging for wchar_t handling in many places, none of this call chain
contains anything to handle Unicode, not even disabled dummy code.
Even if you would enable wchar_t support in our sort, ignoring my
screaming, none of this code chain would do any Unicode handling,
it would continue to do what i described, explicitely using its own,
hand-rolled re-implementation of single-byte isalpha(3).

So short of saying somethings like

  It is unspecified how the non-digit substrings are compared.

i can't think of a way to make this less specific, and i have no
idea what the intended behaviour of -V would be in the presence
of LC_COLLATE support.

Do you have an idea of what we might say to achieve a reasonable
level of vagueness?

>> +Substrings consisting of digits are compared as integer numbers.
>> +.Pp
>> +At the end of each string, zero or more suffixes that start with a dot,
>> +consist only of letters, digits, and tilde characters, and do not
>> +start with a digit are ignored, equivalent to the regular expression
>> +"(\e.([A-Za-z~][A-Za-z0-9~]*)?)*".
>> +This is intended for ignoring filename suffixes such as
>> +.Dq .tar.bz2 .

> Maybe .tgz for consistency with the example below

I slightly prefer demonstrating here that the suffix can contain digits,
in particular since the presence of digits in file name extensions can
result in confusion when people apply the suffix rule and the rule
about digit/non-digit splitting in the wrong order.

Besides, when you have multiple examples, i don't consider it a goal
to have all examples demonstrate the same aspects.  To the contrary,
having the examples cover as many different aspects as possible
feels preferable.

> (and since we don't have bzip2(1) in base)?

I don't think that's a problem.  The base system is certainly
equipped to handle strings containing the substring "bz2", and even
to store files with a .bz2 file name extension.

Besides, i doubt anyone uses OpenBSD without using ports, and use
of bzip2(1) is widespread in ports, so mentioning it in an example
does not feel exotic at all.

>>  .Pp
>>  For example:
>>  .Bd -literal -offset indent

> Maybe clarify here that the 'odd-numbered substring' is simply a dot in
> the typical 'version sort' case.

Like in the patch below?

It feels slightly wordy, any idea how to bring the point across more
concisely?

Yours,
  Ingo


Index: sort.1
===================================================================
RCS file: /cvs/src/usr.bin/sort/sort.1,v
diff -u -r1.67 sort.1
--- sort.1      27 Mar 2025 11:43:58 -0000      1.67
+++ sort.1      27 Mar 2025 12:46:22 -0000
@@ -201,8 +201,8 @@
 IPv4 addresses in dotted quad notation.
 .Pp
 When comparing two strings, both strings are split into substrings
-such that the first and every odd-numbered substring
-consist of non-digit characters only,
+such that every odd-numbered substring including the first one
+consists of non-digit characters only,
 while every even-numbered substring consists of digits only.
 These substrings are compared in turn from left to right
 until a difference is found.
@@ -222,7 +222,11 @@
 This is intended for ignoring filename suffixes such as
 .Dq .tar.bz2 .
 .Pp
-For example:
+In the following example, the first substring is
+.Qq sort\-
+and the other odd-numbered substrings are
+.Qq \&.
+each:
 .Bd -literal -offset indent
 $ ls sort* | sort -V
 sort-1.022.tgz

Reply via email to