Re: Unexpected behavior of sort -hu

Ingo Schwarze Wed, 26 Mar 2025 05:59:55 -0700

Hi Pascal,

Pascal Stumpf wrote on Tue, Mar 18, 2025 at 04:59:40PM +0100:
> On Sat, 15 Mar 2025 20:00:42 +0100, Ingo Schwarze wrote:
>> Van Dung Ha wrote on Wed, Mar 12, 2025 at 02:25:02PM +0100:


>>> Recently, I was parsing a few IP addresses from the snort logs
>>> to populate a pf table and encountered something counter-intuitive.
>>> Here is an example  source list.
>>> [...]
>>> 77.224.14.2
>>> 77.224.14.21
>>> 77.224.14.18
>>> 77.224.14.21

>> The easiest way i found to do what you want is
>> 
>>   $ tr . s < test.in | sort -nuts -k1 -k2 -k3 -k4 | tr s .
>> 
>> nuts in the Karakorum!

> sort -V to the rescue!

Thank you fort pointing that out!

Looking at the description of -V in our sort(1) manual revealed that
that description is completely useless:

 * The pseudo syntax PREFIX VERSION SUFFIX is highly misleading
   because the PREFIX is optional, and additional infixes are
   accepted in the middle of the version numbers.
 * Leading zeros are not ignored.  For example, if you sort
   the strings "1" and "01", the order is reversed, sorting "01" before "1".
 * The phrase "If an input string does not match the pattern"
   is utterly confusing because no pattern is mentioned at all.
 * It is unclear what "the byte compare function" is supposed to be,
   and looking at the code, i conclude that such a function likely
   does not even exist.
 * The description doesn't say at all how the comparison works.

So basically, i wrote a new description from scratch, see the patch below.

I think it is likely that my new description is not entirely accurate.
Very probably, there are some edge cases that might not agree with my
description, or where my description at least leaves ambiguity.

I think that problem is hard to avoid for two reasons.

First, our implementation makes huge numbers of choices where it is hard
to say which of these are intentional features and which are accidental
implementation details.  Given that the authors of the code chose to not
document at all what the intended behaviour of the code is, neither
under- nor over-documenting feels easy to avoid.

Even more importantly, i judge the code quality of sort(1)
as positively atrocious.  Because i have done some documentation
work in libcrypto, i'm quite used to documenting low-quality code,
but i must say the code in /usr/src/usr.bin/sort does not feel better
than typical OpenSSL quality, i.e. it is significantly below OpenBSD
quality standards:

 - The architecture of the code lacks logical structure.
   Instead of structuring the code in a way that reflects the
   purposes and algorithms and makes reasoning about and auditing
   the code easy for human beings, there are interminable cascades
   of special cases and interminable chains of wrappers, hiding
   the places where the actual work is being done.

 - The code does not use POSIX libc APIs but instead uses idiosyncratic
   wrappers throughout, even for the most fundamental <string.h>
   and <ctype.h> stuff, making it very hard to understand on the
   micro just like on the macro level.

Thouhts?  OKs?
  Ingo


Index: sort.1
===================================================================
RCS file: /cvs/src/usr.bin/sort/sort.1,v
diff -u -r1.66 sort.1
--- sort.1      15 Mar 2025 18:20:04 -0000      1.66
+++ sort.1      26 Mar 2025 12:49:11 -0000
@@ -196,14 +196,31 @@
 .It Fl r , Fl Fl reverse
 Sort in reverse order.
 .It Fl V , Fl Fl version-sort
-Sort version numbers.
-The input lines are treated as file names in form
-PREFIX VERSION SUFFIX, where SUFFIX matches the regular expression
-"(\.([A-Za-z~][A-Za-z0-9~]*)?)*".
-The files are compared by their prefixes and versions (leading
-zeros are ignored in version numbers, see example below).
-If an input string does not match the pattern, then it is compared
-using the byte compare function.
+This option is intended to sort strings that contain version numbers,
+but it can be used for other purposes as well, for example to sort
+IPv4 addresses in dotted quad notation.
+.Pp
+When comparing two strings, both strings are split into substrings
+such that the first and every odd-numbered substring
+consists of non-digit characters only,
+while every even-numbered substring consists of digits only.
+These substrings are compared in turn from left to right
+until a difference is found.
+The first substring can be empty; all others cannot.
+.Pp
+Non-digit substrings are compared alphabetically, with upper case
+letters sorting before lower case letters, letters sorting before
+non-letters, and non-letters sorting in
+.Xr ascii 7
+order.
+Substrings consisting of digits are compared as integer numbers.
+.Pp
+At the end of each string, zero or more suffixes that start with a dot,
+consist only of letters, digits, and tilde characters, and do not
+start with a digit are ignored, equivalent to the regular expression
+"(\e.([A-Za-z~][A-Za-z0-9~]*)?)*".
+This is intended for ignoring filename suffixes such as
+.Dq .tar.bz2 .
 .Pp
 For example:
 .Bd -literal -offset indent

Re: Unexpected behavior of sort -hu

Reply via email to