bug#49239: Unexpected results with sort -V

Paul Eggert Sat, 12 Feb 2022 21:32:10 -0800

On 6/28/21 10:54, Kamil Dudka wrote:

You are right.  The matching algorithm was not implemented correctly and
the patch you attached fixes it.

I looked into Bug#49239 and found some more places where thedocumentation disagreed with the code. I installed the attached patchesinto Gnulib and Coreutils, respectively, which should bring the two intoagreement and should fix the bugs that Michael reported albeit in adifferent way than his proposed patch. Briefly:

* The code didn't allow file name suffixes to be the entire file name,but the documentation did. Here I went with the documentation. I couldbe talked into the other way; it shouldn't matter much either way.

* The code did the preliminary test (without suffixes) using strcmp, thedocumentation said it should use version comparison. Here I went withthe documentation.

* As Michael mentioned, sort -V mishandled NUL. I fixed this by adding aGnulib function filenvercmp that treats NUL as just another character.

* As Michael also mentioned, filevercmp fell back on strcmp if versionsort found no difference, which meant sort's --stable flag wasineffective. I fixed this by not having filevercmp fall back on strcmp.

* I fixed the two-consecutive dot and trailing-dot bugs Michaelmentioned, by rewriting the suffix finder to not have that confusingREAD_ALPHA state variable, and to instead implement the regularexpression's nested * operators in the usual way with nested loops.

Thanks, Michael, for reporting the problem. I'm boldly closing theCoreutils bug report as fixed.

From 9f48fb992a3d7e96610c4ce8be969cff2d61a01b Mon Sep 17 00:00:00 2001
From: Paul Eggert <[email protected]>
Date: Sat, 12 Feb 2022 16:27:05 -0800
Subject: [PATCH] filevercmp: fix several unexpected results

Problems reported by Michael Debertol in <https://bugs.gnu.org/49239>.
While looking into this, I spotted some more areas where the
code and documentation did not agree, or where the documentation
was unclear.  The biggest change needed by coreutils is a new
function filenvercmp that can compare byte strings containing NUL.
* lib/filevercmp.c: Do not include sys/types.h, stdlib.h, string.h.
Include idx.h, verify.h.
(match_suffix): Remove, replacing all uses with calls to ...
(file_prefixlen): ... this new function.  Simplify it by
avoiding the need for a confusing READ_ALPHA state variable.
Change its API to something more useful, with a *LEN arg.
it with a new *LEN arg.
(file_prefixlen, verrevcmp):
Prefer idx_t to size_t where either will do.
(order): Change args to S, POS, LEN instead of just S[POS].
This lets us handle NUL bytes correctly.  Callers changed.
Verify that ints are sufficiently wide for its API.
(verrevcmp): Don't assume that S1[S1_LEN] is a non-digit,
and likewise for S2[S2_LEN].  The byte might not be accessible
if filenvercmp is being called.
(filevercmp): Reimplement by calling filenvercmp.
(filenvercmp): New function, rewritten without the assumption
that the inputs are null-terminated.
Remove "easy comparison to see if strings are identical", as the
use of it later (a) was undocumented, and (b) caused sort -V to be
unstable.  When both strings start with ".", do not skip past
the "."s before looking for suffixes, as this disagreed
with the documentation.
* lib/filevercmp.h: Fix comments, which had many mistakes.
(filenvercmp): New decl.
* modules/filevercmp (Depends-on): Add idx, verify.  Remove string.
* tests/test-filevercmp.c: Include string.h.
(examples): Reorder examples ".0" and ".9" that matched the code
but not the documentation.  The code has been fixed to match the
documentation.  Add some examples involving \1 so that they
can be tried with both \1 and \0.  Add some other examples
taken from the bug report.
(equals): New set of test cases.
(sign, test_filevercmp): New functions.
(main): Remove test case where the fixed filevercmp disagrees with
strverscmp.  Use test_filevercmp instead of filevercmp, so that
we also test filenvercmp.  Test the newly-introduced EQUALS cases.
---
 ChangeLog               |  46 ++++++++++
 lib/filevercmp.c        | 187 +++++++++++++++++++++-------------------
 lib/filevercmp.h        |  66 ++++++++++----
 modules/filevercmp      |   3 +-
 tests/test-filevercmp.c |  94 +++++++++++++++++++-
 5 files changed, 284 insertions(+), 112 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 62162cbfce..4bf0cec7f0 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,49 @@
+2022-02-12  Paul Eggert  <[email protected]>
+
+	filevercmp: fix several unexpected results
+	Problems reported by Michael Debertol in <https://bugs.gnu.org/49239>.
+	While looking into this, I spotted some more areas where the
+	code and documentation did not agree, or where the documentation
+	was unclear.  The biggest change needed by coreutils is a new
+	function filenvercmp that can compare byte strings containing NUL.
+	* lib/filevercmp.c: Do not include sys/types.h, stdlib.h, string.h.
+	Include idx.h, verify.h.
+	(match_suffix): Remove, replacing all uses with calls to ...
+	(file_prefixlen): ... this new function.  Simplify it by
+	avoiding the need for a confusing READ_ALPHA state variable.
+	Change its API to something more useful, with a *LEN arg.
+	it with a new *LEN arg.
+	(file_prefixlen, verrevcmp):
+	Prefer idx_t to size_t where either will do.
+	(order): Change args to S, POS, LEN instead of just S[POS].
+	This lets us handle NUL bytes correctly.  Callers changed.
+	Verify that ints are sufficiently wide for its API.
+	(verrevcmp): Don't assume that S1[S1_LEN] is a non-digit,
+	and likewise for S2[S2_LEN].  The byte might not be accessible
+	if filenvercmp is being called.
+	(filevercmp): Reimplement by calling filenvercmp.
+	(filenvercmp): New function, rewritten without the assumption
+	that the inputs are null-terminated.
+	Remove "easy comparison to see if strings are identical", as the
+	use of it later (a) was undocumented, and (b) caused sort -V to be
+	unstable.  When both strings start with ".", do not skip past
+	the "."s before looking for suffixes, as this disagreed
+	with the documentation.
+	* lib/filevercmp.h: Fix comments, which had many mistakes.
+	(filenvercmp): New decl.
+	* modules/filevercmp (Depends-on): Add idx, verify.  Remove string.
+	* tests/test-filevercmp.c: Include string.h.
+	(examples): Reorder examples ".0" and ".9" that matched the code
+	but not the documentation.  The code has been fixed to match the
+	documentation.  Add some examples involving \1 so that they
+	can be tried with both \1 and \0.  Add some other examples
+	taken from the bug report.
+	(equals): New set of test cases.
+	(sign, test_filevercmp): New functions.
+	(main): Remove test case where the fixed filevercmp disagrees with
+	strverscmp.  Use test_filevercmp instead of filevercmp, so that
+	we also test filenvercmp.  Test the newly-introduced EQUALS cases.
+
 2022-02-09  Bruno Haible  <[email protected]>
 
 	string: Fix "mismatched allocation function" warnings regarding strndup.
diff --git a/lib/filevercmp.c b/lib/filevercmp.c
index b7e3ac6167..d546e79054 100644
--- a/lib/filevercmp.c
+++ b/lib/filevercmp.c
@@ -1,4 +1,5 @@
-/*
+/* Compare file names containing version numbers.
+
    Copyright (C) 1995 Ian Jackson <[email protected]>
    Copyright (C) 2001 Anthony Towns <[email protected]>
    Copyright (C) 2008-2022 Free Software Foundation, Inc.
@@ -19,60 +20,65 @@
 #include <config.h>
 #include "filevercmp.h"
 
-#include <sys/types.h>
-#include <stdlib.h>
 #include <stdbool.h>
-#include <string.h>
 #include <c-ctype.h>
 #include <limits.h>
-
-/* Match a file suffix defined by this regular expression:
-   /(\.[A-Za-z~][A-Za-z0-9~]*)*$/
-   Scan the string *STR and return a pointer to the matching suffix, or
-   NULL if not found.  Upon return, *STR points to terminating NUL.  */
-static const char *
-match_suffix (const char **str)
+#include <idx.h>
+#include <verify.h>
+
+/* Return the length of a prefix of S that corresponds to the suffix
+   defined by this extended regular expression in the C locale:
+     (\.[A-Za-z~][A-Za-z0-9~]*)*$
+   If *LEN is -1, S is a string; set *LEN to S's length.
+   Otherwise, *LEN should be nonnegative, S is a char array,
+   and *LEN does not change.  */
+static idx_t
+file_prefixlen (char const *s, ptrdiff_t *len)
 {
-  const char *match = NULL;
-  bool read_alpha = false;
-  while (**str)
+  size_t n = *len;  /* SIZE_MAX if N == -1.  */
+
+  for (idx_t i = 0; ; i++)
     {
-      if (read_alpha)
-        {
-          read_alpha = false;
-          if (!c_isalpha (**str) && '~' != **str)
-            match = NULL;
-        }
-      else if ('.' == **str)
+      idx_t prefixlen = i;
+      while (i + 1 < n && s[i] == '.' && (c_isalpha (s[i + 1])
+                                          || s[i + 1] == '~'))
+        for (i += 2; i < n && (c_isalnum (s[i]) || s[i] == '~'); i++)
+          continue;
+
+      if (*len < 0 ? !s[i] : i == n)
         {
-          read_alpha = true;
-          if (!match)
-            match = *str;
+          *len = i;
+          return prefixlen;
         }
-      else if (!c_isalnum (**str) && '~' != **str)
-        match = NULL;
-      (*str)++;
     }
-  return match;
 }
 
-/* verrevcmp helper function */
+/* Return a version sort comparison value for S's byte at position POS.
+   S has length LEN.  If POS == LEN, sort before all non-'~' bytes.  */
+
 static int
-order (unsigned char c)
+order (char const *s, idx_t pos, idx_t len)
 {
+  if (pos == len)
+    return -1;
+
+  unsigned char c = s[pos];
   if (c_isdigit (c))
     return 0;
   else if (c_isalpha (c))
     return c;
   else if (c == '~')
-    return -1;
+    return -2;
   else
-    return (int) c + UCHAR_MAX + 1;
+    {
+      verify (UCHAR_MAX <= (INT_MAX - 1 - 2) / 2);
+      return c + UCHAR_MAX + 1;
+    }
 }
 
 /* slightly modified verrevcmp function from dpkg
-   S1, S2 - compared string
-   S1_LEN, S2_LEN - length of strings to be scanned
+   S1, S2 - compared char array
+   S1_LEN, S2_LEN - length of arrays to be scanned
 
    This implements the algorithm for comparison of version strings
    specified by Debian and now widely adopted.  The detailed
@@ -81,37 +87,38 @@ order (unsigned char c)
    implements that from s5.6.12 of Debian Policy v3.8.0.1
    https://www.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Version */
 static int _GL_ATTRIBUTE_PURE
-verrevcmp (const char *s1, size_t s1_len, const char *s2, size_t s2_len)
+verrevcmp (const char *s1, idx_t s1_len, const char *s2, idx_t s2_len)
 {
-  size_t s1_pos = 0;
-  size_t s2_pos = 0;
+  idx_t s1_pos = 0;
+  idx_t s2_pos = 0;
   while (s1_pos < s1_len || s2_pos < s2_len)
     {
       int first_diff = 0;
       while ((s1_pos < s1_len && !c_isdigit (s1[s1_pos]))
              || (s2_pos < s2_len && !c_isdigit (s2[s2_pos])))
         {
-          int s1_c = (s1_pos == s1_len) ? 0 : order (s1[s1_pos]);
-          int s2_c = (s2_pos == s2_len) ? 0 : order (s2[s2_pos]);
+          int s1_c = order (s1, s1_pos, s1_len);
+          int s2_c = order (s2, s2_pos, s2_len);
           if (s1_c != s2_c)
             return s1_c - s2_c;
           s1_pos++;
           s2_pos++;
         }
-      while (s1[s1_pos] == '0')
+      while (s1_pos < s1_len && s1[s1_pos] == '0')
         s1_pos++;
-      while (s2[s2_pos] == '0')
+      while (s2_pos < s2_len && s2[s2_pos] == '0')
         s2_pos++;
-      while (c_isdigit (s1[s1_pos]) && c_isdigit (s2[s2_pos]))
+      while (s1_pos < s1_len && s2_pos < s2_len
+             && c_isdigit (s1[s1_pos]) && c_isdigit (s2[s2_pos]))
         {
           if (!first_diff)
             first_diff = s1[s1_pos] - s2[s2_pos];
           s1_pos++;
           s2_pos++;
         }
-      if (c_isdigit (s1[s1_pos]))
+      if (s1_pos < s1_len && c_isdigit (s1[s1_pos]))
         return 1;
-      if (c_isdigit (s2[s2_pos]))
+      if (s2_pos < s2_len && c_isdigit (s2[s2_pos]))
         return -1;
       if (first_diff)
         return first_diff;
@@ -124,58 +131,56 @@ verrevcmp (const char *s1, size_t s1_len, const char *s2, size_t s2_len)
 int
 filevercmp (const char *s1, const char *s2)
 {
-  const char *s1_pos;
-  const char *s2_pos;
-  const char *s1_suffix, *s2_suffix;
-  size_t s1_len, s2_len;
-  int result;
-
-  /* easy comparison to see if strings are identical */
-  int simple_cmp = strcmp (s1, s2);
-  if (simple_cmp == 0)
-    return 0;
+  return filenvercmp (s1, -1, s2, -1);
+}
 
-  /* special handle for "", "." and ".." */
-  if (!*s1)
-    return -1;
-  if (!*s2)
-    return 1;
-  if (0 == strcmp (".", s1))
-    return -1;
-  if (0 == strcmp (".", s2))
-    return 1;
-  if (0 == strcmp ("..", s1))
-    return -1;
-  if (0 == strcmp ("..", s2))
+/* Compare versions A (of length ALEN) and B (of length BLEN).
+   See filevercmp.h for function description.  */
+int
+filenvercmp (char const *a, ptrdiff_t alen, char const *b, ptrdiff_t blen)
+{
+  /* Special case for empty versions.  */
+  bool aempty = alen < 0 ? !a[0] : !alen;
+  bool bempty = blen < 0 ? !b[0] : !blen;
+  if (aempty)
+    return -!bempty;
+  if (bempty)
     return 1;
 
-  /* special handle for other hidden files */
-  if (*s1 == '.' && *s2 != '.')
-    return -1;
-  if (*s1 != '.' && *s2 == '.')
-    return 1;
-  if (*s1 == '.' && *s2 == '.')
+  /* Special cases for leading ".": "." sorts first, then "..", then
+     other names with leading ".", then other names.  */
+  if (a[0] == '.')
     {
-      s1++;
-      s2++;
-    }
+      if (b[0] != '.')
+        return -1;
 
-  /* "cut" file suffixes */
-  s1_pos = s1;
-  s2_pos = s2;
-  s1_suffix = match_suffix (&s1_pos);
-  s2_suffix = match_suffix (&s2_pos);
-  s1_len = (s1_suffix ? s1_suffix : s1_pos) - s1;
-  s2_len = (s2_suffix ? s2_suffix : s2_pos) - s2;
-
-  /* restore file suffixes if strings are identical after "cut" */
-  if ((s1_suffix || s2_suffix) && (s1_len == s2_len)
-      && 0 == strncmp (s1, s2, s1_len))
-    {
-      s1_len = s1_pos - s1;
-      s2_len = s2_pos - s2;
+      bool adot = alen < 0 ? !a[1] : alen == 1;
+      bool bdot = blen < 0 ? !b[1] : blen == 1;
+      if (adot)
+        return -!bdot;
+      if (bdot)
+        return 1;
+
+      bool adotdot = a[1] == '.' && (alen < 0 ? !a[2] : alen == 2);
+      bool bdotdot = b[1] == '.' && (blen < 0 ? !b[2] : blen == 2);
+      if (adotdot)
+        return -!bdotdot;
+      if (bdotdot)
+        return 1;
     }
+  else if (b[0] == '.')
+    return 1;
+
+  /* Cut file suffixes.  */
+  idx_t aprefixlen = file_prefixlen (a, &alen);
+  idx_t bprefixlen = file_prefixlen (b, &blen);
+
+  /* If both suffixes are empty, a second pass would return the same thing.  */
+  bool one_pass_only = aprefixlen == alen && bprefixlen == blen;
+
+  int result = verrevcmp (a, aprefixlen, b, bprefixlen);
 
-  result = verrevcmp (s1, s1_len, s2, s2_len);
-  return result == 0 ? simple_cmp : result;
+  /* Return the initial result if nonzero, or if no second pass is needed.
+     Otherwise, restore the suffixes and try again.  */
+  return result || one_pass_only ? result : verrevcmp (a, alen, b, blen);
 }
diff --git a/lib/filevercmp.h b/lib/filevercmp.h
index d62cc4000e..5a33677671 100644
--- a/lib/filevercmp.h
+++ b/lib/filevercmp.h
@@ -1,4 +1,5 @@
-/*
+/* Compare file names containing version numbers.
+
    Copyright (C) 1995 Ian Jackson <[email protected]>
    Copyright (C) 2001 Anthony Towns <[email protected]>
    Copyright (C) 2008-2022 Free Software Foundation, Inc.
@@ -19,24 +20,57 @@
 #ifndef FILEVERCMP_H
 #define FILEVERCMP_H
 
-/* Compare version strings:
+#include <stddef.h>
+
+/* Compare strings A and B as file names containing version numbers,
+   and return an integer that is negative, zero, or positive depending
+   on whether A compares less than, equal to, or greater than B.
+
+   Use the following version sort algorithm:
+
+     1. Compare the strings' maximal-length non-digit prefixes lexically.
+        If there is a difference return that difference.
+        Otherwise discard the prefixes and continue with the next step.
+
+     2. Compare the strings' maximal-length digit prefixes, using
+        numeric comparison of the numbers represented by each prefix.
+        (Treat an empty prefix as zero; this can happen only at string end.)
+        If there is a difference, return that difference.
+        Otherwise discard the prefixes and continue with the next step.
+
+     3. If both strings are empty, return 0.  Otherwise continue with step 1.
+
+   In version sort, lexical comparison is left to right, byte by byte,
+   using the byte's numeric value (0-255), except that:
+
+     1. ASCII letters sort before other bytes.
+     2. A tilde sorts before anything, even an empty string.
+
+   In addition to the version sort rules, the following strings have
+   special priority and sort before all other strings (listed in order):
 
-   This function compares strings S1 and S2:
-   1) By PREFIX in the same way as strcmp.
-   2) Then by VERSION (most similarly to version compare of Debian's dpkg).
-      Leading zeros in version numbers are ignored.
-   3) If both (PREFIX and  VERSION) are equal, strcmp function is used for
-      comparison. So this function can return 0 if (and only if) strings S1
-      and S2 are identical.
+     1. The empty string.
+     2. ".".
+     3. "..".
+     4. Strings starting with "." sort before other strings.
 
-   It returns number >0 for S1 > S2, 0 for S1 == S2 and number <0 for S1 < S2.
+   Before comparing two strings where both begin with non-".",
+   or where both begin with "." but neither is "." or "..",
+   suffixes matching the C-locale extended regular expression
+   (\.[A-Za-z~][A-Za-z0-9~]*)*$ are removed and the strings compared
+   without them, using version sort without special priority;
+   if they do not compare equal, this comparison result is used and
+   the suffixes are effectively ignored.  Otherwise, the entire
+   strings are compared using version sort.
 
-   This function compares strings, in a way that if VER1 and VER2 are version
-   numbers and PREFIX and SUFFIX (SUFFIX defined as (\.[A-Za-z~][A-Za-z0-9~]*)*)
-   are strings then VER1 < VER2 implies filevercmp (PREFIX VER1 SUFFIX,
-   PREFIX VER2 SUFFIX) < 0.
+   This function is intended to be a replacement for strverscmp.  */
+int filevercmp (char const *a, char const *b) _GL_ATTRIBUTE_PURE;
 
-   This function is intended to be a replacement for strverscmp. */
-int filevercmp (const char *s1, const char *s2) _GL_ATTRIBUTE_PURE;
+/* Like filevercmp, except compare the byte arrays A (of length ALEN)
+   and B (of length BLEN) so that A and B can contain '\0', which
+   sorts just before '\1'.  But if ALEN is -1 treat A as a string
+   terminated by '\0', and similarly for BLEN.  */
+int filenvercmp (char const *a, ptrdiff_t alen, char const *b, ptrdiff_t blen)
+  _GL_ATTRIBUTE_PURE;
 
 #endif /* FILEVERCMP_H */
diff --git a/modules/filevercmp b/modules/filevercmp
index 373dfd00ea..4786ce11b6 100644
--- a/modules/filevercmp
+++ b/modules/filevercmp
@@ -7,8 +7,9 @@ lib/filevercmp.c
 
 Depends-on:
 c-ctype
+idx
 stdbool
-string
+verify
 
 configure.ac:
 
diff --git a/tests/test-filevercmp.c b/tests/test-filevercmp.c
index 79ad4aed8f..b2a7e90f3f 100644
--- a/tests/test-filevercmp.c
+++ b/tests/test-filevercmp.c
@@ -19,6 +19,7 @@
 #include "filevercmp.h"
 
 #include <stddef.h>
+#include <string.h>
 
 #include "macros.h"
 
@@ -28,8 +29,6 @@ static const char *const examples[] =
   "",
   ".",
   "..",
-  ".0",
-  ".9",
   ".A",
   ".Z",
   ".a~",
@@ -40,7 +39,14 @@ static const char *const examples[] =
   ".zz~",
   ".zz",
   ".zz.~1~",
+  ".0",
+  ".9",
   ".zz.0",
+  ".\1",
+  ".\1.txt",
+  ".\1x",
+  ".\1x\1",
+  ".\1.0",
   "0",
   "9",
   "A",
@@ -51,6 +57,10 @@ static const char *const examples[] =
   "a.b",
   "a.bc~",
   "a.bc",
+  "a+",
+  "a.",
+  "a..a",
+  "a.+",
   "b~",
   "b",
   "gcc-c++-10.fc9.tar.gz",
@@ -80,10 +90,79 @@ static const char *const examples[] =
   "zz",
   "zz.~1~",
   "zz.0",
+  "zz.0.txt",
+  "\1",
+  "\1.txt",
+  "\1x",
+  "\1x\1",
+  "\1.0",
+  "#\1.b#",
   "#.b#",
   NULL
 };
 
+/* Sets of examples that should all sort equally.  Each set is
+   terminated by NULL.  */
+static char const *const equals[] =
+  {
+    "a",
+    "a0",
+    "a0000",
+    NULL,
+    "a\1c-27.txt",
+    "a\1c-027.txt",
+    "a\1c-00000000000000000000000000000000000000000000000000000027.txt",
+    NULL,
+    ".a\1c-27.txt",
+    ".a\1c-027.txt",
+    ".a\1c-00000000000000000000000000000000000000000000000000000027.txt",
+    NULL,
+    "a\1c-",
+    "a\1c-0",
+    "a\1c-00",
+    NULL,
+    ".a\1c-",
+    ".a\1c-0",
+    ".a\1c-00",
+    NULL,
+    "a\1c-0.txt",
+    "a\1c-00.txt",
+    NULL,
+    ".a\1c-1\1.txt",
+    ".a\1c-001\1.txt",
+    NULL,
+  };
+
+static int
+sign (int i)
+{
+  return i < 0 ? -1 : 0 < i;
+}
+
+/* Return filevercmp (A, B), checking that a similar result is gotten
+   after replacing all '\1's with '\0's and calling filenvercmp with
+   the embedded '\0's.  */
+static int
+test_filevercmp (char const *a, char const *b)
+{
+  int result = filevercmp (a, b);
+
+  char buffer[1000];
+
+  ptrdiff_t alen = strlen (a), blen = strlen (b);
+  ASSERT (alen + blen <= sizeof buffer);
+  memcpy (buffer, a, alen);
+  memcpy (buffer + alen, b, blen);
+  for (ptrdiff_t i = 0; i < alen + blen; i++)
+    if (buffer[i] == '\1')
+      buffer[i] = '\0';
+
+  int nresult = filenvercmp (buffer, alen, buffer + alen, blen);
+  ASSERT (sign (nresult) == sign (result));
+
+  return result;
+}
+
 int
 main (void)
 {
@@ -94,7 +173,6 @@ main (void)
   ASSERT (filevercmp ("a", "a") == 0);
   ASSERT (filevercmp ("a", "b") < 0);
   ASSERT (filevercmp ("b", "a") > 0);
-  ASSERT (filevercmp ("a0", "a") > 0);
   ASSERT (filevercmp ("00", "01") < 0);
   ASSERT (filevercmp ("01", "010") < 0);
   ASSERT (filevercmp ("9", "10") < 0);
@@ -106,7 +184,7 @@ main (void)
       const char *const *j;
       for (j = examples; *j; j++)
         {
-          int result = filevercmp (*i, *j);
+          int result = test_filevercmp (*i, *j);
           if (result < 0)
             ASSERT (i < j);
           else if (0 < result)
@@ -116,5 +194,13 @@ main (void)
         }
     }
 
+  for (i = equals; i < equals + sizeof equals / sizeof *equals; i++)
+    for (; *i; i++)
+      for (char const *const *j = i; *j; j++)
+        {
+          ASSERT (test_filevercmp (*i, *j) == 0);
+          ASSERT (test_filevercmp (*j, *i) == 0);
+        }
+
   return 0;
 }
-- 
2.32.0

From d8047ae86d5418782db7ec906c10e1af4f129997 Mon Sep 17 00:00:00 2001
From: Paul Eggert <[email protected]>
Date: Sat, 12 Feb 2022 20:14:27 -0800
Subject: [PATCH] sort: fix several version-sort problems

This also affects ls -v in some corner cases.
Problems reported by Michael Debertol <https://bugs.gnu.org/49239>.
While looking into this, I spotted some more areas where the
code and documentation did not agree, or where the documentation
was unclear.  In some cases I changed the code; in others
the documentation.  I hope things are nailed down better now.
* doc/sort-version.texi: Distinguish more carefully between
characters and bytes.  Say that non-identical strings can
compare equal, since they now can.  Improve readability in
various ways.  Make it clearer that a suffix can be the
entire string.
* src/ls.c (cmp_version): Fall back on strcmp if filevercmp
reports equality, since filevercmp is no longer a total order.
* src/sort.c (keycompare): Use filenvercmp, to treat NULs correctly.
* tests/misc/ls-misc.pl (v_files):
Adjust test to match new behavior.
* tests/misc/sort-version.sh: Add tests for stability,
and for sorting with NUL bytes.
---
 NEWS                       |   6 ++
 doc/sort-version.texi      | 174 ++++++++++++++++++++-----------------
 src/ls.c                   |  16 ++--
 src/sort.c                 |   2 +-
 tests/misc/ls-misc.pl      |   8 +-
 tests/misc/sort-version.sh |  10 ++-
 6 files changed, 123 insertions(+), 93 deletions(-)

diff --git a/NEWS b/NEWS
index fcf31fe39..d58be8863 100644
--- a/NEWS
+++ b/NEWS
@@ -24,6 +24,12 @@ GNU coreutils NEWS                                    -*- outline -*-
   'id xyz' now uses the name 'xyz' to determine groups, instead of xyz's uid.
   [bug introduced in coreutils-8.22]
 
+  'ls -v' and 'sort -V' no longer mishandle corner cases like "a..a" vs "a.+"
+  or lines containing NULs.  Their behavior now matches the documentation
+  for file names like ".m4" that consist entirely of an extension,
+  and the documentation has been clarified for unusual cases.
+  [bug introduced in coreutils-7.0]
+
   On macOS, 'mv A B' no longer fails with "Operation not supported"
   when A and B are in the same tmpfs file system.
   [bug introduced in coreutils-9.0]
diff --git a/doc/sort-version.texi b/doc/sort-version.texi
index 7f76ac5bb..aa7b0f9d5 100644
--- a/doc/sort-version.texi
+++ b/doc/sort-version.texi
@@ -161,33 +161,40 @@ The strings are compared from left to right.
 
 @item
 First the initial part of each string consisting entirely of non-digit
-characters is determined.
+bytes is determined.
 
-@enumerate
+@enumerate A
 @item
-These two parts (one of which may be empty) are compared lexically.
+These two parts (either of which may be empty) are compared lexically.
 If a difference is found it is returned.
 
 @item
-The lexical comparison is a comparison of ASCII values modified so that:
+The lexical comparison is a lexicographic comparison of byte strings,
+except that:
 
-@enumerate
+@enumerate a
 @item
-Letters sort before non-letters.
+ASCII letters sort before other bytes.
 @item
-A tilde sorts before anything, even the end of a part.
+A tilde sorts before anything, even an empty string.
 @end enumerate
 @end enumerate
 
 @item
-Then the initial part of the remainder of each string which consists
-entirely of digit characters is determined. The numerical values of
+Then the initial part of the remainder of each string that contains
+all the leading digits is determined. The numerical values represented by
 these two parts are compared, and any difference found is returned as
 the result of the comparison.
-@enumerate
+
+@enumerate A
 @item
 For these purposes an empty string (which can only occur at the end of
 one or both version strings being compared) counts as zero.
+
+@item
+Because the numerical value is used, non-identical strings can compare
+equal.  For example, @samp{123} compares equal to @samp{00123}, and
+the empty string compares equal to @samp{0}.
 @end enumerate
 
 @item
@@ -202,7 +209,7 @@ down to the following parts, and the parts compared respectively from
 each string:
 
 @example
-foo  @r{vs}  foo   @r{(rule 2, non-digit characters)}
+foo  @r{vs}  foo   @r{(rule 2, non-digits)}
 07   @r{vs}  7     @r{(rule 3, digits)}
 .    @r{vs}  a.    @r{(rule 2)}
 7    @r{vs}  7     @r{(rule 3)}
@@ -213,22 +220,22 @@ Comparison flow based on above algorithm:
 
 @enumerate
 @item
-The first parts (@samp{foo}) are identical in both strings.
+The first parts (@samp{foo}) are identical.
 
 @item
 The second parts (@samp{07} and @samp{7}) are compared numerically,
-and are identical.
+and compare equal.
 
 @item
 The third parts (@samp{.} vs @samp{a.}) are compared
-lexically by ASCII value (rule 2.2).
+lexically by ASCII value (rule 2.B).
 
 @item
-The first character of the first string (@samp{.}) is compared
-to the first character of the second string (@samp{a}).
+The first byte of the first string (@samp{.}) is compared
+to the first byte of the second string (@samp{a}).
 
 @item
-Rule 2.2.1 dictates that ``all letters sorts earlier than all non-letters''.
+Rule 2.B.a says letters sorts before non-letters.
 Hence, @samp{a} comes before @samp{.}.
 
 @item
@@ -280,17 +287,17 @@ $ sort -n input4                  $ sort -V input4
 Numeric sort (@samp{sort -n}) treats the entire string as a single numeric
 value, and compares it to other values. For example, @samp{8.1}, @samp{8.10} and
 @samp{8.100} are numerically equivalent, and are ordered together. Similarly,
-@samp{8.49} is numerically smaller than @samp{8.5}, and appears before first.
+@samp{8.49} is numerically less than @samp{8.5}, and appears before first.
 
 Version sort (@samp{sort -V}) first breaks down the string into digit and
 non-digit parts, and only then compares each part (see annotated
-example in Version-sort ordering rules).
+example in @ref{Version-sort ordering rules}).
 
 Comparing the string @samp{8.1} to @samp{8.01}, first the
-@samp{8} characters are compared (and are identical), then the
+@samp{8}s are compared (and are identical), then the
 dots (@samp{.}) are compared and are identical, and lastly the
 remaining digits are compared numerically (@samp{1} and @samp{01}) -
-which are numerically equivalent. Hence, @samp{8.01} and @samp{8.1}
+which are numerically equal.  Hence, @samp{8.01} and @samp{8.1}
 are grouped together.
 
 Similarly, comparing @samp{8.5} to @samp{8.49} -- the @samp{8}
@@ -301,10 +308,10 @@ This sorting order (where @samp{8.5} comes before @samp{8.49}) is common when
 assigning versions to computer programs (while perhaps not intuitive
 or ``natural'' for people).
 
-@node Punctuation characters
-@subsection Punctuation characters
+@node Version sort punctuation
+@subsection Version sort punctuation
 
-Punctuation characters are sorted by ASCII order (rule 2.2).
+Punctuation is sorted by ASCII order (rule 2.B).
 
 @example
 $ touch 1.0.5_src.tar.gz 1.0_src.tar.gz
@@ -319,22 +326,22 @@ Based on the version-sort ordering rules, the strings are broken down
 into the following parts:
 
 @example
-          1   @r{vs}  1               @r{(rule 3, all digit characters)}
-          .   @r{vs}  .               @r{(rule 2, all non-digit characters)}
+          1   @r{vs}  1               @r{(rule 3, all digits)}
+          .   @r{vs}  .               @r{(rule 2, all non-digits)}
           0   @r{vs}  0               @r{(rule 3)}
           .   @r{vs}  _src.tar.gz     @r{(rule 2)}
-          5   @r{vs}  empty string    @r{(no more character in the file name)}
+          5   @r{vs}  empty string    @r{(no more bytes in the file name)}
 _src.tar.gz   @r{vs}  empty string
 @end example
 
 The fourth parts (@samp{.} and @samp{_src.tar.gz}) are compared
-lexically by ASCII order. The character @samp{.} (ASCII value 46) is
-smaller than @samp{_} (ASCII value 95) -- and should be listed before it.
+lexically by ASCII order. The @samp{.} (ASCII value 46) is
+less than @samp{_} (ASCII value 95) -- and should be listed before it.
 
 Hence, @file{1.0.5_src.tar.gz} is listed first.
 
-If a different character appears instead of the underscore (for
-example, percent sign @samp{%} ASCII value 37, which is smaller
+If a different byte appears instead of the underscore (for
+example, percent sign @samp{%} ASCII value 37, which is less
 than dot's ASCII value of 46), that file will be listed first:
 
 @example
@@ -344,7 +351,7 @@ $ touch   1.0.5_src.tar.gz     1.0%zzzzz.gz
 @end example
 
 The same reasoning applies to the following example, as @samp{.} with
-ASCII value 46 is smaller than @samp{/} with ASCII value 47:
+ASCII value 46 is less than @samp{/} with ASCII value 47:
 
 @example
 $ cat input5
@@ -359,7 +366,7 @@ $ sort -V input5
 @node Punctuation vs letters
 @subsection Punctuation vs letters
 
-Rule 2.2.1 dictates that letters sorts earlier than all non-letters
+Rule 2.B.a says letters sort before non-letters
 (after breaking down a string to digit and non-digit parts).
 
 @example
@@ -372,23 +379,23 @@ a%
 @end example
 
 The input strings consist entirely of non-digits, and based on the
-above algorithm have only one part, all non-digit characters
+above algorithm have only one part, all non-digits
 (@samp{a%} vs @samp{az}).
 
 Each part is then compared lexically,
-character-by-character. @samp{a} compares identically in both
+byte-by-byte; @samp{a} compares identically in both
 strings.
 
-Rule 2.2.1 dictates that letters (@samp{z}) sorts earlier than all
-non-letters (@samp{%}) -- hence @samp{az} appears first (despite
+Rule 2.B.a says a letter like @samp{z} sorts before
+a non-letter like @samp{%} -- hence @samp{az} appears first (despite
 @samp{z} having ASCII value of 122, much larger than @samp{%}
 with ASCII value 37).
 
-@node The tilde @samp{~} character
-@subsection The tilde @samp{~} character
+@node The tilde @samp{~}
+@subsection The tilde @samp{~}
 
-Rule 2.2.2 dictates that the tilde character @samp{~} (ASCII 126) sorts
-before all other non-digit characters, including an empty part.
+Rule 2.B.b says the tilde @samp{~} (ASCII 126) sorts
+before other bytes, and before an empty string.
 
 @example
 $ cat input7
@@ -413,13 +420,13 @@ with a non-digit (@samp{~}). This is the first part. All other lines
 in the input file start with a digit -- their first non-digit part is
 empty.
 
-Based on rule 2.2.2, tilde @samp{~} sorts before all other non-digits
-including the empty part -- hence it comes before all other strings,
+Based on rule 2.B.b, tilde @samp{~} sorts before other bytes
+and before the empty string -- hence it comes before all other strings,
 and is listed first in the sorted output.
 
 The remaining lines (@samp{1}, @samp{1%}, @samp{1.2}, @samp{1~})
 follow similar logic: The digit part is extracted (1 for all strings)
-and compares identical. The following extracted parts for the remaining
+and compares equal. The following extracted parts for the remaining
 input lines are: empty part, @samp{%}, @samp{.}, @samp{~}.
 
 Tilde sorts before all others, hence the line @samp{1~} appears next.
@@ -452,8 +459,8 @@ aα
 Ignoring the first letter (@samp{a}) which is identical in all
 strings, the compared values are:
 
-@samp{a} and @samp{z} are letters, and sort earlier than
-all other non-digit characters.
+@samp{a} and @samp{z} are letters, and sort before
+all other non-digits.
 
 Then, percent sign @samp{%} (ASCII value 37) is compared to the
 first byte of the UTF-8 sequence of @samp{α}, which is 0xCE or 206). The
@@ -467,8 +474,8 @@ official Debian algorithm, in order to accommodate more general usage
 and file name listing.
 
 
-@node Hyphen-minus and colon characters
-@subsection Hyphen-minus @samp{-} and colon @samp{:} characters
+@node Hyphen-minus and colon
+@subsection Hyphen-minus @samp{-} and colon @samp{:}
 
 In Debian's version string syntax the version consists of three parts:
 @example
@@ -488,7 +495,7 @@ Example of such version strings:
 @end example
 
 If the @samp{debian_revision part} is not present,
-hyphen characters @samp{-} are not allowed.
+hyphens @samp{-} are not allowed.
 If epoch is not present, colons @samp{:} are not allowed.
 
 If these parts are present, hyphen and/or colons can appear only once
@@ -498,8 +505,8 @@ In GNU Coreutils, such restrictions are not reasonable (a file name can
 have many hyphens, a line of text can have many colons).
 
 As a result, in GNU Coreutils hyphens and colons are treated exactly
-like all other punctuation characters, i.e., they are sorted after
-letters.  @xref{Punctuation characters}.
+like all other punctuation, i.e., they are sorted after
+letters.  @xref{Version sort punctuation}.
 
 In Debian, these characters are treated differently than in Coreutils:
 a version string with hyphen will sort before similar strings without
@@ -522,29 +529,27 @@ out of order
 For further details, see @ref{Comparing two strings using Debian's
 algorithm} and @uref{https://bugs.gnu.org/35939,GNU Bug 35939}.
 
-@node Additional hard-coded priorities in GNU Coreutils version sort
-@subsection Additional hard-coded priorities in GNU Coreutils version sort
+@node Special priority in GNU Coreutils version sort
+@subsection Special priority in GNU Coreutils version sort
 
 In GNU Coreutils version sort, the following items have
-special priority and sort earlier than all other characters (listed in
-order);
+special priority and sort before all other strings (listed in order):
 
 @enumerate
 @item The empty string
 
-@item The string @samp{.} (a single dot character, ASCII 46)
+@item The string @samp{.} (a single dot, ASCII 46)
 
-@item The string @samp{..} (two dot characters)
+@item The string @samp{..} (two dots)
 
-@item Strings start with a dot (@samp{.}) sort earlier than
-strings starting with any other characters.
+@item Strings starting with dot (@samp{.}) sort before
+strings starting with any other byte.
 @end enumerate
 
 Example:
 
 @example
 $ printf '%s\n' a "" b "." c  ".."  ".d20" ".d3"  | sort -V
-
 .
 ..
 .d3
@@ -567,33 +572,35 @@ program, the ordering rules are the same.
 @subsection Special handling of file extensions
 
 GNU Coreutils version sort implements specialized handling
-of file extensions (or strings that look like file names with
-extensions).
-
-This nuanced implementation enables slightly more natural ordering of files.
+of strings that look like file names with extensions.
+This enables slightly more natural ordering of file
+names.
 
-The additional rules are:
+The following additional rules apply when comparing two strings where
+both begin with non-@samp{.}.  They also apply when comparing two
+strings where both begin with @samp{.} but neither is @samp{.} or @samp{..}.
 
 @enumerate
 @item
-A suffix (i.e., a file extension) is defined as: a dot, followed by a
-letter or tilde, followed by one or more letters, digits, or tildes
-(possibly repeated more than once), until the end of the string
-(technically, matching the regular expression
-@code{(\.[A-Za-z~][A-Za-z0-9~]*)*}).
+A suffix (i.e., a file extension) is defined as: a dot, followed by an
+ASCII letter or tilde, followed by zero or more ASCII letters, digits,
+or tildes; all repeated zero or more times, and ending at string end.
+This is equivalent to matching the extended regular expression
+@code{(\.[A-Za-z~][A-Za-z0-9~]*)*$} in the C locale.
 
 @item
-If the strings contains suffixes, the suffixes are temporarily
-removed, and the strings are compared without them (using the
-@ref{Version-sort ordering rules,algorithm,algorithm} above).
+The suffixes are temporarily removed, and the strings are compared
+without them, using version sort (see @ref{Version-sort ordering
+rules}) without special priority (see @ref{Special priority in GNU
+Coreutils version sort}).
 
 @item
-If the suffix-less strings are identical, the suffix is restored and
-the entire strings are compared.
+If the suffix-less strings do not compare equal, this comparison
+result is used and the suffixes are effectively ignored.
 
 @item
-If the non-suffixed strings differ, the result is returned and the
-suffix is effectively ignored.
+If the suffix-less strings compare equal, the suffixes are restored
+and the entire strings are compared using version sort.
 @end enumerate
 
 Examples for rule 1:
@@ -619,6 +626,9 @@ is not included)
 @item
 @samp{gcc-c++-10.8.12-0.7rc2.fc9.tar.bz2}: the suffix is
 @samp{.fc9.tar.bz2} (@samp{.7rc2} is not included as it begins with a digit)
+
+@item
+@samp{.autom4te.cfg}: the suffix is the entire string.
 @end itemize
 
 Examples for rule 2:
@@ -665,8 +675,8 @@ Without the suffix-removal algorithm, the strings will be broken down
 to the following parts:
 
 @example
-hello-  @r{vs}  hello-  @r{(rule 2, all non-digit characters)}
-8       @r{vs}  8       @r{(rule 3, all digit characters)}
+hello-  @r{vs}  hello-  @r{(rule 2, all non-digits)}
+8       @r{vs}  8       @r{(rule 3, all digits)}
 .txt    @r{vs}  .       @r{(rule 2)}
 empty   @r{vs}  2
 empty   @r{vs}  .txt
@@ -685,8 +695,8 @@ The suffixes (@samp{.txt}) are removed, and the remaining strings are
 broken down into the following parts:
 
 @example
-hello-  @r{vs}  hello-  @r{(rule 2, all non-digit characters)}
-8       @r{vs}  8       @r{(rule 3, all digit characters)}
+hello-  @r{vs}  hello-  @r{(rule 2, all non-digits)}
+8       @r{vs}  8       @r{(rule 3, all digits)}
 empty   @r{vs}  .       @r{(rule 2)}
 empty   @r{vs}  2
 @end example
@@ -754,7 +764,7 @@ dpkg: warning: version '3.0/' has bad syntax:
 
 To illustrate the different handling of hyphens between Debian and
 Coreutils algorithms (see
-@ref{Hyphen-minus and colon characters}):
+@ref{Hyphen-minus and colon}):
 
 @example
 $ compver abb ab-cd 2>/dev/null     $ printf 'abb\nab-cd\n' | sort -V
diff --git a/src/ls.c b/src/ls.c
index 53f80076b..1930e4abb 100644
--- a/src/ls.c
+++ b/src/ls.c
@@ -3941,18 +3941,20 @@ DEFINE_SORT_FUNCTIONS (extension, cmp_extension)
 DEFINE_SORT_FUNCTIONS (width, cmp_width)
 
 /* Compare file versions.
-   Unlike all other compare functions above, cmp_version depends only
-   on filevercmp, which does not fail (even for locale reasons), and does not
-   need a secondary sort key. See lib/filevercmp.h for function description.
+   Unlike the other compare functions, cmp_version does not fail
+   because filevercmp and strcmp do not fail; cmp_version uses strcmp
+   instead of xstrcoll because filevercmp is locale-independent so
+   strcmp is its appropriate secondary.
 
-   All the other sort options, in fact, need xstrcoll and strcmp variants,
-   because they all use a string comparison (either as the primary or secondary
+   All the other sort options need xstrcoll and strcmp variants,
+   because they all use xstrcoll (either as the primary or secondary
    sort key), and xstrcoll has the ability to do a longjmp if strcoll fails for
-   locale reasons.  Lastly, filevercmp is ALWAYS available with gnulib.  */
+   locale reasons.  */
 static int
 cmp_version (struct fileinfo const *a, struct fileinfo const *b)
 {
-  return filevercmp (a->name, b->name);
+  int diff = filevercmp (a->name, b->name);
+  return diff ? diff : strcmp (a->name, b->name);
 }
 
 static int
diff --git a/src/sort.c b/src/sort.c
index 1a3bda698..3b775d6bb 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -2703,7 +2703,7 @@ keycompare (struct line const *a, struct line const *b)
           else if (key->random)
             diff = compare_random (ta, tlena, tb, tlenb);
           else if (key->version)
-            diff = filevercmp (ta, tb);
+            diff = filenvercmp (ta, tlena, tb, tlenb);
           else
             {
               /* Locale-dependent string sorting.  This is slower than
diff --git a/tests/misc/ls-misc.pl b/tests/misc/ls-misc.pl
index e246b5c20..a2ac60b23 100755
--- a/tests/misc/ls-misc.pl
+++ b/tests/misc/ls-misc.pl
@@ -113,8 +113,12 @@ sub make_j_d ()
   chmod 0555, 'j/d' or die "making j/d executable: $!\n";
 }
 
-my @v1 = (qw(0 9 A Z a z), 'zz~', 'zz', 'zz.~1~', 'zz.0');
-my @v_files = ((map { ".$_" } @v1), @v1);
+my @v_files = (
+    '.A', '.Z', '.a', '.z', '.zz~', '.zz', '.zz.~1~',
+    '.0', '.9', '.zz.0',
+     '0',  '9',
+     'A',  'Z',  'a',  'z',  'zz~',  'zz',  'zz.~1~',
+     'zz.0');
 my $exe_in_subdir = {PRE => sub { make_j_d (); push_ls_colors('ex=01;32') }};
 my $remove_j = {POST => sub {unlink 'j/d' or die "j/d: $!\n";
                              rmdir 'j' or die "j: $!\n";
diff --git a/tests/misc/sort-version.sh b/tests/misc/sort-version.sh
index 3786605d6..10ea6a56c 100755
--- a/tests/misc/sort-version.sh
+++ b/tests/misc/sort-version.sh
@@ -58,6 +58,7 @@ string start 5.8.0 end of str
 string start 5.80.0 end of str
 string start 5.9.0 end of str
 string start 5.90.0 end of str
+string start 5.04.0 end of str
 _EOF_
 
 cat > exp << _EOF_
@@ -85,6 +86,7 @@ string start 5.1.0 end of str
 string start 5.2.0 end of str
 string start 5.3.0 end of str
 string start 5.4.0 end of str
+string start 5.04.0 end of str
 string start 5.5.0 end of str
 string start 5.6.0 end of str
 string start 5.7.0 end of str
@@ -101,6 +103,12 @@ string start 5.80.0 end of str
 string start 5.90.0 end of str
 _EOF_
 
-sort --sort=version -o out in || fail=1
+sort --stable --sort=version -o out in || fail=1
 compare exp out || fail=1
+
+tr ' ' '\0' <in >in0 || framework_failure_
+sort --stable --sort=version -o out0 in0 || fail=1
+tr '\0' ' ' <out0 >out1 || framework_failure_
+compare exp out1 || fail=1
+
 Exit $fail
-- 
2.32.0

bug#49239: Unexpected results with sort -V

Reply via email to