On 08/07/10 18:50, Christoph Anton Mitterer wrote: > I found a rather strange bug in diff (Debian version 1:3.0-1) when doing > large backups of millions of files. > First I've copied the data to an external HDD, then I've did an "diff -q > -r >out 2> err" in order to find any errors.
A wise precaution! I do that sort of thing a lot. > The strange unicode characters had their character codes inside which read > as follows: > 1CEE 1CAE > 1CAE 1CEE > ... > Any ideas? My idea is that you've found and reported a bug that has been in GNU diff ever since 2.8 was released back in 2002. Thanks very much for reporting this: you've undoubtedly saved others some work and hair-pulling. I've installed the following patch, which I hope fixes things. If you can test it on your data, that'd be nice. I have verified this bug on a much simpler test case, and this patch includes the test case to try to make sure that a similar bug doesn't creep back in later. >From 29ac7f191c444400c6df067647e3158a1e188ddf Mon Sep 17 00:00:00 2001 From: Paul Eggert <[email protected]> Date: Thu, 12 Aug 2010 17:55:05 -0700 Subject: [PATCH] diff: avoid spurious diffs when two distinct dir entries compare equal Problem reported by Christoph Anton Mitterer in: http://lists.gnu.org/archive/html/bug-diffutils/2010-08/msg00000.html * NEWS: Mention this bug fix. * src/dir.c (compare_names_for_qsort): Fall back on file_name_cmp if two distinct entries in the same directory compare equal. (diff_dirs): Prefer a file_name_cmp match when available. * tests/Makefile.am (TESTS): New test colliding-file-names. * tests/colliding-file-names: New file. --- NEWS | 5 +++++ src/dir.c | 41 +++++++++++++++++++++++++++++++++++++++-- tests/Makefile.am | 1 + tests/colliding-file-names | 20 ++++++++++++++++++++ 4 files changed, 65 insertions(+), 2 deletions(-) create mode 100644 tests/colliding-file-names diff --git a/NEWS b/NEWS index 27b9886..1c41e9e 100644 --- a/NEWS +++ b/NEWS @@ -2,6 +2,11 @@ GNU diffutils NEWS -*- outline -*- * Noteworthy changes in release ?.? (????-??-??) [?] +** Bug fixes + + diff no longer reports spurious differences merely because two entries + in the same directory have names that compare equal in the current + locale, or compare equal because --ignore-file-name-case was given. * Noteworthy changes in release 3.0 (2010-05-03) [stable] diff --git a/src/dir.c b/src/dir.c index 5b4eaec..1b078ab 100644 --- a/src/dir.c +++ b/src/dir.c @@ -166,14 +166,16 @@ compare_names (char const *name1, char const *name2) : file_name_cmp (name1, name2)); } -/* A wrapper for compare_names suitable as an argument for qsort. */ +/* Compare names FILE1 and FILE2 when sorting a directory. + Prefer filtered comparison, breaking ties with file_name_cmp. */ static int compare_names_for_qsort (void const *file1, void const *file2) { char const *const *f1 = file1; char const *const *f2 = file2; - return compare_names (*f1, *f2); + int diff = compare_names (*f1, *f2); + return diff ? diff : file_name_cmp (*f1, *f2); } /* Compare the contents of two directories named in CMP. @@ -253,6 +255,41 @@ diff_dirs (struct comparison const *cmp, pretend the "next name" in that dir is very large. */ int nameorder = (!*names[0] ? 1 : !*names[1] ? -1 : compare_names (*names[0], *names[1])); + + /* Prefer a file_name_cmp match if available. This algorithm is + O(N**2), where N is the number of names in a directory + that compare_names says are all equal, but in practice N + is so small it's not worth tuning. */ + if (nameorder == 0) + { + int raw_order = file_name_cmp (*names[0], *names[1]); + if (raw_order != 0) + { + int greater_side = raw_order < 0; + int lesser_side = 1 - greater_side; + char const **lesser = names[lesser_side]; + char const *greater_name = *names[greater_side]; + char const **p; + + for (p = lesser + 1; + *p && compare_names (*p, greater_name) == 0; + p++) + { + int cmp = file_name_cmp (*p, greater_name); + if (0 <= cmp) + { + if (cmp == 0) + { + memmove (lesser + 1, lesser, + (char *) p - (char *) lesser); + *lesser = greater_name; + } + break; + } + } + } + } + int v1 = (*handle_file) (cmp, 0 < nameorder ? 0 : *names[0]++, nameorder < 0 ? 0 : *names[1]++); diff --git a/tests/Makefile.am b/tests/Makefile.am index 6a4858c..242d501 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -1,6 +1,7 @@ TESTS = \ basic \ binary \ + colliding-file-names \ help-version \ function-line-vs-leading-space \ label-vs-func \ diff --git a/tests/colliding-file-names b/tests/colliding-file-names new file mode 100644 index 0000000..f14dce7 --- /dev/null +++ b/tests/colliding-file-names @@ -0,0 +1,20 @@ +#!/bin/sh +# Check that diff responds well if a directory has multiple file names +# that compare equal. + +: ${srcdir=.} +. "$srcdir/init.sh"; path_prepend_ ../src + +mkdir d1 d2 || fail=1 + +for i in abc abC aBc aBC Abc AbC ABc ABC; do + echo $i >d1/$i || fail=1 +done + +for i in ABC ABc AbC Abc aBC aBc abC abc; do + echo $i >d2/$i || fail=1 +done + +diff -r --ignore-file-name-case d1 d2 || fail=1 + +Exit $fail -- 1.7.2
