On Nov 4, 2005, at 12:29 PM, Kirk Strauser wrote:
Multigigabyte? Find another approach to solving the problem, a text-base diff is going to require excessive resources and time. A 64-bit platform
with 2 GB of RAM & 3GB of swap requires ~1000 seconds to diff ~400MB.

There really aren't many options. For the patient, here's what's happening:
[ ... ]
And that's why I need a fast diff. Even if it takes as long as the database bulk loads, we can run it on another server and use 20 seconds of CPU for
PostgreSQL instead of 45 minutes.  The practical upshot is that the
database will never get sluggish, even if the other "diff server" is loaded
to the gills.

OK, but even if only one line out of 1000 changes, you still can't make either diff or Colin Percival's bsdiff run on gigabyte sized files and have it fit into MAXDSIZE on 32-bit address space. From the latter's website:

"bsdiff is quite memory-hungry. It requires max(17*n,9*n+m)+O(1) bytes of memory, where n is the size of the old file and m is the size of the new file. bspatch requires n+m+O(1) bytes. bsdiff runs in O((n+m) log n) time; on a 200MHz Pentium Pro, building a binary patch for a 4MB file takes about 90 seconds. bspatch runs in O(n+m) time; on the same machine, applying that patch takes about two seconds."

Some time ago, I wrote a quick test harness for diff here:

   http://www.pkix.net/~chuck/difftest.py

On a 5.4 machine with kern.dfldsiz="1G" set in /boot/loader.conf, you can only manage to run diff on files up to about 120 MB in size:

31-pi% ./difftest.py -v
INFO: beginning diff trial run with ratio = 100
filea_size=10485760 (aka 10.000 MB)
time=1.370 filea_size=10MB diff_size=818KB
filea_size=15728640 (aka 15.000 MB)
time=2.305 filea_size=15MB diff_size=1229KB
filea_size=23592960 (aka 22.500 MB)
time=5.443 filea_size=22MB diff_size=1844KB
filea_size=35389440 (aka 33.750 MB)
time=7.195 filea_size=33MB diff_size=2768KB
filea_size=53084160 (aka 50.625 MB)
time=16.771 filea_size=50MB diff_size=4163KB
filea_size=79626240 (aka 75.938 MB)
time=43.525 filea_size=75MB diff_size=6257KB
filea_size=119439360 (aka 113.906 MB)
time=78.346 filea_size=113MB diff_size=9MB
filea_size=179159040 (aka 170.859 MB)
diff: memory exhausted
NOTICE: diff exitted with errno 2
time=36.896 filea_size=170MB diff_size=0KB
272.58s real  154.73s user  13.23s system  61%

On a 64-bit SPARC box mentioned above, you can get sizes up to ~400 MB:

[ ... ]
filea_size=119439360 (aka 113.906 MB)
time=140.650 filea_size=115MB diff_size=9MB
filea_size=179159040 (aka 170.859 MB)
time=424.586 filea_size=172MB diff_size=15MB
filea_size=268738560 (aka 256.289 MB)
time=546.334 filea_size=258MB diff_size=22MB
filea_size=403107840 (aka 384.434 MB)
time=957.059 filea_size=388MB diff_size=33MB
filea_size=604661760 (aka 576.650 MB)
diff: memory exhausted
NOTICE: diff exitted with errno 2
time=105.728 filea_size=582MB diff_size=0KB
5610.90s real  3268.63s user  1761.90s system  89%

Roughly, you need about an order of magnitude more RAM or virtual memory available then the size of the files you are trying to diff, even if the files are very similar.

--
-Chuck

_______________________________________________
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to