On Sunday 06 November 2005 07:39, Andrew P. wrote: > Note, that the difference must be kept in RAM, so it won't work if there > are multi-gig diffs, but it will work very fast if the diffs are only > 10-100Mb, it will work at close to I/O speed if the diff is under 10Mb.
Thanks, Andrew! My Python script runs that algorithm in 17 seconds on a
400MB file with 10% CPU.
For anyone interested, here's my implementation. Note that the readline()
method in Python always returns something, even at EOF (at which point you
get an empty string). Also, empty strings evaluate as "false", which is
why the "if not (oldline or newline): break" code exits at the end.
old_records = []
new_records = []
while 1:
oldline, newline = oldfile.readline(), newfile.readline()
if not (oldline or newline):
break
if oldline == newline:
continue
try:
new_records.remove(oldline)
except ValueError:
if oldline:
old_records.append(oldline)
try:
old_records.remove(newline)
except ValueError:
if newline:
new_records.append(newline)
> Hope this gives you some idea.
It did. It must've been a long work week, because that all seems so obvious
in retrospect but was completely opaque at the time. Thanks again!
--
Kirk Strauser
pgpd16wKmA9g0.pgp
Description: PGP signature
