Re: Fast diff command for large files?

Kirk Strauser Mon, 07 Nov 2005 07:49:16 -0800

On Sunday 06 November 2005 07:39, Andrew P. wrote:

> Note, that the difference must be kept in RAM, so it won't work if there 
> are multi-gig diffs, but it will work very fast if the diffs are only 
> 10-100Mb, it will work at close to I/O speed if the diff is under 10Mb.


Thanks, Andrew!  My Python script runs that algorithm in 17 seconds on a 
400MB file with 10% CPU.

For anyone interested, here's my implementation.  Note that the readline() 
method in Python always returns something, even at EOF (at which point you 
get an empty string).  Also, empty strings evaluate as "false", which is 
why the "if not (oldline or newline): break" code exits at the end.

    old_records = []
    new_records = []

    while 1:
        oldline, newline = oldfile.readline(), newfile.readline()
        if not (oldline or newline):
            break
        if oldline == newline:
            continue

        try:
            new_records.remove(oldline)
        except ValueError:
            if oldline:
                old_records.append(oldline)

        try:
            old_records.remove(newline)
        except ValueError:
            if newline:
                new_records.append(newline)

> Hope this gives you some idea.

It did.  It must've been a long work week, because that all seems so obvious 
in retrospect but was completely opaque at the time.  Thanks again!
-- 
Kirk Strauser

pgpd16wKmA9g0.pgp
Description: PGP signature

Re: Fast diff command for large files?

Reply via email to