If the lines are really sorted, all you really need is a merge, where you read one line from each source, and if equal, read another from each. If one source is less, output the lesser line with appropriate tag , and refresh that one from its source. Stop when either source has run out, and then flush the rest of the other source to the output, with appropriate tag.

Time is linear, and memory use negligible.

Marco Mariani wrote:


You can adapt and use this, provided the files are already sorted. Memory usage scales linearly with the size of the file difference, and time scales linearly with file sizes.


#!/usr/bin/env python

import sys


def run(fname_a, fname_b):
    filea = file(fname_a)
    fileb = file(fname_b)
    a_lines = set()
    b_lines = set()

    while True:
        a = filea.readline()
        b = fileb.readline()
        if not (a or b):
            break

        if a == b:
            continue

        if a in b_lines:
            b_lines.remove(a)
        elif a:
            a_lines.add(a)

        if b in a_lines:
            a_lines.remove(b)
        elif b:
            b_lines.add(b)


    for line in a_lines:
        print line

    if a_lines or b_lines:
        print ''
        print '***************'
        print ''

    for line in b_lines:
        print line


if __name__ == '__main__':
    run(sys.argv[1], sys.argv[2])


</div>

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to