If the lines are really sorted, all you really need is a merge, where
you read one line from each source, and if equal, read another from
each. If one source is less, output the lesser line with appropriate
tag , and refresh that one from its source. Stop when either source has
run out, and then flush the rest of the other source to the output, with
appropriate tag.
Time is linear, and memory use negligible.
Marco Mariani wrote:
You can adapt and use this, provided the files are already sorted.
Memory usage scales linearly with the size of the file difference, and
time scales linearly with file sizes.
#!/usr/bin/env python
import sys
def run(fname_a, fname_b):
filea = file(fname_a)
fileb = file(fname_b)
a_lines = set()
b_lines = set()
while True:
a = filea.readline()
b = fileb.readline()
if not (a or b):
break
if a == b:
continue
if a in b_lines:
b_lines.remove(a)
elif a:
a_lines.add(a)
if b in a_lines:
a_lines.remove(b)
elif b:
b_lines.add(b)
for line in a_lines:
print line
if a_lines or b_lines:
print ''
print '***************'
print ''
for line in b_lines:
print line
if __name__ == '__main__':
run(sys.argv[1], sys.argv[2])
</div>
--
http://mail.python.org/mailman/listinfo/python-list