I just wrote something. I could not run a profiler or analyze the timing but I felt it was effecient. Have a look and see if it helps:
from itertools import * def sep_add(line1, line2): if line1 and line2: val1 = line1.split() val2 = line2.split() if (val1 and val2) and (len(val1) == len(val2) == 2): return (val1[0] == val2[0]) def add_col_files(file1, file2): fs1 = open(file1, "r") fs2 = open(file2, "r") fsn = open("new_sample.txt", "w") # Zip the files together and find the ones that match the index # process the tuple accordingly # output should be what you want to write to the file for k in ifilter(lambda (i,j): (sep_add(i,j)), izip(fs1, fs2)): if k: output = k[0] + " " + k[1] + "\n" # sample output fsn.write(output) fsn.close() fs1.close() fs2.close() if __name__ == "__main__": import time start = time.localtime(time.time()) print time.asctime(start) #add_col_files("sample1.txt", "sample3.txt") end = time.localtime(time.time()) print time.asctime(end) It took about a minute on my comp for comparing about 50-100M sized files. As I said not done too much of testing on this code. 2011/6/21 Ken Seehart <k...@seehart.com> > ** > On 6/20/2011 10:31 PM, Ken Seehart wrote: > > On 6/20/2011 7:59 PM, king6c...@gmail.com wrote: > > Hi, > I have two large files,each has more than 200000000 lines,and each line > consists of two fields,one is the id and the other a value, > the ids are sorted. > > for example: > > file1 > (uin_a y) > 1 10000245 > 2 12333 > 3 324543 > 5 3464565 > .... > > > file2 > (uin_b gift) > 1 34545 > 3 6436466 > 4 35345646 > 5 463626 > .... > > I want to merge them and get a file,the lines of which consists of an id > and the sum of the two values in file1 and file2。 > the codes are as below: > > uin_y=open('file1') > uin_gift=open(file2') > > y_line=uin_y.next() > gift_line=uin_gift.next() > > while 1: > try: > uin_a,y=[int(i) for i in y_line.split()] > uin_b,gift=[int(i) for i in gift_line.split()] > if uin_a==uin_b: > score=y+gift > print uin_a,score > y_line=uin_y.next() > gift_line=uin_gift.next() > if uin_a<uin_b: > print uin_a,y > y_line=uin_y.next() > if uin_a>uin_b: > print uin_b,gift > gift_line=uin_gift.next() > except StopIteration: > break > > > the question is that those code runs 40+ minutes on a server(16 core,32G > mem), > the time complexity is O(n),and there are not too much operations, > I think it should be faster.So I want to ask which part costs so much. > I tried the cProfile module but didn't get too much. > I guess maybe it is the int() operation that cost so much,but I'm not sure > and don't know how to solve this. > Is there a way to avoid type convertion in Python such as scanf in C? > Thanks for your help :) > > > Unfortunately python does not have a scanf equivalent AFAIK. Most use cases > for scanf can be handled by regular expressions, but that would clearly > useless for you, and just slow you down more since it does not perform the > int conversion for you. > > Your code appears to have a bug: I would expect that the last entry will be > lost unless both files end with the same index value. Be sure to test your > code on a few short test files. > > I recommend psyco to make the whole thing faster. > > Regards, > Ken Seehart > > Another thought (a bit of extra work, but you might find it worthwhile if > psyco doesn't yield a sufficient boost): > > Write a couple filter programs to convert to and from binary data (pairs of > 32 or 64 bit integers depending on your requirements). > > Modify your program to use the subprocess module to open two instances of > the binary conversion process with the two input files. Then pipe the output > of that program into the binary to text filter. > > This might turn out to be faster since each process would make use of a > core. Also it gives you other options, such as keeping your data in binary > form for processing, and converting to text only as needed. > > Ken Seehart > > > -- > http://mail.python.org/mailman/listinfo/python-list > >
-- http://mail.python.org/mailman/listinfo/python-list