Hi, I have two large files,each has more than 200000000 lines,and each line consists of two fields,one is the id and the other a value, the ids are sorted.
for example: file1 (uin_a y) 1 10000245 2 12333 3 324543 5 3464565 .... file2 (uin_b gift) 1 34545 3 6436466 4 35345646 5 463626 .... I want to merge them and get a file,the lines of which consists of an id and the sum of the two values in file1 and file2。 the codes are as below: uin_y=open('file1') uin_gift=open(file2') y_line=uin_y.next() gift_line=uin_gift.next() while 1: try: uin_a,y=[int(i) for i in y_line.split()] uin_b,gift=[int(i) for i in gift_line.split()] if uin_a==uin_b: score=y+gift print uin_a,score y_line=uin_y.next() gift_line=uin_gift.next() if uin_a<uin_b: print uin_a,y y_line=uin_y.next() if uin_a>uin_b: print uin_b,gift gift_line=uin_gift.next() except StopIteration: break the question is that those code runs 40+ minutes on a server(16 core,32G mem), the time complexity is O(n),and there are not too much operations, I think it should be faster.So I want to ask which part costs so much. I tried the cProfile module but didn't get too much. I guess maybe it is the int() operation that cost so much,but I'm not sure and don't know how to solve this. Is there a way to avoid type convertion in Python such as scanf in C? Thanks for your help :)
-- http://mail.python.org/mailman/listinfo/python-list