On 6/20/2011 7:59 PM, king6c...@gmail.com wrote:
> Hi,
> I have two large files,each has more than 200000000 lines,and each
> line consists of two fields,one is the id and the other a value,
> the ids are sorted.
>
> for example:
>
> file1
> (uin_a y)
> 1 10000245
> 2 12333
> 3 324543
> 5 3464565
> ....
>
>
> file2
> (uin_b gift)
> 1 34545
> 3 6436466
> 4 35345646
> 5 463626
> ....
>
> I want to merge them and get a file,the lines of which consists of an
> id and the sum of the two values in file1 and file2。
> the codes are as below:
>
> uin_y=open('file1')
> uin_gift=open(file2')
>
> y_line=uin_y.next()
> gift_line=uin_gift.next()
>
> while 1:
> try:
> uin_a,y=[int(i) for i in y_line.split()]
> uin_b,gift=[int(i) for i in gift_line.split()]
> if uin_a==uin_b:
> score=y+gift
> print uin_a,score
> y_line=uin_y.next()
> gift_line=uin_gift.next()
> if uin_a<uin_b:
> print uin_a,y
> y_line=uin_y.next()
> if uin_a>uin_b:
> print uin_b,gift
> gift_line=uin_gift.next()
> except StopIteration:
> break
>
>
> the question is that those code runs 40+ minutes on a server(16
> core,32G mem),
> the time complexity is O(n),and there are not too much operations,
> I think it should be faster.So I want to ask which part costs so much.
> I tried the cProfile module but didn't get too much.
> I guess maybe it is the int() operation that cost so much,but I'm not
> sure
> and don't know how to solve this.
> Is there a way to avoid type convertion in Python such as scanf in C?
> Thanks for your help :)

Unfortunately python does not have a scanf equivalent AFAIK. Most use
cases for scanf can be handled by regular expressions, but that would
clearly useless for you, and just slow you down more since it does not
perform the int conversion for you.

Your code appears to have a bug: I would expect that the last entry will
be lost unless both files end with the same index value. Be sure to test
your code on a few short test files.

I recommend psyco to make the whole thing faster.

Regards,
Ken Seehart

-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to