Hi all,

I have a big file 1.5GB in size, with about 6 million lines of tab-delimited data. I have to perform some filtration on the data and keep the good data. After filtration, I have about 5.5 million data left remaining. As you might already guessed, I have to read them in batches and I did so using .readlines(100000000). After reading each batch, I will split the line (in string format) to a list using .split("\t") and then check several conditions, after which if all conditions are satisfied, I will store the list into a matrix.

The code is as follows:
-----Start------
a=open("bigfile")
matrix=[]
while True:
   lines = a.readlines(100000000)
   for line in lines:
       data=line.split("\t")
       if several_conditions_are_satisfied:
           matrix.append(data)
print "Number of lines read:", len(lines), "matrix.__sizeof__:", matrix.__sizeof__()
   if len(lines)==0:
       break
-----End-----

Results:
Number of lines read: 461544 matrix.__sizeof__: 1694768
Number of lines read: 449840 matrix.__sizeof__: 3435984
Number of lines read: 455690 matrix.__sizeof__: 5503904
Number of lines read: 451955 matrix.__sizeof__: 6965928
Number of lines read: 452645 matrix.__sizeof__: 8816304
Number of lines read: 448555 matrix.__sizeof__: 9918368

Traceback (most recent call last):
MemoryError

The peak memory usage at the task manager is > 2GB which results in the memory error.

However, if I modify the code, to store as a list of string rather than a list of list by changing the append statement stated above to "matrix.append("\t".join(data))", then I do not run out of memory.

Results:
Number of lines read: 461544 matrix.__sizeof__: 1694768
Number of lines read: 449840 matrix.__sizeof__: 3435984
Number of lines read: 455690 matrix.__sizeof__: 5503904
Number of lines read: 451955 matrix.__sizeof__: 6965928
Number of lines read: 452645 matrix.__sizeof__: 8816304
Number of lines read: 448555 matrix.__sizeof__: 9918368
Number of lines read: 453455 matrix.__sizeof__: 12552984
Number of lines read: 432440 matrix.__sizeof__: 14122132
Number of lines read: 432921 matrix.__sizeof__: 15887424
Number of lines read: 464259 matrix.__sizeof__: 17873376
Number of lines read: 450875 matrix.__sizeof__: 20107572
Number of lines read: 458552 matrix.__sizeof__: 20107572
Number of lines read: 453261 matrix.__sizeof__: 22621044
Number of lines read: 413456 matrix.__sizeof__: 22621044
Number of lines read: 166464 matrix.__sizeof__: 25448700
Number of lines read: 0 matrix.__sizeof__: 25448700

In this case, the peak memory according to the task manager is about 1.5 GB.

Does anyone know why is there such a big difference memory usage when storing the matrix as a list of list, and when storing it as a list of string? According to __sizeof__ though, the values are the same whether storing it as a list of list, or storing it as a list of string. Is there any methods how I can store all the info into a list of list? I have tried creating such a matrix of equivalent size and it only uses 35mb of memory but I am not sure why when using the code above, the memory usage shot up so fast and exceeded 2GB.

Any advice is greatly appreciated.

Regards,
Jinxiang
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to