OW Ghim Siong <o...@bii.a-star.edu.sg> writes: > I have a big file 1.5GB in size, with about 6 million lines of > tab-delimited data. I have to perform some filtration on the data and > keep the good data. After filtration, I have about 5.5 million data > left remaining. As you might already guessed, I have to read them in > batches and I did so using .readlines(100000000).
Why do you need to handle the batching in your code? Perhaps you're not aware that a file object is already an iterator for the lines of text in the file. > After reading each batch, I will split the line (in string format) to > a list using .split("\t") and then check several conditions, after > which if all conditions are satisfied, I will store the list into a > matrix. As I understand it, you don't need a line after moving to the next. So there's no need to maintain a manual buffer of lines at all; please explain if there is something additional requiring a huge buffer of input lines. > The code is as follows: > -----Start------ > a=open("bigfile") > matrix=[] > while True: > lines = a.readlines(100000000) > for line in lines: > data=line.split("\t") > if several_conditions_are_satisfied: > matrix.append(data) > print "Number of lines read:", len(lines), "matrix.__sizeof__:", > matrix.__sizeof__() > if len(lines)==0: > break > -----End----- Using the file's native line iterator:: infile = open("bigfile") matrix = [] for line in infile: record = line.split("\t") if several_conditions_are_satisfied: matrix.append(record) > Results: > Number of lines read: 461544 matrix.__sizeof__: 1694768 > Number of lines read: 449840 matrix.__sizeof__: 3435984 > Number of lines read: 455690 matrix.__sizeof__: 5503904 > Number of lines read: 451955 matrix.__sizeof__: 6965928 > Number of lines read: 452645 matrix.__sizeof__: 8816304 > Number of lines read: 448555 matrix.__sizeof__: 9918368 > > Traceback (most recent call last): > MemoryError If you still get a MemoryError, you can use the ‘pdb’ module <URL:http://docs.python.org/library/pdb.html> to debug it interactively. Another option is to catch the MemoryError and construct a diagnostic message similar to the one you had above:: import sys infile = open("bigfile") matrix = [] for line in infile: record = line.split("\t") if several_conditions_are_satisfied: try: matrix.append(record) except MemoryError: matrix_len = len(matrix) sys.stderr.write( "len(matrix): %(matrix_len)d\n" % vars()) raise > I have tried creating such a matrix of equivalent size and it only > uses 35mb of memory but I am not sure why when using the code above, > the memory usage shot up so fast and exceeded 2GB. > > Any advice is greatly appreciated. With large data sets, and the manipulation and computation you will likely be wanting to perform, it's probably time to consider the NumPy library <URL:http://numpy.scipy.org/> which has much more powerful array types, part of the SciPy library <URL:http://www.scipy.org/>. -- \ “[It's] best to confuse only one issue at a time.” —Brian W. | `\ Kernighan, Dennis M. Ritchie, _The C programming language_, 1988 | _o__) | Ben Finney -- http://mail.python.org/mailman/listinfo/python-list