OW Ghim Siong wrote: > I have a big file 1.5GB in size, with about 6 million lines of > tab-delimited data.
How many fields are there an each line? > I have to perform some filtration on the data and > keep the good data. After filtration, I have about 5.5 million data left > remaining. As you might already guessed, I have to read them in batches > and I did so using .readlines(100000000). I'd have guessed differently. Typically, I would say that you read one line, apply whatever operation you want to it and then write out the result. At least that is the "typical" operation of filtering. > a=open("bigfile") I guess you are on MS Windows. There, you have different handling of textual and non-textual files with regards to the handling of line endings. Generally, using non-textual as input is easier, because it doesn't require any translations. However, textual input is the default, therefore: a = open("bigfile", "rb") Or, even better: with open("bigfile", "rb") as a: to make sure the file is closed correctly and in time. > matrix=[] > while True: > lines = a.readlines(100000000) > for line in lines: I believe you could do for line in a: # use line here > data=line.split("\t") Question here: How many elements does each line contain? And what is their content? The point is that each object has its overhead, and if the content is just e.g. an integral number or a short string, the ratio of interesting content to overhead is rather bad! Compare this to storing a longer string with just the overhead of a single string object instead, it should be obvious. > However, if I modify the code, to store as a list of string rather than > a list of list by changing the append statement stated above to > "matrix.append("\t".join(data))", then I do not run out of memory. You already have the result of that join: matrix.append(line) > Does anyone know why is there such a big difference memory usage when > storing the matrix as a list of list, and when storing it as a list of > string? According to __sizeof__ though, the values are the same whether > storing it as a list of list, or storing it as a list of string. I can barely believe that. How are you using __sizeof__? Why aren't you using sys.getsizeof() instead? Are you aware that the size of a list doesn't include the size for its content (even though it grows with the number of elements), while the size of a string does? > Is there any methods how I can store all the info into a list of list? I > have tried creating such a matrix of equivalent size and it only uses > 35mb of memory but I am not sure why when using the code above, the > memory usage shot up so fast and exceeded 2GB. The size of an empty list is 20 here, plus 4 per element (makes sense on a 32-bit machine), excluding the elements themselves. That means that you have around 8M elements (25448700/4). These take around 32MB of memory, which is what you are probably seeing. The point is that your 35mb don't include any content, probably just a single interned integer or None, so that all elements of your list are the same and only require memory once. In your real-world application that is obviously not so. My suggestions: 1. Find out what exactly is going on here, in particular why our interpretations of the memory usage differ. 2. Redesign your code to really use a filtering design, i.e. don't keep the whole data in memory. 3. If you still have memory issues, take a look at the array library, which should make storage of large arrays a bit more efficient. Good luck! Uli -- Domino Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 -- http://mail.python.org/mailman/listinfo/python-list