Memory issues when storing as List of Strings vs List of List

2010-11-30 Thread OW Ghim Siong

Hi all,

I have a big file 1.5GB in size, with about 6 million lines of 
tab-delimited data. I have to perform some filtration on the data and 
keep the good data. After filtration, I have about 5.5 million data left 
remaining. As you might already guessed, I have to read them in batches 
and I did so using .readlines(1). After reading each batch, I 
will split the line (in string format) to a list using .split(\t) and 
then check several conditions, after which if all conditions are 
satisfied, I will store the list into a matrix.


The code is as follows:
-Start--
a=open(bigfile)
matrix=[]
while True:
   lines = a.readlines(1)
   for line in lines:
   data=line.split(\t)
   if several_conditions_are_satisfied:
   matrix.append(data)
   print Number of lines read:, len(lines), matrix.__sizeof__:, 
matrix.__sizeof__()

   if len(lines)==0:
   break
-End-

Results:
Number of lines read: 461544 matrix.__sizeof__: 1694768
Number of lines read: 449840 matrix.__sizeof__: 3435984
Number of lines read: 455690 matrix.__sizeof__: 5503904
Number of lines read: 451955 matrix.__sizeof__: 6965928
Number of lines read: 452645 matrix.__sizeof__: 8816304
Number of lines read: 448555 matrix.__sizeof__: 9918368

Traceback (most recent call last):
MemoryError

The peak memory usage at the task manager is  2GB which results in the 
memory error.


However, if I modify the code, to store as a list of string rather than 
a list of list by changing the append statement stated above to 
matrix.append(\t.join(data)), then I do not run out of memory.


Results:
Number of lines read: 461544 matrix.__sizeof__: 1694768
Number of lines read: 449840 matrix.__sizeof__: 3435984
Number of lines read: 455690 matrix.__sizeof__: 5503904
Number of lines read: 451955 matrix.__sizeof__: 6965928
Number of lines read: 452645 matrix.__sizeof__: 8816304
Number of lines read: 448555 matrix.__sizeof__: 9918368
Number of lines read: 453455 matrix.__sizeof__: 12552984
Number of lines read: 432440 matrix.__sizeof__: 14122132
Number of lines read: 432921 matrix.__sizeof__: 15887424
Number of lines read: 464259 matrix.__sizeof__: 17873376
Number of lines read: 450875 matrix.__sizeof__: 20107572
Number of lines read: 458552 matrix.__sizeof__: 20107572
Number of lines read: 453261 matrix.__sizeof__: 22621044
Number of lines read: 413456 matrix.__sizeof__: 22621044
Number of lines read: 166464 matrix.__sizeof__: 25448700
Number of lines read: 0 matrix.__sizeof__: 25448700

In this case, the peak memory according to the task manager is about 1.5 GB.

Does anyone know why is there such a big difference memory usage when 
storing the matrix as a list of list, and when storing it as a list of 
string? According to __sizeof__ though, the values are the same whether 
storing it as a list of list, or storing it as a list of string. Is 
there any methods how I can store all the info into a list of list? I 
have tried creating such a matrix of equivalent size and it only uses 
35mb of memory but I am not sure why when using the code above, the 
memory usage shot up so fast and exceeded 2GB.


Any advice is greatly appreciated.

Regards,
Jinxiang
--
http://mail.python.org/mailman/listinfo/python-list


Re: Memory issues when storing as List of Strings vs List of List

2010-11-30 Thread Ulrich Eckhardt
OW Ghim Siong wrote:
 I have a big file 1.5GB in size, with about 6 million lines of
 tab-delimited data.

How many fields are there an each line?

 I have to perform some filtration on the data and 
 keep the good data. After filtration, I have about 5.5 million data left
 remaining. As you might already guessed, I have to read them in batches
 and I did so using .readlines(1).

I'd have guessed differently. Typically, I would say that you read one line,
apply whatever operation you want to it and then write out the result. At
least that is the typical operation of filtering.

 a=open(bigfile)

I guess you are on MS Windows. There, you have different handling of textual
and non-textual files with regards to the handling of line endings.
Generally, using non-textual as input is easier, because it doesn't require
any translations. However, textual input is the default, therefore:

  a = open(bigfile, rb)

Or, even better:

 with open(bigfile, rb) as a:

to make sure the file is closed correctly and in time.

 matrix=[]
 while True:
 lines = a.readlines(1)
 for line in lines:

I believe you could do

  for line in a:
  # use line here

 data=line.split(\t)

Question here: How many elements does each line contain? And what is their
content? The point is that each object has its overhead, and if the content
is just e.g. an integral number or a short string, the ratio of interesting
content to overhead is rather bad! Compare this to storing a longer string
with just the overhead of a single string object instead, it should be
obvious.

 However, if I modify the code, to store as a list of string rather than
 a list of list by changing the append statement stated above to
 matrix.append(\t.join(data)), then I do not run out of memory.

You already have the result of that join:

  matrix.append(line)

 Does anyone know why is there such a big difference memory usage when
 storing the matrix as a list of list, and when storing it as a list of
 string? According to __sizeof__ though, the values are the same whether
 storing it as a list of list, or storing it as a list of string.

I can barely believe that. How are you using __sizeof__? Why aren't you
using sys.getsizeof() instead? Are you aware that the size of a list
doesn't include the size for its content (even though it grows with the
number of elements), while the size of a string does?


 Is there any methods how I can store all the info into a list of list? I
 have tried creating such a matrix of equivalent size and it only uses
 35mb of memory but I am not sure why when using the code above, the
 memory usage shot up so fast and exceeded 2GB.

The size of an empty list is 20 here, plus 4 per element (makes sense on a
32-bit machine), excluding the elements themselves. That means that you
have around 8M elements (25448700/4). These take around 32MB of memory,
which is what you are probably seeing. The point is that your 35mb don't
include any content, probably just a single interned integer or None, so
that all elements of your list are the same and only require memory once.
In your real-world application that is obviously not so.

My suggestions:
1. Find out what exactly is going on here, in particular why our
interpretations of the memory usage differ.
2. Redesign your code to really use a filtering design, i.e. don't keep the
whole data in memory.
3. If you still have memory issues, take a look at the array library, which
should make storage of large arrays a bit more efficient.


Good luck!

Uli

-- 
Domino Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory issues when storing as List of Strings vs List of List

2010-11-30 Thread Peter Otten
OW Ghim Siong wrote:

 Hi all,
 
 I have a big file 1.5GB in size, with about 6 million lines of
 tab-delimited data. I have to perform some filtration on the data and
 keep the good data. After filtration, I have about 5.5 million data left
 remaining. As you might already guessed, I have to read them in batches
 and I did so using .readlines(1). After reading each batch, I
 will split the line (in string format) to a list using .split(\t) and
 then check several conditions, after which if all conditions are
 satisfied, I will store the list into a matrix.
 
 The code is as follows:
 -Start--
 a=open(bigfile)
 matrix=[]
 while True:
 lines = a.readlines(1)
 for line in lines:
 data=line.split(\t)
 if several_conditions_are_satisfied:
 matrix.append(data)
 print Number of lines read:, len(lines), matrix.__sizeof__:,
 matrix.__sizeof__()
 if len(lines)==0:
 break
 -End-

As Ulrich says, don't use readlines(), use

for line in a:
   ... 

that way you have only one line in memory at a time instead of the huge 
lines list.

 Results:
 Number of lines read: 461544 matrix.__sizeof__: 1694768
 Number of lines read: 449840 matrix.__sizeof__: 3435984
 Number of lines read: 455690 matrix.__sizeof__: 5503904
 Number of lines read: 451955 matrix.__sizeof__: 6965928
 Number of lines read: 452645 matrix.__sizeof__: 8816304
 Number of lines read: 448555 matrix.__sizeof__: 9918368
 
 Traceback (most recent call last):
 MemoryError
 
 The peak memory usage at the task manager is  2GB which results in the
 memory error.
 
 However, if I modify the code, to store as a list of string rather than
 a list of list by changing the append statement stated above to
 matrix.append(\t.join(data)), then I do not run out of memory.
 
 Results:
 Number of lines read: 461544 matrix.__sizeof__: 1694768
 Number of lines read: 449840 matrix.__sizeof__: 3435984
 Number of lines read: 455690 matrix.__sizeof__: 5503904
 Number of lines read: 451955 matrix.__sizeof__: 6965928
 Number of lines read: 452645 matrix.__sizeof__: 8816304
 Number of lines read: 448555 matrix.__sizeof__: 9918368
 Number of lines read: 453455 matrix.__sizeof__: 12552984
 Number of lines read: 432440 matrix.__sizeof__: 14122132
 Number of lines read: 432921 matrix.__sizeof__: 15887424
 Number of lines read: 464259 matrix.__sizeof__: 17873376
 Number of lines read: 450875 matrix.__sizeof__: 20107572
 Number of lines read: 458552 matrix.__sizeof__: 20107572
 Number of lines read: 453261 matrix.__sizeof__: 22621044
 Number of lines read: 413456 matrix.__sizeof__: 22621044
 Number of lines read: 166464 matrix.__sizeof__: 25448700
 Number of lines read: 0 matrix.__sizeof__: 25448700
 
 In this case, the peak memory according to the task manager is about 1.5
 GB.
 
 Does anyone know why is there such a big difference memory usage when
 storing the matrix as a list of list, and when storing it as a list of
 string? According to __sizeof__ though, the values are the same whether
 storing it as a list of list, or storing it as a list of string. Is

sizeof gives you the shallow size of the list, basically the memory to 
hold C pointers to the items in the list. A better approximation for the 
total size of a list of lists of string is

 from sys import getsizeof as sizeof
 matrix = [[alpha, beta], [gamma, delta]]
 sizeof(matrix), sum(sizeof(row) for row in matrix), sum(sizeof(entry) 
for row in matrix for entry in row)
(88, 176, 179)
 sum(_)
443

As you can see the outer list requires only a small portion of the total 
memory, and its relative size will decrease as the matrix grows.

The above calculation may still be wrong because some of the strings could 
be identical. Collapsing identical strings into a single object is also a 
way to save memory if you have a significant number of repetitions. Try

matrix = []
with open(...) as f:
for line in f:
data = line.split(\t)
if ...:
matrix.append(map(intern, data))

to see whether it sufficiently reduces the amount of memory needed.

 there any methods how I can store all the info into a list of list? I
 have tried creating such a matrix of equivalent size and it only uses
 35mb of memory but I am not sure why when using the code above, the
 memory usage shot up so fast and exceeded 2GB.
 
 Any advice is greatly appreciated.
 
 Regards,
 Jinxiang

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory issues when storing as List of Strings vs List of List

2010-11-30 Thread Tim Chase

On 11/30/2010 04:29 AM, OW Ghim Siong wrote:

a=open(bigfile)
matrix=[]
while True:
 lines = a.readlines(1)
 for line in lines:
 data=line.split(\t)
 if several_conditions_are_satisfied:
 matrix.append(data)
 print Number of lines read:, len(lines), matrix.__sizeof__:,
matrix.__sizeof__()
 if len(lines)==0:
 break


As others have mentiond, don't use .readlines() but use the 
file-object as an iterator instead.  This can even be rewritten 
as a simple list-comprehension:


  from csv import reader
  matrix = [data
for data
in reader(file('bigfile.txt', 'rb'), delimiter='\t')
if several_conditions_are_satisfied(data)
]

Assuming that you're throwing away most of the data (the final 
matrix fits well within memory, even if the source file doesn't).


-tkc



--
http://mail.python.org/mailman/listinfo/python-list


Re: Memory issues when storing as List of Strings vs List of List

2010-11-30 Thread Antoine Pitrou
On Tue, 30 Nov 2010 18:29:35 +0800
OW Ghim Siong o...@bii.a-star.edu.sg wrote:
 
 Does anyone know why is there such a big difference memory usage when 
 storing the matrix as a list of list, and when storing it as a list of 
 string?

That's because any object has a fixed overhead (related to metadata and
allocation), so storing a matrix line as a sequence of several objects
rather than a single string makes the total overhead larger,
especially when the payload of each object is small.

If you want to mitigate the issue, you could store your lines as tuples
rather than lists, since tuples have a smaller memory footprint:

matrix.append(tuple(data))

 According to __sizeof__ though, the values are the same whether 
 storing it as a list of list, or storing it as a list of string.

As mentioned by others, __sizeof__ only gives you the size of the
container, not the size of the contained values (which is where the
difference is here).

Regards

Antoine.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Memory issues when storing as List of Strings vs List of List

2010-11-30 Thread Ben Finney
OW Ghim Siong o...@bii.a-star.edu.sg writes:

 I have a big file 1.5GB in size, with about 6 million lines of
 tab-delimited data. I have to perform some filtration on the data and
 keep the good data. After filtration, I have about 5.5 million data
 left remaining. As you might already guessed, I have to read them in
 batches and I did so using .readlines(1).

Why do you need to handle the batching in your code? Perhaps you're not
aware that a file object is already an iterator for the lines of text in
the file.

 After reading each batch, I will split the line (in string format) to
 a list using .split(\t) and then check several conditions, after
 which if all conditions are satisfied, I will store the list into a
 matrix.

As I understand it, you don't need a line after moving to the next. So
there's no need to maintain a manual buffer of lines at all; please
explain if there is something additional requiring a huge buffer of
input lines.

 The code is as follows:
 -Start--
 a=open(bigfile)
 matrix=[]
 while True:
lines = a.readlines(1)
for line in lines:
data=line.split(\t)
if several_conditions_are_satisfied:
matrix.append(data)
print Number of lines read:, len(lines), matrix.__sizeof__:,
 matrix.__sizeof__()
if len(lines)==0:
break
 -End-

Using the file's native line iterator::

infile = open(bigfile)
matrix = []
for line in infile:
record = line.split(\t)
if several_conditions_are_satisfied:
matrix.append(record)

 Results:
 Number of lines read: 461544 matrix.__sizeof__: 1694768
 Number of lines read: 449840 matrix.__sizeof__: 3435984
 Number of lines read: 455690 matrix.__sizeof__: 5503904
 Number of lines read: 451955 matrix.__sizeof__: 6965928
 Number of lines read: 452645 matrix.__sizeof__: 8816304
 Number of lines read: 448555 matrix.__sizeof__: 9918368

 Traceback (most recent call last):
 MemoryError

If you still get a MemoryError, you can use the ‘pdb’ module
URL:http://docs.python.org/library/pdb.html to debug it interactively.

Another option is to catch the MemoryError and construct a diagnostic
message similar to the one you had above::

import sys

infile = open(bigfile)
matrix = []
for line in infile:
record = line.split(\t)
if several_conditions_are_satisfied:
try:
matrix.append(record)
except MemoryError:
matrix_len = len(matrix)
sys.stderr.write(
len(matrix): %(matrix_len)d\n % vars())
raise

 I have tried creating such a matrix of equivalent size and it only
 uses 35mb of memory but I am not sure why when using the code above,
 the memory usage shot up so fast and exceeded 2GB.

 Any advice is greatly appreciated.

With large data sets, and the manipulation and computation you will
likely be wanting to perform, it's probably time to consider the NumPy
library URL:http://numpy.scipy.org/ which has much more powerful array
types, part of the SciPy library URL:http://www.scipy.org/.

-- 
 \“[It's] best to confuse only one issue at a time.” —Brian W. |
  `\  Kernighan, Dennis M. Ritchie, _The C programming language_, 1988 |
_o__)  |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list