Re: Memory efficient tuple storage
In the end, I used a cStringIO object to store the chromosomes - because there are only 23, I can use one character for each chromosome and represent the whole lot with a giant string and a dictionary to say what each character means. Then I used numpy arrays for the data and coordinates. This squeezed each file into under 100MB. Thanks again for the help! Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory efficient tuple storage
On Mar 13, 1:13 pm, "psaff...@googlemail.com" wrote: > Thanks for all the replies. > > First of all, can anybody recommend a good way to show memory usage? I > tried heapy, but couldn't make much sense of the output and it didn't > seem to change too much for different usages. Maybe I was just making > the h.heap() call in the wrong place. I also tried getrusage() in the > resource module. That seemed to give 0 for the shared and unshared > memory size no matter what I did. I was calling it after the function > call the filled up the lists. The memory figures I give in this > message come from top. > > The numpy solution does work, but it uses more than 1GB of memory for > one of my 130MB files. I'm using > > np.dtype({'names': ['chromo', 'position', 'dpoint'], 'formats': ['S6', > 'i4', 'f8']}) > > so shouldn't it use 18 bytes per line? The file has 5832443 lines, > which by my arithmetic is around 100MB...? snip Sorry, did not study your post. But can you use a ctypes.Structure? Or, can you use a database or mmap to keep the data out of memory? Or, how would you feel about a mini extension in C? -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory efficient tuple storage
"psaff...@googlemail.com" writes: > However, I still need the coordinates. If I don't keep them in a list, > where can I keep them? See the docs for the array module: http://docs.python.org/library/array.html -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory efficient tuple storage
On Fri, Mar 13, 2009 at 1:13 PM, psaff...@googlemail.com wrote: > Thanks for all the replies. > [snip] > > The numpy solution does work, but it uses more than 1GB of memory for > one of my 130MB files. I'm using > > np.dtype({'names': ['chromo', 'position', 'dpoint'], 'formats': ['S6', > 'i4', 'f8']}) > > so shouldn't it use 18 bytes per line? The file has 5832443 lines, > which by my arithmetic is around 100MB...? I made a mock up file with 5832443 lines, each line consisting of abcdef 100 100.0 and ran the g2arr() function with 'S6' for the string. While running (which took really long), the memory usage spiked on my computer to around 800MB, but once g2arr() returned, the memory usage went to around 200MB. The number of bytes consumed by the array is 105MB (using arr.nbytes). From looking at the loadtxt routine in numpy, it looks like there are a zillion objects created (string objects for splitting each line, temporary ints floats and strings for type conversions, etc) while in the routine which are garbage collected upon return. I'm not well versed in Python's internal memory managment system, but from what I understand, practically all that memory is either returned to the OS or held onto by Python for future use by other objects after the routine returns. But the only memory in use by the array is the ~100MB for the raw data. Making 5 copies of the array (using numpy.copy(arr)) bumps total memory usage (from top) up to 700MB, which is 117MB per array or so. The total memory reported by summing the arr.nbytes is 630MB (105MB / array), so there isn't that much memory wasted. Basically, the numpy solution will pack the data into an array of C structs with the fields as indicated by the dtype parameter. Perhaps a database solution as mentioned in other posts would suit you better; if the temporary spike in memory usage is unacceptable you could try to roll your own loadtxt function that would be leaner and meaner. I suggest the numpy solution for its ease and efficient use of memory. Kurt -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory efficient tuple storage
En Fri, 13 Mar 2009 14:49:51 -0200, Tim Wintle escribió: If the same chromosome string is being used multiple times then you may find it more efficient to reference the same string, so you don't need to have multiple copies of the same string in memory. That may be what is taking up the space. i.e. something like (written verbosely) reference_dict = {} for (chromosome,posn) in my_file: chromosome = reference_dict.setdefault(chromosome,chromosome) Note that the intern() builtin does exactly that: chromosome = intern(chromosome) -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory efficient tuple storage
psaffrey googlemail.com googlemail.com> writes: > > First of all, can anybody recommend a good way to show memory usage? Python 2.6 has a function called sys.getsizeof(). -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory efficient tuple storage
Thanks for all the replies. First of all, can anybody recommend a good way to show memory usage? I tried heapy, but couldn't make much sense of the output and it didn't seem to change too much for different usages. Maybe I was just making the h.heap() call in the wrong place. I also tried getrusage() in the resource module. That seemed to give 0 for the shared and unshared memory size no matter what I did. I was calling it after the function call the filled up the lists. The memory figures I give in this message come from top. The numpy solution does work, but it uses more than 1GB of memory for one of my 130MB files. I'm using np.dtype({'names': ['chromo', 'position', 'dpoint'], 'formats': ['S6', 'i4', 'f8']}) so shouldn't it use 18 bytes per line? The file has 5832443 lines, which by my arithmetic is around 100MB...? My previous solution - using a python array for the numbers and a list of tuples for the coordinates uses about 900MB. The dictionary solution suggested by Tim got this down to 650MB. If I just ignore the coordinates, this comes down to less than 100MB. I feel sure the list mechanics for storing the coordinates is what is killing me here. As to "work smarter", you could be right, but it's tricky. The 28 files are in 4 groups of 7, so given that each file is about 6 million lines, each group of data points contains about 42 million points. First, I need to divide every point by the median of its group. Then I need to z-score the whole group of points. After this preparation, I need to file each point, based on its coordinates, into other data structures - the genome itself is divided up into bins that cover a range of coordinates, and we file each point into the appropriate bin for the coordinate region it overlaps. Then there operations that combine the values from various bins. The relevant coordinates for these combinations come from more enormous csv files. I've already done all this analysis on smaller datasets, so I'm hoping I won't have to make huge changes just to fit the data into memory. Yes, I'm also finding out how much it will cost to upgrade to 32GB of memory :) Sorry for the long message... Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory efficient tuple storage
While Kurt gave some excellent ideas for using numpy, there were some missing details in your original post that might help folks come up with a "work smarter, not harder" solution. Clearly, you're not loading it into memory just for giggles -- surely you're *doing* something with it once it's in memory. With details of what you're trying to do with that data, some of the smart-minds on the list may be able to provide a solution/algorithm that doesn't require having everything in memory concurrently. Or you may try streaming through your data-sources, pumping it into a sqlite/mysql/postgres database, allowing for more efficient querying of the data. Both mysql & postgres offer the ability to import data directly into the server without the need for (or overhead of) a bajillion INSERT statements which may also speed up the slurping-in process. -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory efficient tuple storage
On Fri, Mar 13, 2009 at 11:33 AM, Kurt Smith wrote: [snip OP] > > Assuming your data is in a plaintext file something like > 'genomedata.txt' below, the following will load it into a numpy array > with a customized dtype. You can access the different fields by name > ('chromo', 'position', and 'dpoint' -- change to your liking). Don't > know if this works or not; might give it a try. To clarify -- I don't know if this will work for your particular problem, but I do know that it will read in the array correctly and cut down on memory usage in the final array size. Specifically, if you use a dtype with 'S50', 'i4' and 'f8' (see the numpy dtype docs) -- that's 50 bytes for your chromosome string, 4 bytes for the position and 8 bytes for the data point -- each entry will use just 50 + 4 + 8 bytes, and the numpy array will have just enough memory allocated for all of these records. The datatypes stored in the array will be a char array for the string, a C int and a C double; it won't use the corresponding python datatypes which have a bunch of other memory usage associated with them. Hope this helps, Kurt -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory efficient tuple storage
On Fri, 2009-03-13 at 08:59 -0700, psaff...@googlemail.com wrote: > I'm reading in some rather large files (28 files each of 130MB). Each > file is a genome coordinate (chromosome (string) and position (int)) > and a data point (float). I want to read these into a list of > coordinates (each a tuple of (chromosome, position)) and a list of > data points. > > This has taught me that Python lists are not memory efficient, because > if I use lists it gets through 100MB a second until it hits the swap > space and I have 8GB physical memory in this machine. I can use Python > or numpy arrays for the data points, which is much more manageable. > However, I still need the coordinates. If I don't keep them in a list, > where can I keep them? If you just have one list, of objects then it's actually relatively efficient, it's if you have lots of lists that it's inefficient. I'm not certain without seeing your code (and my biology isn't good enough to know the answer to my question below) How many unique chromosome strings do you have (by equivalence)? If the same chromosome string is being used multiple times then you may find it more efficient to reference the same string, so you don't need to have multiple copies of the same string in memory. That may be what is taking up the space. i.e. something like (written verbosely) reference_dict = {} list_of_coordinates = [] for (chromosome,posn) in my_file: chromosome = reference_dict.setdefault(chromosome,chromosome) list_of_coordinates.append((chromosome,posn)) (or something like that) Tim Wintle -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory efficient tuple storage
On Fri, Mar 13, 2009 at 10:59 AM, psaff...@googlemail.com wrote: > I'm reading in some rather large files (28 files each of 130MB). Each > file is a genome coordinate (chromosome (string) and position (int)) > and a data point (float). I want to read these into a list of > coordinates (each a tuple of (chromosome, position)) and a list of > data points. > > This has taught me that Python lists are not memory efficient, because > if I use lists it gets through 100MB a second until it hits the swap > space and I have 8GB physical memory in this machine. I can use Python > or numpy arrays for the data points, which is much more manageable. > However, I still need the coordinates. If I don't keep them in a list, > where can I keep them? Assuming your data is in a plaintext file something like 'genomedata.txt' below, the following will load it into a numpy array with a customized dtype. You can access the different fields by name ('chromo', 'position', and 'dpoint' -- change to your liking). Don't know if this works or not; might give it a try. === [186]$ cat genomedata.txt gene1 120189 5.34849 gene2 84040 903873.1 gene3 300822 -21002.2020 [187]$ cat g2arr.py import numpy as np def g2arr(fname): # the 'S100' should be modified to be large enough for your string field. dt = np.dtype({'names': ['chromo', 'position', 'dpoint'], 'formats': ['S100', np.int, np.float]}) return np.loadtxt(fname, delimiter=' ', dtype=dt) if __name__ == '__main__': arr = g2arr('genomedata.txt') print arr print arr['chromo'] print arr['position'] print arr['dpoint'] = Take a look at the np.loadtxt and np.dtype documentation. Kurt -- http://mail.python.org/mailman/listinfo/python-list
Memory efficient tuple storage
I'm reading in some rather large files (28 files each of 130MB). Each file is a genome coordinate (chromosome (string) and position (int)) and a data point (float). I want to read these into a list of coordinates (each a tuple of (chromosome, position)) and a list of data points. This has taught me that Python lists are not memory efficient, because if I use lists it gets through 100MB a second until it hits the swap space and I have 8GB physical memory in this machine. I can use Python or numpy arrays for the data points, which is much more manageable. However, I still need the coordinates. If I don't keep them in a list, where can I keep them? Peter -- http://mail.python.org/mailman/listinfo/python-list