On Mon, Feb 21, 2011 at 7:24 PM, Dan Stromberg <drsali...@gmail.com> wrote:
> > On Mon, Feb 21, 2011 at 6:57 PM, Kelson Zawack < > zawack...@gis.a-star.edu.sg> wrote: > >> I have a large (10gb) data file for which I want to parse each line into >> an object and then append this object to a list for sorting and further >> processing. I have noticed however that as the length of the list increases >> the rate at which objects are added to it decreases dramatically. My first >> thought was that I was nearing the memory capacity of the machine and the >> decrease in performance was due to the os swapping things in and out of >> memory. When I looked at the memory usage this was not the case. My >> process was the only job running and was consuming 40gb of the the total >> 130gb and no swapping processes were running. To make sure there was not >> some problem with the rest of my code, or the servers file system, I ran my >> program again as it was but without the line that was appending items to the >> list and it completed without problem indicating that the decrease in >> performance is the result of some part of the process of appending to the >> list. Since other people have observed this problem as well ( >> http://tek-tips.com/viewthread.cfm?qid=1096178&page=13, >> http://stackoverflow.com/questions/2473783/is-there-a-way-to-circumvent-python-list-append-becoming-progressively-slower-i) >> I did not bother to further analyze or benchmark it. Since the answers in >> the above forums do not seem very definitive I thought I would inquire >> here about what the reason for this decrease in performance is, and if there >> is a way, or another data structure, that would avoid this >> problem.<http://mail.python.org/mailman/listinfo/python-list> > > > Do you have 130G of physical RAM, or 130G of virtual memory? That makes a > big difference. (Yeah, I know, 130G of physical RAM is probably pretty rare > today) > > Disabling garbage collection is a good idea, but if you don't have well > over 10G of physical RAM, you'd probably better also use a (partially) > disk-based sort. To do otherwise would pretty much beg for swapping and a > large slowdown. > > Merge sort works very well for very large datasets. > http://en.wikipedia.org/wiki/Merge_sort Just make your sublists be disk > files, not in-memory lists - until you get down to a small enough sublist > that you can sort it in memory, without thrashing. Timsort (list_.sort()) > is excellent for in memory sorting. > > Actually, GNU sort is very good at sorting huge datasets - you could > probably just open a subprocess to it, as long as you can make your data fit > the line-oriented model GNU sort expects, and you have enough temporary disk > space. > Depending on what you're doing after the sort, you might also look at bsddb.btopen
-- http://mail.python.org/mailman/listinfo/python-list