On Mon, Feb 21, 2011 at 6:57 PM, Kelson Zawack <zawack...@gis.a-star.edu.sg>wrote:
> I have a large (10gb) data file for which I want to parse each line into an > object and then append this object to a list for sorting and further > processing. I have noticed however that as the length of the list increases > the rate at which objects are added to it decreases dramatically. My first > thought was that I was nearing the memory capacity of the machine and the > decrease in performance was due to the os swapping things in and out of > memory. When I looked at the memory usage this was not the case. My > process was the only job running and was consuming 40gb of the the total > 130gb and no swapping processes were running. To make sure there was not > some problem with the rest of my code, or the servers file system, I ran my > program again as it was but without the line that was appending items to the > list and it completed without problem indicating that the decrease in > performance is the result of some part of the process of appending to the > list. Since other people have observed this problem as well ( > http://tek-tips.com/viewthread.cfm?qid=1096178&page=13, > http://stackoverflow.com/questions/2473783/is-there-a-way-to-circumvent-python-list-append-becoming-progressively-slower-i) > I did not bother to further analyze or benchmark it. Since the answers in > the above forums do not seem very definitive I thought I would inquire > here about what the reason for this decrease in performance is, and if there > is a way, or another data structure, that would avoid this > problem.<http://mail.python.org/mailman/listinfo/python-list> Do you have 130G of physical RAM, or 130G of virtual memory? That makes a big difference. (Yeah, I know, 130G of physical RAM is probably pretty rare today) Disabling garbage collection is a good idea, but if you don't have well over 10G of physical RAM, you'd probably better also use a (partially) disk-based sort. To do otherwise would pretty much beg for swapping and a large slowdown. Merge sort works very well for very large datasets. http://en.wikipedia.org/wiki/Merge_sort Just make your sublists be disk files, not in-memory lists - until you get down to a small enough sublist that you can sort it in memory, without thrashing. Timsort (list_.sort()) is excellent for in memory sorting. Actually, GNU sort is very good at sorting huge datasets - you could probably just open a subprocess to it, as long as you can make your data fit the line-oriented model GNU sort expects, and you have enough temporary disk space.
-- http://mail.python.org/mailman/listinfo/python-list