Re: Creating Long Lists
On Tue, 2011-02-22, Ben Finney wrote: Kelson Zawack zawack...@gis.a-star.edu.sg writes: I have a large (10gb) data file for which I want to parse each line into an object and then append this object to a list for sorting and further processing. What is the nature of the further processing? Does that further processing access the items sequentially? If so, they don't all need to be in memory at once, and you can produce them with a generator URL:http://docs.python.org/glossary.html#term-generator. He mentioned sorting them -- you need all of them for that. If that's the *only* such use, I'd experiment with writing them as sortable text to file, and run GNU sort (the Unix utility) on the file. It seems to have a clever file-backed sort algorithm. /Jorgen -- // Jorgen Grahn grahn@ Oo o. . . \X/ snipabacken.se O o . -- http://mail.python.org/mailman/listinfo/python-list
Re: Creating Long Lists
On Wed, 2011-02-23 at 13:57 +, Jorgen Grahn wrote: If that's the *only* such use, I'd experiment with writing them as sortable text to file, and run GNU sort (the Unix utility) on the file. It seems to have a clever file-backed sort algorithm. +1 - and experiment with the different flags to sort (compression of intermediate results, intermediate batch size, etc) Tim -- http://mail.python.org/mailman/listinfo/python-list
Re: Creating Long Lists
The answer it turns out is the garbage collector. When I disable the garbage collector before the loop that loads the data into the list and then enable it after the loop the program runs without issue. This raises a question though, can the logic of the garbage collector be changed so that it is not triggered in instances like this were you really do want to put lots and lots of stuff in memory. Turning on and off the garbage collector is not a big deal, but it would obviously be nicer not to have to. -- http://mail.python.org/mailman/listinfo/python-list
Re: Creating Long Lists
Kelson Zawack zawack...@gis.a-star.edu.sg writes: This raises a question though, can the logic of the garbage collector be changed so that it is not triggered in instances like this were you really do want to put lots and lots of stuff in memory. Have you considered using a more specialised data type for such large data sets, such as ‘array.array’ or the NumPy array types? -- \ “True greatness is measured by how much freedom you give to | `\ others, not by how much you can coerce others to do what you | _o__) want.” —Larry Wall | Ben Finney -- http://mail.python.org/mailman/listinfo/python-list
Re: Creating Long Lists
Kelson Zawack wrote: The answer it turns out is the garbage collector. When I disable the garbage collector before the loop that loads the data into the list and then enable it after the loop the program runs without issue. This raises a question though, can the logic of the garbage collector be changed so that it is not triggered in instances like this were you really do want to put lots and lots of stuff in memory. Turning on and off the garbage collector is not a big deal, but it would obviously be nicer not to have to. What Python version are you using? The garbage collection heuristic has been tweaked in 2.7, see http://svn.python.org/view/python/trunk/Modules/gcmodule.c?r1=67832r2=68462 -- http://mail.python.org/mailman/listinfo/python-list
Re: Creating Long Lists
I am using python 2.6.2, so it may no longer be a problem. I am open to using another data type, but the way I read the documentation array.array only supports numeric types, not arbitrary objects. I also tried playing around with numpy arrays, albeit for only a short time, and it seems that although they do support arbitrary objects they seem to be geared toward numbers as well and I found it cumbersome to manipulate objects with them. It could be though that if I understood them better they would work fine. Also do numpy arrays support sorting arbitrary objects, I only saw a method that sorts numbers. -- http://mail.python.org/mailman/listinfo/python-list
Re: Creating Long Lists
On 2/22/2011 4:40 AM, Kelson Zawack wrote: The answer it turns out is the garbage collector. When I disable the garbage collector before the loop that loads the data into the list and then enable it after the loop the program runs without issue. This raises a question though, can the logic of the garbage collector be changed so that it is not triggered in instances like this were you really do want to put lots and lots of stuff in memory. Turning on and off the garbage collector is not a big deal, but it would obviously be nicer not to have to. Heuristics, by their very nature, are not correct in all situations. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Creating Long Lists
I have a large (10gb) data file for which I want to parse each line into an object and then append this object to a list for sorting and further processing. I have noticed however that as the length of the list increases the rate at which objects are added to it decreases dramatically. My first thought was that I was nearing the memory capacity of the machine and the decrease in performance was due to the os swapping things in and out of memory. When I looked at the memory usage this was not the case. My process was the only job running and was consuming 40gb of the the total 130gb and no swapping processes were running. To make sure there was not some problem with the rest of my code, or the servers file system, I ran my program again as it was but without the line that was appending items to the list and it completed without problem indicating that the decrease in performance is the result of some part of the process of appending to the list. Since other people have observed this problem as well (http://tek-tips.com/viewthread.cfm?qid=1096178page=13, http://stackoverflow.com/questions/2473783/is-there-a-way-to-circumvent-python-list-append-becoming-progressively-slower-i) I did not bother to further analyze or benchmark it. Since the answers in the above forums do not seem very definitive I thought I would inquire here about what the reason for this decrease in performance is, and if there is a way, or another data structure, that would avoid this problem. -- http://mail.python.org/mailman/listinfo/python-list
Re: Creating Long Lists
On Feb 22, 12:57 pm, Kelson Zawack zawack...@gis.a-star.edu.sg wrote: I did not bother to further analyze or benchmark it. Since the answers in the above forums do not seem very definitive I thought I would inquire here about what the reason for this decrease in performance is, and if there is a way, or another data structure, that would avoid this problem. The first link is 6 years old and refers to Python 2.4. Unless you're using 2.4 you should probably ignore it. The first answer on the stackoverflow link was accepted by the poster as resolving his issue. Try disabling garbage collection. -- http://mail.python.org/mailman/listinfo/python-list
Re: Creating Long Lists
alex23 wuwe...@gmail.com writes: On Feb 22, 12:57 pm, Kelson Zawack zawack...@gis.a-star.edu.sg wrote: I did not bother to further analyze or benchmark it. Since the answers in the above forums do not seem very definitive I thought I would inquire here about what the reason for this decrease in performance is, and if there is a way, or another data structure, that would avoid this problem. The first link is 6 years old and refers to Python 2.4. Unless you're using 2.4 you should probably ignore it. The first answer on the stackoverflow link was accepted by the poster as resolving his issue. Try disabling garbage collection. I just read http://bugs.python.org/issue4074 which discusses a patch that has been included 2 years ago. So using a recent Python 2.x also doesn't have this issue? -- John Bokma j3b Blog: http://johnbokma.com/Facebook: http://www.facebook.com/j.j.j.bokma Freelance Perl Python Development: http://castleamber.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Creating Long Lists
On Mon, Feb 21, 2011 at 6:57 PM, Kelson Zawack zawack...@gis.a-star.edu.sgwrote: I have a large (10gb) data file for which I want to parse each line into an object and then append this object to a list for sorting and further processing. I have noticed however that as the length of the list increases the rate at which objects are added to it decreases dramatically. My first thought was that I was nearing the memory capacity of the machine and the decrease in performance was due to the os swapping things in and out of memory. When I looked at the memory usage this was not the case. My process was the only job running and was consuming 40gb of the the total 130gb and no swapping processes were running. To make sure there was not some problem with the rest of my code, or the servers file system, I ran my program again as it was but without the line that was appending items to the list and it completed without problem indicating that the decrease in performance is the result of some part of the process of appending to the list. Since other people have observed this problem as well ( http://tek-tips.com/viewthread.cfm?qid=1096178page=13, http://stackoverflow.com/questions/2473783/is-there-a-way-to-circumvent-python-list-append-becoming-progressively-slower-i) I did not bother to further analyze or benchmark it. Since the answers in the above forums do not seem very definitive I thought I would inquire here about what the reason for this decrease in performance is, and if there is a way, or another data structure, that would avoid this problem.http://mail.python.org/mailman/listinfo/python-list Do you have 130G of physical RAM, or 130G of virtual memory? That makes a big difference. (Yeah, I know, 130G of physical RAM is probably pretty rare today) Disabling garbage collection is a good idea, but if you don't have well over 10G of physical RAM, you'd probably better also use a (partially) disk-based sort. To do otherwise would pretty much beg for swapping and a large slowdown. Merge sort works very well for very large datasets. http://en.wikipedia.org/wiki/Merge_sort Just make your sublists be disk files, not in-memory lists - until you get down to a small enough sublist that you can sort it in memory, without thrashing. Timsort (list_.sort()) is excellent for in memory sorting. Actually, GNU sort is very good at sorting huge datasets - you could probably just open a subprocess to it, as long as you can make your data fit the line-oriented model GNU sort expects, and you have enough temporary disk space. -- http://mail.python.org/mailman/listinfo/python-list
Re: Creating Long Lists
Kelson Zawack zawack...@gis.a-star.edu.sg writes: I have a large (10gb) data file for which I want to parse each line into an object and then append this object to a list for sorting and further processing. What is the nature of the further processing? Does that further processing access the items sequentially? If so, they don't all need to be in memory at once, and you can produce them with a generator URL:http://docs.python.org/glossary.html#term-generator. Note that, if you just want lines of text from a file, the file object itself is a generator for the lines of text within it. If, on the other hand, you need arbitrary access all over that large data set, you probably want a data type better suited. The standard library has the ‘array’ module for this purpose; the third-party NumPy library provides even more power. -- \ “Remember: every member of your ‘target audience’ also owns a | `\ broadcasting station. These ‘targets’ can shoot back.” —Michael | _o__) Rathbun to advertisers, news.admin.net-abuse.email | Ben Finney -- http://mail.python.org/mailman/listinfo/python-list
Re: Creating Long Lists
On Mon, Feb 21, 2011 at 7:24 PM, Dan Stromberg drsali...@gmail.com wrote: On Mon, Feb 21, 2011 at 6:57 PM, Kelson Zawack zawack...@gis.a-star.edu.sg wrote: I have a large (10gb) data file for which I want to parse each line into an object and then append this object to a list for sorting and further processing. I have noticed however that as the length of the list increases the rate at which objects are added to it decreases dramatically. My first thought was that I was nearing the memory capacity of the machine and the decrease in performance was due to the os swapping things in and out of memory. When I looked at the memory usage this was not the case. My process was the only job running and was consuming 40gb of the the total 130gb and no swapping processes were running. To make sure there was not some problem with the rest of my code, or the servers file system, I ran my program again as it was but without the line that was appending items to the list and it completed without problem indicating that the decrease in performance is the result of some part of the process of appending to the list. Since other people have observed this problem as well ( http://tek-tips.com/viewthread.cfm?qid=1096178page=13, http://stackoverflow.com/questions/2473783/is-there-a-way-to-circumvent-python-list-append-becoming-progressively-slower-i) I did not bother to further analyze or benchmark it. Since the answers in the above forums do not seem very definitive I thought I would inquire here about what the reason for this decrease in performance is, and if there is a way, or another data structure, that would avoid this problem.http://mail.python.org/mailman/listinfo/python-list Do you have 130G of physical RAM, or 130G of virtual memory? That makes a big difference. (Yeah, I know, 130G of physical RAM is probably pretty rare today) Disabling garbage collection is a good idea, but if you don't have well over 10G of physical RAM, you'd probably better also use a (partially) disk-based sort. To do otherwise would pretty much beg for swapping and a large slowdown. Merge sort works very well for very large datasets. http://en.wikipedia.org/wiki/Merge_sort Just make your sublists be disk files, not in-memory lists - until you get down to a small enough sublist that you can sort it in memory, without thrashing. Timsort (list_.sort()) is excellent for in memory sorting. Actually, GNU sort is very good at sorting huge datasets - you could probably just open a subprocess to it, as long as you can make your data fit the line-oriented model GNU sort expects, and you have enough temporary disk space. Depending on what you're doing after the sort, you might also look at bsddb.btopen -- http://mail.python.org/mailman/listinfo/python-list