Re: zip() function troubles
On 31 Jul 2007 12:57:13 -0700, Paul Rubin http://phr.cx@nospam.invalid wrote: Chris Mellon [EMAIL PROTECTED] writes: Better hand in your computer, then. You're never going to find a situation where the environment won't affect the running time of your algorithms. The environment may affect the running time by an additive or linear multiplicative constant but it should never turn an O(n) algorithm into an O(n**2) one. Hardly true. When you start to factor real world issues like memory allocations (and all the complexities thereof) and cache misses into algorithmic performance an algorithm certainly can change from O(n) to O(n**2). This is one reason why real world performance is tested and compared using benchmarks and not Big O notation. For the record, the python GC is generational. This is a case of a default heuristic giving pathological behavior in a corner case, not anything broken about the design of the python GC. No, it is broken, per discussion on a comp.compilers/comp.lang.functional thread this week. The notion of using a generational collector was to collect less and less frequently for older and older generations (e.g. doubling the amount of allocation between generations) but the usual solution is apparently to increase the heap size by some multiplicative factor when GC fails to free enough memory. -- The issue at hand has nothing to do with the generational nature of the Python GC. It's caused by a heuristic that aggressively attempts to reclaim objects when there are large numbers of allocations without releases. This is a tuneable GC parameter. -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
Chris Mellon [EMAIL PROTECTED] writes: Better hand in your computer, then. You're never going to find a situation where the environment won't affect the running time of your algorithms. The environment may affect the running time by an additive or linear multiplicative constant but it should never turn an O(n) algorithm into an O(n**2) one. For the record, the python GC is generational. This is a case of a default heuristic giving pathological behavior in a corner case, not anything broken about the design of the python GC. No, it is broken, per discussion on a comp.compilers/comp.lang.functional thread this week. The notion of using a generational collector was to collect less and less frequently for older and older generations (e.g. doubling the amount of allocation between generations) but the usual solution is apparently to increase the heap size by some multiplicative factor when GC fails to free enough memory. -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
Istvan Albert [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] || if you try it yourself you'll see that it is very easy to generate 10 | million tuples, No it is not on most machines. | on my system it takes 3 (!!!) seconds to do the following: | | size = 10**7 | data = [] | for i in range(10): |x = [ (0,1) ] * size x has 10**7 references (4 bytes each) to the same tuple. Use id() to check. 40 megs is manageable. |data.append( x ) | | Now it takes over two minutes to do this: | | size = 10**7 | a = [ 0 ] * size | b = zip(a,a) b has 40 megs that reference 10 meg *different* tuples. Each is 20 to 40, so 200-400 megs more. Try [(i,i) for i in xrange(500)] for comparison (it also makes 1000 objects plus large list). | the only explanation I can come up with is that the internal | implementation of zip must have some flaws References are not objects. Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
Peter Otten [EMAIL PROTECTED] writes: When you are allocating a lot of objects without releasing them the garbage collector kicks in to look for cycles. Try switching it off: I think that is the answer. The zip took almost 2 minutes without turning gc off, but takes 1.25 seconds with gc off. It turned a linear-time algorithm into a quadratic one. I think something is broken about a design where that can happen. Maybe Pypy will have a generational GC someday. -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
On Jul 27, 2:16 am, Terry Reedy [EMAIL PROTECTED] wrote: References are not objects. yes this a valid objection, but in all fairness the example in the original post operates on comparably sized objects and also exhibited unexpected performance degradation as it turns out the garbage collection is the culprit, I never used to have to care about gc (and that's great!), but now that I'm that I'm shuffling 1Gb chunks I have to be more careful. best, Istvan -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
On 26 Jul 2007 23:35:44 -0700, Paul Rubin http://phr.cx@nospam.invalid wrote: Peter Otten [EMAIL PROTECTED] writes: When you are allocating a lot of objects without releasing them the garbage collector kicks in to look for cycles. Try switching it off: I think that is the answer. The zip took almost 2 minutes without turning gc off, but takes 1.25 seconds with gc off. It turned a linear-time algorithm into a quadratic one. I think something is broken about a design where that can happen. Maybe Pypy will have a generational GC someday. -- Better hand in your computer, then. You're never going to find a situation where the environment won't affect the running time of your algorithms. For the record, the python GC is generational. This is a case of a default heuristic giving pathological behavior in a corner case, not anything broken about the design of the python GC. http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
On 7/27/07, Istvan Albert [EMAIL PROTECTED] wrote: On Jul 27, 2:16 am, Terry Reedy [EMAIL PROTECTED] wrote: References are not objects. yes this a valid objection, but in all fairness the example in the original post operates on comparably sized objects and also exhibited unexpected performance degradation as it turns out the garbage collection is the culprit, I never used to have to care about gc (and that's great!), but now that I'm that I'm shuffling 1Gb chunks I have to be more careful. You might be interested in the gc.set_threshold function. -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
On Jul 27, 1:24 am, Peter Otten [EMAIL PROTECTED] wrote: When you are allocating a lot of objects without releasing them the garbage collector kicks in to look for cycles. Try switching it off: import gc gc.disable() Yes, this solves the problem I was experiencing. Thanks. Istvan -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
On Jul 26, 4:25 pm, Istvan Albert [EMAIL PROTECTED] wrote: Now I know that zip () wastes lots of memory because it copies the content of the lists, I had used zip to try to trade memory for speed (heh!) , and now that everything was replaced with izip it works just fine. What was really surprising is that it works with no issues up until 1 million items, but for say 10 million it pretty much goes nuts. Does anyone know why? There's nothing in izip() that holds memory, tracks indices, or is sensitive to the length of its inputs (it simply calls the next method on the input iterators as returns the tuple result). So, if you're seeing behavior changes at 10 millions items, the cause almost certainly lies elsewhere. One possible candidate could be memory consumed by immortal integer objects. Raymond Hettinger -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
On Jul 27, 2:18 pm, Raymond Hettinger [EMAIL PROTECTED] wrote: What was really surprising is that it works with no issues up until 1 million items later editing made the sentence more difficult to read I should have said: What was really surprising is that zip works with no issues up until 1 million items It was the zip function (and the garbage collection that it repeatedly triggers) that cause the problem best, Istvan -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
Raymond Hettinger [EMAIL PROTECTED] writes: What was really surprising is that it works with no issues up until 1 million items, but for say 10 million it pretty much goes nuts. Does anyone know why? There's nothing in izip() that holds memory, tracks indices, or ... The issue was with zip, not izip. It was caused by gc running too often. -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
Istvan Albert [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] | On Jul 27, 2:18 pm, Raymond Hettinger [EMAIL PROTECTED] wrote: | | What was really surprising is that it works | with no issues up until 1 million items | | later editing made the sentence more difficult to read | I should have said: What was really surprising is that zip works | with no issues up until 1 million items | | It was the zip function (and the garbage collection that it repeatedly | triggers) that cause the problem It is not the zip function that caused a problem. It was the creation of so many tuples without any deletions. The list comp example I gave does the same thing and has the same problem. Nothing to do with zipping per se. tjr -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
On Jul 26, 9:33 pm, Paul Rubin http://[EMAIL PROTECTED] wrote: Do a top or vmstat while that is happening and see if you are swapping. You are allocating 10 million ints and 10 million tuple nodes, = 20 million objects. Although, even at 100 bytes per object that would be 1GB which would fit in your machine easily. Is it a 64 bit cpu? we can safely drop the memory limit as being the cause and think about something else if you try it yourself you'll see that it is very easy to generate 10 million tuples, on my system it takes 3 (!!!) seconds to do the following: size = 10**7 data = [] for i in range(10): x = [ (0,1) ] * size data.append( x ) Now it takes over two minutes to do this: size = 10**7 a = [ 0 ] * size b = zip(a,a) the only explanation I can come up with is that the internal implementation of zip must have some flaws -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
Istvan Albert [EMAIL PROTECTED] writes: I tested this on a linux server system with 4Gb of RAM a = [ 0 ] * 10**7 takes miliseconds, but say the b = zip(a,a) will take a very long time to finish: Do a top or vmstat while that is happening and see if you are swapping. You are allocating 10 million ints and 10 million tuple nodes, = 20 million objects. Although, even at 100 bytes per object that would be 1GB which would fit in your machine easily. Is it a 64 bit cpu? -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
Istvan Albert [EMAIL PROTECTED] writes: Now it takes over two minutes to do this: size = 10**7 a = [ 0 ] * size b = zip(a,a) OK, I'm getting similar results under 64 bit Pytnon 2.4.4c1 and also under 2.5. About 103 seconds for 10**7 and 26 seconds for 5*10**6. So it looks like zip is using quadratic time. I suggest entering a bug report. -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
Istvan Albert [EMAIL PROTECTED] writes: exceeded 10 million the zip function slowed to a crawl. Note that there was memory available to store over 100 million items. How many bytes is that? Maybe the items (heap-allocated boxed integers in your code example) are bigger than you expect. -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
On Jul 26, 7:44 pm, Paul Rubin http://[EMAIL PROTECTED] wrote: Istvan Albert [EMAIL PROTECTED] writes: exceeded 10 million the zip function slowed to a crawl. Note that there was memory available to store over 100 million items. How many bytes is that? Maybe the items (heap-allocated boxed integers in your code example) are bigger than you expect. while I don't have an answer to this the point that I was trying to make is that I'm fairly certain that it is not a memory issue (some sort of swapping) because the overall memory usage with the OS included is about 1Gb (out of available 2Gb) I tested this on a linux server system with 4Gb of RAM a = [ 0 ] * 10**7 takes miliseconds, but say the b = zip(a,a) will take a very long time to finish: atlas:~$ time python -c a = [0] * 10**7 real0m0.165s user0m0.128s sys 0m0.036s atlas:~$ time python -c a = [0] * 10**7; b= zip(a,a) real0m55.150s user0m54.919s sys 0m0.232s Istvan -- http://mail.python.org/mailman/listinfo/python-list
Re: zip() function troubles
Istvan Albert wrote: I've been debugging the reason for a major slowdown in a piece of code ... and it turns out that it was the zip function. In the past the lists that were zipped were reasonably short, but once the size exceeded 10 million the zip function slowed to a crawl. Note that there was memory available to store over 100 million items. Now I know that zip () wastes lots of memory because it copies the content of the lists, I had used zip to try to trade memory for speed (heh!) , and now that everything was replaced with izip it works just fine. What was really surprising is that it works with no issues up until 1 million items, but for say 10 million it pretty much goes nuts. Does anyone know why? is there some limit that it reaches, or is there something about the operating system (Vista in the case) that makes it behave like so? I've noticed the same kinds of behavior when trying to create very long lists that should easily fit into memory, yet above a given threshold I get inexplicable slowdowns. Now that I think about is this something about the way lists grow when expanding them? and here is the code: from itertools import izip BIGNUM = int(1E7) # let's make a large list data = range(BIGNUM) # this works fine (uses about 200 MB and 4 seconds) s = 0 for x in data: s += x print s # this works fine, 4 seconds as well s = 0 for x1, x2 in izip(data, data): s += x1 print s # this takes over 2 minutes! and uses 600 MB of memory # the memory usage slowly ticks upwards s = 0 for x1, x2 in zip(data, data): s += x1 print s When you are allocating a lot of objects without releasing them the garbage collector kicks in to look for cycles. Try switching it off: import gc gc.disable() try: # do the zipping finally: gc.enable() Peter -- http://mail.python.org/mailman/listinfo/python-list