On Apr 12, 11:11 am, [EMAIL PROTECTED] wrote: > I should have been more specific about possible fixes. > > > > python2.5 -m timeit 'gc.disable();l=[(i,) for i in range(2000000)]' > > > 10 loops, best of 3: 662 msec per loop > > > > python2.5 -m timeit 'gc.enable();l=[(i,) for i in range(2000000)]' > > > 10 loops, best of 3: 15.2 sec per loop > > > In the latter case, thegarbagecollector apparently consumes about > > 97% of the overall time. > > In a related thread onhttp://bugs.python.org/issue2607 > Amaury Forgeot d'Arc suggested a setting of the GC > thresholds that actually solves the problem for me: > > > Disabling the gc may not be a good idea in a real application; I suggest > > you to play with the gc.set_threshold function and set larger values, at > > least while building the dictionary. (700, 1000, 10) seems to yield good > > results. > > python2.5 -m timeit 'gc.set_threshold(700,1000,10);l=[(i,) for i in > > range(2000000)]' > > 10 loops, best of 3: 658 msec per loop > > which made me suggest to use these as defaults, but then > Martin v. Löwis wrote that > > > No, the defaults are correct for typical applications. > > At that point I felt lost and as the general wish in that thread was > to move > discussion to comp.lang.python, I brought it up here, in a modified > and simplified form. > > > I would suggest to configure the default behaviour of thegarbage > > collector in such a way that this squared complexity is avoided > > without requiring specific knowledge and intervention by the user. Not > > being an expert in these details I would like to ask the gurus how > > this could be done. > > I hope this should be at least technically possible, whether it is > > really desirable or important for a default installation of Python > > could then be discussed once the disadvantages of such a setting would > > be apparent. > > I still don't see what is so good about defaults that lead to O(N*N) > computation for a O(N) problem, and I like Amaury's suggestion a lot, > so I would like to see comments on its disadvantages. Please don't > tell me that O(N*N) is good enough. For N>1E7 it isn't. > > About some other remarks made here: > > > I think the linguists of the world should write better automated > > translation systems. Not being an expert in these details I would like > > to ask the gurus how it could be done. > > I fully agree, and that is why I (as someone involved with MT) would > prefer to focus on MT and not on memory allocation issues, and by the > way, this is why I normally prefer to work in Python, as it normally > lets me focus on the problem instead of the tool. > > > There are going to be pathological cases in any memory allocation > > scheme. The fact that you have apparently located one calls for you to > > change your approach, not for the finely-tuned well-conceivedgarbage > > collection scheme to be abandoned for your convenience. > > I do not agree at all. Having to deal with N>1E7 objects is IMHO not > pathological, it is just daily practice in data-oriented work (we're > talking about deriving models from Gigawords, not about toy systems). > > > If you had made a definite proposal that could have been evaluated you > > request would perhaps have seemed a little more approachable. > > I feel it is ok to describe a generic problem without having found > the answer yet. > (If you disagree: Where are your definite proposals wrt. MT ? ;-) > But I admit, I should have brought up Amaury's definite proposal > right away. > > > A question often asked ... of people who are creating > > very large data structures in Python is "Why are you doing that?" > > That is, you should consider whether some kind of database solution > > would be better. You mention lots of dicts--it sounds like some > > balanced B-trees with disk loading on demand could be a good choice. > > I do it because I have to count frequencies of pairs of things that > appear in real data. Extremely simple stuff, only interesting because > of the size of the data set. After Amaury's hint to switch GC > temporarily > off I can count 100M pairs tokens (18M pair types) in about 15 > minutes, > which is very cool, and I can do it in only a few lines of code, which > is even cooler. > Any approach that would touch the disk would be too slow for me, and > bringing in complicated datastructures by myself would distract me > too much from what I need to do. That's exactly why I use Python; > it has lots of highly fine-tuned complicated stuff behind the scenes, > that is just doing the right thing, so I should not have to (and I > could > not) re-invent this myself. > > Thanks for the helpful comments, > Andreas
In the issue at bugs.python.org, why do you say "Do I have to switch to Perl or C to get this done???", as if Perl and C were similar languages? If we were to look at Python, Perl, and C together, Python and Perl would be similar languages, and C would be quite something different, *specially* regarding the discussed issue: speed. So what do you mean by saying "Perl or C"? -- http://mail.python.org/mailman/listinfo/python-list