2011/10/28 Brian Holt <[email protected]>: >> Right, but it seems to me that this is exactly what we want to test the> >> hyothesis. Maybe I am being dense, as I m a bit rushing through my mail,> >> but it seems to me that if you keep a reference to a, then you compensate> >> for the difference that was pointed out in the discussion below, i.e.> that >> the forest gets garbage-collected after each iteration. Thus I think> that >> it would enable us to test the hypothesis. > > > import cPickle > > b = None > for i in range(0, 20): > with open("forest%d.pkl" % (i), 'r') as f: > start = datetime.now() > a = cPickle.load(f) > print 'loaded ', i, datetime.now() - start > b = a # keep a reference to a > > results in > > loaded 0 0:00:14.706000 > loaded 1 0:00:22.700545 > loaded 2 0:00:22.609137 > loaded 3 0:00:23.454734 > loaded 4 0:00:24.734567 > loaded 5 0:00:23.774540 > loaded 6 0:00:25.547649 > loaded 7 0:00:26.773837 > loaded 8 0:00:27.114894 > loaded 9 0:00:25.662419 > loaded 10 0:00:21.782435 > loaded 11 0:00:23.872373 > loaded 12 0:00:24.596157 > loaded 13 0:00:26.310549 > loaded 14 0:00:30.219642 > loaded 15 0:00:24.484561 > loaded 16 0:00:26.037760 > loaded 17 0:00:30.347977 > loaded 18 0:00:22.695595 > loaded 19 0:00:27.575407 > > I think this confirms the hypothesis. By keeping a reference to `a` > at each stage, (except the first iteration), the time taken to load a > subsequent item is more than if there is no reference, but less than > if a reference to every item was stored.
So if I understand this is consistent Giles' hypothesis on memory fragmentation and the CPython VM that does a poor job at handling the memory for the many many small objects loaded by each call cPickle.load. In the first iteration there is no previously allocated object in the Python VM and the timing is representative of the real unpickling while in the subsequent iteration, the allocated objects of the previous iteration induce a Python memory management overhead (maybe related to a bad handling of fragmentation). Maybe this is caused by the ref counting garbage collector that does not scale to a reference graph with hundreds of thousands of edges. I CC Victor who might have an idea about this. Victor here is the beginning of the thread: http://sourceforge.net/mailarchive/forum.php?thread_name=CAFvE7K4rrhKOYP9XpSNYs3s2OkVtgbH3AzPGiVjkU2h5Nqd6Eg%40mail.gmail.com&forum_name=scikit-learn-general http://sourceforge.net/mailarchive/forum.php?thread_name=20111028114022.GA18119%40phare.normalesup.org&forum_name=scikit-learn-general http://sourceforge.net/mailarchive/forum.php?thread_name=CAFkP-qzo%3DodjQfWTAxd7_Ybn-ar887nGtmCAEv-ZdqZ5qYWcsQ%40mail.gmail.com&forum_name=scikit-learn-general As usual the sourceforge archive has broken the thread into 3 pieces... -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
