Re: [Scikit-learn-general] Storing and loading decision tree classifiers

Olivier Grisel Fri, 28 Oct 2011 06:07:07 -0700

2011/10/28 Brian Holt <[email protected]>:
>> Right, but it seems to me that this is exactly what we want to test the> 
>> hyothesis. Maybe I am being dense, as I m a bit rushing through my mail,> 
>> but it seems to me that if you keep a reference to a, then you compensate> 
>> for the difference that was pointed out in the discussion below, i.e.> that 
>> the forest gets garbage-collected after each iteration. Thus I think> that 
>> it would enable us to test the hypothesis.
>
>
>   import cPickle
>
>   b = None
>   for i in range(0, 20):
>       with open("forest%d.pkl" % (i), 'r') as f:
>           start = datetime.now()
>           a = cPickle.load(f)
>           print 'loaded ', i, datetime.now() - start
>           b = a # keep a reference to a
>
> results in
>
> loaded  0 0:00:14.706000
> loaded  1 0:00:22.700545
> loaded  2 0:00:22.609137
> loaded  3 0:00:23.454734
> loaded  4 0:00:24.734567
> loaded  5 0:00:23.774540
> loaded  6 0:00:25.547649
> loaded  7 0:00:26.773837
> loaded  8 0:00:27.114894
> loaded  9 0:00:25.662419
> loaded  10 0:00:21.782435
> loaded  11 0:00:23.872373
> loaded  12 0:00:24.596157
> loaded  13 0:00:26.310549
> loaded  14 0:00:30.219642
> loaded  15 0:00:24.484561
> loaded  16 0:00:26.037760
> loaded  17 0:00:30.347977
> loaded  18 0:00:22.695595
> loaded  19 0:00:27.575407
>
> I think this confirms the hypothesis.  By keeping a reference to `a`
> at each stage, (except the first iteration), the time taken to load a
> subsequent item is more than if there is no reference, but less than
> if a reference to every item was stored.


So if I understand this is consistent Giles' hypothesis on memory
fragmentation and the CPython VM that does a poor job at handling the
memory for the many many small objects loaded by each call
cPickle.load.

In the first iteration there is no previously allocated object in the
Python VM and the timing is representative of the real unpickling
while in the subsequent iteration, the allocated objects of the
previous iteration induce a Python memory management overhead (maybe
related to a bad handling of fragmentation).

Maybe this is caused by the ref counting garbage collector that does
not scale to a reference graph with hundreds of thousands of edges.

I CC Victor who might have an idea about this. Victor here is the
beginning of the thread:

http://sourceforge.net/mailarchive/forum.php?thread_name=CAFvE7K4rrhKOYP9XpSNYs3s2OkVtgbH3AzPGiVjkU2h5Nqd6Eg%40mail.gmail.com&forum_name=scikit-learn-general

http://sourceforge.net/mailarchive/forum.php?thread_name=20111028114022.GA18119%40phare.normalesup.org&forum_name=scikit-learn-general

http://sourceforge.net/mailarchive/forum.php?thread_name=CAFkP-qzo%3DodjQfWTAxd7_Ybn-ar887nGtmCAEv-ZdqZ5qYWcsQ%40mail.gmail.com&forum_name=scikit-learn-general
As usual the sourceforge archive has broken the thread into 3 pieces...

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

Reply via email to