On Fri, Oct 28, 2011 at 10:30:01AM +0100, Brian Holt wrote:
> cPickle with HIGHEST_PROTOCOL is significantly faster, it averages 15
> seconds to load the 10 tree forest compared to the 5 minutes without.

Good. Thus we do not need to do any modifications to the existing code,
it seems.

> What still confuses me is why loading the forests and storing them in
> a list should be any slower than loading them individually. 

Technically, I do not think that the pickling/unpickling is an O(n)
algorithm, with n the number of objects. I think that it grows quicker.
Having looked at the corresponding code, one of the reasons is that the
pickling actually works on a graph of self-referencing objects (a list
can contain a dict that contains the same list). Thus, to avoid to go in
infinite loops, the pickling needs to do loop detection, which is does by
checking the 'id' of the different objects.

In short, the objects that you are storing have a specific structure
(they are unconnected). By storing them separately, you are benefiting
from your knowledge of the structure, but the pickling/unpickling
algorithm, which solves the general case, does not know that.

Gaƫl

------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to