Once a Decision Tree ( or a forest ) has been trained, I almost always
want to save the resulting classifier to disk and then load the
classifier at a later stage for testing.
My dataset is 5.2GB on disk: (690K * 2K) float32s. I can load this
into memory using `np.load('dataset.npy')` in 20 seconds on our
server.
When a decision tree is trained to depth 20 and pickled, it requires
between 200MB and 300MB on disk, but here is the kicker: it takes
*hours* to load it up. Last time I tried, it took 16 hours to load a
forest of 10 trees.
My question is; is there a way to improve the performance of loading
classifiers, either using different pickle options (of which I don't
know any, but there may be), or by using a different scheme
(marshalling sounds promising based on [1]), or any other way?
Perhaps I can implement a a pickle loader in cython?
[1]
http://stackoverflow.com/questions/329249/why-is-marshal-so-much-faster-than-pickle
------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn
about Cisco certifications, training, and career opportunities.
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general