2011/11/3 Peter Prettenhofer <[email protected]>: > Hi everybody, > > I created an experimental branch [1] which uses numpy arrays (as Gael > suggested) instead of the composite structure to represent the tree. > > The reason for this was two-fold: first, storage is more compact (no > structure padding) and writing/reading to disc is more efficient and > second, traversing the composite structure in cython is inefficient > compared to pure C. I assume that the reason for the latter is the > reference counting overhead when we traverse the structure (look at > the generated c code of the `apply_tree` function in `_tree.c`). I ran > into this performance problem when I benched my gradient boosting code > [2] against its R counterpart gbm. > > According to our covertype benchmark the new representation is a bit > slower at training time due to the array re-sizing operations; its > about a factor of 4-5 faster at prediction time - competitive with > liblinear on our benchmark! The graphviz exporter has not been updated > yet - so one test fails.
Interesting. What is the order of magnitude of the decrease in speed at fit time? Do you think we could get the best of both words by keeping the cython struct internally at fit time and converting it to the array representation at the end of the fit function to gain the efficient serialization and best prediction performance? Two downsides to this approach: - probably more code to maintain (unless the fit impl in cython struct is significantly simpler than the array repre counterpart) - this will prevent us to implement a tree fit with warm restart where the tree it further grown from an initial tree state (I wonder if there is any use case for that, it might not be an issue if not) -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
