Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-04 Thread Gael Varoquaux
On Thu, Nov 03, 2011 at 11:40:30PM +0100, Peter Prettenhofer wrote: > I created an experimental branch which uses numpy arrays (as Gael > suggested) instead of the composite structure to represent the tree. Great work (as usual)! Thanks heaps. What I really like about this (in addition to the pe

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-04 Thread Gilles Louppe
I have just submitted a PR to brian's branch :) On 4 November 2011 11:13, Peter Prettenhofer wrote: > Gilles, > > I was not aware of your work in _tree.pyx. Looks great! Still, I > didn't touch any line in `find_best_split` so the merging/rebase > should be quite straight-forward (though not fast

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-04 Thread Peter Prettenhofer
Gilles, I was not aware of your work in _tree.pyx. Looks great! Still, I didn't touch any line in `find_best_split` so the merging/rebase should be quite straight-forward (though not fast-forward). @Gilles: it would be great if you could submit a PR to Brians enh/ensemble branch. thanks, Peter

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-04 Thread Brian Holt
>I have myself made a lot of changes in tree.py and _tree.pyx in a lot of places in the code. Wouldn't it be easier for you to merge your code into my files? As I see in [1, 2] your changes are localized, and hence it would be quicker for you to merge them into my files than for me merging all my c

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-04 Thread Gilles Louppe
Peter, I have myself made a lot of changes in tree.py and _tree.pyx in a lot of places in the code. Wouldn't it be easier for you to merge your code into my files? As I see in [1, 2] your changes are localized, and hence it would be quicker for you to merge them into my files than for me merging a

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-04 Thread Peter Prettenhofer
2011/11/4 Brian Holt : > @pprett:  Thanks for doing the hard work to change the tree into a numpy > representation.  I have been thinking a lot about it, and I was just about > to implement it, but you've got there first. I have a few suggestions after > looking at your code that I'd like to try ou

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-04 Thread Peter Prettenhofer
Hi Gilles, thanks! I'll invest some more time into the array repr (tests, visitor, benchmarks) and ask Brian for feedback - if that's ok I'd suggest we merge the array repr into master and rebase ensemble (and gradient boosting). I can do the rebasing -> it should't be a huge problem for random f

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-04 Thread Brian Holt
@pprett: Thanks for doing the hard work to change the tree into a numpy representation. I have been thinking a lot about it, and I was just about to implement it, but you've got there first. I have a few suggestions after looking at your code that I'd like to try out, so I might make a clone. ---

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-04 Thread Olivier Grisel
2011/11/4 Peter Prettenhofer : > [..] >> >> Interesting. What is the order of magnitude of the decrease in speed >> at fit time? > > IMHO it's negligible > > here are some timings for:: > >    rs = np.random.RandomState(13) >    X = rs.rand(5, 100) >    y = rs.randint(2, size=5) >    from s

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-04 Thread Peter Prettenhofer
[..] > > Interesting. What is the order of magnitude of the decrease in speed > at fit time? IMHO it's negligible here are some timings for:: rs = np.random.RandomState(13) X = rs.rand(5, 100) y = rs.randint(2, size=5) from sklearn.tree import tree clf = tree.Decision

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-04 Thread Olivier Grisel
2011/11/3 Peter Prettenhofer : > Hi everybody, > > I created an experimental branch [1] which uses numpy arrays (as Gael > suggested) instead of the composite structure to represent the tree. > > The reason for this was two-fold: first, storage is more compact (no > structure padding) and writing/r

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-03 Thread Gilles Louppe
Peter, This looks very good! I will definitely have a look at it later. However, as I warned in pull request #385, I have been making changes [1] to the tree code and to the ensemble branch. I guess our future patches are in conflict :( How should we proceed? [1] https://github.com/glouppe/sciki

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-03 Thread Peter Prettenhofer
Hi everybody, I created an experimental branch [1] which uses numpy arrays (as Gael suggested) instead of the composite structure to represent the tree. The reason for this was two-fold: first, storage is more compact (no structure padding) and writing/reading to disc is more efficient and second

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Olivier Grisel
Victor replied to me in a private message: it might be caused by this bug http://bugs.python.org/issue12775 . Brian, can you disable the gc dans re-run your scripts to check whether this is the case? import gc gc.disable() Also Victor would like to know whether the situation is better in pyt

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Olivier Grisel
2011/10/28 Brian Holt : >> Right, but it seems to me that this is exactly what we want to test the> >> hyothesis. Maybe I am being dense, as I m a bit rushing through my mail,> >> but it seems to me that if you keep a reference to a, then you compensate> >> for the difference that was pointed ou

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Brian Holt
> Right, but it seems to me that this is exactly what we want to test the> > hyothesis. Maybe I am being dense, as I m a bit rushing through my mail,> but > it seems to me that if you keep a reference to a, then you compensate> for > the difference that was pointed out in the discussion below, i

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Gael Varoquaux
On Fri, Oct 28, 2011 at 10:55:08AM +0100, Brian Holt wrote: > > > Interesting. This hypothesis should be testable, for instance by > > > keeping a reference on 'a', appending it to a list. I'd be > > > interested in the results, if you mind trying out Brian. > > I'm not sure I understand. I th

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Olivier Grisel
2011/10/28 Gilles Louppe : >> Still, almost 4 minutes just to extend the python heap and reallocate >> a bunch of already allocated objects seems unlikely. Also I don't >> understand why the Python interpreter would need to "move" allocated >> object: it can just grow the heap, reallocate a larger

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Gilles Louppe
Here is an interesting link: http://revista.python.org.ar/2/en/html/memory-fragmentation.html On 28 October 2011 12:12, Gilles Louppe wrote: >> Still, almost 4 minutes just to extend the python heap and reallocate >> a bunch of already allocated objects seems unlikely. Also I don't >> understand

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Gilles Louppe
> Still, almost 4 minutes just to extend the python heap and reallocate > a bunch of already allocated objects seems unlikely. Also I don't > understand why the Python interpreter would need to "move" allocated > object: it can just grow the heap, reallocate a larger buffer list (if > needed, with

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Brian Holt
>Still, almost 4 minutes just to extend the python heap and reallocate >a bunch of already allocated objects seems unlikely. Also I don't >understand why the Python interpreter would need to "move" allocated >object: it can just grow the heap, reallocate a larger buffer list (if >needed, with just

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Brian Holt
> Interesting. This hypothesis should be testable, for instance by keeping> a > reference on 'a', appending it to a list. I'd be interested in the> results, > if you mind trying out Brian. I'm not sure I understand. I thought that by appending to a list I am keeping a reference to the object.

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Olivier Grisel
2011/10/28 Gilles Louppe : >> >> loaded  19 0:03:55.910640 > > In contrast, in this case, forests can no longer be garbage-collected > and new memory need to be allocated at each iteration, the private heap > need to be extended and so on. In the process, I suspect that objects > are moved to one p

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Gael Varoquaux
On Fri, Oct 28, 2011 at 11:47:48AM +0200, Gilles Louppe wrote: > >    import cPickle > >    for i in range(0, 20): > >        with open("forest%d.pkl" % (i), 'r') as f: > >            start = datetime.now() > >            a = cPickle.load(f) > >            print 'loaded ', i, datetime.now() - star

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Gilles Louppe
>    import cPickle > >    for i in range(0, 20): >        with open("forest%d.pkl" % (i), 'r') as f: >            start = datetime.now() >            a = cPickle.load(f) >            print 'loaded ', i, datetime.now() - start > > produce these run-time results > > loaded  0 0:00:14.952436 > loaded

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Gael Varoquaux
On Fri, Oct 28, 2011 at 10:30:01AM +0100, Brian Holt wrote: > cPickle with HIGHEST_PROTOCOL is significantly faster, it averages 15 > seconds to load the 10 tree forest compared to the 5 minutes without. Good. Thus we do not need to do any modifications to the existing code, it seems. > What stil

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Brian Holt
cPickle with HIGHEST_PROTOCOL is significantly faster, it averages 15 seconds to load the 10 tree forest compared to the 5 minutes without. What still confuses me is why loading the forests and storing them in a list should be any slower than loading them individually. In other words, why should

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-27 Thread bdholt1
learn-general] Storing and loading decision tree classifiers 100K nodes is not much larger than my test (60K)... have you checked the memory consumption during the load operation? I suspect that you run out of memory and the huge overhead is due to thrashing. 2011/10/27 Brian Holt : >

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-27 Thread Peter Prettenhofer
100K nodes is not much larger than my test (60K)... have you checked the memory consumption during the load operation? I suspect that you run out of memory and the huge overhead is due to thrashing. 2011/10/27 Brian Holt : > Firstly, thanks for all the helpful comments.  I didn't know that the > p

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-27 Thread Brian Holt
Firstly, thanks for all the helpful comments. I didn't know that the protocol made such a big difference, so until now in ignorance I've been using the default. That said, I left a test running last night on one of our centre's servers and it took 8hrs to load 20 forests ( each with 10 trees, dep

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-26 Thread Peter Prettenhofer
I just dumped and loaded a fairly large tree (~4 nodes; from bench_sgd_covertype.py) with cPickle, both operations performed in less than 1 sec (w/ and w/o HIGHTEST_PROTOCOL). Brian: how large are your trees (are they complete binary trees?) best, Peter 2011/10/26 Peter Prettenhofer : > br

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-26 Thread Peter Prettenhofer
brian, try to save the tree using:: cPickle.dump(tree, f, cPickle.HIGHEST_PROTOCOL) if this doesn't solve the issue we should reconsider Gaels array representation. best, peter Am 26.10.2011 14:37 schrieb "Andreas Mueller" : > > > My question is; is there a way to improve the performance of loa

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-26 Thread Gael Varoquaux
On Wed, Oct 26, 2011 at 01:35:07PM +0100, Brian Holt wrote: > My question is; is there a way to improve the performance of loading > classifiers, either using different pickle options (of which I don't > know any, but there may be), or by using a different scheme > (marshalling sounds promising bas

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-26 Thread Andreas Mueller
> My question is; is there a way to improve the performance of loading > classifiers, either using different pickle options (of which I don't > know any, but there may be) > > Just to be sure, you used the latest pickling format, right? cPickle uses the oldest one by default afaik. --

[Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-26 Thread Brian Holt
Once a Decision Tree ( or a forest ) has been trained, I almost always want to save the resulting classifier to disk and then load the classifier at a later stage for testing. My dataset is 5.2GB on disk: (690K * 2K) float32s. I can load this into memory using `np.load('dataset.npy')` in 20 secon