On Thu, Nov 03, 2011 at 11:40:30PM +0100, Peter Prettenhofer wrote:
> I created an experimental branch which uses numpy arrays (as Gael
> suggested) instead of the composite structure to represent the tree.
Great work (as usual)! Thanks heaps.
What I really like about this (in addition to the pe
I have just submitted a PR to brian's branch :)
On 4 November 2011 11:13, Peter Prettenhofer
wrote:
> Gilles,
>
> I was not aware of your work in _tree.pyx. Looks great! Still, I
> didn't touch any line in `find_best_split` so the merging/rebase
> should be quite straight-forward (though not fast
Gilles,
I was not aware of your work in _tree.pyx. Looks great! Still, I
didn't touch any line in `find_best_split` so the merging/rebase
should be quite straight-forward (though not fast-forward). @Gilles:
it would be great if you could submit a PR to Brians enh/ensemble
branch.
thanks,
Peter
>I have myself made a lot of changes in tree.py and _tree.pyx in a lot
of places in the code. Wouldn't it be easier for you to merge your
code into my files? As I see in [1, 2] your changes are localized, and
hence it would be quicker for you to merge them into my files than for
me merging all my c
Peter,
I have myself made a lot of changes in tree.py and _tree.pyx in a lot
of places in the code. Wouldn't it be easier for you to merge your
code into my files? As I see in [1, 2] your changes are localized, and
hence it would be quicker for you to merge them into my files than for
me merging a
2011/11/4 Brian Holt :
> @pprett: Thanks for doing the hard work to change the tree into a numpy
> representation. I have been thinking a lot about it, and I was just about
> to implement it, but you've got there first. I have a few suggestions after
> looking at your code that I'd like to try ou
Hi Gilles,
thanks!
I'll invest some more time into the array repr (tests, visitor,
benchmarks) and ask Brian for feedback - if that's ok I'd suggest we
merge the array repr into master and rebase ensemble (and gradient
boosting). I can do the rebasing -> it should't be a huge problem for
random f
@pprett: Thanks for doing the hard work to change the tree into a numpy
representation. I have been thinking a lot about it, and I was just about
to implement it, but you've got there first. I have a few suggestions after
looking at your code that I'd like to try out, so I might make a clone.
---
2011/11/4 Peter Prettenhofer :
> [..]
>>
>> Interesting. What is the order of magnitude of the decrease in speed
>> at fit time?
>
> IMHO it's negligible
>
> here are some timings for::
>
> rs = np.random.RandomState(13)
> X = rs.rand(5, 100)
> y = rs.randint(2, size=5)
> from s
[..]
>
> Interesting. What is the order of magnitude of the decrease in speed
> at fit time?
IMHO it's negligible
here are some timings for::
rs = np.random.RandomState(13)
X = rs.rand(5, 100)
y = rs.randint(2, size=5)
from sklearn.tree import tree
clf = tree.Decision
2011/11/3 Peter Prettenhofer :
> Hi everybody,
>
> I created an experimental branch [1] which uses numpy arrays (as Gael
> suggested) instead of the composite structure to represent the tree.
>
> The reason for this was two-fold: first, storage is more compact (no
> structure padding) and writing/r
Peter,
This looks very good! I will definitely have a look at it later.
However, as I warned in pull request #385, I have been making changes
[1] to the tree code and to the ensemble branch. I guess our future
patches are in conflict :( How should we proceed?
[1] https://github.com/glouppe/sciki
Hi everybody,
I created an experimental branch [1] which uses numpy arrays (as Gael
suggested) instead of the composite structure to represent the tree.
The reason for this was two-fold: first, storage is more compact (no
structure padding) and writing/reading to disc is more efficient and
second
Victor replied to me in a private message: it might be caused by this
bug http://bugs.python.org/issue12775 .
Brian, can you disable the gc dans re-run your scripts to check
whether this is the case?
import gc
gc.disable()
Also Victor would like to know whether the situation is better in pyt
2011/10/28 Brian Holt :
>> Right, but it seems to me that this is exactly what we want to test the>
>> hyothesis. Maybe I am being dense, as I m a bit rushing through my mail,>
>> but it seems to me that if you keep a reference to a, then you compensate>
>> for the difference that was pointed ou
> Right, but it seems to me that this is exactly what we want to test the>
> hyothesis. Maybe I am being dense, as I m a bit rushing through my mail,> but
> it seems to me that if you keep a reference to a, then you compensate> for
> the difference that was pointed out in the discussion below, i
On Fri, Oct 28, 2011 at 10:55:08AM +0100, Brian Holt wrote:
> > > Interesting. This hypothesis should be testable, for instance by
> > > keeping a reference on 'a', appending it to a list. I'd be
> > > interested in the results, if you mind trying out Brian.
> > I'm not sure I understand. I th
2011/10/28 Gilles Louppe :
>> Still, almost 4 minutes just to extend the python heap and reallocate
>> a bunch of already allocated objects seems unlikely. Also I don't
>> understand why the Python interpreter would need to "move" allocated
>> object: it can just grow the heap, reallocate a larger
Here is an interesting link:
http://revista.python.org.ar/2/en/html/memory-fragmentation.html
On 28 October 2011 12:12, Gilles Louppe wrote:
>> Still, almost 4 minutes just to extend the python heap and reallocate
>> a bunch of already allocated objects seems unlikely. Also I don't
>> understand
> Still, almost 4 minutes just to extend the python heap and reallocate
> a bunch of already allocated objects seems unlikely. Also I don't
> understand why the Python interpreter would need to "move" allocated
> object: it can just grow the heap, reallocate a larger buffer list (if
> needed, with
>Still, almost 4 minutes just to extend the python heap and reallocate
>a bunch of already allocated objects seems unlikely. Also I don't
>understand why the Python interpreter would need to "move" allocated
>object: it can just grow the heap, reallocate a larger buffer list (if
>needed, with just
> Interesting. This hypothesis should be testable, for instance by keeping> a
> reference on 'a', appending it to a list. I'd be interested in the> results,
> if you mind trying out Brian.
I'm not sure I understand. I thought that by appending to a list I am
keeping a reference to the object.
2011/10/28 Gilles Louppe :
>>
>> loaded 19 0:03:55.910640
>
> In contrast, in this case, forests can no longer be garbage-collected
> and new memory need to be allocated at each iteration, the private heap
> need to be extended and so on. In the process, I suspect that objects
> are moved to one p
On Fri, Oct 28, 2011 at 11:47:48AM +0200, Gilles Louppe wrote:
> > import cPickle
> > for i in range(0, 20):
> > with open("forest%d.pkl" % (i), 'r') as f:
> > start = datetime.now()
> > a = cPickle.load(f)
> > print 'loaded ', i, datetime.now() - star
> import cPickle
>
> for i in range(0, 20):
> with open("forest%d.pkl" % (i), 'r') as f:
> start = datetime.now()
> a = cPickle.load(f)
> print 'loaded ', i, datetime.now() - start
>
> produce these run-time results
>
> loaded 0 0:00:14.952436
> loaded
On Fri, Oct 28, 2011 at 10:30:01AM +0100, Brian Holt wrote:
> cPickle with HIGHEST_PROTOCOL is significantly faster, it averages 15
> seconds to load the 10 tree forest compared to the 5 minutes without.
Good. Thus we do not need to do any modifications to the existing code,
it seems.
> What stil
cPickle with HIGHEST_PROTOCOL is significantly faster, it averages 15
seconds to load the 10 tree forest compared to the 5 minutes without.
What still confuses me is why loading the forests and storing them in
a list should be any slower than loading them individually. In other
words, why should
learn-general] Storing and loading decision tree
classifiers
100K nodes is not much larger than my test (60K)... have you checked
the memory consumption during the load operation? I suspect that you
run out of memory and the huge overhead is due to thrashing.
2011/10/27 Brian Holt :
>
100K nodes is not much larger than my test (60K)... have you checked
the memory consumption during the load operation? I suspect that you
run out of memory and the huge overhead is due to thrashing.
2011/10/27 Brian Holt :
> Firstly, thanks for all the helpful comments. I didn't know that the
> p
Firstly, thanks for all the helpful comments. I didn't know that the
protocol made such a big difference, so until now in ignorance I've
been using the default.
That said, I left a test running last night on one of our centre's
servers and it took 8hrs to load 20 forests ( each with 10 trees,
dep
I just dumped and loaded a fairly large tree (~4 nodes; from
bench_sgd_covertype.py) with cPickle, both operations performed in
less than 1 sec (w/ and w/o HIGHTEST_PROTOCOL).
Brian: how large are your trees (are they complete binary trees?)
best,
Peter
2011/10/26 Peter Prettenhofer :
> br
brian, try to save the tree using::
cPickle.dump(tree, f, cPickle.HIGHEST_PROTOCOL)
if this doesn't solve the issue we should reconsider Gaels array
representation.
best,
peter
Am 26.10.2011 14:37 schrieb "Andreas Mueller" :
>
> > My question is; is there a way to improve the performance of loa
On Wed, Oct 26, 2011 at 01:35:07PM +0100, Brian Holt wrote:
> My question is; is there a way to improve the performance of loading
> classifiers, either using different pickle options (of which I don't
> know any, but there may be), or by using a different scheme
> (marshalling sounds promising bas
> My question is; is there a way to improve the performance of loading
> classifiers, either using different pickle options (of which I don't
> know any, but there may be)
>
>
Just to be sure, you used the latest pickling format, right?
cPickle uses the oldest one by default afaik.
--
Once a Decision Tree ( or a forest ) has been trained, I almost always
want to save the resulting classifier to disk and then load the
classifier at a later stage for testing.
My dataset is 5.2GB on disk: (690K * 2K) float32s. I can load this
into memory using `np.load('dataset.npy')` in 20 secon
35 matches
Mail list logo