Question about the use of sample masks in the implementation of
sklearn/tree/tree.py<https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L323>
.
It's stated in the tree
documentation<http://scikit-learn.org/stable/modules/tree.html>(under
3.8.3 Complexity):
> sample mask is used to mask data points that are inactive at a given node,
> which avoids the copying of data (important for large datasets or training
> trees within an ensemble)
However, the implementation of sample mask uses a boolean array to index
the numpy 2d array. My understanding is that any fancy indexing\ causes
data copy. So I ran memory_profiler on a toy example, and here's what I get:
> python -m memory_profiler profile_memory.py
Line # Mem usage Increment Line Contents
==============================================
4 @profile
5 13.66 MB 0.00 MB def test_masking():
6 14.05 MB 0.38 MB a = np.zeros((1000, 50))
7 14.05 MB 0.00 MB b = a[:500, :]
8
9 14.06 MB 0.01 MB fancy_indices = np.arange(a.shape[0])
10 14.06 MB 0.00 MB random.shuffle(fancy_indices)
11 14.38 MB 0.32 MB c = a[fancy_indices[:800], :]
12
13 14.39 MB 0.01 MB mask = np.ones((a.shape[0],),
dtype=np.bool)
14 14.77 MB 0.38 MB d = a[mask]
15
16 14.77 MB 0.00 MB return a, b, c, d
Which shows that data is copied (line 14).
I'm writing an ensemble learner on top of trees (specifically, bagging),
and would love to not have to copy the data for each tree that's fitting.
Any thoughts?
Thanks!
Ian
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general