[Scikit-learn-general] How do sample masks prevent data copy in trees?

Ian Wong Mon, 25 Jun 2012 13:52:04 -0700

Question about the use of sample masks in the implementation of
sklearn/tree/tree.py<https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L323>
.


It's stated in the tree
documentation<http://scikit-learn.org/stable/modules/tree.html>(under
3.8.3 Complexity):

> sample mask is used to mask data points that are inactive at a given node,
> which avoids the copying of data (important for large datasets or training
> trees within an ensemble)


However, the implementation of sample mask uses a boolean array to index
the numpy 2d array. My understanding is that any fancy indexing\ causes
data copy. So I ran memory_profiler on a toy example, and here's what I get:

> python -m memory_profiler profile_memory.py
Line #    Mem usage  Increment   Line Contents
==============================================
     4                           @profile
     5     13.66 MB    0.00 MB   def test_masking():
     6     14.05 MB    0.38 MB     a = np.zeros((1000, 50))
     7     14.05 MB    0.00 MB     b = a[:500, :]
     8
     9     14.06 MB    0.01 MB     fancy_indices = np.arange(a.shape[0])
    10     14.06 MB    0.00 MB     random.shuffle(fancy_indices)
    11     14.38 MB    0.32 MB     c = a[fancy_indices[:800], :]
    12
    13     14.39 MB    0.01 MB     mask = np.ones((a.shape[0],),
dtype=np.bool)
    14     14.77 MB    0.38 MB     d = a[mask]
    15
    16     14.77 MB    0.00 MB     return a, b, c, d

Which shows that data is copied (line 14).

I'm writing an ensemble learner on top of trees (specifically, bagging),
and would love to not have to copy the data for each tree that's fitting.

Any thoughts?

Thanks!
Ian

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] How do sample masks prevent data copy in trees?

Reply via email to