2012/6/25 Ian Wong <[email protected]>:
> Question about the use of sample masks in the implementation of
> sklearn/tree/tree.py.
>
> It's stated in the tree documentation (under 3.8.3 Complexity):
>>
>> sample mask is used to mask data points that are inactive at a given node,
>> which avoids the copying of data (important for large datasets or training
>> trees within an ensemble)
>
>
> However, the implementation of sample mask uses a boolean array to index the
> numpy 2d array. My understanding is that any fancy indexing\ causes data
> copy. So I ran memory_profiler on a toy example, and here's what I get:
>
>> python -m memory_profiler profile_memory.py
> Line #    Mem usage  Increment   Line Contents
> ==============================================
>      4                           @profile
>      5     13.66 MB    0.00 MB   def test_masking():
>      6     14.05 MB    0.38 MB     a = np.zeros((1000, 50))
>      7     14.05 MB    0.00 MB     b = a[:500, :]
>      8
>      9     14.06 MB    0.01 MB     fancy_indices = np.arange(a.shape[0])
>     10     14.06 MB    0.00 MB     random.shuffle(fancy_indices)
>     11     14.38 MB    0.32 MB     c = a[fancy_indices[:800], :]
>     12
>     13     14.39 MB    0.01 MB     mask = np.ones((a.shape[0],),
> dtype=np.bool)
>     14     14.77 MB    0.38 MB     d = a[mask]
>     15
>     16     14.77 MB    0.00 MB     return a, b, c, d
>
> Which shows that data is copied (line 14).
>
> I'm writing an ensemble learner on top of trees (specifically, bagging), and
> would love to not have to copy the data for each tree that's fitting.
>
> Any thoughts?
>
> Thanks!
> Ian

Hi Ian,

correct - fancy indexing and indexing with a boolean mask causes a
data copy. The routines for tree building (e.g. ``_find_best_split``)
use the sample mask to check whether or not a sample is in the mask.
Fancy indexing is only used if the ``sample_mask`` gets too sparse
(because than ``_find_best_split`` has to sweep a large array without
doing any for most of the samples).

You can find the relevant code sections in ``tree.Tree.build``
(tree.py:288) and ``_tree._find_best_split`` (_tree.pyx:533).

DecsionTreeClassifier|Regressor allow you to pass ``sample_mask`` and
``X_argsorted``; your ensemble learner should use both in order to
avoid data copies and redundant computation of ``X_argsorted``.

Note: ``sample_mask`` does only work if you do sampling w/o
replacement; if you do sampling w/ replacement you have to do fancy
indexing because currently we don't support sample weights.

best,
 Peter

>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to