2012/6/25 Ian Wong <[email protected]>: > Question about the use of sample masks in the implementation of > sklearn/tree/tree.py. > > It's stated in the tree documentation (under 3.8.3 Complexity): >> >> sample mask is used to mask data points that are inactive at a given node, >> which avoids the copying of data (important for large datasets or training >> trees within an ensemble) > > > However, the implementation of sample mask uses a boolean array to index the > numpy 2d array. My understanding is that any fancy indexing\ causes data > copy. So I ran memory_profiler on a toy example, and here's what I get: > >> python -m memory_profiler profile_memory.py > Line # Mem usage Increment Line Contents > ============================================== > 4 @profile > 5 13.66 MB 0.00 MB def test_masking(): > 6 14.05 MB 0.38 MB a = np.zeros((1000, 50)) > 7 14.05 MB 0.00 MB b = a[:500, :] > 8 > 9 14.06 MB 0.01 MB fancy_indices = np.arange(a.shape[0]) > 10 14.06 MB 0.00 MB random.shuffle(fancy_indices) > 11 14.38 MB 0.32 MB c = a[fancy_indices[:800], :] > 12 > 13 14.39 MB 0.01 MB mask = np.ones((a.shape[0],), > dtype=np.bool) > 14 14.77 MB 0.38 MB d = a[mask] > 15 > 16 14.77 MB 0.00 MB return a, b, c, d > > Which shows that data is copied (line 14). > > I'm writing an ensemble learner on top of trees (specifically, bagging), and > would love to not have to copy the data for each tree that's fitting. > > Any thoughts? > > Thanks! > Ian
Hi Ian, correct - fancy indexing and indexing with a boolean mask causes a data copy. The routines for tree building (e.g. ``_find_best_split``) use the sample mask to check whether or not a sample is in the mask. Fancy indexing is only used if the ``sample_mask`` gets too sparse (because than ``_find_best_split`` has to sweep a large array without doing any for most of the samples). You can find the relevant code sections in ``tree.Tree.build`` (tree.py:288) and ``_tree._find_best_split`` (_tree.pyx:533). DecsionTreeClassifier|Regressor allow you to pass ``sample_mask`` and ``X_argsorted``; your ensemble learner should use both in order to avoid data copies and redundant computation of ``X_argsorted``. Note: ``sample_mask`` does only work if you do sampling w/o replacement; if you do sampling w/ replacement you have to do fancy indexing because currently we don't support sample weights. best, Peter > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Peter Prettenhofer ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
