Hi Andy,
The best way to understand the min_density parameter is to think of it as
'the minimum subset population density'. The idea is that if this density
parameter gets too low, then the program should copy the points and proceed
to split using the copied subset.
As an example, assume that there are 1000 datapoints in a set. At depth 5
at a given node, assume that there are only 90 datapoints that are to be
used to find the split point. This works out at a density of 0.09. If the
min_density parameter is set to 0.1, then the program will not pass in the
original dataset with the mask, but will construct a new dataset with only
these 90 values to pass in to the split routine.
The reasoning behind the idea is that there are 2 extremes with a sliding
scale between them. At the one extreme, we always copy the data as we
descend the tree to create the recursive partition. This should be a bit
slower (because of all the copying), and will also use a lot of memory.
At the other extreme, we always retain a single dataset and simple modify
the mask to indicate which datapoints should be considered at a particular
node. This is likely to be fast (less copying), but also slower because
the cython code that computes the split will need to check potentially lots
of irrelevant datapoints for inclusion in the split operation.
The solution we came up with was the allow the user to specify at which
density the data should be copied, to satisfy the users requirements and
resource trade offs.
Hope it helps
Brian
------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general