Hello,

Reading around on the theory behind tree based regression, I concluded that there are various reasons to stop exploring the tree when a given node has been reached. Among these, I have those two:

1. When starting to process a node, if its size (row count) is less than X then consider it a leaf 2. When a split for a node is considered, if any side of the split has its size less than Y, then ignore it when selecting the best split

As an example, let's consider a node with 45 rows, that for a given split creates two children, containing 5 and 35 rows respectively.

If I set X to 50, then the node is a leaf and no split is attempted
if I set X to 10 and Y to 15, then the splits are computed but because one of them has less than 15 rows, that split is ignored.

I'm using DecisionTreeRegressor and RandomForestRegressor on our data and because the former is implemented using the latter, they both share the same parameters. Going through those parameters, I found minInstancesPerNode which to me is the Y value, but I could not find any parameter for the X value.
Have I missed something?
If not, would there be a way to implement this?

Regards



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to