To me, they are.
Y is used to control if a split is a valid candidate when deciding which
one to follow.
X is used to make a node a leaf if it has too few elements to even
consider candidate splits.
颜发才(Yan Facai) wrote:
It seems that split will always stop when count of nodes is less than
max(X, Y).
Hence, are they different?
On Tue, Jun 27, 2017 at 11:07 PM, OBones <obo...@free.fr
<mailto:obo...@free.fr>> wrote:
Hello,
Reading around on the theory behind tree based regression, I
concluded that there are various reasons to stop exploring the
tree when a given node has been reached. Among these, I have those
two:
1. When starting to process a node, if its size (row count) is
less than X then consider it a leaf
2. When a split for a node is considered, if any side of the split
has its size less than Y, then ignore it when selecting the best split
As an example, let's consider a node with 45 rows, that for a
given split creates two children, containing 5 and 35 rows
respectively.
If I set X to 50, then the node is a leaf and no split is attempted
if I set X to 10 and Y to 15, then the splits are computed but
because one of them has less than 15 rows, that split is ignored.
I'm using DecisionTreeRegressor and RandomForestRegressor on our
data and because the former is implemented using the latter, they
both share the same parameters.
Going through those parameters, I found minInstancesPerNode which
to me is the Y value, but I could not find any parameter for the X
value.
Have I missed something?
If not, would there be a way to implement this?
Regards
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org