To me, they are.
Y is used to control if a split is a valid candidate when deciding which one to follow. X is used to make a node a leaf if it has too few elements to even consider candidate splits.

颜发才(Yan Facai) wrote:
It seems that split will always stop when count of nodes is less than max(X, Y).
Hence, are they different?



On Tue, Jun 27, 2017 at 11:07 PM, OBones <obo...@free.fr <mailto:obo...@free.fr>> wrote:

    Hello,

    Reading around on the theory behind tree based regression, I
    concluded that there are various reasons to stop exploring the
    tree when a given node has been reached. Among these, I have those
    two:

    1. When starting to process a node, if its size (row count) is
    less than X then consider it a leaf
    2. When a split for a node is considered, if any side of the split
    has its size less than Y, then ignore it when selecting the best split

    As an example, let's consider a node with 45 rows, that for a
    given split creates two children, containing 5 and 35 rows
    respectively.

    If I set X to 50, then the node is a leaf and no split is attempted
    if I set X to 10 and Y to 15, then the splits are computed but
    because one of them has less than 15 rows, that split is ignored.

    I'm using DecisionTreeRegressor and RandomForestRegressor on our
    data and because the former is implemented using the latter, they
    both share the same parameters.
    Going through those parameters, I found minInstancesPerNode which
    to me is the Y value, but I could not find any parameter for the X
    value.
    Have I missed something?
    If not, would there be a way to implement this?

    Regards



    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscr...@spark.apache.org
    <mailto:user-unsubscr...@spark.apache.org>




---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to