subject:"redundant decision tree model"

Re: redundant decision tree model

2018-02-16 Thread Alessandro Solimando

Hello,
a small recap for who is interested.

There was already a ticket covering the case that I failed to find when I
checked.
As a result the other one has been correctly marked as duplicate:
https://issues.apache.org/jira/browse/SPARK-3159

I have created a PR for this that you can check here if interested:
https://github.com/apache/spark/pull/20632

On 13 February 2018 at 14:39, Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Thanks for your feedback Sean, I agree with you.
>
> I have logged a JIRA case (https://issues.apache.org/jir
> a/browse/SPARK-23409), I will take a look at the code more in detail and
> see if I come up with a PR to handle this.
>
> On 13 February 2018 at 12:00, Sean Owen  wrote:
>
>> I think the simple pruning you have in mind was just never implemented.
>>
>> That sort of pruning wouldn't help much if the nodes maintained a
>> distribution over classes, as those are rarely identical, but, they just
>> maintain a single class prediction. After training, I see no value in
>> keeping those nodes. Whatever impurity gain the split managed on the
>> training data is 'lost' when the prediction is collapsed to a single class
>> anyway.
>>
>> Whether it's easy to implement in the code I don't know, but it's
>> straightforward conceptually.
>>
>> On Tue, Feb 13, 2018 at 4:21 AM Alessandro Solimando <
>> alessandro.solima...@gmail.com> wrote:
>>
>>> Hello Nick,
>>> thanks for the pointer, that's interesting.
>>>
>>> However, there seems to be a major difference with what I was discussing.
>>>
>>> The JIRA issue relates to overfitting and consideration on information
>>> gain, while what I propose is a much simpler "syntactic" pruning.
>>>
>>> Consider a fragment of the example above, the leftmost subtree in
>>> particular:
>>>
>>> If (feature 1 <= 0.5)
If (feature 2 <= 0.5)
 If (feature 0 <= 0.5)
  Predict: 0.0
 Else (feature 0 > 0.5)
  Predict: 0.0
Else (feature 2 > 0.5)
 If (feature 0 <= 0.5)
  Predict: 0.0
 Else (feature 0 > 0.5)
  Predict: 0.0
>>>
>>>
>>> Which corresponds to the following "objects":
>>>
>>> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
 org.apache.spark.ml.tree.ContinuousSplit@fdf0)
 --InternalNode(prediction = 0.0, impurity = 0.345679012345679, split =
 org.apache.spark.ml.tree.ContinuousSplit@ffe0)
 ---InternalNode(prediction = 0.0, impurity = 0.4445, split
 = org.apache.spark.ml.tree.ContinuousSplit@3fe0)
 LeafNode(prediction = 0.0, impurity = -1.0)
 LeafNode(prediction = 0.0, impurity = 0.0)
 ---InternalNode(prediction = 0.0, impurity = 0.2777, split
 = org.apache.spark.ml.tree.ContinuousSplit@3fe0)
 LeafNode(prediction = 0.0, impurity = 0.0)
 LeafNode(prediction = 0.0, impurity = -1.0)
>>>
>>>
>>> For sure a more comprehensive policy for node splitting based on
>>> impurity might prevent this situation (by splitting node "ffe0" you
>>> have an impurity gain on one child, and a loss on the other), but
>>> independently from this, once the tree is built, I would cut the redundant
>>> subtree and obtain the following:
>>>
>>> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
 org.apache.spark.ml.tree.ContinuousSplit@fdf0)
 --LeafNode(prediction = 0.0, impurity = ...)
>>>
>>>
>>> I cannot say that this is relevant for all the tree ensemble methods,
>>> but it for sure is for RF, even more than for DT, as the lever effect will
>>> be even higher (and the code generating them is the same, DT calls RF with
>>> numTree = 1 for what I can see).
>>>
>>> Being an optimization aiming at saving model memory footprint and
>>> invocation time, it is independent from any consideration on the
>>> statistical amortization of overfit, as your reply seems to imply.
>>>
>>> Am I missing something?
>>>
>>> Best regards,
>>> Alessandro
>>>
>>>
>>>
>>> On 13 February 2018 at 10:57, Nick Pentreath 
>>> wrote:
>>>
 There is a long outstanding JIRA issue about it:
 https://issues.apache.org/jira/browse/SPARK-3155.

 It is probably still a useful feature to have for trees but the
 priority is not that high since it may not be that useful for the tree
 ensemble models.


 On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando <
 alessandro.solima...@gmail.com> wrote:

> Hello community,
> I have recently manually inspected some decision trees computed with
> Spark (2.2.1, but the behavior is the same with the latest code on the
> repo).
>
> I have observed that the trees are always complete, even if an entire
> subtree leads to the same prediction in its different leaves.
>
> In such case, the root of the subtree, instead of being an
> InternalNode, could simply be a LeafNode with the (shared) prediction.
>

Re: redundant decision tree model

2018-02-13 Thread Alessandro Solimando

Thanks for your feedback Sean, I agree with you.

I have logged a JIRA case (https://issues.apache.org/jira/browse/SPARK-23409),
I will take a look at the code more in detail and see if I come up with a
PR to handle this.

On 13 February 2018 at 12:00, Sean Owen  wrote:

> I think the simple pruning you have in mind was just never implemented.
>
> That sort of pruning wouldn't help much if the nodes maintained a
> distribution over classes, as those are rarely identical, but, they just
> maintain a single class prediction. After training, I see no value in
> keeping those nodes. Whatever impurity gain the split managed on the
> training data is 'lost' when the prediction is collapsed to a single class
> anyway.
>
> Whether it's easy to implement in the code I don't know, but it's
> straightforward conceptually.
>
> On Tue, Feb 13, 2018 at 4:21 AM Alessandro Solimando <
> alessandro.solima...@gmail.com> wrote:
>
>> Hello Nick,
>> thanks for the pointer, that's interesting.
>>
>> However, there seems to be a major difference with what I was discussing.
>>
>> The JIRA issue relates to overfitting and consideration on information
>> gain, while what I propose is a much simpler "syntactic" pruning.
>>
>> Consider a fragment of the example above, the leftmost subtree in
>> particular:
>>
>> If (feature 1 <= 0.5)
>>>If (feature 2 <= 0.5)
>>> If (feature 0 <= 0.5)
>>>  Predict: 0.0
>>> Else (feature 0 > 0.5)
>>>  Predict: 0.0
>>>Else (feature 2 > 0.5)
>>> If (feature 0 <= 0.5)
>>>  Predict: 0.0
>>> Else (feature 0 > 0.5)
>>>  Predict: 0.0
>>
>>
>> Which corresponds to the following "objects":
>>
>> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
>>> org.apache.spark.ml.tree.ContinuousSplit@fdf0)
>>> --InternalNode(prediction = 0.0, impurity = 0.345679012345679, split =
>>> org.apache.spark.ml.tree.ContinuousSplit@ffe0)
>>> ---InternalNode(prediction = 0.0, impurity = 0.4445, split =
>>> org.apache.spark.ml.tree.ContinuousSplit@3fe0)
>>> LeafNode(prediction = 0.0, impurity = -1.0)
>>> LeafNode(prediction = 0.0, impurity = 0.0)
>>> ---InternalNode(prediction = 0.0, impurity = 0.2777, split =
>>> org.apache.spark.ml.tree.ContinuousSplit@3fe0)
>>> LeafNode(prediction = 0.0, impurity = 0.0)
>>> LeafNode(prediction = 0.0, impurity = -1.0)
>>
>>
>> For sure a more comprehensive policy for node splitting based on impurity
>> might prevent this situation (by splitting node "ffe0" you have an
>> impurity gain on one child, and a loss on the other), but independently
>> from this, once the tree is built, I would cut the redundant subtree and
>> obtain the following:
>>
>> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
>>> org.apache.spark.ml.tree.ContinuousSplit@fdf0)
>>> --LeafNode(prediction = 0.0, impurity = ...)
>>
>>
>> I cannot say that this is relevant for all the tree ensemble methods, but
>> it for sure is for RF, even more than for DT, as the lever effect will be
>> even higher (and the code generating them is the same, DT calls RF with
>> numTree = 1 for what I can see).
>>
>> Being an optimization aiming at saving model memory footprint and
>> invocation time, it is independent from any consideration on the
>> statistical amortization of overfit, as your reply seems to imply.
>>
>> Am I missing something?
>>
>> Best regards,
>> Alessandro
>>
>>
>>
>> On 13 February 2018 at 10:57, Nick Pentreath 
>> wrote:
>>
>>> There is a long outstanding JIRA issue about it:
>>> https://issues.apache.org/jira/browse/SPARK-3155.
>>>
>>> It is probably still a useful feature to have for trees but the priority
>>> is not that high since it may not be that useful for the tree ensemble
>>> models.
>>>
>>>
>>> On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando <
>>> alessandro.solima...@gmail.com> wrote:
>>>
 Hello community,
 I have recently manually inspected some decision trees computed with
 Spark (2.2.1, but the behavior is the same with the latest code on the
 repo).

 I have observed that the trees are always complete, even if an entire
 subtree leads to the same prediction in its different leaves.

 In such case, the root of the subtree, instead of being an
 InternalNode, could simply be a LeafNode with the (shared) prediction.

 I know that decision trees computed by scikit-learn share the same
 feature, I understand that this is needed by construction, because you
 realize this redundancy only at the end.

 So my question is, why is this "post-pruning" missing?

 Three hypothesis:

 1) It is not suitable (for a reason I fail to see)
 2) Such addition to the code is considered as not worth (in terms of
 code complexity, maybe)
 3) It has been overlooked, but could be a favorable addition

 For clarity, I have managed to isolate a

Re: redundant decision tree model

2018-02-13 Thread Sean Owen

I think the simple pruning you have in mind was just never implemented.

That sort of pruning wouldn't help much if the nodes maintained a
distribution over classes, as those are rarely identical, but, they just
maintain a single class prediction. After training, I see no value in
keeping those nodes. Whatever impurity gain the split managed on the
training data is 'lost' when the prediction is collapsed to a single class
anyway.

Whether it's easy to implement in the code I don't know, but it's
straightforward conceptually.

On Tue, Feb 13, 2018 at 4:21 AM Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Hello Nick,
> thanks for the pointer, that's interesting.
>
> However, there seems to be a major difference with what I was discussing.
>
> The JIRA issue relates to overfitting and consideration on information
> gain, while what I propose is a much simpler "syntactic" pruning.
>
> Consider a fragment of the example above, the leftmost subtree in
> particular:
>
> If (feature 1 <= 0.5)
>>If (feature 2 <= 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>>Else (feature 2 > 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>
>
> Which corresponds to the following "objects":
>
> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
>> org.apache.spark.ml.tree.ContinuousSplit@fdf0)
>> --InternalNode(prediction = 0.0, impurity = 0.345679012345679, split =
>> org.apache.spark.ml.tree.ContinuousSplit@ffe0)
>> ---InternalNode(prediction = 0.0, impurity = 0.4445, split =
>> org.apache.spark.ml.tree.ContinuousSplit@3fe0)
>> LeafNode(prediction = 0.0, impurity = -1.0)
>> LeafNode(prediction = 0.0, impurity = 0.0)
>> ---InternalNode(prediction = 0.0, impurity = 0.2777, split =
>> org.apache.spark.ml.tree.ContinuousSplit@3fe0)
>> LeafNode(prediction = 0.0, impurity = 0.0)
>> LeafNode(prediction = 0.0, impurity = -1.0)
>
>
> For sure a more comprehensive policy for node splitting based on impurity
> might prevent this situation (by splitting node "ffe0" you have an
> impurity gain on one child, and a loss on the other), but independently
> from this, once the tree is built, I would cut the redundant subtree and
> obtain the following:
>
> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
>> org.apache.spark.ml.tree.ContinuousSplit@fdf0)
>> --LeafNode(prediction = 0.0, impurity = ...)
>
>
> I cannot say that this is relevant for all the tree ensemble methods, but
> it for sure is for RF, even more than for DT, as the lever effect will be
> even higher (and the code generating them is the same, DT calls RF with
> numTree = 1 for what I can see).
>
> Being an optimization aiming at saving model memory footprint and
> invocation time, it is independent from any consideration on the
> statistical amortization of overfit, as your reply seems to imply.
>
> Am I missing something?
>
> Best regards,
> Alessandro
>
>
>
> On 13 February 2018 at 10:57, Nick Pentreath 
> wrote:
>
>> There is a long outstanding JIRA issue about it:
>> https://issues.apache.org/jira/browse/SPARK-3155.
>>
>> It is probably still a useful feature to have for trees but the priority
>> is not that high since it may not be that useful for the tree ensemble
>> models.
>>
>>
>> On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando <
>> alessandro.solima...@gmail.com> wrote:
>>
>>> Hello community,
>>> I have recently manually inspected some decision trees computed with
>>> Spark (2.2.1, but the behavior is the same with the latest code on the
>>> repo).
>>>
>>> I have observed that the trees are always complete, even if an entire
>>> subtree leads to the same prediction in its different leaves.
>>>
>>> In such case, the root of the subtree, instead of being an InternalNode,
>>> could simply be a LeafNode with the (shared) prediction.
>>>
>>> I know that decision trees computed by scikit-learn share the same
>>> feature, I understand that this is needed by construction, because you
>>> realize this redundancy only at the end.
>>>
>>> So my question is, why is this "post-pruning" missing?
>>>
>>> Three hypothesis:
>>>
>>> 1) It is not suitable (for a reason I fail to see)
>>> 2) Such addition to the code is considered as not worth (in terms of
>>> code complexity, maybe)
>>> 3) It has been overlooked, but could be a favorable addition
>>>
>>> For clarity, I have managed to isolate a small case to reproduce this,
>>> in what follows.
>>>
>>> This is the dataset:
>>>
 +-+-+
 |label|features |
 +-+-+
 |1.0  |[1.0,0.0,1.0]|
 |1.0  |[0.0,1.0,0.0]|
 |1.0  |[1.0,1.0,0.0]|
 |0.0  |[0.0,0.0,0.0]|
 |1.0  |[1.0,1.0,0.0]|
 |0.0  |[0.0,1.0,1.0]|
 |1.0  |[0.0,0.0,0.0]|
 |0.0  |[0.0,1.0,1.0]|
 |1.0  |[0.0,1.0,1.0]|
 |0.0

Re: redundant decision tree model

2018-02-13 Thread Alessandro Solimando

Hello Nick,
thanks for the pointer, that's interesting.

However, there seems to be a major difference with what I was discussing.

The JIRA issue relates to overfitting and consideration on information
gain, while what I propose is a much simpler "syntactic" pruning.

Consider a fragment of the example above, the leftmost subtree in
particular:

If (feature 1 <= 0.5)
>If (feature 2 <= 0.5)
> If (feature 0 <= 0.5)
>  Predict: 0.0
> Else (feature 0 > 0.5)
>  Predict: 0.0
>Else (feature 2 > 0.5)
> If (feature 0 <= 0.5)
>  Predict: 0.0
> Else (feature 0 > 0.5)
>  Predict: 0.0


Which corresponds to the following "objects":

-InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
> org.apache.spark.ml.tree.ContinuousSplit@fdf0)
> --InternalNode(prediction = 0.0, impurity = 0.345679012345679, split =
> org.apache.spark.ml.tree.ContinuousSplit@ffe0)
> ---InternalNode(prediction = 0.0, impurity = 0.4445, split =
> org.apache.spark.ml.tree.ContinuousSplit@3fe0)
> LeafNode(prediction = 0.0, impurity = -1.0)
> LeafNode(prediction = 0.0, impurity = 0.0)
> ---InternalNode(prediction = 0.0, impurity = 0.2777, split =
> org.apache.spark.ml.tree.ContinuousSplit@3fe0)
> LeafNode(prediction = 0.0, impurity = 0.0)
> LeafNode(prediction = 0.0, impurity = -1.0)


For sure a more comprehensive policy for node splitting based on impurity
might prevent this situation (by splitting node "ffe0" you have an
impurity gain on one child, and a loss on the other), but independently
from this, once the tree is built, I would cut the redundant subtree and
obtain the following:

-InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
> org.apache.spark.ml.tree.ContinuousSplit@fdf0)
> --LeafNode(prediction = 0.0, impurity = ...)


I cannot say that this is relevant for all the tree ensemble methods, but
it for sure is for RF, even more than for DT, as the lever effect will be
even higher (and the code generating them is the same, DT calls RF with
numTree = 1 for what I can see).

Being an optimization aiming at saving model memory footprint and
invocation time, it is independent from any consideration on the
statistical amortization of overfit, as your reply seems to imply.

Am I missing something?

Best regards,
Alessandro



On 13 February 2018 at 10:57, Nick Pentreath 
wrote:

> There is a long outstanding JIRA issue about it:
> https://issues.apache.org/jira/browse/SPARK-3155.
>
> It is probably still a useful feature to have for trees but the priority
> is not that high since it may not be that useful for the tree ensemble
> models.
>
>
> On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando <
> alessandro.solima...@gmail.com> wrote:
>
>> Hello community,
>> I have recently manually inspected some decision trees computed with
>> Spark (2.2.1, but the behavior is the same with the latest code on the
>> repo).
>>
>> I have observed that the trees are always complete, even if an entire
>> subtree leads to the same prediction in its different leaves.
>>
>> In such case, the root of the subtree, instead of being an InternalNode,
>> could simply be a LeafNode with the (shared) prediction.
>>
>> I know that decision trees computed by scikit-learn share the same
>> feature, I understand that this is needed by construction, because you
>> realize this redundancy only at the end.
>>
>> So my question is, why is this "post-pruning" missing?
>>
>> Three hypothesis:
>>
>> 1) It is not suitable (for a reason I fail to see)
>> 2) Such addition to the code is considered as not worth (in terms of code
>> complexity, maybe)
>> 3) It has been overlooked, but could be a favorable addition
>>
>> For clarity, I have managed to isolate a small case to reproduce this, in
>> what follows.
>>
>> This is the dataset:
>>
>>> +-+-+
>>> |label|features |
>>> +-+-+
>>> |1.0  |[1.0,0.0,1.0]|
>>> |1.0  |[0.0,1.0,0.0]|
>>> |1.0  |[1.0,1.0,0.0]|
>>> |0.0  |[0.0,0.0,0.0]|
>>> |1.0  |[1.0,1.0,0.0]|
>>> |0.0  |[0.0,1.0,1.0]|
>>> |1.0  |[0.0,0.0,0.0]|
>>> |0.0  |[0.0,1.0,1.0]|
>>> |1.0  |[0.0,1.0,1.0]|
>>> |0.0  |[1.0,0.0,0.0]|
>>> |0.0  |[1.0,0.0,1.0]|
>>> |1.0  |[0.0,1.0,1.0]|
>>> |0.0  |[0.0,0.0,1.0]|
>>> |0.0  |[1.0,0.0,1.0]|
>>> |0.0  |[0.0,0.0,1.0]|
>>> |0.0  |[1.0,1.0,1.0]|
>>> |0.0  |[1.0,1.0,0.0]|
>>> |1.0  |[1.0,1.0,1.0]|
>>> |0.0  |[1.0,0.0,1.0]|
>>> +-+-+
>>
>>
>> Which generates the following model:
>>
>> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with
>>> 15 nodes
>>>   If (feature 1 <= 0.5)
>>>If (feature 2 <= 0.5)
>>> If (feature 0 <= 0.5)
>>>  Predict: 0.0
>>> Else (feature 0 > 0.5)
>>>  Predict: 0.0
>>>Else (feature 2 > 0.5)
>>> If (feature 0 <= 0.5)
>>>  Predict: 0.0
>>> Else (feature 0 > 0.5)
>>>  Predict: 0.0
>>>   Else (feature 1 > 0.5)
>>>If (feature 2 <= 0.5)
>>> If

Re: redundant decision tree model

2018-02-13 Thread Nick Pentreath

There is a long outstanding JIRA issue about it:
https://issues.apache.org/jira/browse/SPARK-3155.

It is probably still a useful feature to have for trees but the priority is
not that high since it may not be that useful for the tree ensemble models.

On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Hello community,
> I have recently manually inspected some decision trees computed with Spark
> (2.2.1, but the behavior is the same with the latest code on the repo).
>
> I have observed that the trees are always complete, even if an entire
> subtree leads to the same prediction in its different leaves.
>
> In such case, the root of the subtree, instead of being an InternalNode,
> could simply be a LeafNode with the (shared) prediction.
>
> I know that decision trees computed by scikit-learn share the same
> feature, I understand that this is needed by construction, because you
> realize this redundancy only at the end.
>
> So my question is, why is this "post-pruning" missing?
>
> Three hypothesis:
>
> 1) It is not suitable (for a reason I fail to see)
> 2) Such addition to the code is considered as not worth (in terms of code
> complexity, maybe)
> 3) It has been overlooked, but could be a favorable addition
>
> For clarity, I have managed to isolate a small case to reproduce this, in
> what follows.
>
> This is the dataset:
>
>> +-+-+
>> |label|features |
>> +-+-+
>> |1.0  |[1.0,0.0,1.0]|
>> |1.0  |[0.0,1.0,0.0]|
>> |1.0  |[1.0,1.0,0.0]|
>> |0.0  |[0.0,0.0,0.0]|
>> |1.0  |[1.0,1.0,0.0]|
>> |0.0  |[0.0,1.0,1.0]|
>> |1.0  |[0.0,0.0,0.0]|
>> |0.0  |[0.0,1.0,1.0]|
>> |1.0  |[0.0,1.0,1.0]|
>> |0.0  |[1.0,0.0,0.0]|
>> |0.0  |[1.0,0.0,1.0]|
>> |1.0  |[0.0,1.0,1.0]|
>> |0.0  |[0.0,0.0,1.0]|
>> |0.0  |[1.0,0.0,1.0]|
>> |0.0  |[0.0,0.0,1.0]|
>> |0.0  |[1.0,1.0,1.0]|
>> |0.0  |[1.0,1.0,0.0]|
>> |1.0  |[1.0,1.0,1.0]|
>> |0.0  |[1.0,0.0,1.0]|
>> +-+-+
>
>
> Which generates the following model:
>
> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
>> nodes
>>   If (feature 1 <= 0.5)
>>If (feature 2 <= 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>>Else (feature 2 > 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>>   Else (feature 1 > 0.5)
>>If (feature 2 <= 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 1.0
>> Else (feature 0 > 0.5)
>>  Predict: 1.0
>>Else (feature 2 > 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>
>
> As you can see, the following model would be equivalent, but smaller and
>
> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
>> nodes
>>   If (feature 1 <= 0.5)
>>Predict: 0.0
>>   Else (feature 1 > 0.5)
>>If (feature 2 <= 0.5)
>> Predict: 1.0
>>Else (feature 2 > 0.5)
>> Predict: 0.0
>
>
> This happens pretty often in real cases, and despite the small gain in the
> single model invocation for the "optimized" version, it can become non
> negligible when the number of calls is massive, as one can expect in a Big
> Data context.
>
> I would appreciate your opinion on this matter (if relevant for a PR or
> not, pros/cons etc).
>
> Best regards,
> Alessandro
>

redundant decision tree model

2018-02-13 Thread Alessandro Solimando

Hello community,
I have recently manually inspected some decision trees computed with Spark
(2.2.1, but the behavior is the same with the latest code on the repo).

I have observed that the trees are always complete, even if an entire
subtree leads to the same prediction in its different leaves.

In such case, the root of the subtree, instead of being an InternalNode,
could simply be a LeafNode with the (shared) prediction.

I know that decision trees computed by scikit-learn share the same feature,
I understand that this is needed by construction, because you realize this
redundancy only at the end.

So my question is, why is this "post-pruning" missing?

Three hypothesis:

1) It is not suitable (for a reason I fail to see)
2) Such addition to the code is considered as not worth (in terms of code
complexity, maybe)
3) It has been overlooked, but could be a favorable addition

For clarity, I have managed to isolate a small case to reproduce this, in
what follows.

This is the dataset:

> +-+-+
> |label|features |
> +-+-+
> |1.0  |[1.0,0.0,1.0]|
> |1.0  |[0.0,1.0,0.0]|
> |1.0  |[1.0,1.0,0.0]|
> |0.0  |[0.0,0.0,0.0]|
> |1.0  |[1.0,1.0,0.0]|
> |0.0  |[0.0,1.0,1.0]|
> |1.0  |[0.0,0.0,0.0]|
> |0.0  |[0.0,1.0,1.0]|
> |1.0  |[0.0,1.0,1.0]|
> |0.0  |[1.0,0.0,0.0]|
> |0.0  |[1.0,0.0,1.0]|
> |1.0  |[0.0,1.0,1.0]|
> |0.0  |[0.0,0.0,1.0]|
> |0.0  |[1.0,0.0,1.0]|
> |0.0  |[0.0,0.0,1.0]|
> |0.0  |[1.0,1.0,1.0]|
> |0.0  |[1.0,1.0,0.0]|
> |1.0  |[1.0,1.0,1.0]|
> |0.0  |[1.0,0.0,1.0]|
> +-+-+


Which generates the following model:

DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
> nodes
>   If (feature 1 <= 0.5)
>If (feature 2 <= 0.5)
> If (feature 0 <= 0.5)
>  Predict: 0.0
> Else (feature 0 > 0.5)
>  Predict: 0.0
>Else (feature 2 > 0.5)
> If (feature 0 <= 0.5)
>  Predict: 0.0
> Else (feature 0 > 0.5)
>  Predict: 0.0
>   Else (feature 1 > 0.5)
>If (feature 2 <= 0.5)
> If (feature 0 <= 0.5)
>  Predict: 1.0
> Else (feature 0 > 0.5)
>  Predict: 1.0
>Else (feature 2 > 0.5)
> If (feature 0 <= 0.5)
>  Predict: 0.0
> Else (feature 0 > 0.5)
>  Predict: 0.0


As you can see, the following model would be equivalent, but smaller and

DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
> nodes
>   If (feature 1 <= 0.5)
>Predict: 0.0
>   Else (feature 1 > 0.5)
>If (feature 2 <= 0.5)
> Predict: 1.0
>Else (feature 2 > 0.5)
> Predict: 0.0


This happens pretty often in real cases, and despite the small gain in the
single model invocation for the "optimized" version, it can become non
negligible when the number of calls is massive, as one can expect in a Big
Data context.

I would appreciate your opinion on this matter (if relevant for a PR or
not, pros/cons etc).

Best regards,
Alessandro

Re: redundant decision tree model

Re: redundant decision tree model

Re: redundant decision tree model

Re: redundant decision tree model

Re: redundant decision tree model

redundant decision tree model

6 matches

Site Navigation

Mail list logo

Footer information