Re: Garbage stats in Random Forest leaf node?

2015-03-17 Thread Xiangrui Meng
This is the default value (Double.MinValue) for invalid gain:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala#L67

Please ignore it. Maybe we should update `toString` to use scientific notation.

-Xiangrui


On Mon, Mar 16, 2015 at 5:19 PM, cjwang c...@cjwang.us wrote:
 I dumped the trees in the random forest model, and occasionally saw a leaf
 node with strange stats:

 - pred=1.00 prob=0.80 imp=-1.00
 gain=-17976931348623157.00


 Here impurity = -1 and gain = a giant negative number.  Normally, I would
 get a None from Node.stats at a leaf node.  Here it printed because Some(s)
 matches:

 node.stats match {
 case Some(s) = println( imp=%f gain=%f format(s.impurity,
 s.gain))
 case None = println
 }


 Is it a bug?

 This doesn't seem happening in the model from DecisionTree, but my data sets
 are limited.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Garbage-stats-in-Random-Forest-leaf-node-tp22087.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Garbage stats in Random Forest leaf node?

2015-03-17 Thread Xiangrui Meng
There are two cases: minInstancesPerNode not satisfied or minInfoGain
not satisfied:

https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L729
https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L745

On Tue, Mar 17, 2015 at 12:59 PM, Chang-Jia Wang c...@cjwang.us wrote:
 Just curious, why most of the leaf nodes returns None, but just a couple 
 returns default?  Why would the gain invalid?

 C.J.

 On Mar 17, 2015, at 11:53 AM, Xiangrui Meng men...@gmail.com wrote:

 This is the default value (Double.MinValue) for invalid gain:

 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala#L67

 Please ignore it. Maybe we should update `toString` to use scientific 
 notation.

 -Xiangrui


 On Mon, Mar 16, 2015 at 5:19 PM, cjwang c...@cjwang.us wrote:
 I dumped the trees in the random forest model, and occasionally saw a leaf
 node with strange stats:

 - pred=1.00 prob=0.80 imp=-1.00
 gain=-17976931348623157.00


 Here impurity = -1 and gain = a giant negative number.  Normally, I would
 get a None from Node.stats at a leaf node.  Here it printed because Some(s)
 matches:

node.stats match {
case Some(s) = println( imp=%f gain=%f format(s.impurity,
 s.gain))
case None = println
}


 Is it a bug?

 This doesn't seem happening in the model from DecisionTree, but my data sets
 are limited.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Garbage-stats-in-Random-Forest-leaf-node-tp22087.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Garbage stats in Random Forest leaf node?

2015-03-16 Thread cjwang
I dumped the trees in the random forest model, and occasionally saw a leaf
node with strange stats:

- pred=1.00 prob=0.80 imp=-1.00
gain=-17976931348623157.00


Here impurity = -1 and gain = a giant negative number.  Normally, I would
get a None from Node.stats at a leaf node.  Here it printed because Some(s)
matches:

node.stats match {
case Some(s) = println( imp=%f gain=%f format(s.impurity,
s.gain))
case None = println
}


Is it a bug?

This doesn't seem happening in the model from DecisionTree, but my data sets
are limited.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Garbage-stats-in-Random-Forest-leaf-node-tp22087.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org