Re: Garbage stats in Random Forest leaf node?
This is the default value (Double.MinValue) for invalid gain: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala#L67 Please ignore it. Maybe we should update `toString` to use scientific notation. -Xiangrui On Mon, Mar 16, 2015 at 5:19 PM, cjwang c...@cjwang.us wrote: I dumped the trees in the random forest model, and occasionally saw a leaf node with strange stats: - pred=1.00 prob=0.80 imp=-1.00 gain=-17976931348623157.00 Here impurity = -1 and gain = a giant negative number. Normally, I would get a None from Node.stats at a leaf node. Here it printed because Some(s) matches: node.stats match { case Some(s) = println( imp=%f gain=%f format(s.impurity, s.gain)) case None = println } Is it a bug? This doesn't seem happening in the model from DecisionTree, but my data sets are limited. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Garbage-stats-in-Random-Forest-leaf-node-tp22087.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Garbage stats in Random Forest leaf node?
There are two cases: minInstancesPerNode not satisfied or minInfoGain not satisfied: https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L729 https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L745 On Tue, Mar 17, 2015 at 12:59 PM, Chang-Jia Wang c...@cjwang.us wrote: Just curious, why most of the leaf nodes returns None, but just a couple returns default? Why would the gain invalid? C.J. On Mar 17, 2015, at 11:53 AM, Xiangrui Meng men...@gmail.com wrote: This is the default value (Double.MinValue) for invalid gain: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala#L67 Please ignore it. Maybe we should update `toString` to use scientific notation. -Xiangrui On Mon, Mar 16, 2015 at 5:19 PM, cjwang c...@cjwang.us wrote: I dumped the trees in the random forest model, and occasionally saw a leaf node with strange stats: - pred=1.00 prob=0.80 imp=-1.00 gain=-17976931348623157.00 Here impurity = -1 and gain = a giant negative number. Normally, I would get a None from Node.stats at a leaf node. Here it printed because Some(s) matches: node.stats match { case Some(s) = println( imp=%f gain=%f format(s.impurity, s.gain)) case None = println } Is it a bug? This doesn't seem happening in the model from DecisionTree, but my data sets are limited. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Garbage-stats-in-Random-Forest-leaf-node-tp22087.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Garbage stats in Random Forest leaf node?
I dumped the trees in the random forest model, and occasionally saw a leaf node with strange stats: - pred=1.00 prob=0.80 imp=-1.00 gain=-17976931348623157.00 Here impurity = -1 and gain = a giant negative number. Normally, I would get a None from Node.stats at a leaf node. Here it printed because Some(s) matches: node.stats match { case Some(s) = println( imp=%f gain=%f format(s.impurity, s.gain)) case None = println } Is it a bug? This doesn't seem happening in the model from DecisionTree, but my data sets are limited. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Garbage-stats-in-Random-Forest-leaf-node-tp22087.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org