There are two cases: minInstancesPerNode not satisfied or minInfoGain not satisfied:
https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L729 https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L745 On Tue, Mar 17, 2015 at 12:59 PM, Chang-Jia Wang <c...@cjwang.us> wrote: > Just curious, why most of the leaf nodes returns None, but just a couple > returns default? Why would the gain invalid? > > C.J. > > On Mar 17, 2015, at 11:53 AM, Xiangrui Meng <men...@gmail.com> wrote: > >> This is the default value (Double.MinValue) for invalid gain: >> >> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala#L67 >> >> Please ignore it. Maybe we should update `toString` to use scientific >> notation. >> >> -Xiangrui >> >> >> On Mon, Mar 16, 2015 at 5:19 PM, cjwang <c...@cjwang.us> wrote: >>> I dumped the trees in the random forest model, and occasionally saw a leaf >>> node with strange stats: >>> >>> - pred=1.000000 prob=0.800000 imp=-1.000000 >>> gain=-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000 >>> >>> >>> Here impurity = -1 and gain = a giant negative number. Normally, I would >>> get a None from Node.stats at a leaf node. Here it printed because Some(s) >>> matches: >>> >>> node.stats match { >>> case Some(s) => println(" imp=%f gain=%f" format(s.impurity, >>> s.gain)) >>> case None => println >>> } >>> >>> >>> Is it a bug? >>> >>> This doesn't seem happening in the model from DecisionTree, but my data sets >>> are limited. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Garbage-stats-in-Random-Forest-leaf-node-tp22087.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org