We are using |RandomForestRegressor| from Spark 2.1.1 to train a model.

To make sure we have the appropriate parameters we start with a very small dataset, one that has 6024 lines. The regressor is created with this code:

|val rf = new RandomForestRegressor() .setLabelCol("MyLabel") .setFeaturesCol("MyFeatures") .setImpurity("variance") .setMaxDepth(3.) .setMinInstancesPerNode(1) .setMinInfoGain(0) .setNumTrees(2) .setFeatureSubsetStrategy("onethird") .setMaxBins(32) .setSubsamplingRate(1) val model = rf.fit(train) |

Using the debugger I can observe the |ImpurityStats| for each |rootNode| on each |DecisionTreeModel| inside the |trees| array. The stat that I am interested in is the first one in the |stats| array because it is the number of rows that the node has been trained with.

What I find strange is that this value for each |rootNode| is not always 6024 but sometimes more and sometimes less. From my understanding of the method I was under the impression that each tree would be trained with exactly the same number of rows than the original training set.

Looking at the source code, I could not fully figure out where this happens, nor why it was decided to do so.

Are there any resources discussing this behavior?


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to