Bjoern Toldbod created SPARK-18678: -------------------------------------- Summary: Skewed feature subsampling in Random forest Key: SPARK-18678 URL: https://issues.apache.org/jira/browse/SPARK-18678 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.0.2 Reporter: Bjoern Toldbod
The feature subsampling performed in the RandomForest-implementation from org.apache.spark.ml.tree.impl.RandomForest is performed using SamplingUtils.reservoirSampleAndCount The implementation of the sampling skews feature selection in favor of features with a higher index. The skewness is smaller for a large number of features, but completely dominates the feature selection for a small number of features. The extreme case is when the number of features is 2 and number of features to select is 1. In this case the feature sampling will always pick feature 1 and ignore feature 0. Of course this produces low quality models for few features when using subsampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org