When training a RandomForest model, the Strategy class (in mllib.tree.configuration) provides a subsamplingRate parameter. I was hoping to use this to cut down on processing time for large datasets (more than 2MM rows and 9K predictors), but I've found that the runtime stays approximately constant (and sometimes noticeably increases) when I try lowering the value of subsamplingRate.
Is this the expected behavior? (And, if so, what is the intended purpose of this parameter?) Of course, I could always just subsample the input dataset prior to running RF, but I was hoping that the subsamplingRate (which ostensibly affects the sampling used during RF bagging) would decrease the amount of data processing without requiring me to entirely ignore large subsets of the data. Thanks, ~ Andrew This email and any files transmitted with it are confidential, proprietary and intended solely for the individual or entity to whom they are addressed. If you have received this email in error please delete it immediately.