When training a RandomForest model, the Strategy class (in 
mllib.tree.configuration) provides a subsamplingRate parameter.  I was hoping 
to use this to cut down on processing time for large datasets (more than 2MM 
rows and 9K predictors), but I've found that the runtime stays approximately 
constant (and sometimes noticeably increases) when I try lowering the value of 
subsamplingRate.

Is this the expected behavior?  (And, if so, what is the intended purpose of 
this parameter?)

Of course, I could always just subsample the input dataset prior to running RF, 
but I was hoping that the subsamplingRate (which ostensibly affects the 
sampling used during RF bagging) would decrease the amount of data processing 
without requiring me to entirely ignore large subsets of the data.

Thanks,

~ Andrew


This email and any files transmitted with it are confidential, proprietary and 
intended solely for the individual or entity to whom they are addressed. If you 
have received this email in error please delete it immediately.

Reply via email to