Because we don't have random access to the record, sampling still need
to go through the records sequentially. It does save some computation,
which is perhaps noticeable only if you have data cached in memory.
Different random seeds are used for trees. -Xiangrui

On Wed, Jun 3, 2015 at 4:40 PM, Andrew Leverentz <andylevere...@fico.com> wrote:
> When training a RandomForest model, the Strategy class (in
> mllib.tree.configuration) provides a subsamplingRate parameter.  I was
> hoping to use this to cut down on processing time for large datasets (more
> than 2MM rows and 9K predictors), but I’ve found that the runtime stays
> approximately constant (and sometimes noticeably increases) when I try
> lowering the value of subsamplingRate.
>
>
>
> Is this the expected behavior?  (And, if so, what is the intended purpose of
> this parameter?)
>
>
>
> Of course, I could always just subsample the input dataset prior to running
> RF, but I was hoping that the subsamplingRate (which ostensibly affects the
> sampling used during RF bagging) would decrease the amount of data
> processing without requiring me to entirely ignore large subsets of the
> data.
>
>
>
> Thanks,
>
>
>
> ~ Andrew
>
>
>
>
> This email and any files transmitted with it are confidential, proprietary
> and intended solely for the individual or entity to whom they are addressed.
> If you have received this email in error please delete it immediately.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to