Re: train many decision tress with a single spark job

Sean Owen Sun, 11 Jan 2015 04:13:31 -0800

You just mean you want to divide the data set into N subsets, and do
that dividing by user, not make one model per user right?


I suppose you could filter the source RDD N times, and build a model
for each resulting subset. This can be parallelized on the driver. For
example let's say you divide into N subsets depending on the value of
the user ID modulo N:

val N = ...
(0 until N).par.map(d => DecisionTree.train(data.filter(_.userID % N
== d), ...))

data should be cache()-ed here of course.

However it may be faster and more principled to take random subsets directly:

data.randomSplit(Array.fill(N)(1.0 / N)).par.map(subset =>
DecisionTree.train(subset, ...))

On Sun, Jan 11, 2015 at 1:53 AM, Josh Buffum <jbuf...@gmail.com> wrote:
> I've got a data set of activity by user. For each user, I'd like to train a
> decision tree model. I currently have the feature creation step implemented
> in Spark and would naturally like to use mllib's decision tree model.
> However, it looks like the decision tree model expects the whole RDD and
> will train a single tree.
>
> Can I split the RDD by user (i.e. groupByKey) and then call the
> DecisionTree.trainClassifer in a reduce() or aggregate function to create a
> RDD[DecisionTreeModels]? Maybe train the model with an in-memory dataset
> instead of an RDD? Call sc.parallelize on the Iterable values in a groupBy
> to create a mini-RDD?
>
> Has anyone else tried something like this with success?
>
> Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: train many decision tress with a single spark job

Reply via email to