I've got a data set of activity by user. For each user, I'd like to train a
decision tree model. I currently have the feature creation step implemented
in Spark and would naturally like to use mllib's decision tree model.
However, it looks like the decision tree model expects the whole RDD and
will train a single tree.

Can I split the RDD by user (i.e. groupByKey) and then call the
DecisionTree.trainClassifer in a reduce() or aggregate function to create a
RDD[DecisionTreeModels]? Maybe train the model with an in-memory dataset
instead of an RDD? Call sc.parallelize on the Iterable values in a groupBy
to create a mini-RDD?

Has anyone else tried something like this with success?

Thanks!

Reply via email to