You just mean you want to divide the data set into N subsets, and do that dividing by user, not make one model per user right?
I suppose you could filter the source RDD N times, and build a model for each resulting subset. This can be parallelized on the driver. For example let's say you divide into N subsets depending on the value of the user ID modulo N: val N = ... (0 until N).par.map(d => DecisionTree.train(data.filter(_.userID % N == d), ...)) data should be cache()-ed here of course. However it may be faster and more principled to take random subsets directly: data.randomSplit(Array.fill(N)(1.0 / N)).par.map(subset => DecisionTree.train(subset, ...)) On Sun, Jan 11, 2015 at 1:53 AM, Josh Buffum <jbuf...@gmail.com> wrote: > I've got a data set of activity by user. For each user, I'd like to train a > decision tree model. I currently have the feature creation step implemented > in Spark and would naturally like to use mllib's decision tree model. > However, it looks like the decision tree model expects the whole RDD and > will train a single tree. > > Can I split the RDD by user (i.e. groupByKey) and then call the > DecisionTree.trainClassifer in a reduce() or aggregate function to create a > RDD[DecisionTreeModels]? Maybe train the model with an in-memory dataset > instead of an RDD? Call sc.parallelize on the Iterable values in a groupBy > to create a mini-RDD? > > Has anyone else tried something like this with success? > > Thanks! --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org