Sean,

Thanks for the response. Is there some subtle difference between one model
partitioned by N users or N models per each 1 user? I think I'm missing
something with your question.

Looping through the RDD filtering one user at a time would certainly give
me the response that I am hoping for (i.e a map of user => decisiontree),
however, that seems like it would yield poor performance? The userIDs are
not integers, so I either need to iterator through some in-memory array of
them (could be quite large) or have some distributed lookup table. Neither
seem great.

I tried the random split thing. I wonder if I did something wrong there,
but some of the splits got RDDs with 0 tuples and some got RDDs with > 1
tuple. I guess that's to be expected with some random distribution?
However, that won't work for me since it breaks the "one tree per user"
thing. I guess I could randomly distribute user IDs and then do the "scan
everything and filter" step...

How bad of an idea is it to do:

data.groupByKey.map( kvp => {
  val (key, data) = kvp
  val tree = DecisionTree.train( sc.makeRDD(data), ... )
  (key, tree)
})

Is there a way I could tell spark not to distribute the RDD created by
sc.makeRDD(data) but just to deal with it on whatever spark worker is
handling kvp? Does that question make sense?

Thanks!

Josh

On Sun, Jan 11, 2015 at 4:12 AM, Sean Owen <so...@cloudera.com> wrote:

> You just mean you want to divide the data set into N subsets, and do
> that dividing by user, not make one model per user right?
>
> I suppose you could filter the source RDD N times, and build a model
> for each resulting subset. This can be parallelized on the driver. For
> example let's say you divide into N subsets depending on the value of
> the user ID modulo N:
>
> val N = ...
> (0 until N).par.map(d => DecisionTree.train(data.filter(_.userID % N
> == d), ...))
>
> data should be cache()-ed here of course.
>
> However it may be faster and more principled to take random subsets
> directly:
>
> data.randomSplit(Array.fill(N)(1.0 / N)).par.map(subset =>
> DecisionTree.train(subset, ...))
>
> On Sun, Jan 11, 2015 at 1:53 AM, Josh Buffum <jbuf...@gmail.com> wrote:
> > I've got a data set of activity by user. For each user, I'd like to
> train a
> > decision tree model. I currently have the feature creation step
> implemented
> > in Spark and would naturally like to use mllib's decision tree model.
> > However, it looks like the decision tree model expects the whole RDD and
> > will train a single tree.
> >
> > Can I split the RDD by user (i.e. groupByKey) and then call the
> > DecisionTree.trainClassifer in a reduce() or aggregate function to
> create a
> > RDD[DecisionTreeModels]? Maybe train the model with an in-memory dataset
> > instead of an RDD? Call sc.parallelize on the Iterable values in a
> groupBy
> > to create a mini-RDD?
> >
> > Has anyone else tried something like this with success?
> >
> > Thanks!
>

Reply via email to