You are right... my code example doesn't work :) I actually do want a decision tree per user. So, for 1 million users, I want 1 million trees. We're training against time series data, so there are still quite a few data points per users. My previous message where I mentioned RDDs with no length was, I think, a result of the way the random partitioning worked (I was partitioning into N groups where N was the number of users... total).
Given this, I'm thinking the mlllib is not designed for this particular case? It appears optimized for training across large datasets. I was just hoping to leverage it since creating my feature sets for the users was already in Spark. On Mon, Jan 12, 2015 at 5:05 PM, Sean Owen <so...@cloudera.com> wrote: > A model partitioned by users? > > I mean that if you have a million users surely you don't mean to build a > million models. There would be little data per user right? Sounds like you > have 0 sometimes. > > You would typically be generalizing across users not examining them in > isolation. Models are built on thousands or millions of data points. > > I assumed you were subsetting for cross validation in which case we are > talking about making more like say 10 models. You usually take random > subsets. But it might be as fine to subset as a function of a user ID if > you like. Or maybe you do have some reason for segregating users and > modeling them differently (e.g. different geographies or something). > > Your code doesn't work as is since you are using RDDs inside RDDs. But I > am also not sure you should do what it looks like you are trying to do. > On Jan 13, 2015 12:32 AM, "Josh Buffum" <jbuf...@gmail.com> wrote: > >> Sean, >> >> Thanks for the response. Is there some subtle difference between one >> model partitioned by N users or N models per each 1 user? I think I'm >> missing something with your question. >> >> Looping through the RDD filtering one user at a time would certainly give >> me the response that I am hoping for (i.e a map of user => decisiontree), >> however, that seems like it would yield poor performance? The userIDs are >> not integers, so I either need to iterator through some in-memory array of >> them (could be quite large) or have some distributed lookup table. Neither >> seem great. >> >> I tried the random split thing. I wonder if I did something wrong there, >> but some of the splits got RDDs with 0 tuples and some got RDDs with > 1 >> tuple. I guess that's to be expected with some random distribution? >> However, that won't work for me since it breaks the "one tree per user" >> thing. I guess I could randomly distribute user IDs and then do the "scan >> everything and filter" step... >> >> How bad of an idea is it to do: >> >> data.groupByKey.map( kvp => { >> val (key, data) = kvp >> val tree = DecisionTree.train( sc.makeRDD(data), ... ) >> (key, tree) >> }) >> >> Is there a way I could tell spark not to distribute the RDD created by >> sc.makeRDD(data) but just to deal with it on whatever spark worker is >> handling kvp? Does that question make sense? >> >> Thanks! >> >> Josh >> >> On Sun, Jan 11, 2015 at 4:12 AM, Sean Owen <so...@cloudera.com> wrote: >> >>> You just mean you want to divide the data set into N subsets, and do >>> that dividing by user, not make one model per user right? >>> >>> I suppose you could filter the source RDD N times, and build a model >>> for each resulting subset. This can be parallelized on the driver. For >>> example let's say you divide into N subsets depending on the value of >>> the user ID modulo N: >>> >>> val N = ... >>> (0 until N).par.map(d => DecisionTree.train(data.filter(_.userID % N >>> == d), ...)) >>> >>> data should be cache()-ed here of course. >>> >>> However it may be faster and more principled to take random subsets >>> directly: >>> >>> data.randomSplit(Array.fill(N)(1.0 / N)).par.map(subset => >>> DecisionTree.train(subset, ...)) >>> >>> On Sun, Jan 11, 2015 at 1:53 AM, Josh Buffum <jbuf...@gmail.com> wrote: >>> > I've got a data set of activity by user. For each user, I'd like to >>> train a >>> > decision tree model. I currently have the feature creation step >>> implemented >>> > in Spark and would naturally like to use mllib's decision tree model. >>> > However, it looks like the decision tree model expects the whole RDD >>> and >>> > will train a single tree. >>> > >>> > Can I split the RDD by user (i.e. groupByKey) and then call the >>> > DecisionTree.trainClassifer in a reduce() or aggregate function to >>> create a >>> > RDD[DecisionTreeModels]? Maybe train the model with an in-memory >>> dataset >>> > instead of an RDD? Call sc.parallelize on the Iterable values in a >>> groupBy >>> > to create a mini-RDD? >>> > >>> > Has anyone else tried something like this with success? >>> > >>> > Thanks! >>> >> >>