Hi Josh, I was trying out decision tree ensemble using bagging. Here I am spiting the input using random split and training tree for each of the split. Here is sample code:
val bags : Int = 10 val models : Array[DecisionTreeModel] = training.randomSplit(Array.fill(bags)(1.0 / bags)).map { (data) => DecisionTree.trainClassifier(toLabelPoints(data)) } def toLablePoint(data: RDD[Double]) : RDD[LabeledPoint] = { // convert data RDD to lablepoint RDD } For your case, I think, you need custom logic to split the dataset. Thanks Sourabh On Tue, Jan 13, 2015 at 3:55 PM, Sean Owen <so...@cloudera.com> wrote: > OK, I still wonder whether it's not better to make one big model. The > usual assumption is that the user's identity isn't predictive per se. > If every customer in your shop is truly unlike the others, most > predictive analytics goes out the window. It's factors like our > location, income, etc that are predictive and there aren't a million > of those. > > But let's say it's so and you really need 1M RDDs. I think I'd just > repeatedly filter the source RDD. That really won't be the slow step. > I think the right way to do it is to create a list of all user IDs on > the driver, turn it into a parallel collection (and override the # of > threads it uses on the driver to something reasonable) and map each > one to the result of filtering and modeling that user subset. > > The problem is just the overhead of scheduling millions and millions > of tiny modeling jobs. It will still probably take a long time. Could > be fine if you have still millions of data points per user. It's even > appropriate. But then the challenge here is that you're processing > trillions of data points! that will be fun. > > I think any distributed system is overkill and not designed for the > case where data fits into memory. You can always take a local > collection and call parallelize to make it into an RDD, so in that > sense Spark can handle a tiny data set if you really want. > > I'm still not sure I've seen a case where you want to partition by > user but trust you really need that. > > On Tue, Jan 13, 2015 at 1:30 AM, Josh Buffum <jbuf...@gmail.com> wrote: > > You are right... my code example doesn't work :) > > > > I actually do want a decision tree per user. So, for 1 million users, I > want > > 1 million trees. We're training against time series data, so there are > still > > quite a few data points per users. My previous message where I mentioned > > RDDs with no length was, I think, a result of the way the random > > partitioning worked (I was partitioning into N groups where N was the > number > > of users... total). > > > > Given this, I'm thinking the mlllib is not designed for this particular > > case? It appears optimized for training across large datasets. I was just > > hoping to leverage it since creating my feature sets for the users was > > already in Spark. > > > > > > On Mon, Jan 12, 2015 at 5:05 PM, Sean Owen <so...@cloudera.com> wrote: > >> > >> A model partitioned by users? > >> > >> I mean that if you have a million users surely you don't mean to build a > >> million models. There would be little data per user right? Sounds like > you > >> have 0 sometimes. > >> > >> You would typically be generalizing across users not examining them in > >> isolation. Models are built on thousands or millions of data points. > >> > >> I assumed you were subsetting for cross validation in which case we are > >> talking about making more like say 10 models. You usually take random > >> subsets. But it might be as fine to subset as a function of a user ID > if you > >> like. Or maybe you do have some reason for segregating users and > modeling > >> them differently (e.g. different geographies or something). > >> > >> Your code doesn't work as is since you are using RDDs inside RDDs. But I > >> am also not sure you should do what it looks like you are trying to do. > >> > >> On Jan 13, 2015 12:32 AM, "Josh Buffum" <jbuf...@gmail.com> wrote: > >>> > >>> Sean, > >>> > >>> Thanks for the response. Is there some subtle difference between one > >>> model partitioned by N users or N models per each 1 user? I think I'm > >>> missing something with your question. > >>> > >>> Looping through the RDD filtering one user at a time would certainly > give > >>> me the response that I am hoping for (i.e a map of user => > decisiontree), > >>> however, that seems like it would yield poor performance? The userIDs > are > >>> not integers, so I either need to iterator through some in-memory > array of > >>> them (could be quite large) or have some distributed lookup table. > Neither > >>> seem great. > >>> > >>> I tried the random split thing. I wonder if I did something wrong > there, > >>> but some of the splits got RDDs with 0 tuples and some got RDDs with > > 1 > >>> tuple. I guess that's to be expected with some random distribution? > However, > >>> that won't work for me since it breaks the "one tree per user" thing. I > >>> guess I could randomly distribute user IDs and then do the "scan > everything > >>> and filter" step... > >>> > >>> How bad of an idea is it to do: > >>> > >>> data.groupByKey.map( kvp => { > >>> val (key, data) = kvp > >>> val tree = DecisionTree.train( sc.makeRDD(data), ... ) > >>> (key, tree) > >>> }) > >>> > >>> Is there a way I could tell spark not to distribute the RDD created by > >>> sc.makeRDD(data) but just to deal with it on whatever spark worker is > >>> handling kvp? Does that question make sense? > >>> > >>> Thanks! > >>> > >>> Josh > >>> > >>> On Sun, Jan 11, 2015 at 4:12 AM, Sean Owen <so...@cloudera.com> wrote: > >>>> > >>>> You just mean you want to divide the data set into N subsets, and do > >>>> that dividing by user, not make one model per user right? > >>>> > >>>> I suppose you could filter the source RDD N times, and build a model > >>>> for each resulting subset. This can be parallelized on the driver. For > >>>> example let's say you divide into N subsets depending on the value of > >>>> the user ID modulo N: > >>>> > >>>> val N = ... > >>>> (0 until N).par.map(d => DecisionTree.train(data.filter(_.userID % N > >>>> == d), ...)) > >>>> > >>>> data should be cache()-ed here of course. > >>>> > >>>> However it may be faster and more principled to take random subsets > >>>> directly: > >>>> > >>>> data.randomSplit(Array.fill(N)(1.0 / N)).par.map(subset => > >>>> DecisionTree.train(subset, ...)) > >>>> > >>>> On Sun, Jan 11, 2015 at 1:53 AM, Josh Buffum <jbuf...@gmail.com> > wrote: > >>>> > I've got a data set of activity by user. For each user, I'd like to > >>>> > train a > >>>> > decision tree model. I currently have the feature creation step > >>>> > implemented > >>>> > in Spark and would naturally like to use mllib's decision tree > model. > >>>> > However, it looks like the decision tree model expects the whole RDD > >>>> > and > >>>> > will train a single tree. > >>>> > > >>>> > Can I split the RDD by user (i.e. groupByKey) and then call the > >>>> > DecisionTree.trainClassifer in a reduce() or aggregate function to > >>>> > create a > >>>> > RDD[DecisionTreeModels]? Maybe train the model with an in-memory > >>>> > dataset > >>>> > instead of an RDD? Call sc.parallelize on the Iterable values in a > >>>> > groupBy > >>>> > to create a mini-RDD? > >>>> > > >>>> > Has anyone else tried something like this with success? > >>>> > > >>>> > Thanks! > >>> > >>> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >