Re: train many decision tress with a single spark job

sourabh chaki Tue, 13 Jan 2015 10:00:03 -0800

Hi Josh,

I was trying out decision tree ensemble using bagging. Here I am spiting
the input using random split and training tree for each of the split. Here
is sample code:


val bags : Int = 10
val models : Array[DecisionTreeModel]  =
training.randomSplit(Array.fill(bags)(1.0 / bags)).map {
      (data) => DecisionTree.trainClassifier(toLabelPoints(data))
    }
def toLablePoint(data: RDD[Double]) : RDD[LabeledPoint] = {
// convert data RDD to lablepoint RDD
}

For your case, I think, you need custom logic to split the dataset.

Thanks
Sourabh


On Tue, Jan 13, 2015 at 3:55 PM, Sean Owen <so...@cloudera.com> wrote:

> OK, I still wonder whether it's not better to make one big model. The
> usual assumption is that the user's identity isn't predictive per se.
> If every customer in your shop is truly unlike the others, most
> predictive analytics goes out the window. It's factors like our
> location, income, etc that are predictive and there aren't a million
> of those.
>
> But let's say it's so and you really need 1M RDDs. I think I'd just
> repeatedly filter the source RDD. That really won't be the slow step.
> I think the right way to do it is to create a list of all user IDs on
> the driver, turn it into a parallel collection (and override the # of
> threads it uses on the driver to something reasonable) and map each
> one to the result of filtering and modeling that user subset.
>
> The problem is just the overhead of scheduling millions and millions
> of tiny modeling jobs. It will still probably take a long time. Could
> be fine if you have still millions of data points per user. It's even
> appropriate. But then the challenge here is that you're processing
> trillions of data points! that will be fun.
>
> I think any distributed system is overkill and not designed for the
> case where data fits into memory. You can always take a local
> collection and call parallelize to make it into an RDD, so in that
> sense Spark can handle a tiny data set if you really want.
>
> I'm still not sure I've seen a case where you want to partition by
> user but trust you really need that.
>
> On Tue, Jan 13, 2015 at 1:30 AM, Josh Buffum <jbuf...@gmail.com> wrote:
> > You are right... my code example doesn't work :)
> >
> > I actually do want a decision tree per user. So, for 1 million users, I
> want
> > 1 million trees. We're training against time series data, so there are
> still
> > quite a few data points per users. My previous message where I mentioned
> > RDDs with no length was, I think, a result of the way the random
> > partitioning worked (I was partitioning into N groups where N was the
> number
> > of users... total).
> >
> > Given this, I'm thinking the mlllib is not designed for this particular
> > case? It appears optimized for training across large datasets. I was just
> > hoping to leverage it since creating my feature sets for the users was
> > already in Spark.
> >
> >
> > On Mon, Jan 12, 2015 at 5:05 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> A model partitioned by users?
> >>
> >> I mean that if you have a million users surely you don't mean to build a
> >> million models. There would be little data per user right? Sounds like
> you
> >> have 0 sometimes.
> >>
> >> You would typically be generalizing across users not examining them in
> >> isolation. Models are built on thousands or millions of data points.
> >>
> >> I assumed you were subsetting for cross validation in which case we are
> >> talking about making more like say 10 models. You usually take random
> >> subsets. But it might be as fine to subset as a function of a user ID
> if you
> >> like. Or maybe you do have some reason for segregating users and
> modeling
> >> them differently (e.g. different geographies or something).
> >>
> >> Your code doesn't work as is since you are using RDDs inside RDDs. But I
> >> am also not sure you should do what it looks like you are trying to do.
> >>
> >> On Jan 13, 2015 12:32 AM, "Josh Buffum" <jbuf...@gmail.com> wrote:
> >>>
> >>> Sean,
> >>>
> >>> Thanks for the response. Is there some subtle difference between one
> >>> model partitioned by N users or N models per each 1 user? I think I'm
> >>> missing something with your question.
> >>>
> >>> Looping through the RDD filtering one user at a time would certainly
> give
> >>> me the response that I am hoping for (i.e a map of user =>
> decisiontree),
> >>> however, that seems like it would yield poor performance? The userIDs
> are
> >>> not integers, so I either need to iterator through some in-memory
> array of
> >>> them (could be quite large) or have some distributed lookup table.
> Neither
> >>> seem great.
> >>>
> >>> I tried the random split thing. I wonder if I did something wrong
> there,
> >>> but some of the splits got RDDs with 0 tuples and some got RDDs with >
> 1
> >>> tuple. I guess that's to be expected with some random distribution?
> However,
> >>> that won't work for me since it breaks the "one tree per user" thing. I
> >>> guess I could randomly distribute user IDs and then do the "scan
> everything
> >>> and filter" step...
> >>>
> >>> How bad of an idea is it to do:
> >>>
> >>> data.groupByKey.map( kvp => {
> >>>   val (key, data) = kvp
> >>>   val tree = DecisionTree.train( sc.makeRDD(data), ... )
> >>>   (key, tree)
> >>> })
> >>>
> >>> Is there a way I could tell spark not to distribute the RDD created by
> >>> sc.makeRDD(data) but just to deal with it on whatever spark worker is
> >>> handling kvp? Does that question make sense?
> >>>
> >>> Thanks!
> >>>
> >>> Josh
> >>>
> >>> On Sun, Jan 11, 2015 at 4:12 AM, Sean Owen <so...@cloudera.com> wrote:
> >>>>
> >>>> You just mean you want to divide the data set into N subsets, and do
> >>>> that dividing by user, not make one model per user right?
> >>>>
> >>>> I suppose you could filter the source RDD N times, and build a model
> >>>> for each resulting subset. This can be parallelized on the driver. For
> >>>> example let's say you divide into N subsets depending on the value of
> >>>> the user ID modulo N:
> >>>>
> >>>> val N = ...
> >>>> (0 until N).par.map(d => DecisionTree.train(data.filter(_.userID % N
> >>>> == d), ...))
> >>>>
> >>>> data should be cache()-ed here of course.
> >>>>
> >>>> However it may be faster and more principled to take random subsets
> >>>> directly:
> >>>>
> >>>> data.randomSplit(Array.fill(N)(1.0 / N)).par.map(subset =>
> >>>> DecisionTree.train(subset, ...))
> >>>>
> >>>> On Sun, Jan 11, 2015 at 1:53 AM, Josh Buffum <jbuf...@gmail.com>
> wrote:
> >>>> > I've got a data set of activity by user. For each user, I'd like to
> >>>> > train a
> >>>> > decision tree model. I currently have the feature creation step
> >>>> > implemented
> >>>> > in Spark and would naturally like to use mllib's decision tree
> model.
> >>>> > However, it looks like the decision tree model expects the whole RDD
> >>>> > and
> >>>> > will train a single tree.
> >>>> >
> >>>> > Can I split the RDD by user (i.e. groupByKey) and then call the
> >>>> > DecisionTree.trainClassifer in a reduce() or aggregate function to
> >>>> > create a
> >>>> > RDD[DecisionTreeModels]? Maybe train the model with an in-memory
> >>>> > dataset
> >>>> > instead of an RDD? Call sc.parallelize on the Iterable values in a
> >>>> > groupBy
> >>>> > to create a mini-RDD?
> >>>> >
> >>>> > Has anyone else tried something like this with success?
> >>>> >
> >>>> > Thanks!
> >>>
> >>>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: train many decision tress with a single spark job

Reply via email to