Re: train many decision tress with a single spark job

Josh Buffum Mon, 12 Jan 2015 17:33:08 -0800

You are right... my code example doesn't work :)

I actually do want a decision tree per user. So, for 1 million users, I
want 1 million trees. We're training against time series data, so there are
still quite a few data points per users. My previous message where I
mentioned RDDs with no length was, I think, a result of the way the random
partitioning worked (I was partitioning into N groups where N was the
number of users... total).


Given this, I'm thinking the mlllib is not designed for this particular
case? It appears optimized for training across large datasets. I was just
hoping to leverage it since creating my feature sets for the users was
already in Spark.


On Mon, Jan 12, 2015 at 5:05 PM, Sean Owen <so...@cloudera.com> wrote:

> A model partitioned by users?
>
> I mean that if you have a million users surely you don't mean to build a
> million models. There would be little data per user right? Sounds like you
> have 0 sometimes.
>
> You would typically be generalizing across users not examining them in
> isolation. Models are built on thousands or millions of data points.
>
> I assumed you were subsetting for cross validation in which case we are
> talking about making more like say 10 models. You usually take random
> subsets. But it might be as fine to subset as a function of a user ID if
> you like. Or maybe you do have some reason for segregating users and
> modeling them differently (e.g. different geographies or something).
>
> Your code doesn't work as is since you are using RDDs inside RDDs. But I
> am also not sure you should do what it looks like you are trying to do.
> On Jan 13, 2015 12:32 AM, "Josh Buffum" <jbuf...@gmail.com> wrote:
>
>> Sean,
>>
>> Thanks for the response. Is there some subtle difference between one
>> model partitioned by N users or N models per each 1 user? I think I'm
>> missing something with your question.
>>
>> Looping through the RDD filtering one user at a time would certainly give
>> me the response that I am hoping for (i.e a map of user => decisiontree),
>> however, that seems like it would yield poor performance? The userIDs are
>> not integers, so I either need to iterator through some in-memory array of
>> them (could be quite large) or have some distributed lookup table. Neither
>> seem great.
>>
>> I tried the random split thing. I wonder if I did something wrong there,
>> but some of the splits got RDDs with 0 tuples and some got RDDs with > 1
>> tuple. I guess that's to be expected with some random distribution?
>> However, that won't work for me since it breaks the "one tree per user"
>> thing. I guess I could randomly distribute user IDs and then do the "scan
>> everything and filter" step...
>>
>> How bad of an idea is it to do:
>>
>> data.groupByKey.map( kvp => {
>>   val (key, data) = kvp
>>   val tree = DecisionTree.train( sc.makeRDD(data), ... )
>>   (key, tree)
>> })
>>
>> Is there a way I could tell spark not to distribute the RDD created by
>> sc.makeRDD(data) but just to deal with it on whatever spark worker is
>> handling kvp? Does that question make sense?
>>
>> Thanks!
>>
>> Josh
>>
>> On Sun, Jan 11, 2015 at 4:12 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> You just mean you want to divide the data set into N subsets, and do
>>> that dividing by user, not make one model per user right?
>>>
>>> I suppose you could filter the source RDD N times, and build a model
>>> for each resulting subset. This can be parallelized on the driver. For
>>> example let's say you divide into N subsets depending on the value of
>>> the user ID modulo N:
>>>
>>> val N = ...
>>> (0 until N).par.map(d => DecisionTree.train(data.filter(_.userID % N
>>> == d), ...))
>>>
>>> data should be cache()-ed here of course.
>>>
>>> However it may be faster and more principled to take random subsets
>>> directly:
>>>
>>> data.randomSplit(Array.fill(N)(1.0 / N)).par.map(subset =>
>>> DecisionTree.train(subset, ...))
>>>
>>> On Sun, Jan 11, 2015 at 1:53 AM, Josh Buffum <jbuf...@gmail.com> wrote:
>>> > I've got a data set of activity by user. For each user, I'd like to
>>> train a
>>> > decision tree model. I currently have the feature creation step
>>> implemented
>>> > in Spark and would naturally like to use mllib's decision tree model.
>>> > However, it looks like the decision tree model expects the whole RDD
>>> and
>>> > will train a single tree.
>>> >
>>> > Can I split the RDD by user (i.e. groupByKey) and then call the
>>> > DecisionTree.trainClassifer in a reduce() or aggregate function to
>>> create a
>>> > RDD[DecisionTreeModels]? Maybe train the model with an in-memory
>>> dataset
>>> > instead of an RDD? Call sc.parallelize on the Iterable values in a
>>> groupBy
>>> > to create a mini-RDD?
>>> >
>>> > Has anyone else tried something like this with success?
>>> >
>>> > Thanks!
>>>
>>
>>

Re: train many decision tress with a single spark job

Reply via email to