Re: train many decision tress with a single spark job

2015-01-13 Thread sourabh chaki
Hi Josh, I was trying out decision tree ensemble using bagging. Here I am spiting the input using random split and training tree for each of the split. Here is sample code: val bags : Int = 10 val models : Array[DecisionTreeModel] = training.randomSplit(Array.fill(bags)(1.0 / bags)).map {

Re: train many decision tress with a single spark job

2015-01-13 Thread Sean Owen
OK, I still wonder whether it's not better to make one big model. The usual assumption is that the user's identity isn't predictive per se. If every customer in your shop is truly unlike the others, most predictive analytics goes out the window. It's factors like our location, income, etc that are

Re: train many decision tress with a single spark job

2015-01-12 Thread Josh Buffum
Sean, Thanks for the response. Is there some subtle difference between one model partitioned by N users or N models per each 1 user? I think I'm missing something with your question. Looping through the RDD filtering one user at a time would certainly give me the response that I am hoping for

Re: train many decision tress with a single spark job

2015-01-12 Thread Josh Buffum
You are right... my code example doesn't work :) I actually do want a decision tree per user. So, for 1 million users, I want 1 million trees. We're training against time series data, so there are still quite a few data points per users. My previous message where I mentioned RDDs with no length

Re: train many decision tress with a single spark job

2015-01-12 Thread Sean Owen
A model partitioned by users? I mean that if you have a million users surely you don't mean to build a million models. There would be little data per user right? Sounds like you have 0 sometimes. You would typically be generalizing across users not examining them in isolation. Models are built

Re: train many decision tress with a single spark job

2015-01-11 Thread Sean Owen
You just mean you want to divide the data set into N subsets, and do that dividing by user, not make one model per user right? I suppose you could filter the source RDD N times, and build a model for each resulting subset. This can be parallelized on the driver. For example let's say you divide

train many decision tress with a single spark job

2015-01-10 Thread Josh Buffum
I've got a data set of activity by user. For each user, I'd like to train a decision tree model. I currently have the feature creation step implemented in Spark and would naturally like to use mllib's decision tree model. However, it looks like the decision tree model expects the whole RDD and