Re: train many decision tress with a single spark job

2015-01-13 Thread sourabh chaki
Hi Josh,

I was trying out decision tree ensemble using bagging. Here I am spiting
the input using random split and training tree for each of the split. Here
is sample code:

val bags : Int = 10
val models : Array[DecisionTreeModel]  =
training.randomSplit(Array.fill(bags)(1.0 / bags)).map {
  (data) = DecisionTree.trainClassifier(toLabelPoints(data))
}
def toLablePoint(data: RDD[Double]) : RDD[LabeledPoint] = {
// convert data RDD to lablepoint RDD
}

For your case, I think, you need custom logic to split the dataset.

Thanks
Sourabh


On Tue, Jan 13, 2015 at 3:55 PM, Sean Owen so...@cloudera.com wrote:

 OK, I still wonder whether it's not better to make one big model. The
 usual assumption is that the user's identity isn't predictive per se.
 If every customer in your shop is truly unlike the others, most
 predictive analytics goes out the window. It's factors like our
 location, income, etc that are predictive and there aren't a million
 of those.

 But let's say it's so and you really need 1M RDDs. I think I'd just
 repeatedly filter the source RDD. That really won't be the slow step.
 I think the right way to do it is to create a list of all user IDs on
 the driver, turn it into a parallel collection (and override the # of
 threads it uses on the driver to something reasonable) and map each
 one to the result of filtering and modeling that user subset.

 The problem is just the overhead of scheduling millions and millions
 of tiny modeling jobs. It will still probably take a long time. Could
 be fine if you have still millions of data points per user. It's even
 appropriate. But then the challenge here is that you're processing
 trillions of data points! that will be fun.

 I think any distributed system is overkill and not designed for the
 case where data fits into memory. You can always take a local
 collection and call parallelize to make it into an RDD, so in that
 sense Spark can handle a tiny data set if you really want.

 I'm still not sure I've seen a case where you want to partition by
 user but trust you really need that.

 On Tue, Jan 13, 2015 at 1:30 AM, Josh Buffum jbuf...@gmail.com wrote:
  You are right... my code example doesn't work :)
 
  I actually do want a decision tree per user. So, for 1 million users, I
 want
  1 million trees. We're training against time series data, so there are
 still
  quite a few data points per users. My previous message where I mentioned
  RDDs with no length was, I think, a result of the way the random
  partitioning worked (I was partitioning into N groups where N was the
 number
  of users... total).
 
  Given this, I'm thinking the mlllib is not designed for this particular
  case? It appears optimized for training across large datasets. I was just
  hoping to leverage it since creating my feature sets for the users was
  already in Spark.
 
 
  On Mon, Jan 12, 2015 at 5:05 PM, Sean Owen so...@cloudera.com wrote:
 
  A model partitioned by users?
 
  I mean that if you have a million users surely you don't mean to build a
  million models. There would be little data per user right? Sounds like
 you
  have 0 sometimes.
 
  You would typically be generalizing across users not examining them in
  isolation. Models are built on thousands or millions of data points.
 
  I assumed you were subsetting for cross validation in which case we are
  talking about making more like say 10 models. You usually take random
  subsets. But it might be as fine to subset as a function of a user ID
 if you
  like. Or maybe you do have some reason for segregating users and
 modeling
  them differently (e.g. different geographies or something).
 
  Your code doesn't work as is since you are using RDDs inside RDDs. But I
  am also not sure you should do what it looks like you are trying to do.
 
  On Jan 13, 2015 12:32 AM, Josh Buffum jbuf...@gmail.com wrote:
 
  Sean,
 
  Thanks for the response. Is there some subtle difference between one
  model partitioned by N users or N models per each 1 user? I think I'm
  missing something with your question.
 
  Looping through the RDD filtering one user at a time would certainly
 give
  me the response that I am hoping for (i.e a map of user =
 decisiontree),
  however, that seems like it would yield poor performance? The userIDs
 are
  not integers, so I either need to iterator through some in-memory
 array of
  them (could be quite large) or have some distributed lookup table.
 Neither
  seem great.
 
  I tried the random split thing. I wonder if I did something wrong
 there,
  but some of the splits got RDDs with 0 tuples and some got RDDs with 
 1
  tuple. I guess that's to be expected with some random distribution?
 However,
  that won't work for me since it breaks the one tree per user thing. I
  guess I could randomly distribute user IDs and then do the scan
 everything
  and filter step...
 
  How bad of an idea is it to do:
 
  data.groupByKey.map( kvp = {
val (key, data) = kvp
val tree = 

Re: train many decision tress with a single spark job

2015-01-13 Thread Sean Owen
OK, I still wonder whether it's not better to make one big model. The
usual assumption is that the user's identity isn't predictive per se.
If every customer in your shop is truly unlike the others, most
predictive analytics goes out the window. It's factors like our
location, income, etc that are predictive and there aren't a million
of those.

But let's say it's so and you really need 1M RDDs. I think I'd just
repeatedly filter the source RDD. That really won't be the slow step.
I think the right way to do it is to create a list of all user IDs on
the driver, turn it into a parallel collection (and override the # of
threads it uses on the driver to something reasonable) and map each
one to the result of filtering and modeling that user subset.

The problem is just the overhead of scheduling millions and millions
of tiny modeling jobs. It will still probably take a long time. Could
be fine if you have still millions of data points per user. It's even
appropriate. But then the challenge here is that you're processing
trillions of data points! that will be fun.

I think any distributed system is overkill and not designed for the
case where data fits into memory. You can always take a local
collection and call parallelize to make it into an RDD, so in that
sense Spark can handle a tiny data set if you really want.

I'm still not sure I've seen a case where you want to partition by
user but trust you really need that.

On Tue, Jan 13, 2015 at 1:30 AM, Josh Buffum jbuf...@gmail.com wrote:
 You are right... my code example doesn't work :)

 I actually do want a decision tree per user. So, for 1 million users, I want
 1 million trees. We're training against time series data, so there are still
 quite a few data points per users. My previous message where I mentioned
 RDDs with no length was, I think, a result of the way the random
 partitioning worked (I was partitioning into N groups where N was the number
 of users... total).

 Given this, I'm thinking the mlllib is not designed for this particular
 case? It appears optimized for training across large datasets. I was just
 hoping to leverage it since creating my feature sets for the users was
 already in Spark.


 On Mon, Jan 12, 2015 at 5:05 PM, Sean Owen so...@cloudera.com wrote:

 A model partitioned by users?

 I mean that if you have a million users surely you don't mean to build a
 million models. There would be little data per user right? Sounds like you
 have 0 sometimes.

 You would typically be generalizing across users not examining them in
 isolation. Models are built on thousands or millions of data points.

 I assumed you were subsetting for cross validation in which case we are
 talking about making more like say 10 models. You usually take random
 subsets. But it might be as fine to subset as a function of a user ID if you
 like. Or maybe you do have some reason for segregating users and modeling
 them differently (e.g. different geographies or something).

 Your code doesn't work as is since you are using RDDs inside RDDs. But I
 am also not sure you should do what it looks like you are trying to do.

 On Jan 13, 2015 12:32 AM, Josh Buffum jbuf...@gmail.com wrote:

 Sean,

 Thanks for the response. Is there some subtle difference between one
 model partitioned by N users or N models per each 1 user? I think I'm
 missing something with your question.

 Looping through the RDD filtering one user at a time would certainly give
 me the response that I am hoping for (i.e a map of user = decisiontree),
 however, that seems like it would yield poor performance? The userIDs are
 not integers, so I either need to iterator through some in-memory array of
 them (could be quite large) or have some distributed lookup table. Neither
 seem great.

 I tried the random split thing. I wonder if I did something wrong there,
 but some of the splits got RDDs with 0 tuples and some got RDDs with  1
 tuple. I guess that's to be expected with some random distribution? However,
 that won't work for me since it breaks the one tree per user thing. I
 guess I could randomly distribute user IDs and then do the scan everything
 and filter step...

 How bad of an idea is it to do:

 data.groupByKey.map( kvp = {
   val (key, data) = kvp
   val tree = DecisionTree.train( sc.makeRDD(data), ... )
   (key, tree)
 })

 Is there a way I could tell spark not to distribute the RDD created by
 sc.makeRDD(data) but just to deal with it on whatever spark worker is
 handling kvp? Does that question make sense?

 Thanks!

 Josh

 On Sun, Jan 11, 2015 at 4:12 AM, Sean Owen so...@cloudera.com wrote:

 You just mean you want to divide the data set into N subsets, and do
 that dividing by user, not make one model per user right?

 I suppose you could filter the source RDD N times, and build a model
 for each resulting subset. This can be parallelized on the driver. For
 example let's say you divide into N subsets depending on the value of
 the user ID modulo N:

 val N = ...
 (0 until 

Re: train many decision tress with a single spark job

2015-01-12 Thread Josh Buffum
Sean,

Thanks for the response. Is there some subtle difference between one model
partitioned by N users or N models per each 1 user? I think I'm missing
something with your question.

Looping through the RDD filtering one user at a time would certainly give
me the response that I am hoping for (i.e a map of user = decisiontree),
however, that seems like it would yield poor performance? The userIDs are
not integers, so I either need to iterator through some in-memory array of
them (could be quite large) or have some distributed lookup table. Neither
seem great.

I tried the random split thing. I wonder if I did something wrong there,
but some of the splits got RDDs with 0 tuples and some got RDDs with  1
tuple. I guess that's to be expected with some random distribution?
However, that won't work for me since it breaks the one tree per user
thing. I guess I could randomly distribute user IDs and then do the scan
everything and filter step...

How bad of an idea is it to do:

data.groupByKey.map( kvp = {
  val (key, data) = kvp
  val tree = DecisionTree.train( sc.makeRDD(data), ... )
  (key, tree)
})

Is there a way I could tell spark not to distribute the RDD created by
sc.makeRDD(data) but just to deal with it on whatever spark worker is
handling kvp? Does that question make sense?

Thanks!

Josh

On Sun, Jan 11, 2015 at 4:12 AM, Sean Owen so...@cloudera.com wrote:

 You just mean you want to divide the data set into N subsets, and do
 that dividing by user, not make one model per user right?

 I suppose you could filter the source RDD N times, and build a model
 for each resulting subset. This can be parallelized on the driver. For
 example let's say you divide into N subsets depending on the value of
 the user ID modulo N:

 val N = ...
 (0 until N).par.map(d = DecisionTree.train(data.filter(_.userID % N
 == d), ...))

 data should be cache()-ed here of course.

 However it may be faster and more principled to take random subsets
 directly:

 data.randomSplit(Array.fill(N)(1.0 / N)).par.map(subset =
 DecisionTree.train(subset, ...))

 On Sun, Jan 11, 2015 at 1:53 AM, Josh Buffum jbuf...@gmail.com wrote:
  I've got a data set of activity by user. For each user, I'd like to
 train a
  decision tree model. I currently have the feature creation step
 implemented
  in Spark and would naturally like to use mllib's decision tree model.
  However, it looks like the decision tree model expects the whole RDD and
  will train a single tree.
 
  Can I split the RDD by user (i.e. groupByKey) and then call the
  DecisionTree.trainClassifer in a reduce() or aggregate function to
 create a
  RDD[DecisionTreeModels]? Maybe train the model with an in-memory dataset
  instead of an RDD? Call sc.parallelize on the Iterable values in a
 groupBy
  to create a mini-RDD?
 
  Has anyone else tried something like this with success?
 
  Thanks!



Re: train many decision tress with a single spark job

2015-01-12 Thread Josh Buffum
You are right... my code example doesn't work :)

I actually do want a decision tree per user. So, for 1 million users, I
want 1 million trees. We're training against time series data, so there are
still quite a few data points per users. My previous message where I
mentioned RDDs with no length was, I think, a result of the way the random
partitioning worked (I was partitioning into N groups where N was the
number of users... total).

Given this, I'm thinking the mlllib is not designed for this particular
case? It appears optimized for training across large datasets. I was just
hoping to leverage it since creating my feature sets for the users was
already in Spark.


On Mon, Jan 12, 2015 at 5:05 PM, Sean Owen so...@cloudera.com wrote:

 A model partitioned by users?

 I mean that if you have a million users surely you don't mean to build a
 million models. There would be little data per user right? Sounds like you
 have 0 sometimes.

 You would typically be generalizing across users not examining them in
 isolation. Models are built on thousands or millions of data points.

 I assumed you were subsetting for cross validation in which case we are
 talking about making more like say 10 models. You usually take random
 subsets. But it might be as fine to subset as a function of a user ID if
 you like. Or maybe you do have some reason for segregating users and
 modeling them differently (e.g. different geographies or something).

 Your code doesn't work as is since you are using RDDs inside RDDs. But I
 am also not sure you should do what it looks like you are trying to do.
 On Jan 13, 2015 12:32 AM, Josh Buffum jbuf...@gmail.com wrote:

 Sean,

 Thanks for the response. Is there some subtle difference between one
 model partitioned by N users or N models per each 1 user? I think I'm
 missing something with your question.

 Looping through the RDD filtering one user at a time would certainly give
 me the response that I am hoping for (i.e a map of user = decisiontree),
 however, that seems like it would yield poor performance? The userIDs are
 not integers, so I either need to iterator through some in-memory array of
 them (could be quite large) or have some distributed lookup table. Neither
 seem great.

 I tried the random split thing. I wonder if I did something wrong there,
 but some of the splits got RDDs with 0 tuples and some got RDDs with  1
 tuple. I guess that's to be expected with some random distribution?
 However, that won't work for me since it breaks the one tree per user
 thing. I guess I could randomly distribute user IDs and then do the scan
 everything and filter step...

 How bad of an idea is it to do:

 data.groupByKey.map( kvp = {
   val (key, data) = kvp
   val tree = DecisionTree.train( sc.makeRDD(data), ... )
   (key, tree)
 })

 Is there a way I could tell spark not to distribute the RDD created by
 sc.makeRDD(data) but just to deal with it on whatever spark worker is
 handling kvp? Does that question make sense?

 Thanks!

 Josh

 On Sun, Jan 11, 2015 at 4:12 AM, Sean Owen so...@cloudera.com wrote:

 You just mean you want to divide the data set into N subsets, and do
 that dividing by user, not make one model per user right?

 I suppose you could filter the source RDD N times, and build a model
 for each resulting subset. This can be parallelized on the driver. For
 example let's say you divide into N subsets depending on the value of
 the user ID modulo N:

 val N = ...
 (0 until N).par.map(d = DecisionTree.train(data.filter(_.userID % N
 == d), ...))

 data should be cache()-ed here of course.

 However it may be faster and more principled to take random subsets
 directly:

 data.randomSplit(Array.fill(N)(1.0 / N)).par.map(subset =
 DecisionTree.train(subset, ...))

 On Sun, Jan 11, 2015 at 1:53 AM, Josh Buffum jbuf...@gmail.com wrote:
  I've got a data set of activity by user. For each user, I'd like to
 train a
  decision tree model. I currently have the feature creation step
 implemented
  in Spark and would naturally like to use mllib's decision tree model.
  However, it looks like the decision tree model expects the whole RDD
 and
  will train a single tree.
 
  Can I split the RDD by user (i.e. groupByKey) and then call the
  DecisionTree.trainClassifer in a reduce() or aggregate function to
 create a
  RDD[DecisionTreeModels]? Maybe train the model with an in-memory
 dataset
  instead of an RDD? Call sc.parallelize on the Iterable values in a
 groupBy
  to create a mini-RDD?
 
  Has anyone else tried something like this with success?
 
  Thanks!





Re: train many decision tress with a single spark job

2015-01-12 Thread Sean Owen
A model partitioned by users?

I mean that if you have a million users surely you don't mean to build a
million models. There would be little data per user right? Sounds like you
have 0 sometimes.

You would typically be generalizing across users not examining them in
isolation. Models are built on thousands or millions of data points.

I assumed you were subsetting for cross validation in which case we are
talking about making more like say 10 models. You usually take random
subsets. But it might be as fine to subset as a function of a user ID if
you like. Or maybe you do have some reason for segregating users and
modeling them differently (e.g. different geographies or something).

Your code doesn't work as is since you are using RDDs inside RDDs. But I am
also not sure you should do what it looks like you are trying to do.
On Jan 13, 2015 12:32 AM, Josh Buffum jbuf...@gmail.com wrote:

 Sean,

 Thanks for the response. Is there some subtle difference between one model
 partitioned by N users or N models per each 1 user? I think I'm missing
 something with your question.

 Looping through the RDD filtering one user at a time would certainly give
 me the response that I am hoping for (i.e a map of user = decisiontree),
 however, that seems like it would yield poor performance? The userIDs are
 not integers, so I either need to iterator through some in-memory array of
 them (could be quite large) or have some distributed lookup table. Neither
 seem great.

 I tried the random split thing. I wonder if I did something wrong there,
 but some of the splits got RDDs with 0 tuples and some got RDDs with  1
 tuple. I guess that's to be expected with some random distribution?
 However, that won't work for me since it breaks the one tree per user
 thing. I guess I could randomly distribute user IDs and then do the scan
 everything and filter step...

 How bad of an idea is it to do:

 data.groupByKey.map( kvp = {
   val (key, data) = kvp
   val tree = DecisionTree.train( sc.makeRDD(data), ... )
   (key, tree)
 })

 Is there a way I could tell spark not to distribute the RDD created by
 sc.makeRDD(data) but just to deal with it on whatever spark worker is
 handling kvp? Does that question make sense?

 Thanks!

 Josh

 On Sun, Jan 11, 2015 at 4:12 AM, Sean Owen so...@cloudera.com wrote:

 You just mean you want to divide the data set into N subsets, and do
 that dividing by user, not make one model per user right?

 I suppose you could filter the source RDD N times, and build a model
 for each resulting subset. This can be parallelized on the driver. For
 example let's say you divide into N subsets depending on the value of
 the user ID modulo N:

 val N = ...
 (0 until N).par.map(d = DecisionTree.train(data.filter(_.userID % N
 == d), ...))

 data should be cache()-ed here of course.

 However it may be faster and more principled to take random subsets
 directly:

 data.randomSplit(Array.fill(N)(1.0 / N)).par.map(subset =
 DecisionTree.train(subset, ...))

 On Sun, Jan 11, 2015 at 1:53 AM, Josh Buffum jbuf...@gmail.com wrote:
  I've got a data set of activity by user. For each user, I'd like to
 train a
  decision tree model. I currently have the feature creation step
 implemented
  in Spark and would naturally like to use mllib's decision tree model.
  However, it looks like the decision tree model expects the whole RDD and
  will train a single tree.
 
  Can I split the RDD by user (i.e. groupByKey) and then call the
  DecisionTree.trainClassifer in a reduce() or aggregate function to
 create a
  RDD[DecisionTreeModels]? Maybe train the model with an in-memory dataset
  instead of an RDD? Call sc.parallelize on the Iterable values in a
 groupBy
  to create a mini-RDD?
 
  Has anyone else tried something like this with success?
 
  Thanks!





Re: train many decision tress with a single spark job

2015-01-11 Thread Sean Owen
You just mean you want to divide the data set into N subsets, and do
that dividing by user, not make one model per user right?

I suppose you could filter the source RDD N times, and build a model
for each resulting subset. This can be parallelized on the driver. For
example let's say you divide into N subsets depending on the value of
the user ID modulo N:

val N = ...
(0 until N).par.map(d = DecisionTree.train(data.filter(_.userID % N
== d), ...))

data should be cache()-ed here of course.

However it may be faster and more principled to take random subsets directly:

data.randomSplit(Array.fill(N)(1.0 / N)).par.map(subset =
DecisionTree.train(subset, ...))

On Sun, Jan 11, 2015 at 1:53 AM, Josh Buffum jbuf...@gmail.com wrote:
 I've got a data set of activity by user. For each user, I'd like to train a
 decision tree model. I currently have the feature creation step implemented
 in Spark and would naturally like to use mllib's decision tree model.
 However, it looks like the decision tree model expects the whole RDD and
 will train a single tree.

 Can I split the RDD by user (i.e. groupByKey) and then call the
 DecisionTree.trainClassifer in a reduce() or aggregate function to create a
 RDD[DecisionTreeModels]? Maybe train the model with an in-memory dataset
 instead of an RDD? Call sc.parallelize on the Iterable values in a groupBy
 to create a mini-RDD?

 Has anyone else tried something like this with success?

 Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



train many decision tress with a single spark job

2015-01-10 Thread Josh Buffum
I've got a data set of activity by user. For each user, I'd like to train a
decision tree model. I currently have the feature creation step implemented
in Spark and would naturally like to use mllib's decision tree model.
However, it looks like the decision tree model expects the whole RDD and
will train a single tree.

Can I split the RDD by user (i.e. groupByKey) and then call the
DecisionTree.trainClassifer in a reduce() or aggregate function to create a
RDD[DecisionTreeModels]? Maybe train the model with an in-memory dataset
instead of an RDD? Call sc.parallelize on the Iterable values in a groupBy
to create a mini-RDD?

Has anyone else tried something like this with success?

Thanks!