Re: Random Forest on Spark

2014-04-17 Thread Debasish Das
Mllib has decision treethere is a rf pr which is not active nowtake
that and swap the tree builder with the fast tree builder that's in
mllib...search for the spark jira...the code is based on google planet
paper. ..

I am sure people in devlist are already working on it...send an email to
know the status over there...

There is also a rf in cloudera oryx but we could not run it on our data
yet

Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs but it
does not scale...
 On Apr 17, 2014 2:45 AM, "Laeeq Ahmed"  wrote:

> Hi,
>
> For one of my application, I want to use Random forests(RF) on top of
> spark. I see that currenlty MLLib does not have implementation for RF. What
> other opensource RF implementations will be great to use with spark in
> terms of speed?
>
> Regards,
> Laeeq Ahmed,
> KTH, Sweden.
>
>


Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Debasish, we've tested the MLLib decision tree a bit and it eats up too
much memory for RF purposes.
Once the tree got to depth 8~9, it was easy to get heap exception, even
with 2~4 GB of memory per worker.

With RF, it's very easy to get 100+ depth in RF with even only 100,000+
rows (because trees usually are not balanced). Additionally, the lack of
multi-class classification limits its applicability.

Also, RF requires random features per tree node to be effective (not just
bootstrap samples), and MLLib decision tree doesn't support that.


On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das wrote:

> Mllib has decision treethere is a rf pr which is not active
> nowtake that and swap the tree builder with the fast tree builder
> that's in mllib...search for the spark jira...the code is based on google
> planet paper. ..
>
> I am sure people in devlist are already working on it...send an email to
> know the status over there...
>
> There is also a rf in cloudera oryx but we could not run it on our data
> yet
>
> Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs but
> it does not scale...
>  On Apr 17, 2014 2:45 AM, "Laeeq Ahmed"  wrote:
>
>> Hi,
>>
>> For one of my application, I want to use Random forests(RF) on top of
>> spark. I see that currenlty MLLib does not have implementation for RF. What
>> other opensource RF implementations will be great to use with spark in
>> terms of speed?
>>
>> Regards,
>> Laeeq Ahmed,
>> KTH, Sweden.
>>
>>


Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
Multiclass classification, Gradient Boosting, and Random Forest support for
based on the recent Decision Tree implementation in MLlib.

Sung - I'd be curious to hear about your use of decision trees (and
forests) where you want to go to 100+ depth. My experience with random
forests has been that people typically build hundreds of shallow trees
(maybe depth 7 or 8), rather than a few (or many) really deep trees.

Generally speaking, we save passes over the data by computing histograms
per variable per split at each *level* of a decision tree. This can blow up
as the level of the decision tree gets deep, but I'd recommend a lot more
memory than 2-4GB per worker for most big data workloads.





On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung
wrote:

> Debasish, we've tested the MLLib decision tree a bit and it eats up too
> much memory for RF purposes.
> Once the tree got to depth 8~9, it was easy to get heap exception, even
> with 2~4 GB of memory per worker.
>
> With RF, it's very easy to get 100+ depth in RF with even only 100,000+
> rows (because trees usually are not balanced). Additionally, the lack of
> multi-class classification limits its applicability.
>
> Also, RF requires random features per tree node to be effective (not just
> bootstrap samples), and MLLib decision tree doesn't support that.
>
>
> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das 
> wrote:
>
>> Mllib has decision treethere is a rf pr which is not active
>> nowtake that and swap the tree builder with the fast tree builder
>> that's in mllib...search for the spark jira...the code is based on google
>> planet paper. ..
>>
>> I am sure people in devlist are already working on it...send an email to
>> know the status over there...
>>
>> There is also a rf in cloudera oryx but we could not run it on our data
>> yet
>>
>> Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs but
>> it does not scale...
>>  On Apr 17, 2014 2:45 AM, "Laeeq Ahmed"  wrote:
>>
>>> Hi,
>>>
>>> For one of my application, I want to use Random forests(RF) on top of
>>> spark. I see that currenlty MLLib does not have implementation for RF. What
>>> other opensource RF implementations will be great to use with spark in
>>> terms of speed?
>>>
>>> Regards,
>>> Laeeq Ahmed,
>>> KTH, Sweden.
>>>
>>>
>


Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
Sorry - I meant to say that "Multiclass classification, Gradient Boosting,
and Random Forest support based on the recent Decision Tree implementation
in MLlib is planned and coming soon."


On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks wrote:

> Multiclass classification, Gradient Boosting, and Random Forest support
> for based on the recent Decision Tree implementation in MLlib.
>
> Sung - I'd be curious to hear about your use of decision trees (and
> forests) where you want to go to 100+ depth. My experience with random
> forests has been that people typically build hundreds of shallow trees
> (maybe depth 7 or 8), rather than a few (or many) really deep trees.
>
> Generally speaking, we save passes over the data by computing histograms
> per variable per split at each *level* of a decision tree. This can blow up
> as the level of the decision tree gets deep, but I'd recommend a lot more
> memory than 2-4GB per worker for most big data workloads.
>
>
>
>
>
> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
> coded...@cs.stanford.edu> wrote:
>
>> Debasish, we've tested the MLLib decision tree a bit and it eats up too
>> much memory for RF purposes.
>> Once the tree got to depth 8~9, it was easy to get heap exception, even
>> with 2~4 GB of memory per worker.
>>
>> With RF, it's very easy to get 100+ depth in RF with even only 100,000+
>> rows (because trees usually are not balanced). Additionally, the lack of
>> multi-class classification limits its applicability.
>>
>> Also, RF requires random features per tree node to be effective (not just
>> bootstrap samples), and MLLib decision tree doesn't support that.
>>
>>
>> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das 
>> wrote:
>>
>>> Mllib has decision treethere is a rf pr which is not active
>>> nowtake that and swap the tree builder with the fast tree builder
>>> that's in mllib...search for the spark jira...the code is based on google
>>> planet paper. ..
>>>
>>> I am sure people in devlist are already working on it...send an email to
>>> know the status over there...
>>>
>>> There is also a rf in cloudera oryx but we could not run it on our data
>>> yet
>>>
>>> Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs but
>>> it does not scale...
>>>  On Apr 17, 2014 2:45 AM, "Laeeq Ahmed"  wrote:
>>>
 Hi,

 For one of my application, I want to use Random forests(RF) on top of
 spark. I see that currenlty MLLib does not have implementation for RF. What
 other opensource RF implementations will be great to use with spark in
 terms of speed?

 Regards,
 Laeeq Ahmed,
 KTH, Sweden.


>>
>


Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Evan,

I actually haven't heard of 'shallow' random forest. I think that the only
scenarios where shallow trees are useful are boosting scenarios.

AFAIK, Random Forest is a variance reducing technique and doesn't do much
about bias (although some people claim that it does have some bias reducing
effect). Because shallow trees typically have higher bias than fully-grown
trees, people don't often use shallow trees with RF.

You can confirm this through some experiments with R's random forest
implementation as well. They allow you to set some limits of depth and/or
pruning.

In contrast, boosting is a bias reduction technique (and increases
variance), so people typically use shallow trees.

Our empirical experiments also confirmed that shallow trees resulted in
drastically lower accuracy for random forests.

There are some papers that mix boosting-like technique with bootstrap
averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could
potentially use shallow trees to build boosted learners, but then average
the results of many boosted learners.


On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks wrote:

> Multiclass classification, Gradient Boosting, and Random Forest support
> for based on the recent Decision Tree implementation in MLlib.
>
> Sung - I'd be curious to hear about your use of decision trees (and
> forests) where you want to go to 100+ depth. My experience with random
> forests has been that people typically build hundreds of shallow trees
> (maybe depth 7 or 8), rather than a few (or many) really deep trees.
>
> Generally speaking, we save passes over the data by computing histograms
> per variable per split at each *level* of a decision tree. This can blow up
> as the level of the decision tree gets deep, but I'd recommend a lot more
> memory than 2-4GB per worker for most big data workloads.
>
>
>
>
>
> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
> coded...@cs.stanford.edu> wrote:
>
>> Debasish, we've tested the MLLib decision tree a bit and it eats up too
>> much memory for RF purposes.
>> Once the tree got to depth 8~9, it was easy to get heap exception, even
>> with 2~4 GB of memory per worker.
>>
>> With RF, it's very easy to get 100+ depth in RF with even only 100,000+
>> rows (because trees usually are not balanced). Additionally, the lack of
>> multi-class classification limits its applicability.
>>
>> Also, RF requires random features per tree node to be effective (not just
>> bootstrap samples), and MLLib decision tree doesn't support that.
>>
>>
>> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das 
>> wrote:
>>
>>> Mllib has decision treethere is a rf pr which is not active
>>> nowtake that and swap the tree builder with the fast tree builder
>>> that's in mllib...search for the spark jira...the code is based on google
>>> planet paper. ..
>>>
>>> I am sure people in devlist are already working on it...send an email to
>>> know the status over there...
>>>
>>> There is also a rf in cloudera oryx but we could not run it on our data
>>> yet
>>>
>>> Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs but
>>> it does not scale...
>>>  On Apr 17, 2014 2:45 AM, "Laeeq Ahmed"  wrote:
>>>
 Hi,

 For one of my application, I want to use Random forests(RF) on top of
 spark. I see that currenlty MLLib does not have implementation for RF. What
 other opensource RF implementations will be great to use with spark in
 terms of speed?

 Regards,
 Laeeq Ahmed,
 KTH, Sweden.


>>
>


Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
Hmm... can you provide some pointers to examples where deep trees are
helpful?

Typically with Decision Trees you limit depth (either directly or
indirectly with minimum node size and minimum improvement criteria) to
avoid overfitting. I agree with the assessment that forests are a variance
reduction technique, but I'd be a little surprised if a bunch of hugely
deep trees don't overfit to training data. I guess I view limiting tree
depth as an analogue to regularization in linear models.


On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung
wrote:

> Evan,
>
> I actually haven't heard of 'shallow' random forest. I think that the only
> scenarios where shallow trees are useful are boosting scenarios.
>
> AFAIK, Random Forest is a variance reducing technique and doesn't do much
> about bias (although some people claim that it does have some bias reducing
> effect). Because shallow trees typically have higher bias than fully-grown
> trees, people don't often use shallow trees with RF.
>
> You can confirm this through some experiments with R's random forest
> implementation as well. They allow you to set some limits of depth and/or
> pruning.
>
> In contrast, boosting is a bias reduction technique (and increases
> variance), so people typically use shallow trees.
>
> Our empirical experiments also confirmed that shallow trees resulted in
> drastically lower accuracy for random forests.
>
> There are some papers that mix boosting-like technique with bootstrap
> averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could
> potentially use shallow trees to build boosted learners, but then average
> the results of many boosted learners.
>
>
> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks wrote:
>
>> Multiclass classification, Gradient Boosting, and Random Forest support
>> for based on the recent Decision Tree implementation in MLlib.
>>
>> Sung - I'd be curious to hear about your use of decision trees (and
>> forests) where you want to go to 100+ depth. My experience with random
>> forests has been that people typically build hundreds of shallow trees
>> (maybe depth 7 or 8), rather than a few (or many) really deep trees.
>>
>> Generally speaking, we save passes over the data by computing histograms
>> per variable per split at each *level* of a decision tree. This can blow up
>> as the level of the decision tree gets deep, but I'd recommend a lot more
>> memory than 2-4GB per worker for most big data workloads.
>>
>>
>>
>>
>>
>> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>>> Debasish, we've tested the MLLib decision tree a bit and it eats up too
>>> much memory for RF purposes.
>>> Once the tree got to depth 8~9, it was easy to get heap exception, even
>>> with 2~4 GB of memory per worker.
>>>
>>> With RF, it's very easy to get 100+ depth in RF with even only 100,000+
>>> rows (because trees usually are not balanced). Additionally, the lack of
>>> multi-class classification limits its applicability.
>>>
>>> Also, RF requires random features per tree node to be effective (not
>>> just bootstrap samples), and MLLib decision tree doesn't support that.
>>>
>>>
>>> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das >> > wrote:
>>>
 Mllib has decision treethere is a rf pr which is not active
 nowtake that and swap the tree builder with the fast tree builder
 that's in mllib...search for the spark jira...the code is based on google
 planet paper. ..

 I am sure people in devlist are already working on it...send an email
 to know the status over there...

 There is also a rf in cloudera oryx but we could not run it on our data
 yet

 Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs
 but it does not scale...
  On Apr 17, 2014 2:45 AM, "Laeeq Ahmed"  wrote:

> Hi,
>
> For one of my application, I want to use Random forests(RF) on top of
> spark. I see that currenlty MLLib does not have implementation for RF. 
> What
> other opensource RF implementations will be great to use with spark in
> terms of speed?
>
> Regards,
> Laeeq Ahmed,
> KTH, Sweden.
>
>
>>>
>>
>


Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Well, if you read the original paper,
http://oz.berkeley.edu/~breiman/randomforest2001.pdf
"Grow the tree using CART methodology to maximum size and do not prune."

Now, the elements of statistical learning book on page 598 says that you
could potentially overfit fully-grown regression random forest. However,
this effect is very slight, and likely negligible for classifications.
http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf

In our experiments however, if the pruning is "drastic", then the
performance actually becomes much worse. This makes intuitive sense IMO
because a decision tree is a non-parametric model, and the expressibility
of a tree depends on the number of nodes.

With a huge amount of data (millions or even billions of rows), we found
that the depth of 10 is simply not adequate to build high-accuracy models.


On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks wrote:

> Hmm... can you provide some pointers to examples where deep trees are
> helpful?
>
> Typically with Decision Trees you limit depth (either directly or
> indirectly with minimum node size and minimum improvement criteria) to
> avoid overfitting. I agree with the assessment that forests are a variance
> reduction technique, but I'd be a little surprised if a bunch of hugely
> deep trees don't overfit to training data. I guess I view limiting tree
> depth as an analogue to regularization in linear models.
>
>
> On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung <
> coded...@cs.stanford.edu> wrote:
>
>> Evan,
>>
>> I actually haven't heard of 'shallow' random forest. I think that the
>> only scenarios where shallow trees are useful are boosting scenarios.
>>
>> AFAIK, Random Forest is a variance reducing technique and doesn't do much
>> about bias (although some people claim that it does have some bias reducing
>> effect). Because shallow trees typically have higher bias than fully-grown
>> trees, people don't often use shallow trees with RF.
>>
>> You can confirm this through some experiments with R's random forest
>> implementation as well. They allow you to set some limits of depth and/or
>> pruning.
>>
>> In contrast, boosting is a bias reduction technique (and increases
>> variance), so people typically use shallow trees.
>>
>> Our empirical experiments also confirmed that shallow trees resulted in
>> drastically lower accuracy for random forests.
>>
>> There are some papers that mix boosting-like technique with bootstrap
>> averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could
>> potentially use shallow trees to build boosted learners, but then average
>> the results of many boosted learners.
>>
>>
>> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks 
>> wrote:
>>
>>> Multiclass classification, Gradient Boosting, and Random Forest support
>>> for based on the recent Decision Tree implementation in MLlib.
>>>
>>> Sung - I'd be curious to hear about your use of decision trees (and
>>> forests) where you want to go to 100+ depth. My experience with random
>>> forests has been that people typically build hundreds of shallow trees
>>> (maybe depth 7 or 8), rather than a few (or many) really deep trees.
>>>
>>> Generally speaking, we save passes over the data by computing histograms
>>> per variable per split at each *level* of a decision tree. This can blow up
>>> as the level of the decision tree gets deep, but I'd recommend a lot more
>>> memory than 2-4GB per worker for most big data workloads.
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
>>> coded...@cs.stanford.edu> wrote:
>>>
 Debasish, we've tested the MLLib decision tree a bit and it eats up too
 much memory for RF purposes.
 Once the tree got to depth 8~9, it was easy to get heap exception, even
 with 2~4 GB of memory per worker.

 With RF, it's very easy to get 100+ depth in RF with even only 100,000+
 rows (because trees usually are not balanced). Additionally, the lack of
 multi-class classification limits its applicability.

 Also, RF requires random features per tree node to be effective (not
 just bootstrap samples), and MLLib decision tree doesn't support that.


 On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das <
 debasish.da...@gmail.com> wrote:

> Mllib has decision treethere is a rf pr which is not active
> nowtake that and swap the tree builder with the fast tree builder
> that's in mllib...search for the spark jira...the code is based on google
> planet paper. ..
>
> I am sure people in devlist are already working on it...send an email
> to know the status over there...
>
> There is also a rf in cloudera oryx but we could not run it on our
> data yet
>
> Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs
> but it does not scale...
>  On Apr 17, 2014 2:45 AM, "Laeeq Ahmed"  wrote:
>
>> Hi,
>>
>> For one of my application, I want to us

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Additionally, the 'random features per node' (or mtry in R) is a very
important feature for Random Forest. The variance reduction comes if the
trees are decorrelated from each other and often the random features per
node does more than bootstrap samples. And this is something that would
have to be supported at the tree level.


On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung
wrote:

> Well, if you read the original paper,
> http://oz.berkeley.edu/~breiman/randomforest2001.pdf
> "Grow the tree using CART methodology to maximum size and do not prune."
>
> Now, the elements of statistical learning book on page 598 says that you
> could potentially overfit fully-grown regression random forest. However,
> this effect is very slight, and likely negligible for classifications.
> http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>
> In our experiments however, if the pruning is "drastic", then the
> performance actually becomes much worse. This makes intuitive sense IMO
> because a decision tree is a non-parametric model, and the expressibility
> of a tree depends on the number of nodes.
>
> With a huge amount of data (millions or even billions of rows), we found
> that the depth of 10 is simply not adequate to build high-accuracy models.
>
>
> On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks wrote:
>
>> Hmm... can you provide some pointers to examples where deep trees are
>> helpful?
>>
>> Typically with Decision Trees you limit depth (either directly or
>> indirectly with minimum node size and minimum improvement criteria) to
>> avoid overfitting. I agree with the assessment that forests are a variance
>> reduction technique, but I'd be a little surprised if a bunch of hugely
>> deep trees don't overfit to training data. I guess I view limiting tree
>> depth as an analogue to regularization in linear models.
>>
>>
>> On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>>> Evan,
>>>
>>> I actually haven't heard of 'shallow' random forest. I think that the
>>> only scenarios where shallow trees are useful are boosting scenarios.
>>>
>>> AFAIK, Random Forest is a variance reducing technique and doesn't do
>>> much about bias (although some people claim that it does have some bias
>>> reducing effect). Because shallow trees typically have higher bias than
>>> fully-grown trees, people don't often use shallow trees with RF.
>>>
>>> You can confirm this through some experiments with R's random forest
>>> implementation as well. They allow you to set some limits of depth and/or
>>> pruning.
>>>
>>> In contrast, boosting is a bias reduction technique (and increases
>>> variance), so people typically use shallow trees.
>>>
>>> Our empirical experiments also confirmed that shallow trees resulted in
>>> drastically lower accuracy for random forests.
>>>
>>> There are some papers that mix boosting-like technique with bootstrap
>>> averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could
>>> potentially use shallow trees to build boosted learners, but then average
>>> the results of many boosted learners.
>>>
>>>
>>> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks 
>>> wrote:
>>>
 Multiclass classification, Gradient Boosting, and Random Forest support
 for based on the recent Decision Tree implementation in MLlib.

 Sung - I'd be curious to hear about your use of decision trees (and
 forests) where you want to go to 100+ depth. My experience with random
 forests has been that people typically build hundreds of shallow trees
 (maybe depth 7 or 8), rather than a few (or many) really deep trees.

 Generally speaking, we save passes over the data by computing
 histograms per variable per split at each *level* of a decision tree. This
 can blow up as the level of the decision tree gets deep, but I'd recommend
 a lot more memory than 2-4GB per worker for most big data workloads.





 On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
 coded...@cs.stanford.edu> wrote:

> Debasish, we've tested the MLLib decision tree a bit and it eats up
> too much memory for RF purposes.
> Once the tree got to depth 8~9, it was easy to get heap exception,
> even with 2~4 GB of memory per worker.
>
> With RF, it's very easy to get 100+ depth in RF with even only
> 100,000+ rows (because trees usually are not balanced). Additionally, the
> lack of multi-class classification limits its applicability.
>
> Also, RF requires random features per tree node to be effective (not
> just bootstrap samples), and MLLib decision tree doesn't support that.
>
>
> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das <
> debasish.da...@gmail.com> wrote:
>
>> Mllib has decision treethere is a rf pr which is not active
>> nowtake that and swap the tree builder with the fast tree builder
>> that's in mllib...search for the spark jira...the

Re: Random Forest on Spark

2014-04-17 Thread Debasish Das
Evan,

Was not mllib decision tree implemented using ideas from Google's PLANET
paper...do the paper also propose to grow a shallow tree ?

Thanks.
Deb


On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung
wrote:

> Additionally, the 'random features per node' (or mtry in R) is a very
> important feature for Random Forest. The variance reduction comes if the
> trees are decorrelated from each other and often the random features per
> node does more than bootstrap samples. And this is something that would
> have to be supported at the tree level.
>
>
> On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung  > wrote:
>
>> Well, if you read the original paper,
>> http://oz.berkeley.edu/~breiman/randomforest2001.pdf
>> "Grow the tree using CART methodology to maximum size and do not prune."
>>
>> Now, the elements of statistical learning book on page 598 says that you
>> could potentially overfit fully-grown regression random forest. However,
>> this effect is very slight, and likely negligible for classifications.
>> http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>>
>> In our experiments however, if the pruning is "drastic", then the
>> performance actually becomes much worse. This makes intuitive sense IMO
>> because a decision tree is a non-parametric model, and the expressibility
>> of a tree depends on the number of nodes.
>>
>> With a huge amount of data (millions or even billions of rows), we found
>> that the depth of 10 is simply not adequate to build high-accuracy models.
>>
>>
>> On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks 
>> wrote:
>>
>>> Hmm... can you provide some pointers to examples where deep trees are
>>> helpful?
>>>
>>> Typically with Decision Trees you limit depth (either directly or
>>> indirectly with minimum node size and minimum improvement criteria) to
>>> avoid overfitting. I agree with the assessment that forests are a variance
>>> reduction technique, but I'd be a little surprised if a bunch of hugely
>>> deep trees don't overfit to training data. I guess I view limiting tree
>>> depth as an analogue to regularization in linear models.
>>>
>>>
>>> On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung <
>>> coded...@cs.stanford.edu> wrote:
>>>
 Evan,

 I actually haven't heard of 'shallow' random forest. I think that the
 only scenarios where shallow trees are useful are boosting scenarios.

 AFAIK, Random Forest is a variance reducing technique and doesn't do
 much about bias (although some people claim that it does have some bias
 reducing effect). Because shallow trees typically have higher bias than
 fully-grown trees, people don't often use shallow trees with RF.

 You can confirm this through some experiments with R's random forest
 implementation as well. They allow you to set some limits of depth and/or
 pruning.

 In contrast, boosting is a bias reduction technique (and increases
 variance), so people typically use shallow trees.

 Our empirical experiments also confirmed that shallow trees resulted in
 drastically lower accuracy for random forests.

 There are some papers that mix boosting-like technique with bootstrap
 averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could
 potentially use shallow trees to build boosted learners, but then average
 the results of many boosted learners.


 On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks >>> > wrote:

> Multiclass classification, Gradient Boosting, and Random Forest
> support for based on the recent Decision Tree implementation in MLlib.
>
> Sung - I'd be curious to hear about your use of decision trees (and
> forests) where you want to go to 100+ depth. My experience with random
> forests has been that people typically build hundreds of shallow trees
> (maybe depth 7 or 8), rather than a few (or many) really deep trees.
>
> Generally speaking, we save passes over the data by computing
> histograms per variable per split at each *level* of a decision tree. This
> can blow up as the level of the decision tree gets deep, but I'd recommend
> a lot more memory than 2-4GB per worker for most big data workloads.
>
>
>
>
>
> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
> coded...@cs.stanford.edu> wrote:
>
>> Debasish, we've tested the MLLib decision tree a bit and it eats up
>> too much memory for RF purposes.
>> Once the tree got to depth 8~9, it was easy to get heap exception,
>> even with 2~4 GB of memory per worker.
>>
>> With RF, it's very easy to get 100+ depth in RF with even only
>> 100,000+ rows (because trees usually are not balanced). Additionally, the
>> lack of multi-class classification limits its applicability.
>>
>> Also, RF requires random features per tree node to be effective (not
>> just bootstrap samples), and MLLib decision tree doesn't sup

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
I believe that they show one example comparing depth 1 ensemble vs depth 3
ensemble but it is based on boosting, not bagging.


On Thu, Apr 17, 2014 at 2:21 PM, Debasish Das wrote:

> Evan,
>
> Was not mllib decision tree implemented using ideas from Google's PLANET
> paper...do the paper also propose to grow a shallow tree ?
>
> Thanks.
> Deb
>
>
> On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung  > wrote:
>
>> Additionally, the 'random features per node' (or mtry in R) is a very
>> important feature for Random Forest. The variance reduction comes if the
>> trees are decorrelated from each other and often the random features per
>> node does more than bootstrap samples. And this is something that would
>> have to be supported at the tree level.
>>
>>
>> On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>>> Well, if you read the original paper,
>>> http://oz.berkeley.edu/~breiman/randomforest2001.pdf
>>> "Grow the tree using CART methodology to maximum size and do not prune."
>>>
>>> Now, the elements of statistical learning book on page 598 says that you
>>> could potentially overfit fully-grown regression random forest. However,
>>> this effect is very slight, and likely negligible for classifications.
>>> http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>>>
>>> In our experiments however, if the pruning is "drastic", then the
>>> performance actually becomes much worse. This makes intuitive sense IMO
>>> because a decision tree is a non-parametric model, and the expressibility
>>> of a tree depends on the number of nodes.
>>>
>>> With a huge amount of data (millions or even billions of rows), we found
>>> that the depth of 10 is simply not adequate to build high-accuracy models.
>>>
>>>
>>> On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks 
>>> wrote:
>>>
 Hmm... can you provide some pointers to examples where deep trees are
 helpful?

 Typically with Decision Trees you limit depth (either directly or
 indirectly with minimum node size and minimum improvement criteria) to
 avoid overfitting. I agree with the assessment that forests are a variance
 reduction technique, but I'd be a little surprised if a bunch of hugely
 deep trees don't overfit to training data. I guess I view limiting tree
 depth as an analogue to regularization in linear models.


 On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung <
 coded...@cs.stanford.edu> wrote:

> Evan,
>
> I actually haven't heard of 'shallow' random forest. I think that the
> only scenarios where shallow trees are useful are boosting scenarios.
>
> AFAIK, Random Forest is a variance reducing technique and doesn't do
> much about bias (although some people claim that it does have some bias
> reducing effect). Because shallow trees typically have higher bias than
> fully-grown trees, people don't often use shallow trees with RF.
>
> You can confirm this through some experiments with R's random forest
> implementation as well. They allow you to set some limits of depth and/or
> pruning.
>
> In contrast, boosting is a bias reduction technique (and increases
> variance), so people typically use shallow trees.
>
> Our empirical experiments also confirmed that shallow trees resulted
> in drastically lower accuracy for random forests.
>
> There are some papers that mix boosting-like technique with bootstrap
> averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could
> potentially use shallow trees to build boosted learners, but then average
> the results of many boosted learners.
>
>
> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks <
> evan.spa...@gmail.com> wrote:
>
>> Multiclass classification, Gradient Boosting, and Random Forest
>> support for based on the recent Decision Tree implementation in MLlib.
>>
>> Sung - I'd be curious to hear about your use of decision trees (and
>> forests) where you want to go to 100+ depth. My experience with random
>> forests has been that people typically build hundreds of shallow trees
>> (maybe depth 7 or 8), rather than a few (or many) really deep trees.
>>
>> Generally speaking, we save passes over the data by computing
>> histograms per variable per split at each *level* of a decision tree. 
>> This
>> can blow up as the level of the decision tree gets deep, but I'd 
>> recommend
>> a lot more memory than 2-4GB per worker for most big data workloads.
>>
>>
>>
>>
>>
>> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>>> Debasish, we've tested the MLLib decision tree a bit and it eats up
>>> too much memory for RF purposes.
>>> Once the tree got to depth 8~9, it was easy to get heap exception,
>>> even with 2~4 GB of memory per worker.
>>>
>

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
What kind of data are you training on? These effects are *highly* data
dependent, and while saying "the depth of 10 is simply not adequate to
build high-accuracy models" may be accurate for the particular problem
you're modeling, it is not true in general. From a statistical perspective,
I consider each node in each tree an additional degree of freedom for the
model, and all else equal I'd expect a model with fewer degrees of freedom
to generalize better. Regardless, if there are lots of use cases for really
deep trees, we'd like to hear about them so that we can decide how
important they are to support!

In the context of CART - pruning very specifically refers to a step *after*
a tree has been constructed to some depth using cross-validation. This was
a variance reduction technique in the original tree work that is
unnecessary and computationally expensive in the context of forests. In the
original Random Forests paper, there are still stopping criteria - usually
either minimum leaf size or minimum split improvement (or both), so
"training to maximum depth" doesn't mean "train until you've completely
divided your dataset and there's one point per leaf." My point is that if
you set minimum leaf size to something like 0.2% of the dataset, then
you're not going to get deeper than 10 or 12 levels with a reasonably
balanced tree.

With respect to PLANET - our implementation is very much in the spirit of
planet, but has some key differences - there's good documentation on
exactly what the differences are forthcoming, so I won't belabor these
here. The differences are designed to 1) avoid data shuffling, and 2)
minimize number of passes over the training data. Of course, there are
tradeoffs involved, and there is at least one really good trick in the
PLANET work that we should leverage that we aren't yet - namely once the
nodes get small enough for data to fit easily on a single machine, data can
be shuffled and then the remainder of the tree can be trained in parallel
from each lower node on a single machine This would actually help with the
memory overheads in model training when trees get deep  - if someone wants
to modify the current implementation of trees in MLlib and contribute this
optimization as a pull request, it would be welcome!

At any rate, we'll take this feedback into account with respect to
improving the tree implementation, but if anyone can send over use cases or
(even better) datasets where really deep trees are necessary, that would be
great!




On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung
wrote:

> Well, if you read the original paper,
> http://oz.berkeley.edu/~breiman/randomforest2001.pdf
> "Grow the tree using CART methodology to maximum size and do not prune."
>
> Now, the elements of statistical learning book on page 598 says that you
> could potentially overfit fully-grown regression random forest. However,
> this effect is very slight, and likely negligible for classifications.
> http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>
> In our experiments however, if the pruning is "drastic", then the
> performance actually becomes much worse. This makes intuitive sense IMO
> because a decision tree is a non-parametric model, and the expressibility
> of a tree depends on the number of nodes.
>
> With a huge amount of data (millions or even billions of rows), we found
> that the depth of 10 is simply not adequate to build high-accuracy models.
>
>
> On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks wrote:
>
>> Hmm... can you provide some pointers to examples where deep trees are
>> helpful?
>>
>> Typically with Decision Trees you limit depth (either directly or
>> indirectly with minimum node size and minimum improvement criteria) to
>> avoid overfitting. I agree with the assessment that forests are a variance
>> reduction technique, but I'd be a little surprised if a bunch of hugely
>> deep trees don't overfit to training data. I guess I view limiting tree
>> depth as an analogue to regularization in linear models.
>>
>>
>> On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>>> Evan,
>>>
>>> I actually haven't heard of 'shallow' random forest. I think that the
>>> only scenarios where shallow trees are useful are boosting scenarios.
>>>
>>> AFAIK, Random Forest is a variance reducing technique and doesn't do
>>> much about bias (although some people claim that it does have some bias
>>> reducing effect). Because shallow trees typically have higher bias than
>>> fully-grown trees, people don't often use shallow trees with RF.
>>>
>>> You can confirm this through some experiments with R's random forest
>>> implementation as well. They allow you to set some limits of depth and/or
>>> pruning.
>>>
>>> In contrast, boosting is a bias reduction technique (and increases
>>> variance), so people typically use shallow trees.
>>>
>>> Our empirical experiments also confirmed that shallow trees resulted in
>>> drastically lower accur

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Yes, it should be data specific and perhaps we're biased toward the data
sets that we are playing with. To put things in perspective, we're highly
interested in (and I believe, our customers are):

1. large (hundreds of millions of rows)
2. multi-class classification - nowadays, dozens of target categories are
common and even thousands in some cases - you could imagine that this is a
big reason for us requiring more 'complex' models
3. high dimensional with thousands of descriptive and sort-of-independent
features

>From the theoretical perspective, I would argue that it's usually in the
best interest to prune as little as possible. I believe that pruning
inherently increases bias of an individual tree, which RF can't do anything
about while decreasing variance - which is what RF is for.

The default pruning criteria for R's reference implementation is min-node
of 1 (meaning fully-grown tree) for classification, and 5 for regression.
I'd imagine they did at least some empirical testing to justify these
values at the time - although at a time of small datasets :).

FYI, we are also considering the MLLib decision tree for our Gradient
Boosting implementation, however, the memory requirement is still a bit too
steep (we were getting heap exceptions at depth limit of 5 with 2GB per
worker with approximately 1000 features). Now 2GB per worker is about what
we expect our typical customers would tolerate and I don't think that it's
unreasonable for shallow trees.



On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks wrote:

> What kind of data are you training on? These effects are *highly* data
> dependent, and while saying "the depth of 10 is simply not adequate to
> build high-accuracy models" may be accurate for the particular problem
> you're modeling, it is not true in general. From a statistical perspective,
> I consider each node in each tree an additional degree of freedom for the
> model, and all else equal I'd expect a model with fewer degrees of freedom
> to generalize better. Regardless, if there are lots of use cases for really
> deep trees, we'd like to hear about them so that we can decide how
> important they are to support!
>
> In the context of CART - pruning very specifically refers to a step
> *after* a tree has been constructed to some depth using cross-validation.
> This was a variance reduction technique in the original tree work that is
> unnecessary and computationally expensive in the context of forests. In the
> original Random Forests paper, there are still stopping criteria - usually
> either minimum leaf size or minimum split improvement (or both), so
> "training to maximum depth" doesn't mean "train until you've completely
> divided your dataset and there's one point per leaf." My point is that if
> you set minimum leaf size to something like 0.2% of the dataset, then
> you're not going to get deeper than 10 or 12 levels with a reasonably
> balanced tree.
>
> With respect to PLANET - our implementation is very much in the spirit of
> planet, but has some key differences - there's good documentation on
> exactly what the differences are forthcoming, so I won't belabor these
> here. The differences are designed to 1) avoid data shuffling, and 2)
> minimize number of passes over the training data. Of course, there are
> tradeoffs involved, and there is at least one really good trick in the
> PLANET work that we should leverage that we aren't yet - namely once the
> nodes get small enough for data to fit easily on a single machine, data can
> be shuffled and then the remainder of the tree can be trained in parallel
> from each lower node on a single machine This would actually help with the
> memory overheads in model training when trees get deep  - if someone wants
> to modify the current implementation of trees in MLlib and contribute this
> optimization as a pull request, it would be welcome!
>
> At any rate, we'll take this feedback into account with respect to
> improving the tree implementation, but if anyone can send over use cases or
> (even better) datasets where really deep trees are necessary, that would be
> great!
>
>
>
>
> On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung  > wrote:
>
>> Well, if you read the original paper,
>> http://oz.berkeley.edu/~breiman/randomforest2001.pdf
>> "Grow the tree using CART methodology to maximum size and do not prune."
>>
>> Now, the elements of statistical learning book on page 598 says that you
>> could potentially overfit fully-grown regression random forest. However,
>> this effect is very slight, and likely negligible for classifications.
>> http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>>
>> In our experiments however, if the pruning is "drastic", then the
>> performance actually becomes much worse. This makes intuitive sense IMO
>> because a decision tree is a non-parametric model, and the expressibility
>> of a tree depends on the number of nodes.
>>
>> With a huge amount of data (millions or even billions o

Re: Random Forest on Spark

2014-04-18 Thread Laeeq Ahmed
Have anyone tried mahout RF or Stratosphere RF with spark. Any comments.

Regards,
Laeeq
On Friday, April 18, 2014 3:11 AM, Sung Hwan Chung  
wrote:
 
Yes, it should be data specific and perhaps we're biased toward the data sets 
that we are playing with. To put things in perspective, we're highly interested 
in (and I believe, our customers are):

1. large (hundreds of millions of rows)
2. multi-class classification - nowadays, dozens of target categories are 
common and even thousands in some cases - you could imagine that this is a big 
reason for us requiring more 'complex' models
3. high dimensional with thousands of descriptive and sort-of-independent 
features

From the theoretical perspective, I would argue that it's usually in the best 
interest to prune as little as possible. I believe that pruning inherently 
increases bias of an individual tree, which RF can't do anything about while 
decreasing variance - which is what RF is for.


The default pruning criteria for R's reference implementation is min-node of 1 
(meaning fully-grown tree) for classification, and 5 for regression. I'd 
imagine they did at least some empirical testing to justify these values at the 
time - although at a time of small datasets :).

FYI, we are also considering the MLLib decision tree for our Gradient Boosting 
implementation, however, the memory requirement is still a bit too steep (we 
were getting heap exceptions at depth limit of 5 with 2GB per worker with 
approximately 1000 features). Now 2GB per worker is about what we expect our 
typical customers would tolerate and I don't think that it's unreasonable for 
shallow trees.




On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks  wrote:

What kind of data are you training on? These effects are *highly* data 
dependent, and while saying "the depth of 10 is simply not adequate to build 
high-accuracy models" may be accurate for the particular problem you're 
modeling, it is not true in general. From a statistical perspective, I consider 
each node in each tree an additional degree of freedom for the model, and all 
else equal I'd expect a model with fewer degrees of freedom to generalize 
better. Regardless, if there are lots of use cases for really deep trees, we'd 
like to hear about them so that we can decide how important they are to 
support! 
>
>
>
>In the context of CART - pruning very specifically refers to a step *after* a 
>tree has been constructed to some depth using cross-validation. This was a 
>variance reduction technique in the original tree work that is unnecessary and 
>computationally expensive in the context of forests. In the original Random 
>Forests paper, there are still stopping criteria - usually either minimum leaf 
>size or minimum split improvement (or both), so "training to maximum depth" 
>doesn't mean "train until you've completely divided your dataset and there's 
>one point per leaf." My point is that if you set minimum leaf size to 
>something like 0.2% of the dataset, then you're not going to get deeper than 
>10 or 12 levels with a reasonably balanced tree.
>
>
>
>With respect to PLANET - our implementation is very much in the spirit of 
>planet, but has some key differences - there's good documentation on exactly 
>what the differences are forthcoming, so I won't belabor these here. The 
>differences are designed to 1) avoid data shuffling, and 2) minimize number of 
>passes over the training data. Of course, there are tradeoffs involved, and 
>there is at least one really good trick in the PLANET work that we should 
>leverage that we aren't yet - namely once the nodes get small enough for data 
>to fit easily on a single machine, data can be shuffled and then the remainder 
>of the tree can be trained in parallel from each lower node on a single 
>machine This would actually help with the memory overheads in model training 
>when trees get deep  - if someone wants to modify the current implementation 
>of trees in MLlib and contribute this optimization as a pull request, it would 
>be welcome!
>
>
>At any rate, we'll take this feedback into account with respect to improving 
>the tree implementation, but if anyone can send over use cases or (even 
>better) datasets where really deep trees are necessary, that would be great!
>
>
>
>
>
>
>
>On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung  
>wrote:
>
>Well, if you read the original paper, 
>http://oz.berkeley.edu/~breiman/randomforest2001.pdf
>>"Grow the tree using CART methodology to maximum size and do not prune."
>>
>>
>>Now, the elements of statistical learning book on page 598 says that you 
>>could potentially overfit fully-grown regression random forest. However, this 
>>effect is very slight, and likely negligible for classifications.
>>http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>>
>>
>>
>>In our experiments however, if the pruning is "drastic", then the performance 
>>actually becomes much worse. This makes intuitive sense IMO because a 
>>

Re: Random Forest on Spark

2014-04-18 Thread Eustache DIEMERT
Another option is to use ExtraTrees as provided by scikit-learn with
pyspark:

https://github.com/pydata/pyrallel/blob/master/pyrallel/ensemble.py#L27-L59

this is a proof of concept right now and should be hacked to what you need,
but the core decision tree implementation is highly optimized and could
solve the memory issue mentioned by the OP.

Also, for scaling ensembles of decision trees there is also the LambdaMART
paper [1] which is more modern/optimized in its approach albeit using MPI
implementation.

Finally, here [2] is a blog post of mine explaining the PLANET paper and
its limitations

[1] http://jmlr.org/proceedings/papers/v14/burges11a/burges11a.pdf

[2]
http://stochastics.komodo.re/implementing-distributed-gradient-boosted-trees-part-2.html

HTH,

Eustache






2014-04-18 10:21 GMT+02:00 Laeeq Ahmed :

> Have anyone tried mahout RF or Stratosphere RF with spark. Any comments.
>
> Regards,
> Laeeq
>   On Friday, April 18, 2014 3:11 AM, Sung Hwan Chung <
> coded...@cs.stanford.edu> wrote:
>  Yes, it should be data specific and perhaps we're biased toward the data
> sets that we are playing with. To put things in perspective, we're highly
> interested in (and I believe, our customers are):
>
> 1. large (hundreds of millions of rows)
> 2. multi-class classification - nowadays, dozens of target categories are
> common and even thousands in some cases - you could imagine that this is a
> big reason for us requiring more 'complex' models
> 3. high dimensional with thousands of descriptive and sort-of-independent
> features
>
> From the theoretical perspective, I would argue that it's usually in the
> best interest to prune as little as possible. I believe that pruning
> inherently increases bias of an individual tree, which RF can't do anything
> about while decreasing variance - which is what RF is for.
>
> The default pruning criteria for R's reference implementation is min-node
> of 1 (meaning fully-grown tree) for classification, and 5 for regression.
> I'd imagine they did at least some empirical testing to justify these
> values at the time - although at a time of small datasets :).
>
> FYI, we are also considering the MLLib decision tree for our Gradient
> Boosting implementation, however, the memory requirement is still a bit too
> steep (we were getting heap exceptions at depth limit of 5 with 2GB per
> worker with approximately 1000 features). Now 2GB per worker is about what
> we expect our typical customers would tolerate and I don't think that it's
> unreasonable for shallow trees.
>
>
>
> On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks wrote:
>
> What kind of data are you training on? These effects are *highly* data
> dependent, and while saying "the depth of 10 is simply not adequate to
> build high-accuracy models" may be accurate for the particular problem
> you're modeling, it is not true in general. From a statistical perspective,
> I consider each node in each tree an additional degree of freedom for the
> model, and all else equal I'd expect a model with fewer degrees of freedom
> to generalize better. Regardless, if there are lots of use cases for really
> deep trees, we'd like to hear about them so that we can decide how
> important they are to support!
>
> In the context of CART - pruning very specifically refers to a step
> *after* a tree has been constructed to some depth using cross-validation.
> This was a variance reduction technique in the original tree work that is
> unnecessary and computationally expensive in the context of forests. In the
> original Random Forests paper, there are still stopping criteria - usually
> either minimum leaf size or minimum split improvement (or both), so
> "training to maximum depth" doesn't mean "train until you've completely
> divided your dataset and there's one point per leaf." My point is that if
> you set minimum leaf size to something like 0.2% of the dataset, then
> you're not going to get deeper than 10 or 12 levels with a reasonably
> balanced tree.
>
> With respect to PLANET - our implementation is very much in the spirit of
> planet, but has some key differences - there's good documentation on
> exactly what the differences are forthcoming, so I won't belabor these
> here. The differences are designed to 1) avoid data shuffling, and 2)
> minimize number of passes over the training data. Of course, there are
> tradeoffs involved, and there is at least one really good trick in the
> PLANET work that we should leverage that we aren't yet - namely once the
> nodes get small enough for data to fit easily on a single machine, data can
> be shuffled and then the remainder of the tree can be trained in parallel
> from each lower node on a single machine This would actually help with the
> memory overheads in model training when trees get deep  - if someone wants
> to modify the current implementation of trees in MLlib and contribute this
> optimization as a pull request, it would be welcome!
>
> At any rate, we'll take this feedbac

Re: Random Forest on Spark

2014-04-18 Thread Sean Owen
Mahout RDF is fairly old code. If you try it, try to use 1.0-SNAPSHOT,
as you will almost certainly need this patch to make it run reasonably
fast: https://issues.apache.org/jira/browse/MAHOUT-1419

I have not tried Stratosphere here.

Since we are on the subject of RDF on Hadoop, possibly on M/R, I don't
feel too bad advertising this: oryx also does
classification/regression via RDF:
https://github.com/cloudera/oryx#classification--regression-example

This is a fairly different design choice than, say, what's in the
PLANET paper. The one big negative is that trees are built only over a
sub-sample of the data. But given that big simplifying assumption, a
lot of other things work well. It's not iterative so is not
handicapped by being M/R-based. May be of interest if building /
benchmarking stuff on Hadoop.

Personally, going forward, I'm interested in something smarter (like
what I see is going into the new Spark impl) although there really are
some big design tradeoffs here, yes.


--
Sean Owen | Director, Data Science | London


On Fri, Apr 18, 2014 at 9:21 AM, Laeeq Ahmed  wrote:
> Have anyone tried mahout RF or Stratosphere RF with spark. Any comments.
>
> Regards,
> Laeeq
> On Friday, April 18, 2014 3:11 AM, Sung Hwan Chung
>  wrote:
> Yes, it should be data specific and perhaps we're biased toward the data
> sets that we are playing with. To put things in perspective, we're highly
> interested in (and I believe, our customers are):
>
> 1. large (hundreds of millions of rows)
> 2. multi-class classification - nowadays, dozens of target categories are
> common and even thousands in some cases - you could imagine that this is a
> big reason for us requiring more 'complex' models
> 3. high dimensional with thousands of descriptive and sort-of-independent
> features
>
> From the theoretical perspective, I would argue that it's usually in the
> best interest to prune as little as possible. I believe that pruning
> inherently increases bias of an individual tree, which RF can't do anything
> about while decreasing variance - which is what RF is for.
>
> The default pruning criteria for R's reference implementation is min-node of
> 1 (meaning fully-grown tree) for classification, and 5 for regression. I'd
> imagine they did at least some empirical testing to justify these values at
> the time - although at a time of small datasets :).
>
> FYI, we are also considering the MLLib decision tree for our Gradient
> Boosting implementation, however, the memory requirement is still a bit too
> steep (we were getting heap exceptions at depth limit of 5 with 2GB per
> worker with approximately 1000 features). Now 2GB per worker is about what
> we expect our typical customers would tolerate and I don't think that it's
> unreasonable for shallow trees.
>
>
>
> On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks 
> wrote:
>
> What kind of data are you training on? These effects are *highly* data
> dependent, and while saying "the depth of 10 is simply not adequate to build
> high-accuracy models" may be accurate for the particular problem you're
> modeling, it is not true in general. From a statistical perspective, I
> consider each node in each tree an additional degree of freedom for the
> model, and all else equal I'd expect a model with fewer degrees of freedom
> to generalize better. Regardless, if there are lots of use cases for really
> deep trees, we'd like to hear about them so that we can decide how important
> they are to support!
>
> In the context of CART - pruning very specifically refers to a step *after*
> a tree has been constructed to some depth using cross-validation. This was a
> variance reduction technique in the original tree work that is unnecessary
> and computationally expensive in the context of forests. In the original
> Random Forests paper, there are still stopping criteria - usually either
> minimum leaf size or minimum split improvement (or both), so "training to
> maximum depth" doesn't mean "train until you've completely divided your
> dataset and there's one point per leaf." My point is that if you set minimum
> leaf size to something like 0.2% of the dataset, then you're not going to
> get deeper than 10 or 12 levels with a reasonably balanced tree.
>
> With respect to PLANET - our implementation is very much in the spirit of
> planet, but has some key differences - there's good documentation on exactly
> what the differences are forthcoming, so I won't belabor these here. The
> differences are designed to 1) avoid data shuffling, and 2) minimize number
> of passes over the training data. Of course, there are tradeoffs involved,
> and there is at least one really good trick in the PLANET work that we
> should leverage that we aren't yet - namely once the nodes get small enough
> for data to fit easily on a single machine, data can be shuffled and then
> the remainder of the tree can be trained in parallel from each lower node on
> a single machine This would actually help with the memory o

Re: Random Forest on Spark

2014-04-18 Thread Eustache DIEMERT
Is there a PR or issue where GBT / RF progress in MLLib is tracked ?


2014-04-17 21:11 GMT+02:00 Evan R. Sparks :

> Sorry - I meant to say that "Multiclass classification, Gradient
> Boosting, and Random Forest support based on the recent Decision Tree
> implementation in MLlib is planned and coming soon."
>
>
> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks wrote:
>
>> Multiclass classification, Gradient Boosting, and Random Forest support
>> for based on the recent Decision Tree implementation in MLlib.
>>
>> Sung - I'd be curious to hear about your use of decision trees (and
>> forests) where you want to go to 100+ depth. My experience with random
>> forests has been that people typically build hundreds of shallow trees
>> (maybe depth 7 or 8), rather than a few (or many) really deep trees.
>>
>> Generally speaking, we save passes over the data by computing histograms
>> per variable per split at each *level* of a decision tree. This can blow up
>> as the level of the decision tree gets deep, but I'd recommend a lot more
>> memory than 2-4GB per worker for most big data workloads.
>>
>>
>>
>>
>>
>> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>>> Debasish, we've tested the MLLib decision tree a bit and it eats up too
>>> much memory for RF purposes.
>>> Once the tree got to depth 8~9, it was easy to get heap exception, even
>>> with 2~4 GB of memory per worker.
>>>
>>> With RF, it's very easy to get 100+ depth in RF with even only 100,000+
>>> rows (because trees usually are not balanced). Additionally, the lack of
>>> multi-class classification limits its applicability.
>>>
>>> Also, RF requires random features per tree node to be effective (not
>>> just bootstrap samples), and MLLib decision tree doesn't support that.
>>>
>>>
>>> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das >> > wrote:
>>>
 Mllib has decision treethere is a rf pr which is not active
 nowtake that and swap the tree builder with the fast tree builder
 that's in mllib...search for the spark jira...the code is based on google
 planet paper. ..

 I am sure people in devlist are already working on it...send an email
 to know the status over there...

 There is also a rf in cloudera oryx but we could not run it on our data
 yet

 Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs
 but it does not scale...
  On Apr 17, 2014 2:45 AM, "Laeeq Ahmed"  wrote:

> Hi,
>
> For one of my application, I want to use Random forests(RF) on top of
> spark. I see that currenlty MLLib does not have implementation for RF. 
> What
> other opensource RF implementations will be great to use with spark in
> terms of speed?
>
> Regards,
> Laeeq Ahmed,
> KTH, Sweden.
>
>
>>>
>>
>


Re: Random Forest on Spark

2014-04-18 Thread Eustache DIEMERT
sorry I mismatched the link, it should be

https://gist.github.com/wpm/6454814

and the algorithm is not ExtraTrees but a basic ensemble of boosted trees.


2014-04-18 10:31 GMT+02:00 Eustache DIEMERT :

> Another option is to use ExtraTrees as provided by scikit-learn with
> pyspark:
>
> https://github.com/pydata/pyrallel/blob/master/pyrallel/ensemble.py#L27-L59
>
> this is a proof of concept right now and should be hacked to what you
> need, but the core decision tree implementation is highly optimized and
> could solve the memory issue mentioned by the OP.
>
> Also, for scaling ensembles of decision trees there is also the LambdaMART
> paper [1] which is more modern/optimized in its approach albeit using MPI
> implementation.
>
> Finally, here [2] is a blog post of mine explaining the PLANET paper and
> its limitations
>
> [1] http://jmlr.org/proceedings/papers/v14/burges11a/burges11a.pdf
>
> [2]
> http://stochastics.komodo.re/implementing-distributed-gradient-boosted-trees-part-2.html
>
> HTH,
>
> Eustache
>
>
>
>
>
>
> 2014-04-18 10:21 GMT+02:00 Laeeq Ahmed :
>
>  Have anyone tried mahout RF or Stratosphere RF with spark. Any comments.
>>
>> Regards,
>> Laeeq
>>   On Friday, April 18, 2014 3:11 AM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>   Yes, it should be data specific and perhaps we're biased toward the
>> data sets that we are playing with. To put things in perspective, we're
>> highly interested in (and I believe, our customers are):
>>
>> 1. large (hundreds of millions of rows)
>> 2. multi-class classification - nowadays, dozens of target categories are
>> common and even thousands in some cases - you could imagine that this is a
>> big reason for us requiring more 'complex' models
>> 3. high dimensional with thousands of descriptive and sort-of-independent
>> features
>>
>> From the theoretical perspective, I would argue that it's usually in the
>> best interest to prune as little as possible. I believe that pruning
>> inherently increases bias of an individual tree, which RF can't do anything
>> about while decreasing variance - which is what RF is for.
>>
>> The default pruning criteria for R's reference implementation is min-node
>> of 1 (meaning fully-grown tree) for classification, and 5 for regression.
>> I'd imagine they did at least some empirical testing to justify these
>> values at the time - although at a time of small datasets :).
>>
>> FYI, we are also considering the MLLib decision tree for our Gradient
>> Boosting implementation, however, the memory requirement is still a bit too
>> steep (we were getting heap exceptions at depth limit of 5 with 2GB per
>> worker with approximately 1000 features). Now 2GB per worker is about what
>> we expect our typical customers would tolerate and I don't think that it's
>> unreasonable for shallow trees.
>>
>>
>>
>> On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks wrote:
>>
>> What kind of data are you training on? These effects are *highly* data
>> dependent, and while saying "the depth of 10 is simply not adequate to
>> build high-accuracy models" may be accurate for the particular problem
>> you're modeling, it is not true in general. From a statistical perspective,
>> I consider each node in each tree an additional degree of freedom for the
>> model, and all else equal I'd expect a model with fewer degrees of freedom
>> to generalize better. Regardless, if there are lots of use cases for really
>> deep trees, we'd like to hear about them so that we can decide how
>> important they are to support!
>>
>> In the context of CART - pruning very specifically refers to a step
>> *after* a tree has been constructed to some depth using cross-validation.
>> This was a variance reduction technique in the original tree work that is
>> unnecessary and computationally expensive in the context of forests. In the
>> original Random Forests paper, there are still stopping criteria - usually
>> either minimum leaf size or minimum split improvement (or both), so
>> "training to maximum depth" doesn't mean "train until you've completely
>> divided your dataset and there's one point per leaf." My point is that if
>> you set minimum leaf size to something like 0.2% of the dataset, then
>> you're not going to get deeper than 10 or 12 levels with a reasonably
>> balanced tree.
>>
>> With respect to PLANET - our implementation is very much in the spirit of
>> planet, but has some key differences - there's good documentation on
>> exactly what the differences are forthcoming, so I won't belabor these
>> here. The differences are designed to 1) avoid data shuffling, and 2)
>> minimize number of passes over the training data. Of course, there are
>> tradeoffs involved, and there is at least one really good trick in the
>> PLANET work that we should leverage that we aren't yet - namely once the
>> nodes get small enough for data to fit easily on a single machine, data can
>> be shuffled and then the remainder of the tree can be trained in parallel
>> 

Re: Random Forest on Spark

2014-04-18 Thread Evan R. Sparks
Interesting, and thanks for the thoughts.

I think we're on the same page with 100s of millions of records. We've
tested the tree implementation in mllib on 1b rows and up to 100 features -
though this isn't hitting the 1000s of features you mention.

Obviously multi class support isn't there yet, but I can see your point
about deeper trees for many class problems. Will try them out on some image
processing stuff with 1k classes we're doing in the lab once they are more
developed to get a sense for where the issues are.

If you're only allocating 2GB/worker you're going to have a hard time
getting the real advantages of Spark.

For your 1k features causing heap exceptions at depth 5  - are these
categorical or continuous? The categorical vars create much smaller
histograms.

If you're fitting all continuous features, the memory requirements are
O(b*d*2^l) where b=number of histogram bins, d=number of features, and l =
level of the tree. Even accounting for object overhead, with the default
number of bins, the histograms at this depth should be order of 10s of MB,
not 2GB - so I'm guessing your cached data is occupying a significant chunk
of that 2GB? In the tree PR - Hirakendu Das tested down to depth 10 on 500m
data points with 20 continuous features and was able to run without running
into memory issues (and scaling properties got better as the depth grew).
His worker mem was 7.5GB and 30% of that was reserved for caching. If you
wanted to go 1000 features at depth 10 I'd estimate a couple of gigs
necessary for heap space for the worker to compute/store the histograms,
and I guess 2x that on the master to do the reduce.

Again 2GB per worker is pretty tight, because there are overheads of just
starting the jvm, launching a worker, loading libraries, etc.

- Evan

On Apr 17, 2014, at 6:10 PM, Sung Hwan Chung 
wrote:

Yes, it should be data specific and perhaps we're biased toward the data
sets that we are playing with. To put things in perspective, we're highly
interested in (and I believe, our customers are):

1. large (hundreds of millions of rows)
2. multi-class classification - nowadays, dozens of target categories are
common and even thousands in some cases - you could imagine that this is a
big reason for us requiring more 'complex' models
3. high dimensional with thousands of descriptive and sort-of-independent
features

>From the theoretical perspective, I would argue that it's usually in the
best interest to prune as little as possible. I believe that pruning
inherently increases bias of an individual tree, which RF can't do anything
about while decreasing variance - which is what RF is for.

The default pruning criteria for R's reference implementation is min-node
of 1 (meaning fully-grown tree) for classification, and 5 for regression.
I'd imagine they did at least some empirical testing to justify these
values at the time - although at a time of small datasets :).

FYI, we are also considering the MLLib decision tree for our Gradient
Boosting implementation, however, the memory requirement is still a bit too
steep (we were getting heap exceptions at depth limit of 5 with 2GB per
worker with approximately 1000 features). Now 2GB per worker is about what
we expect our typical customers would tolerate and I don't think that it's
unreasonable for shallow trees.



On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks wrote:

> What kind of data are you training on? These effects are *highly* data
> dependent, and while saying "the depth of 10 is simply not adequate to
> build high-accuracy models" may be accurate for the particular problem
> you're modeling, it is not true in general. From a statistical perspective,
> I consider each node in each tree an additional degree of freedom for the
> model, and all else equal I'd expect a model with fewer degrees of freedom
> to generalize better. Regardless, if there are lots of use cases for really
> deep trees, we'd like to hear about them so that we can decide how
> important they are to support!
>
> In the context of CART - pruning very specifically refers to a step
> *after* a tree has been constructed to some depth using cross-validation.
> This was a variance reduction technique in the original tree work that is
> unnecessary and computationally expensive in the context of forests. In the
> original Random Forests paper, there are still stopping criteria - usually
> either minimum leaf size or minimum split improvement (or both), so
> "training to maximum depth" doesn't mean "train until you've completely
> divided your dataset and there's one point per leaf." My point is that if
> you set minimum leaf size to something like 0.2% of the dataset, then
> you're not going to get deeper than 10 or 12 levels with a reasonably
> balanced tree.
>
> With respect to PLANET - our implementation is very much in the spirit of
> planet, but has some key differences - there's good documentation on
> exactly what the differences are forthcoming, so I won't belabor the

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung
Thanks for the info on mem requirement.

I think that a lot of businesses would probably prefer to use Spark on top
of YARN, since that's what they invest on - a large Hadoop cluster. And the
default setting for YARN seems to cap memory per container to 8 GB - so
ideally, we would like to use a lot less than that (rather than telling
them, nooo change your YARN settings).

A convenient feature would be to automatically figure things out, and try
to adapt the algorithm to memory limits (e.g., process X # of nodes at a
time, instead of all the nodes at the same level). Additionally, we noticed
that the default 'Double' usage for LabelPoint is very wasteful for a
majority of data sets. Float would do most of times and in fact, a lot of
datasets could get away with using Short or even Byte. Or in your case,
since you're transforming data to Bins anyways, you could probably cache
BIN IDs (for which you could use Short or Byte even)?



On Fri, Apr 18, 2014 at 8:43 AM, Evan R. Sparks wrote:

> Interesting, and thanks for the thoughts.
>
> I think we're on the same page with 100s of millions of records. We've
> tested the tree implementation in mllib on 1b rows and up to 100 features -
> though this isn't hitting the 1000s of features you mention.
>
> Obviously multi class support isn't there yet, but I can see your point
> about deeper trees for many class problems. Will try them out on some image
> processing stuff with 1k classes we're doing in the lab once they are more
> developed to get a sense for where the issues are.
>
> If you're only allocating 2GB/worker you're going to have a hard time
> getting the real advantages of Spark.
>
> For your 1k features causing heap exceptions at depth 5  - are these
> categorical or continuous? The categorical vars create much smaller
> histograms.
>
> If you're fitting all continuous features, the memory requirements are
> O(b*d*2^l) where b=number of histogram bins, d=number of features, and l =
> level of the tree. Even accounting for object overhead, with the default
> number of bins, the histograms at this depth should be order of 10s of MB,
> not 2GB - so I'm guessing your cached data is occupying a significant chunk
> of that 2GB? In the tree PR - Hirakendu Das tested down to depth 10 on 500m
> data points with 20 continuous features and was able to run without running
> into memory issues (and scaling properties got better as the depth grew).
> His worker mem was 7.5GB and 30% of that was reserved for caching. If you
> wanted to go 1000 features at depth 10 I'd estimate a couple of gigs
> necessary for heap space for the worker to compute/store the histograms,
> and I guess 2x that on the master to do the reduce.
>
> Again 2GB per worker is pretty tight, because there are overheads of just
> starting the jvm, launching a worker, loading libraries, etc.
>
> - Evan
>
> On Apr 17, 2014, at 6:10 PM, Sung Hwan Chung 
> wrote:
>
> Yes, it should be data specific and perhaps we're biased toward the data
> sets that we are playing with. To put things in perspective, we're highly
> interested in (and I believe, our customers are):
>
> 1. large (hundreds of millions of rows)
> 2. multi-class classification - nowadays, dozens of target categories are
> common and even thousands in some cases - you could imagine that this is a
> big reason for us requiring more 'complex' models
> 3. high dimensional with thousands of descriptive and sort-of-independent
> features
>
> From the theoretical perspective, I would argue that it's usually in the
> best interest to prune as little as possible. I believe that pruning
> inherently increases bias of an individual tree, which RF can't do anything
> about while decreasing variance - which is what RF is for.
>
> The default pruning criteria for R's reference implementation is min-node
> of 1 (meaning fully-grown tree) for classification, and 5 for regression.
> I'd imagine they did at least some empirical testing to justify these
> values at the time - although at a time of small datasets :).
>
> FYI, we are also considering the MLLib decision tree for our Gradient
> Boosting implementation, however, the memory requirement is still a bit too
> steep (we were getting heap exceptions at depth limit of 5 with 2GB per
> worker with approximately 1000 features). Now 2GB per worker is about what
> we expect our typical customers would tolerate and I don't think that it's
> unreasonable for shallow trees.
>
>
>
> On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks wrote:
>
>> What kind of data are you training on? These effects are *highly* data
>> dependent, and while saying "the depth of 10 is simply not adequate to
>> build high-accuracy models" may be accurate for the particular problem
>> you're modeling, it is not true in general. From a statistical perspective,
>> I consider each node in each tree an additional degree of freedom for the
>> model, and all else equal I'd expect a model with fewer degrees of freedom
>> to generalize better. Reg

Re: Random Forest on Spark

2014-04-18 Thread Debasish Das
Spark on YARN is a big pain due to the strict memory requirement per
container...

If you are stress testing it, could you use a standalone cluster and see at
which feature upper bound does per worker RAM requirement reaches 16 GB or
more...it is possible to get 16 GB instances on EC2 these days without much
trouble.,..

Another way is to run a feature selection algorithm to decrease features
space before running decision tree or algorithm variants...There is a PR on
entropy based feature selection algorithms...you don't want to use them to
decrease features right ?

A feature extraction algorithm like matrix factorization and it's variants
could be used to decrease feature space as well...



On Fri, Apr 18, 2014 at 10:53 AM, Sung Hwan Chung
wrote:

> Thanks for the info on mem requirement.
>
> I think that a lot of businesses would probably prefer to use Spark on top
> of YARN, since that's what they invest on - a large Hadoop cluster. And the
> default setting for YARN seems to cap memory per container to 8 GB - so
> ideally, we would like to use a lot less than that (rather than telling
> them, nooo change your YARN settings).
>
> A convenient feature would be to automatically figure things out, and try
> to adapt the algorithm to memory limits (e.g., process X # of nodes at a
> time, instead of all the nodes at the same level). Additionally, we noticed
> that the default 'Double' usage for LabelPoint is very wasteful for a
> majority of data sets. Float would do most of times and in fact, a lot of
> datasets could get away with using Short or even Byte. Or in your case,
> since you're transforming data to Bins anyways, you could probably cache
> BIN IDs (for which you could use Short or Byte even)?
>
>
>
> On Fri, Apr 18, 2014 at 8:43 AM, Evan R. Sparks wrote:
>
>> Interesting, and thanks for the thoughts.
>>
>> I think we're on the same page with 100s of millions of records. We've
>> tested the tree implementation in mllib on 1b rows and up to 100 features -
>> though this isn't hitting the 1000s of features you mention.
>>
>> Obviously multi class support isn't there yet, but I can see your point
>> about deeper trees for many class problems. Will try them out on some image
>> processing stuff with 1k classes we're doing in the lab once they are more
>> developed to get a sense for where the issues are.
>>
>> If you're only allocating 2GB/worker you're going to have a hard time
>> getting the real advantages of Spark.
>>
>> For your 1k features causing heap exceptions at depth 5  - are these
>> categorical or continuous? The categorical vars create much smaller
>> histograms.
>>
>> If you're fitting all continuous features, the memory requirements are
>> O(b*d*2^l) where b=number of histogram bins, d=number of features, and l =
>> level of the tree. Even accounting for object overhead, with the default
>> number of bins, the histograms at this depth should be order of 10s of MB,
>> not 2GB - so I'm guessing your cached data is occupying a significant chunk
>> of that 2GB? In the tree PR - Hirakendu Das tested down to depth 10 on 500m
>> data points with 20 continuous features and was able to run without running
>> into memory issues (and scaling properties got better as the depth grew).
>> His worker mem was 7.5GB and 30% of that was reserved for caching. If you
>> wanted to go 1000 features at depth 10 I'd estimate a couple of gigs
>> necessary for heap space for the worker to compute/store the histograms,
>> and I guess 2x that on the master to do the reduce.
>>
>> Again 2GB per worker is pretty tight, because there are overheads of just
>> starting the jvm, launching a worker, loading libraries, etc.
>>
>> - Evan
>>
>> On Apr 17, 2014, at 6:10 PM, Sung Hwan Chung 
>> wrote:
>>
>> Yes, it should be data specific and perhaps we're biased toward the data
>> sets that we are playing with. To put things in perspective, we're highly
>> interested in (and I believe, our customers are):
>>
>> 1. large (hundreds of millions of rows)
>> 2. multi-class classification - nowadays, dozens of target categories are
>> common and even thousands in some cases - you could imagine that this is a
>> big reason for us requiring more 'complex' models
>> 3. high dimensional with thousands of descriptive and sort-of-independent
>> features
>>
>> From the theoretical perspective, I would argue that it's usually in the
>> best interest to prune as little as possible. I believe that pruning
>> inherently increases bias of an individual tree, which RF can't do anything
>> about while decreasing variance - which is what RF is for.
>>
>> The default pruning criteria for R's reference implementation is min-node
>> of 1 (meaning fully-grown tree) for classification, and 5 for regression.
>> I'd imagine they did at least some empirical testing to justify these
>> values at the time - although at a time of small datasets :).
>>
>> FYI, we are also considering the MLLib decision tree for our Gradient
>> Boosting implementation, how

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung
Debasish,

Unfortunately, we are bound to YARN, at least for the time being, because
that's what most of our customers would be using (unless, all the Hadoop
vendors start supporting standalone Spark - I think Cloudera might do
that?).




On Fri, Apr 18, 2014 at 11:12 AM, Debasish Das wrote:

> Spark on YARN is a big pain due to the strict memory requirement per
> container...
>
> If you are stress testing it, could you use a standalone cluster and see
> at which feature upper bound does per worker RAM requirement reaches 16 GB
> or more...it is possible to get 16 GB instances on EC2 these days without
> much trouble.,..
>
> Another way is to run a feature selection algorithm to decrease features
> space before running decision tree or algorithm variants...There is a PR on
> entropy based feature selection algorithms...you don't want to use them to
> decrease features right ?
>
> A feature extraction algorithm like matrix factorization and it's variants
> could be used to decrease feature space as well...
>
>
>
> On Fri, Apr 18, 2014 at 10:53 AM, Sung Hwan Chung <
> coded...@cs.stanford.edu> wrote:
>
>> Thanks for the info on mem requirement.
>>
>> I think that a lot of businesses would probably prefer to use Spark on
>> top of YARN, since that's what they invest on - a large Hadoop cluster. And
>> the default setting for YARN seems to cap memory per container to 8 GB - so
>> ideally, we would like to use a lot less than that (rather than telling
>> them, nooo change your YARN settings).
>>
>> A convenient feature would be to automatically figure things out, and try
>> to adapt the algorithm to memory limits (e.g., process X # of nodes at a
>> time, instead of all the nodes at the same level). Additionally, we noticed
>> that the default 'Double' usage for LabelPoint is very wasteful for a
>> majority of data sets. Float would do most of times and in fact, a lot of
>> datasets could get away with using Short or even Byte. Or in your case,
>> since you're transforming data to Bins anyways, you could probably cache
>> BIN IDs (for which you could use Short or Byte even)?
>>
>>
>>
>> On Fri, Apr 18, 2014 at 8:43 AM, Evan R. Sparks wrote:
>>
>>> Interesting, and thanks for the thoughts.
>>>
>>> I think we're on the same page with 100s of millions of records. We've
>>> tested the tree implementation in mllib on 1b rows and up to 100 features -
>>> though this isn't hitting the 1000s of features you mention.
>>>
>>> Obviously multi class support isn't there yet, but I can see your point
>>> about deeper trees for many class problems. Will try them out on some image
>>> processing stuff with 1k classes we're doing in the lab once they are more
>>> developed to get a sense for where the issues are.
>>>
>>> If you're only allocating 2GB/worker you're going to have a hard time
>>> getting the real advantages of Spark.
>>>
>>> For your 1k features causing heap exceptions at depth 5  - are these
>>> categorical or continuous? The categorical vars create much smaller
>>> histograms.
>>>
>>> If you're fitting all continuous features, the memory requirements are
>>> O(b*d*2^l) where b=number of histogram bins, d=number of features, and l =
>>> level of the tree. Even accounting for object overhead, with the default
>>> number of bins, the histograms at this depth should be order of 10s of MB,
>>> not 2GB - so I'm guessing your cached data is occupying a significant chunk
>>> of that 2GB? In the tree PR - Hirakendu Das tested down to depth 10 on 500m
>>> data points with 20 continuous features and was able to run without running
>>> into memory issues (and scaling properties got better as the depth grew).
>>> His worker mem was 7.5GB and 30% of that was reserved for caching. If you
>>> wanted to go 1000 features at depth 10 I'd estimate a couple of gigs
>>> necessary for heap space for the worker to compute/store the histograms,
>>> and I guess 2x that on the master to do the reduce.
>>>
>>> Again 2GB per worker is pretty tight, because there are overheads of
>>> just starting the jvm, launching a worker, loading libraries, etc.
>>>
>>> - Evan
>>>
>>> On Apr 17, 2014, at 6:10 PM, Sung Hwan Chung 
>>> wrote:
>>>
>>> Yes, it should be data specific and perhaps we're biased toward the data
>>> sets that we are playing with. To put things in perspective, we're highly
>>> interested in (and I believe, our customers are):
>>>
>>> 1. large (hundreds of millions of rows)
>>> 2. multi-class classification - nowadays, dozens of target categories
>>> are common and even thousands in some cases - you could imagine that this
>>> is a big reason for us requiring more 'complex' models
>>> 3. high dimensional with thousands of descriptive and
>>> sort-of-independent features
>>>
>>> From the theoretical perspective, I would argue that it's usually in the
>>> best interest to prune as little as possible. I believe that pruning
>>> inherently increases bias of an individual tree, which RF can't do anything
>>> about while decreasing varia

Re: Random Forest on Spark

2014-04-18 Thread Manish Amde
Sorry for arriving late to the party! Evan has clearly explained the
current implementation, our future plans and key differences with the
PLANET paper. I don't think I can add more to his comments. :-)

I apologize for not creating the corresponding JIRA tickets for the tree
improvements (multiclass classification, deep trees, post-shuffle
single-machine computation for small datasets, code refactoring for
pluggable loss calculation) and ensembles tree (RF, GBT, AdaBoost,
ExtraTrees, partial implementation of RF). I will create them soon.

We are currently working on creating very fast ensemble trees which will be
different from current ensemble tree implementations in other libraries.
PR's for tree improvements will be great -- just make sure you go carefully
through the tree code which I think is fairly well-documented but
non-trivial to understand and discuss your changes on JIRA before
implementation to avoid duplication.

-Manish


On Fri, Apr 18, 2014 at 8:43 AM, Evan R. Sparks wrote:

> Interesting, and thanks for the thoughts.
>
> I think we're on the same page with 100s of millions of records. We've
> tested the tree implementation in mllib on 1b rows and up to 100 features -
> though this isn't hitting the 1000s of features you mention.
>
> Obviously multi class support isn't there yet, but I can see your point
> about deeper trees for many class problems. Will try them out on some image
> processing stuff with 1k classes we're doing in the lab once they are more
> developed to get a sense for where the issues are.
>
> If you're only allocating 2GB/worker you're going to have a hard time
> getting the real advantages of Spark.
>
> For your 1k features causing heap exceptions at depth 5  - are these
> categorical or continuous? The categorical vars create much smaller
> histograms.
>
> If you're fitting all continuous features, the memory requirements are
> O(b*d*2^l) where b=number of histogram bins, d=number of features, and l =
> level of the tree. Even accounting for object overhead, with the default
> number of bins, the histograms at this depth should be order of 10s of MB,
> not 2GB - so I'm guessing your cached data is occupying a significant chunk
> of that 2GB? In the tree PR - Hirakendu Das tested down to depth 10 on 500m
> data points with 20 continuous features and was able to run without running
> into memory issues (and scaling properties got better as the depth grew).
> His worker mem was 7.5GB and 30% of that was reserved for caching. If you
> wanted to go 1000 features at depth 10 I'd estimate a couple of gigs
> necessary for heap space for the worker to compute/store the histograms,
> and I guess 2x that on the master to do the reduce.
>
> Again 2GB per worker is pretty tight, because there are overheads of just
> starting the jvm, launching a worker, loading libraries, etc.
>
> - Evan
>
> On Apr 17, 2014, at 6:10 PM, Sung Hwan Chung 
> wrote:
>
> Yes, it should be data specific and perhaps we're biased toward the data
> sets that we are playing with. To put things in perspective, we're highly
> interested in (and I believe, our customers are):
>
> 1. large (hundreds of millions of rows)
> 2. multi-class classification - nowadays, dozens of target categories are
> common and even thousands in some cases - you could imagine that this is a
> big reason for us requiring more 'complex' models
> 3. high dimensional with thousands of descriptive and sort-of-independent
> features
>
> From the theoretical perspective, I would argue that it's usually in the
> best interest to prune as little as possible. I believe that pruning
> inherently increases bias of an individual tree, which RF can't do anything
> about while decreasing variance - which is what RF is for.
>
> The default pruning criteria for R's reference implementation is min-node
> of 1 (meaning fully-grown tree) for classification, and 5 for regression.
> I'd imagine they did at least some empirical testing to justify these
> values at the time - although at a time of small datasets :).
>
> FYI, we are also considering the MLLib decision tree for our Gradient
> Boosting implementation, however, the memory requirement is still a bit too
> steep (we were getting heap exceptions at depth limit of 5 with 2GB per
> worker with approximately 1000 features). Now 2GB per worker is about what
> we expect our typical customers would tolerate and I don't think that it's
> unreasonable for shallow trees.
>
>
>
> On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks wrote:
>
>> What kind of data are you training on? These effects are *highly* data
>> dependent, and while saying "the depth of 10 is simply not adequate to
>> build high-accuracy models" may be accurate for the particular problem
>> you're modeling, it is not true in general. From a statistical perspective,
>> I consider each node in each tree an additional degree of freedom for the
>> model, and all else equal I'd expect a model with fewer degrees of freedom
>> to generalize bett

Re: Random Forest on Spark

2014-04-18 Thread Sean Owen
On Fri, Apr 18, 2014 at 7:31 PM, Sung Hwan Chung
 wrote:
> Debasish,
>
> Unfortunately, we are bound to YARN, at least for the time being, because
> that's what most of our customers would be using (unless, all the Hadoop
> vendors start supporting standalone Spark - I think Cloudera might do
> that?).

Yes the CDH5.0.0 distro just runs Spark in stand-alone mode. Using the
YARN integration is still being worked on.


Re: Random Forest on Spark

2014-04-18 Thread Sandy Ryza
I don't think the YARN default of max 8GB container size is a good
justification for limiting memory per worker.  This is a sort of arbitrary
number that came from an era where MapReduce was the main YARN application
and machines generally had less memory.  I expect to see this to get
configured as much higher in practice on most clusters running Spark.

YARN integration is actually complete in CDH5.0.  We support it as well as
standalone mode.




On Fri, Apr 18, 2014 at 11:49 AM, Sean Owen  wrote:

> On Fri, Apr 18, 2014 at 7:31 PM, Sung Hwan Chung
>  wrote:
> > Debasish,
> >
> > Unfortunately, we are bound to YARN, at least for the time being, because
> > that's what most of our customers would be using (unless, all the Hadoop
> > vendors start supporting standalone Spark - I think Cloudera might do
> > that?).
>
> Yes the CDH5.0.0 distro just runs Spark in stand-alone mode. Using the
> YARN integration is still being worked on.
>


Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung
I would argue that memory in clusters is still a limited resource and it's
still beneficial to use memory as economically as possible. Let's say that
you are training a gradient boosted model in Spark, which could conceivably
take several hours to build hundreds to thousands of trees. You do not want
to be occupying a significant portion of the cluster memory such that
nobody else can run anything of significance.

We have a dataset that's only ~10GB CSV in the file system, now once we
cached the whole thing in Spark, it ballooned to 64 GB or so in memory and
so we had to use a lot more workers with memory just so that we could cache
the whole thing - this was due to the fact that although all the features
were byte-sized, MLLib defaults to Double.


On Fri, Apr 18, 2014 at 1:39 PM, Sandy Ryza  wrote:

> I don't think the YARN default of max 8GB container size is a good
> justification for limiting memory per worker.  This is a sort of arbitrary
> number that came from an era where MapReduce was the main YARN application
> and machines generally had less memory.  I expect to see this to get
> configured as much higher in practice on most clusters running Spark.
>
> YARN integration is actually complete in CDH5.0.  We support it as well as
> standalone mode.
>
>
>
>
> On Fri, Apr 18, 2014 at 11:49 AM, Sean Owen  wrote:
>
>> On Fri, Apr 18, 2014 at 7:31 PM, Sung Hwan Chung
>>  wrote:
>> > Debasish,
>> >
>> > Unfortunately, we are bound to YARN, at least for the time being,
>> because
>> > that's what most of our customers would be using (unless, all the Hadoop
>> > vendors start supporting standalone Spark - I think Cloudera might do
>> > that?).
>>
>> Yes the CDH5.0.0 distro just runs Spark in stand-alone mode. Using the
>> YARN integration is still being worked on.
>>
>
>


Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung
Sorry, that was incomplete information, I think Spark's compression helped
(not sure how much though) that the actual memory requirement may have been
smaller.


On Fri, Apr 18, 2014 at 3:16 PM, Sung Hwan Chung
wrote:

> I would argue that memory in clusters is still a limited resource and it's
> still beneficial to use memory as economically as possible. Let's say that
> you are training a gradient boosted model in Spark, which could conceivably
> take several hours to build hundreds to thousands of trees. You do not want
> to be occupying a significant portion of the cluster memory such that
> nobody else can run anything of significance.
>
> We have a dataset that's only ~10GB CSV in the file system, now once we
> cached the whole thing in Spark, it ballooned to 64 GB or so in memory and
> so we had to use a lot more workers with memory just so that we could cache
> the whole thing - this was due to the fact that although all the features
> were byte-sized, MLLib defaults to Double.
>
>
> On Fri, Apr 18, 2014 at 1:39 PM, Sandy Ryza wrote:
>
>> I don't think the YARN default of max 8GB container size is a good
>> justification for limiting memory per worker.  This is a sort of arbitrary
>> number that came from an era where MapReduce was the main YARN application
>> and machines generally had less memory.  I expect to see this to get
>> configured as much higher in practice on most clusters running Spark.
>>
>> YARN integration is actually complete in CDH5.0.  We support it as well
>> as standalone mode.
>>
>>
>>
>>
>> On Fri, Apr 18, 2014 at 11:49 AM, Sean Owen  wrote:
>>
>>> On Fri, Apr 18, 2014 at 7:31 PM, Sung Hwan Chung
>>>  wrote:
>>> > Debasish,
>>> >
>>> > Unfortunately, we are bound to YARN, at least for the time being,
>>> because
>>> > that's what most of our customers would be using (unless, all the
>>> Hadoop
>>> > vendors start supporting standalone Spark - I think Cloudera might do
>>> > that?).
>>>
>>> Yes the CDH5.0.0 distro just runs Spark in stand-alone mode. Using the
>>> YARN integration is still being worked on.
>>>
>>
>>
>


Re: Re: Random Forest on Spark

2014-04-18 Thread Sebastian Schelter

Hi,

Stratosphere does not have a real RF implementation yet, there is only a 
prototype that has been developed by students in a university course 
which is far from production usage at this stage.


--sebastian

On 04/18/2014 10:31 AM, Sean Owen wrote:

Mahout RDF is fairly old code. If you try it, try to use 1.0-SNAPSHOT,
as you will almost certainly need this patch to make it run reasonably
fast: https://issues.apache.org/jira/browse/MAHOUT-1419

I have not tried Stratosphere here.

Since we are on the subject of RDF on Hadoop, possibly on M/R, I don't
feel too bad advertising this: oryx also does
classification/regression via RDF:
https://github.com/cloudera/oryx#classification--regression-example

This is a fairly different design choice than, say, what's in the
PLANET paper. The one big negative is that trees are built only over a
sub-sample of the data. But given that big simplifying assumption, a
lot of other things work well. It's not iterative so is not
handicapped by being M/R-based. May be of interest if building /
benchmarking stuff on Hadoop.

Personally, going forward, I'm interested in something smarter (like
what I see is going into the new Spark impl) although there really are
some big design tradeoffs here, yes.


--
Sean Owen | Director, Data Science | London


On Fri, Apr 18, 2014 at 9:21 AM, Laeeq Ahmed  wrote:

Have anyone tried mahout RF or Stratosphere RF with spark. Any comments.

Regards,
Laeeq
On Friday, April 18, 2014 3:11 AM, Sung Hwan Chung
 wrote:
Yes, it should be data specific and perhaps we're biased toward the data
sets that we are playing with. To put things in perspective, we're highly
interested in (and I believe, our customers are):

1. large (hundreds of millions of rows)
2. multi-class classification - nowadays, dozens of target categories are
common and even thousands in some cases - you could imagine that this is a
big reason for us requiring more 'complex' models
3. high dimensional with thousands of descriptive and sort-of-independent
features

 From the theoretical perspective, I would argue that it's usually in the
best interest to prune as little as possible. I believe that pruning
inherently increases bias of an individual tree, which RF can't do anything
about while decreasing variance - which is what RF is for.

The default pruning criteria for R's reference implementation is min-node of
1 (meaning fully-grown tree) for classification, and 5 for regression. I'd
imagine they did at least some empirical testing to justify these values at
the time - although at a time of small datasets :).

FYI, we are also considering the MLLib decision tree for our Gradient
Boosting implementation, however, the memory requirement is still a bit too
steep (we were getting heap exceptions at depth limit of 5 with 2GB per
worker with approximately 1000 features). Now 2GB per worker is about what
we expect our typical customers would tolerate and I don't think that it's
unreasonable for shallow trees.



On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks 
wrote:

What kind of data are you training on? These effects are *highly* data
dependent, and while saying "the depth of 10 is simply not adequate to build
high-accuracy models" may be accurate for the particular problem you're
modeling, it is not true in general. From a statistical perspective, I
consider each node in each tree an additional degree of freedom for the
model, and all else equal I'd expect a model with fewer degrees of freedom
to generalize better. Regardless, if there are lots of use cases for really
deep trees, we'd like to hear about them so that we can decide how important
they are to support!

In the context of CART - pruning very specifically refers to a step *after*
a tree has been constructed to some depth using cross-validation. This was a
variance reduction technique in the original tree work that is unnecessary
and computationally expensive in the context of forests. In the original
Random Forests paper, there are still stopping criteria - usually either
minimum leaf size or minimum split improvement (or both), so "training to
maximum depth" doesn't mean "train until you've completely divided your
dataset and there's one point per leaf." My point is that if you set minimum
leaf size to something like 0.2% of the dataset, then you're not going to
get deeper than 10 or 12 levels with a reasonably balanced tree.

With respect to PLANET - our implementation is very much in the spirit of
planet, but has some key differences - there's good documentation on exactly
what the differences are forthcoming, so I won't belabor these here. The
differences are designed to 1) avoid data shuffling, and 2) minimize number
of passes over the training data. Of course, there are tradeoffs involved,
and there is at least one really good trick in the PLANET work that we
should leverage that we aren't yet - namely once the nodes get small enough
for data to fit easily on a single machine, data can be shuffled and then
the re