Re: Quantile regression in tree models

2014-11-18 Thread Alessandro Baretta
Manish,

My use case for (asymmetric) absolute error is quite trivially quantile
regression. In other words, I want to use Spark to learn conditional
cumulative distribution functions. See R's GBM quantile regression option.

If you either find or create a Jira ticket, I would be happy to give it a
shot. Is there a design doc explaining how the gradient boosting algorithm
is laid out in MLLib? I tried reading the code, but without a Rosetta
stone it's impossible to make sense of it.

Alex

On Mon, Nov 17, 2014 at 8:25 PM, Manish Amde manish...@gmail.com wrote:

 Hi Alessandro,

 I think absolute error as splitting criterion might be feasible with the
 current architecture -- I think the sufficient statistics we collect
 currently might be able to support this. Could you let us know scenarios
 where absolute error has significantly outperformed squared error for
 regression trees? Also, what's your use case that makes squared error
 undesirable.

 For gradient boosting, you are correct. The weak hypothesis weights refer
 to tree predictions in each of the branches. We plan to explain this in
 the 1.2 documentation and may be add some more clarifications to the
 Javadoc.

 I will try to search for JIRAs or create new ones and update this thread.

 -Manish


 On Monday, November 17, 2014, Alessandro Baretta alexbare...@gmail.com
 wrote:

 Manish,

 Thanks for pointing me to the relevant docs. It is unfortunate that
 absolute error is not supported yet. I can't seem to find a Jira for it.

 Now, here's the what the comments say in the current master branch:
 /**
  * :: Experimental ::
  * A class that implements Stochastic Gradient Boosting
  * for regression and binary classification problems.
  *
  * The implementation is based upon:
  *   J.H. Friedman.  Stochastic Gradient Boosting.  1999.
  *
  * Notes:
  *  - This currently can be run with several loss functions.  However,
 only SquaredError is
  *fully supported.  Specifically, the loss function should be used to
 compute the gradient
  *(to re-label training instances on each iteration) and to weight
 weak hypotheses.
  *Currently, gradients are computed correctly for the available loss
 functions,
  *but weak hypothesis weights are not computed correctly for LogLoss
 or AbsoluteError.
  *Running with those losses will likely behave reasonably, but lacks
 the same guarantees.
 ...
 */

 By the looks of it, the GradientBoosting API would support an absolute
 error type loss function to perform quantile regression, except for weak
 hypothesis weights. Does this refer to the weights of the leaves of the
 trees?

 Alex

 On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde manish...@gmail.com wrote:

 Hi Alessandro,

 MLlib v1.1 supports variance for regression and gini impurity and
 entropy for classification.
 http://spark.apache.org/docs/latest/mllib-decision-tree.html

 If the information gain calculation can be performed by distributed
 aggregation then it might be possible to plug it into the existing
 implementation. We want to perform such calculations (for e.g. median) for
 the gradient boosting models (coming up in the 1.2 release) using absolute
 error and deviance as loss functions but I don't think anyone is planning
 to work on it yet. :-)

 -Manish

 On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta 
 alexbare...@gmail.com wrote:

 I see that, as of v. 1.1, MLLib supports regression and classification
 tree
 models. I assume this means that it uses a squared-error loss function
 for
 the first and logistic cost function for the second. I don't see support
 for quantile regression via an absolute error cost function. Or am I
 missing something?

 If, as it seems, this is missing, how do you recommend to implement it?

 Alex






Re: Quantile regression in tree models

2014-11-18 Thread Manish Amde
Hi Alex,

Here is the ticket for refining tree predictions. Let's discuss this
further on the JIRA.
https://issues.apache.org/jira/browse/SPARK-4240

There is no ticket yet for quantile regression. It will be great if you
could create one and note down the corresponding loss function and gradient
calculations. There is a design doc that Joseph Bradley wrote for
supporting boosting algorithms with generic weak learners but it doesn't
include implementation details. I can definitely help you understand the
existing code if you decide to work on it. However, let's discuss the
relevance of the algorithm to MLlib on the JIRA. It seems like a nice
addition though I am not sure about the implementation complexity. I will
be great to see what others think.

-Manish

On Tue, Nov 18, 2014 at 10:07 AM, Alessandro Baretta alexbare...@gmail.com
wrote:

 Manish,

 My use case for (asymmetric) absolute error is quite trivially quantile
 regression. In other words, I want to use Spark to learn conditional
 cumulative distribution functions. See R's GBM quantile regression option.

 If you either find or create a Jira ticket, I would be happy to give it a
 shot. Is there a design doc explaining how the gradient boosting algorithm
 is laid out in MLLib? I tried reading the code, but without a Rosetta
 stone it's impossible to make sense of it.

 Alex

 On Mon, Nov 17, 2014 at 8:25 PM, Manish Amde manish...@gmail.com wrote:

 Hi Alessandro,

 I think absolute error as splitting criterion might be feasible with the
 current architecture -- I think the sufficient statistics we collect
 currently might be able to support this. Could you let us know scenarios
 where absolute error has significantly outperformed squared error for
 regression trees? Also, what's your use case that makes squared error
 undesirable.

 For gradient boosting, you are correct. The weak hypothesis weights refer
 to tree predictions in each of the branches. We plan to explain this in
 the 1.2 documentation and may be add some more clarifications to the
 Javadoc.

 I will try to search for JIRAs or create new ones and update this thread.

 -Manish


 On Monday, November 17, 2014, Alessandro Baretta alexbare...@gmail.com
 wrote:

 Manish,

 Thanks for pointing me to the relevant docs. It is unfortunate that
 absolute error is not supported yet. I can't seem to find a Jira for it.

 Now, here's the what the comments say in the current master branch:
 /**
  * :: Experimental ::
  * A class that implements Stochastic Gradient Boosting
  * for regression and binary classification problems.
  *
  * The implementation is based upon:
  *   J.H. Friedman.  Stochastic Gradient Boosting.  1999.
  *
  * Notes:
  *  - This currently can be run with several loss functions.  However,
 only SquaredError is
  *fully supported.  Specifically, the loss function should be used
 to compute the gradient
  *(to re-label training instances on each iteration) and to weight
 weak hypotheses.
  *Currently, gradients are computed correctly for the available loss
 functions,
  *but weak hypothesis weights are not computed correctly for LogLoss
 or AbsoluteError.
  *Running with those losses will likely behave reasonably, but lacks
 the same guarantees.
 ...
 */

 By the looks of it, the GradientBoosting API would support an absolute
 error type loss function to perform quantile regression, except for weak
 hypothesis weights. Does this refer to the weights of the leaves of the
 trees?

 Alex

 On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde manish...@gmail.com
 wrote:

 Hi Alessandro,

 MLlib v1.1 supports variance for regression and gini impurity and
 entropy for classification.
 http://spark.apache.org/docs/latest/mllib-decision-tree.html

 If the information gain calculation can be performed by distributed
 aggregation then it might be possible to plug it into the existing
 implementation. We want to perform such calculations (for e.g. median) for
 the gradient boosting models (coming up in the 1.2 release) using absolute
 error and deviance as loss functions but I don't think anyone is planning
 to work on it yet. :-)

 -Manish

 On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta 
 alexbare...@gmail.com wrote:

 I see that, as of v. 1.1, MLLib supports regression and classification
 tree
 models. I assume this means that it uses a squared-error loss function
 for
 the first and logistic cost function for the second. I don't see
 support
 for quantile regression via an absolute error cost function. Or am I
 missing something?

 If, as it seems, this is missing, how do you recommend to implement it?

 Alex







Re: Quantile regression in tree models

2014-11-17 Thread Manish Amde
Hi Alessandro,

MLlib v1.1 supports variance for regression and gini impurity and entropy
for classification.
http://spark.apache.org/docs/latest/mllib-decision-tree.html

If the information gain calculation can be performed by distributed
aggregation then it might be possible to plug it into the existing
implementation. We want to perform such calculations (for e.g. median) for
the gradient boosting models (coming up in the 1.2 release) using absolute
error and deviance as loss functions but I don't think anyone is planning
to work on it yet. :-)

-Manish

On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta alexbare...@gmail.com
wrote:

 I see that, as of v. 1.1, MLLib supports regression and classification tree
 models. I assume this means that it uses a squared-error loss function for
 the first and logistic cost function for the second. I don't see support
 for quantile regression via an absolute error cost function. Or am I
 missing something?

 If, as it seems, this is missing, how do you recommend to implement it?

 Alex



Re: Quantile regression in tree models

2014-11-17 Thread Alessandro Baretta
Manish,

Thanks for pointing me to the relevant docs. It is unfortunate that
absolute error is not supported yet. I can't seem to find a Jira for it.

Now, here's the what the comments say in the current master branch:
/**
 * :: Experimental ::
 * A class that implements Stochastic Gradient Boosting
 * for regression and binary classification problems.
 *
 * The implementation is based upon:
 *   J.H. Friedman.  Stochastic Gradient Boosting.  1999.
 *
 * Notes:
 *  - This currently can be run with several loss functions.  However, only
SquaredError is
 *fully supported.  Specifically, the loss function should be used to
compute the gradient
 *(to re-label training instances on each iteration) and to weight weak
hypotheses.
 *Currently, gradients are computed correctly for the available loss
functions,
 *but weak hypothesis weights are not computed correctly for LogLoss or
AbsoluteError.
 *Running with those losses will likely behave reasonably, but lacks
the same guarantees.
...
*/

By the looks of it, the GradientBoosting API would support an absolute
error type loss function to perform quantile regression, except for weak
hypothesis weights. Does this refer to the weights of the leaves of the
trees?

Alex

On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde manish...@gmail.com wrote:

 Hi Alessandro,

 MLlib v1.1 supports variance for regression and gini impurity and entropy
 for classification.
 http://spark.apache.org/docs/latest/mllib-decision-tree.html

 If the information gain calculation can be performed by distributed
 aggregation then it might be possible to plug it into the existing
 implementation. We want to perform such calculations (for e.g. median) for
 the gradient boosting models (coming up in the 1.2 release) using absolute
 error and deviance as loss functions but I don't think anyone is planning
 to work on it yet. :-)

 -Manish

 On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta 
 alexbare...@gmail.com wrote:

 I see that, as of v. 1.1, MLLib supports regression and classification
 tree
 models. I assume this means that it uses a squared-error loss function for
 the first and logistic cost function for the second. I don't see support
 for quantile regression via an absolute error cost function. Or am I
 missing something?

 If, as it seems, this is missing, how do you recommend to implement it?

 Alex





Re: Quantile regression in tree models

2014-11-17 Thread Manish Amde
Hi Alessandro,

I think absolute error as splitting criterion might be feasible with the
current architecture -- I think the sufficient statistics we collect
currently might be able to support this. Could you let us know scenarios
where absolute error has significantly outperformed squared error for
regression trees? Also, what's your use case that makes squared error
undesirable.

For gradient boosting, you are correct. The weak hypothesis weights refer
to tree predictions in each of the branches. We plan to explain this in the
1.2 documentation and may be add some more clarifications to the Javadoc.

I will try to search for JIRAs or create new ones and update this thread.

-Manish

On Monday, November 17, 2014, Alessandro Baretta alexbare...@gmail.com
wrote:

 Manish,

 Thanks for pointing me to the relevant docs. It is unfortunate that
 absolute error is not supported yet. I can't seem to find a Jira for it.

 Now, here's the what the comments say in the current master branch:
 /**
  * :: Experimental ::
  * A class that implements Stochastic Gradient Boosting
  * for regression and binary classification problems.
  *
  * The implementation is based upon:
  *   J.H. Friedman.  Stochastic Gradient Boosting.  1999.
  *
  * Notes:
  *  - This currently can be run with several loss functions.  However,
 only SquaredError is
  *fully supported.  Specifically, the loss function should be used to
 compute the gradient
  *(to re-label training instances on each iteration) and to weight
 weak hypotheses.
  *Currently, gradients are computed correctly for the available loss
 functions,
  *but weak hypothesis weights are not computed correctly for LogLoss
 or AbsoluteError.
  *Running with those losses will likely behave reasonably, but lacks
 the same guarantees.
 ...
 */

 By the looks of it, the GradientBoosting API would support an absolute
 error type loss function to perform quantile regression, except for weak
 hypothesis weights. Does this refer to the weights of the leaves of the
 trees?

 Alex

 On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde manish...@gmail.com
 javascript:_e(%7B%7D,'cvml','manish...@gmail.com'); wrote:

 Hi Alessandro,

 MLlib v1.1 supports variance for regression and gini impurity and entropy
 for classification.
 http://spark.apache.org/docs/latest/mllib-decision-tree.html

 If the information gain calculation can be performed by distributed
 aggregation then it might be possible to plug it into the existing
 implementation. We want to perform such calculations (for e.g. median) for
 the gradient boosting models (coming up in the 1.2 release) using absolute
 error and deviance as loss functions but I don't think anyone is planning
 to work on it yet. :-)

 -Manish

 On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta 
 alexbare...@gmail.com
 javascript:_e(%7B%7D,'cvml','alexbare...@gmail.com'); wrote:

 I see that, as of v. 1.1, MLLib supports regression and classification
 tree
 models. I assume this means that it uses a squared-error loss function
 for
 the first and logistic cost function for the second. I don't see support
 for quantile regression via an absolute error cost function. Or am I
 missing something?

 If, as it seems, this is missing, how do you recommend to implement it?

 Alex