Manish, My use case for (asymmetric) absolute error is quite trivially quantile regression. In other words, I want to use Spark to learn conditional cumulative distribution functions. See R's GBM quantile regression option.
If you either find or create a Jira ticket, I would be happy to give it a shot. Is there a design doc explaining how the gradient boosting algorithm is laid out in MLLib? I tried reading the code, but without a "Rosetta stone" it's impossible to make sense of it. Alex On Mon, Nov 17, 2014 at 8:25 PM, Manish Amde <manish...@gmail.com> wrote: > Hi Alessandro, > > I think absolute error as splitting criterion might be feasible with the > current architecture -- I think the sufficient statistics we collect > currently might be able to support this. Could you let us know scenarios > where absolute error has significantly outperformed squared error for > regression trees? Also, what's your use case that makes squared error > undesirable. > > For gradient boosting, you are correct. The weak hypothesis weights refer > to tree predictions in each of the branches. We plan to explain this in > the 1.2 documentation and may be add some more clarifications to the > Javadoc. > > I will try to search for JIRAs or create new ones and update this thread. > > -Manish > > > On Monday, November 17, 2014, Alessandro Baretta <alexbare...@gmail.com> > wrote: > >> Manish, >> >> Thanks for pointing me to the relevant docs. It is unfortunate that >> absolute error is not supported yet. I can't seem to find a Jira for it. >> >> Now, here's the what the comments say in the current master branch: >> /** >> * :: Experimental :: >> * A class that implements Stochastic Gradient Boosting >> * for regression and binary classification problems. >> * >> * The implementation is based upon: >> * J.H. Friedman. "Stochastic Gradient Boosting." 1999. >> * >> * Notes: >> * - This currently can be run with several loss functions. However, >> only SquaredError is >> * fully supported. Specifically, the loss function should be used to >> compute the gradient >> * (to re-label training instances on each iteration) and to weight >> weak hypotheses. >> * Currently, gradients are computed correctly for the available loss >> functions, >> * but weak hypothesis weights are not computed correctly for LogLoss >> or AbsoluteError. >> * Running with those losses will likely behave reasonably, but lacks >> the same guarantees. >> ... >> */ >> >> By the looks of it, the GradientBoosting API would support an absolute >> error type loss function to perform quantile regression, except for "weak >> hypothesis weights". Does this refer to the weights of the leaves of the >> trees? >> >> Alex >> >> On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde <manish...@gmail.com> wrote: >> >>> Hi Alessandro, >>> >>> MLlib v1.1 supports variance for regression and gini impurity and >>> entropy for classification. >>> http://spark.apache.org/docs/latest/mllib-decision-tree.html >>> >>> If the information gain calculation can be performed by distributed >>> aggregation then it might be possible to plug it into the existing >>> implementation. We want to perform such calculations (for e.g. median) for >>> the gradient boosting models (coming up in the 1.2 release) using absolute >>> error and deviance as loss functions but I don't think anyone is planning >>> to work on it yet. :-) >>> >>> -Manish >>> >>> On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta < >>> alexbare...@gmail.com> wrote: >>> >>>> I see that, as of v. 1.1, MLLib supports regression and classification >>>> tree >>>> models. I assume this means that it uses a squared-error loss function >>>> for >>>> the first and logistic cost function for the second. I don't see support >>>> for quantile regression via an absolute error cost function. Or am I >>>> missing something? >>>> >>>> If, as it seems, this is missing, how do you recommend to implement it? >>>> >>>> Alex >>>> >>> >>> >>