Manish,

My use case for (asymmetric) absolute error is quite trivially quantile
regression. In other words, I want to use Spark to learn conditional
cumulative distribution functions. See R's GBM quantile regression option.

If you either find or create a Jira ticket, I would be happy to give it a
shot. Is there a design doc explaining how the gradient boosting algorithm
is laid out in MLLib? I tried reading the code, but without a "Rosetta
stone" it's impossible to make sense of it.

Alex

On Mon, Nov 17, 2014 at 8:25 PM, Manish Amde <manish...@gmail.com> wrote:

> Hi Alessandro,
>
> I think absolute error as splitting criterion might be feasible with the
> current architecture -- I think the sufficient statistics we collect
> currently might be able to support this. Could you let us know scenarios
> where absolute error has significantly outperformed squared error for
> regression trees? Also, what's your use case that makes squared error
> undesirable.
>
> For gradient boosting, you are correct. The weak hypothesis weights refer
> to tree predictions in each of the branches. We plan to explain this in
> the 1.2 documentation and may be add some more clarifications to the
> Javadoc.
>
> I will try to search for JIRAs or create new ones and update this thread.
>
> -Manish
>
>
> On Monday, November 17, 2014, Alessandro Baretta <alexbare...@gmail.com>
> wrote:
>
>> Manish,
>>
>> Thanks for pointing me to the relevant docs. It is unfortunate that
>> absolute error is not supported yet. I can't seem to find a Jira for it.
>>
>> Now, here's the what the comments say in the current master branch:
>> /**
>>  * :: Experimental ::
>>  * A class that implements Stochastic Gradient Boosting
>>  * for regression and binary classification problems.
>>  *
>>  * The implementation is based upon:
>>  *   J.H. Friedman.  "Stochastic Gradient Boosting."  1999.
>>  *
>>  * Notes:
>>  *  - This currently can be run with several loss functions.  However,
>> only SquaredError is
>>  *    fully supported.  Specifically, the loss function should be used to
>> compute the gradient
>>  *    (to re-label training instances on each iteration) and to weight
>> weak hypotheses.
>>  *    Currently, gradients are computed correctly for the available loss
>> functions,
>>  *    but weak hypothesis weights are not computed correctly for LogLoss
>> or AbsoluteError.
>>  *    Running with those losses will likely behave reasonably, but lacks
>> the same guarantees.
>> ...
>> */
>>
>> By the looks of it, the GradientBoosting API would support an absolute
>> error type loss function to perform quantile regression, except for "weak
>> hypothesis weights". Does this refer to the weights of the leaves of the
>> trees?
>>
>> Alex
>>
>> On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde <manish...@gmail.com> wrote:
>>
>>> Hi Alessandro,
>>>
>>> MLlib v1.1 supports variance for regression and gini impurity and
>>> entropy for classification.
>>> http://spark.apache.org/docs/latest/mllib-decision-tree.html
>>>
>>> If the information gain calculation can be performed by distributed
>>> aggregation then it might be possible to plug it into the existing
>>> implementation. We want to perform such calculations (for e.g. median) for
>>> the gradient boosting models (coming up in the 1.2 release) using absolute
>>> error and deviance as loss functions but I don't think anyone is planning
>>> to work on it yet. :-)
>>>
>>> -Manish
>>>
>>> On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta <
>>> alexbare...@gmail.com> wrote:
>>>
>>>> I see that, as of v. 1.1, MLLib supports regression and classification
>>>> tree
>>>> models. I assume this means that it uses a squared-error loss function
>>>> for
>>>> the first and logistic cost function for the second. I don't see support
>>>> for quantile regression via an absolute error cost function. Or am I
>>>> missing something?
>>>>
>>>> If, as it seems, this is missing, how do you recommend to implement it?
>>>>
>>>> Alex
>>>>
>>>
>>>
>>

Reply via email to