Re: Feedback: Feature request

2015-08-28 Thread Manish Amde
Sounds good. It's a request I have seen a few times in the past and have
needed it personally. May be Joseph Bradley has something to add.

I think a JIRA to capture this will be great. We can move this discussion
to the JIRA then.

On Friday, August 28, 2015, Cody Koeninger  wrote:

> I wrote some code for this a while back, pretty sure it didn't need access
> to anything private in the decision tree / random forest model.  If people
> want it added to the api I can put together a PR.
>
> I think it's important to have separately parseable operators / operands
> though.  E.g
>
> "lhs":0,"op":"<=","rhs":-35.0
> On Aug 28, 2015 12:03 AM, "Manish Amde"  > wrote:
>
>> Hi James,
>>
>> It's a good idea. A JSON format is more convenient for visualization
>> though a little inconvenient to read. How about toJson() method? It might
>> make the mllib api inconsistent across models though.
>>
>> You should probably create a JIRA for this.
>>
>> CC: dev list
>>
>> -Manish
>>
>> On Aug 26, 2015, at 11:29 AM, Murphy, James > > wrote:
>>
>> Hey all,
>>
>>
>>
>> In working with the DecisionTree classifier, I found it difficult to
>> extract rules that could easily facilitate visualization with libraries
>> like D3.
>>
>>
>>
>> So for example, using : print(model.toDebugString()), I get the following
>> result =
>>
>>
>>
>>If (feature 0 <= -35.0)
>>
>>   If (feature 24 <= 176.0)
>>
>> Predict: 2.1
>>
>>   If (feature 24 = 176.0)
>>
>> Predict: 4.2
>>
>>   Else (feature 24 > 176.0)
>>
>> Predict: 6.3
>>
>> Else (feature 0 > -35.0)
>>
>>   If (feature 24 <= 11.0)
>>
>> Predict: 4.5
>>
>>   Else (feature 24 > 11.0)
>>
>> Predict: 10.2
>>
>>
>>
>> But ideally, I could see results in a more parseable format like JSON:
>>
>>
>>
>> {
>>
>> "node": [
>>
>> {
>>
>> "name":"node1",
>>
>> "rule":"feature 0 <= -35.0",
>>
>> "children":[
>>
>> {
>>
>>   "name":"node2",
>>
>>   "rule":"feature 24 <= 176.0",
>>
>>   "children":[
>>
>>   {
>>
>>   "name":"node4",
>>
>>   "rule":"feature 20 < 116.0",
>>
>>   "predict":  2.1
>>
>>   },
>>
>>   {
>>
>>   "name":"node5",
>>
>>   "rule":"feature 20 = 116.0",
>>
>>   "predict": 4.2
>>
>>   },
>>
>>   {
>>
>>   "name":"node5",
>>
>>   "rule":"feature 20 > 116.0",
>>
>>   "predict": 6.3
>>
>>   }
>>
>>   ]
>>
>> },
>>
>> {
>>
>> "name":"node3",
>>
>> "rule":"feature 0 > -35.0",
>>
>>   "children":[
>>
>>   {
>>
>>   "name":"node7",
>>
>>   "rule":"feature 3 <= 11.0",
>>
>>   "predict": 4.5
>>
>>   },
>>
>>   {
>>
>>   "name":"node8",
>>
>>   "rule":"feature 3 > 11.0",
>>
>>   "predict": 10.2
>>
>>   }
>>
>>   ]
>>
>> }
>>
>>
>>
>> ]
>>
>> }
>>
>> ]
>>
>> }
>>
>>
>>
>> Food for thought!
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Jim
>>
>>
>>
>>


Re: Feedback: Feature request

2015-08-27 Thread Manish Amde
Hi James,

It's a good idea. A JSON format is more convenient for visualization though a 
little inconvenient to read. How about toJson() method? It might make the mllib 
api inconsistent across models though. 

You should probably create a JIRA for this.

CC: dev list

-Manish

> On Aug 26, 2015, at 11:29 AM, Murphy, James  wrote:
> 
> Hey all,
>  
> In working with the DecisionTree classifier, I found it difficult to extract 
> rules that could easily facilitate visualization with libraries like D3.
>  
> So for example, using : print(model.toDebugString()), I get the following 
> result =
>  
>If (feature 0 <= -35.0)
>   If (feature 24 <= 176.0)
> Predict: 2.1
>   If (feature 24 = 176.0)
> Predict: 4.2
>   Else (feature 24 > 176.0)
> Predict: 6.3
> Else (feature 0 > -35.0)
>   If (feature 24 <= 11.0)
> Predict: 4.5
>   Else (feature 24 > 11.0)
> Predict: 10.2
>  
> But ideally, I could see results in a more parseable format like JSON:
>  
> {
> "node": [
> {
> "name":"node1",
> "rule":"feature 0 <= -35.0",
> "children":[
> {
>   "name":"node2",
>   "rule":"feature 24 <= 176.0",
>   "children":[
>   {
>   "name":"node4",
>   "rule":"feature 20 < 116.0",
>   "predict":  2.1
>   },
>   {
>   "name":"node5",
>   "rule":"feature 20 = 116.0",
>   "predict": 4.2
>   },
>   {
>   "name":"node5",
>   "rule":"feature 20 > 116.0",
>   "predict": 6.3
>   }
>   ]
> },
> {
> "name":"node3",
> "rule":"feature 0 > -35.0",
>   "children":[
>   {
>   "name":"node7",
>   "rule":"feature 3 <= 11.0",
>   "predict": 4.5
>   },
>   {
>   "name":"node8",
>   "rule":"feature 3 > 11.0",
>   "predict": 10.2
>   }
>   ]
> }
>  
> ]
> }
> ]
> }
>  
> Food for thought!
>  
> Thanks,
>  
> Jim
>  


Re: Welcoming three new committers

2015-02-03 Thread Manish Amde
Congratulations Cheng, Joseph and Sean.

On Tuesday, February 3, 2015, Zhan Zhang  wrote:

> Congratulations!
>
> On Feb 3, 2015, at 2:34 PM, Matei Zaharia  > wrote:
>
> > Hi all,
> >
> > The PMC recently voted to add three new committers: Cheng Lian, Joseph
> Bradley and Sean Owen. All three have been major contributors to Spark in
> the past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and many
> pieces throughout Spark Core. Join me in welcoming them as committers!
> >
> > Matei
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > For additional commands, e-mail: dev-h...@spark.apache.org
> 
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> For additional commands, e-mail: dev-h...@spark.apache.org 
>
>


Re: Quantile regression in tree models

2014-11-18 Thread Manish Amde
Hi Alex,

Here is the ticket for refining tree predictions. Let's discuss this
further on the JIRA.
https://issues.apache.org/jira/browse/SPARK-4240

There is no ticket yet for quantile regression. It will be great if you
could create one and note down the corresponding loss function and gradient
calculations. There is a design doc that Joseph Bradley wrote for
supporting boosting algorithms with generic weak learners but it doesn't
include implementation details. I can definitely help you understand the
existing code if you decide to work on it. However, let's discuss the
relevance of the algorithm to MLlib on the JIRA. It seems like a nice
addition though I am not sure about the implementation complexity. I will
be great to see what others think.

-Manish

On Tue, Nov 18, 2014 at 10:07 AM, Alessandro Baretta 
wrote:

> Manish,
>
> My use case for (asymmetric) absolute error is quite trivially quantile
> regression. In other words, I want to use Spark to learn conditional
> cumulative distribution functions. See R's GBM quantile regression option.
>
> If you either find or create a Jira ticket, I would be happy to give it a
> shot. Is there a design doc explaining how the gradient boosting algorithm
> is laid out in MLLib? I tried reading the code, but without a "Rosetta
> stone" it's impossible to make sense of it.
>
> Alex
>
> On Mon, Nov 17, 2014 at 8:25 PM, Manish Amde  wrote:
>
>> Hi Alessandro,
>>
>> I think absolute error as splitting criterion might be feasible with the
>> current architecture -- I think the sufficient statistics we collect
>> currently might be able to support this. Could you let us know scenarios
>> where absolute error has significantly outperformed squared error for
>> regression trees? Also, what's your use case that makes squared error
>> undesirable.
>>
>> For gradient boosting, you are correct. The weak hypothesis weights refer
>> to tree predictions in each of the branches. We plan to explain this in
>> the 1.2 documentation and may be add some more clarifications to the
>> Javadoc.
>>
>> I will try to search for JIRAs or create new ones and update this thread.
>>
>> -Manish
>>
>>
>> On Monday, November 17, 2014, Alessandro Baretta 
>> wrote:
>>
>>> Manish,
>>>
>>> Thanks for pointing me to the relevant docs. It is unfortunate that
>>> absolute error is not supported yet. I can't seem to find a Jira for it.
>>>
>>> Now, here's the what the comments say in the current master branch:
>>> /**
>>>  * :: Experimental ::
>>>  * A class that implements Stochastic Gradient Boosting
>>>  * for regression and binary classification problems.
>>>  *
>>>  * The implementation is based upon:
>>>  *   J.H. Friedman.  "Stochastic Gradient Boosting."  1999.
>>>  *
>>>  * Notes:
>>>  *  - This currently can be run with several loss functions.  However,
>>> only SquaredError is
>>>  *fully supported.  Specifically, the loss function should be used
>>> to compute the gradient
>>>  *(to re-label training instances on each iteration) and to weight
>>> weak hypotheses.
>>>  *Currently, gradients are computed correctly for the available loss
>>> functions,
>>>  *but weak hypothesis weights are not computed correctly for LogLoss
>>> or AbsoluteError.
>>>  *Running with those losses will likely behave reasonably, but lacks
>>> the same guarantees.
>>> ...
>>> */
>>>
>>> By the looks of it, the GradientBoosting API would support an absolute
>>> error type loss function to perform quantile regression, except for "weak
>>> hypothesis weights". Does this refer to the weights of the leaves of the
>>> trees?
>>>
>>> Alex
>>>
>>> On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde 
>>> wrote:
>>>
>>>> Hi Alessandro,
>>>>
>>>> MLlib v1.1 supports variance for regression and gini impurity and
>>>> entropy for classification.
>>>> http://spark.apache.org/docs/latest/mllib-decision-tree.html
>>>>
>>>> If the information gain calculation can be performed by distributed
>>>> aggregation then it might be possible to plug it into the existing
>>>> implementation. We want to perform such calculations (for e.g. median) for
>>>> the gradient boosting models (coming up in the 1.2 release) using absolute
>>>> error and deviance as loss functions but I don't think anyone is planning
>>>> to work on it yet. :-)
>>>>
>>>> -Manish
>>>>
>>>> On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta <
>>>> alexbare...@gmail.com> wrote:
>>>>
>>>>> I see that, as of v. 1.1, MLLib supports regression and classification
>>>>> tree
>>>>> models. I assume this means that it uses a squared-error loss function
>>>>> for
>>>>> the first and logistic cost function for the second. I don't see
>>>>> support
>>>>> for quantile regression via an absolute error cost function. Or am I
>>>>> missing something?
>>>>>
>>>>> If, as it seems, this is missing, how do you recommend to implement it?
>>>>>
>>>>> Alex
>>>>>
>>>>
>>>>
>>>
>


Re: Quantile regression in tree models

2014-11-17 Thread Manish Amde
Hi Alessandro,

I think absolute error as splitting criterion might be feasible with the
current architecture -- I think the sufficient statistics we collect
currently might be able to support this. Could you let us know scenarios
where absolute error has significantly outperformed squared error for
regression trees? Also, what's your use case that makes squared error
undesirable.

For gradient boosting, you are correct. The weak hypothesis weights refer
to tree predictions in each of the branches. We plan to explain this in the
1.2 documentation and may be add some more clarifications to the Javadoc.

I will try to search for JIRAs or create new ones and update this thread.

-Manish

On Monday, November 17, 2014, Alessandro Baretta 
wrote:

> Manish,
>
> Thanks for pointing me to the relevant docs. It is unfortunate that
> absolute error is not supported yet. I can't seem to find a Jira for it.
>
> Now, here's the what the comments say in the current master branch:
> /**
>  * :: Experimental ::
>  * A class that implements Stochastic Gradient Boosting
>  * for regression and binary classification problems.
>  *
>  * The implementation is based upon:
>  *   J.H. Friedman.  "Stochastic Gradient Boosting."  1999.
>  *
>  * Notes:
>  *  - This currently can be run with several loss functions.  However,
> only SquaredError is
>  *fully supported.  Specifically, the loss function should be used to
> compute the gradient
>  *(to re-label training instances on each iteration) and to weight
> weak hypotheses.
>  *Currently, gradients are computed correctly for the available loss
> functions,
>  *but weak hypothesis weights are not computed correctly for LogLoss
> or AbsoluteError.
>  *Running with those losses will likely behave reasonably, but lacks
> the same guarantees.
> ...
> */
>
> By the looks of it, the GradientBoosting API would support an absolute
> error type loss function to perform quantile regression, except for "weak
> hypothesis weights". Does this refer to the weights of the leaves of the
> trees?
>
> Alex
>
> On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde  > wrote:
>
>> Hi Alessandro,
>>
>> MLlib v1.1 supports variance for regression and gini impurity and entropy
>> for classification.
>> http://spark.apache.org/docs/latest/mllib-decision-tree.html
>>
>> If the information gain calculation can be performed by distributed
>> aggregation then it might be possible to plug it into the existing
>> implementation. We want to perform such calculations (for e.g. median) for
>> the gradient boosting models (coming up in the 1.2 release) using absolute
>> error and deviance as loss functions but I don't think anyone is planning
>> to work on it yet. :-)
>>
>> -Manish
>>
>> On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta <
>> alexbare...@gmail.com
>> > wrote:
>>
>>> I see that, as of v. 1.1, MLLib supports regression and classification
>>> tree
>>> models. I assume this means that it uses a squared-error loss function
>>> for
>>> the first and logistic cost function for the second. I don't see support
>>> for quantile regression via an absolute error cost function. Or am I
>>> missing something?
>>>
>>> If, as it seems, this is missing, how do you recommend to implement it?
>>>
>>> Alex
>>>
>>
>>
>


Re: Quantile regression in tree models

2014-11-17 Thread Manish Amde
Hi Alessandro,

MLlib v1.1 supports variance for regression and gini impurity and entropy
for classification.
http://spark.apache.org/docs/latest/mllib-decision-tree.html

If the information gain calculation can be performed by distributed
aggregation then it might be possible to plug it into the existing
implementation. We want to perform such calculations (for e.g. median) for
the gradient boosting models (coming up in the 1.2 release) using absolute
error and deviance as loss functions but I don't think anyone is planning
to work on it yet. :-)

-Manish

On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta 
wrote:

> I see that, as of v. 1.1, MLLib supports regression and classification tree
> models. I assume this means that it uses a squared-error loss function for
> the first and logistic cost function for the second. I don't see support
> for quantile regression via an absolute error cost function. Or am I
> missing something?
>
> If, as it seems, this is missing, how do you recommend to implement it?
>
> Alex
>


Re: Decision forests don't work with non-trivial categorical features

2014-10-13 Thread Manish Amde
Sean, sorry for missing out on the discussion.

Evan, you are correct, we are using the heuristic Sean suggested during the
multiclass PR for ordering high-arity categorical variables using the
impurity values for each categorical feature.

Joseph, thanks for fixing the bug which I think was a regression since we
added support for RFs. I don't think we have see this in 1.1.

-Manish

On Mon, Oct 13, 2014 at 11:55 AM, Joseph Bradley 
wrote:

> I think this is the fix:
>
> In this
> file:
> mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala
>
> methods "getFeatureOffset" and "getLeftRightFeatureOffsets" have sanity
> checks ("require") which are correct for DecisionTree but not for
> RandomForest.  You can remove those.  I've sent a PR with this and a few
> other small fixes:
>
> https://github.com/apache/spark/pull/2785
>
> I hope this fixes the bug!
>
> On Mon, Oct 13, 2014 at 11:19 AM, Sean Owen  wrote:
>
> > Great, we'll confer then. I'm using master / 1.2.0-SNAPSHOT. I'll send
> > some details directly under separate cover.
> >
> > On Mon, Oct 13, 2014 at 7:12 PM, Joseph Bradley 
> > wrote:
> > > Hi Sean,
> > >
> > > Sorry I didn't see this thread earlier!  (Thanks Ameet for pinging me.)
> > >
> > > Short version: That exception should not be thrown, so there is a bug
> > > somewhere.  The intended logic for handling high-arity categorical
> > features
> > > is about the best one can do, as far as I know.
> > >
> > > Bug finding: For my checking purposes, which branch of Spark are you
> > using,
> > > and do you have the options being submitted to DecisionTree?
> > >
> > > High-arity categorical features: As you have figured out, if you use a
> > > categorical feature with just a few categories, it is treated as
> > "unordered"
> > > so that we explicitly consider all exponentially many ways to split the
> > > categories into 2 groups.  If you use one with many categories, then it
> > is
> > > necessary to impose an order.  (The communication increases linearly in
> > the
> > > number of possible splits, so it would blow up if we considered all
> > > exponentially many splits.)  This order is chosen separately for each
> > node,
> > > so it is not a uniform order imposed over the entire tree.  This
> actually
> > > means that it is not a heuristic for regression and binary
> > classification;
> > > i.e., it chooses the same split as if we had explicitly considered all
> of
> > > the possible splits.  For multiclass classification, it is a heuristic,
> > but
> > > I don't know of a better solution.
> > >
> > > I'll check the code, but if you can forward info about the bug, that
> > would
> > > be very helpful.
> > >
> > > Thanks!
> > > Joseph
> > >
> >
>


Re: reduce, transform, combine

2014-05-04 Thread Manish Amde
Thanks DB. I will work with mapPartition for now. 


Question to the community in general: should we consider adding such an 
operation to RDDs especially as a developer API?

On Sun, May 4, 2014 at 1:41 AM, DB Tsai  wrote:

> You could easily achieve this by mapPartition. However, it seems that it
> can not be done by using aggregate type of operation. I can see that it's a
> general useful operation. For now, you could use mapPartition.
> Sincerely,
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
> On Sun, May 4, 2014 at 1:12 AM, Manish Amde  wrote:
>> I am currently using the RDD aggregate operation to reduce (fold) per
>> partition and then combine using the RDD aggregate operation.
>> def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U)
>> => U): U
>>
>> I need to perform a transform operation after the seqOp and before the
>> combOp. The signature would look like
>> def foldTransformCombine[U: ClassTag](zeroReduceValue: V, zeroCombineValue:
>> U)(seqOp: (V, T) => V, transformOp: (V) => U, combOp: (U, U) => U): U
>>
>> This is especially useful in the scenario where the transformOp is
>> expensive and should be performed once per partition before combining. Is
>> there a way to accomplish this with existing RDD operations? If yes, great
>> but if not, should we consider adding such a general transformation to the
>> list of RDD operations?
>>
>> -Manish
>>

reduce, transform, combine

2014-05-04 Thread Manish Amde
I am currently using the RDD aggregate operation to reduce (fold) per
partition and then combine using the RDD aggregate operation.
def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U)
=> U): U

I need to perform a transform operation after the seqOp and before the
combOp. The signature would look like
def foldTransformCombine[U: ClassTag](zeroReduceValue: V, zeroCombineValue:
U)(seqOp: (V, T) => V, transformOp: (V) => U, combOp: (U, U) => U): U

This is especially useful in the scenario where the transformOp is
expensive and should be performed once per partition before combining. Is
there a way to accomplish this with existing RDD operations? If yes, great
but if not, should we consider adding such a general transformation to the
list of RDD operations?

-Manish