[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-09-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154201#comment-14154201
 ] 

Apache Spark commented on SPARK-1547:
-

User 'manishamde' has created a pull request for this issue:
https://github.com/apache/spark/pull/2607

> Add gradient boosting algorithm to MLlib
> 
>
> Key: SPARK-1547
> URL: https://issues.apache.org/jira/browse/SPARK-1547
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the gradient boosting algorithm to Spark MLlib. The 
> implementation needs to adapt the gradient boosting algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation
> [Ensembles design document (Google doc) | 
> https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-09-30 Thread Manish Amde (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154051#comment-14154051
 ] 

Manish Amde commented on SPARK-1547:


Adding Hirakendu's feedback on checkpointing below.

~

Hi Manish,

Just looked at the JIRA, quite impressive progress. Will create a JIRA account 
and submit the Apache CLA, long time procrastinating :).

About checkpointing, as I may have discussed before, I had a very crude 
solution to the problem with long lineage chains. I simply cache the residual 
dataset periodically after a certain number of iterations. I uncache the 
previous cached dataset prior to caching the current one. It does materialize 
the dataset, but apparently doesn't break the lineage graph. Nonetheless, in 
practice I see that speed improves to as if I started afresh. It's a crude form 
of checkpointing, but it would be good to do the real thing.

About the choice of interval, obviously it depends on the dataset. Suppose 
there is a linear increase in time with iterations (the map operation to 
subtract the previous tree predictions), and say it takes time t for each such 
map operation. If we break the chain every B iterations, then the time for such 
map operations is t + 2t + ... + Bt ~= B^2 * t. Suppose the time taken to 
checkpoint/materialize is c, then the time for the period of B iterations is  
B^2 * t + c. Thus, the time per iteration is (B^2 * t  + c)/B = B*t + c/B. 
Thus, B = sqrt(c/t). (Ignoring the -1s and factor of 2s in B^2 * t.)

Of course, t and c depend on the dataset. So setting aside all calculations, in 
my implementation B is a user defined parameter. In practice, I run it once and 
see how much time it takes current iteration and when it slows down to the 
extent that it's same as the first iteration. It means might have as started 
afresh. Note that mathematically, for optimal B = sqrt(c/t), the last iteration 
takes time B^2 t = c, which is same as the checkpoint time.

Clearly, this can be automated by keeping track of iteration times, just 
subtract the actual computation time and leave some room for noisy running 
times. Nonetheless, would be good to have a user defined override by a optional 
parameter.

I think a similar optional parameter may also be provided to override the 
automatically determined batch size for deep trees.

Btw, the comment in the JIRA about support for sparse vectors is something nice 
to have and something I have been thinking about. Sparse datasets are now far 
too common and flexible, although the go-to solution is by linear models.

Thanks,
Hirakendu.

> Add gradient boosting algorithm to MLlib
> 
>
> Key: SPARK-1547
> URL: https://issues.apache.org/jira/browse/SPARK-1547
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the gradient boosting algorithm to Spark MLlib. The 
> implementation needs to adapt the gradient boosting algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation
> [Ensembles design document (Google doc) | 
> https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-09-29 Thread Manish Amde (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152251#comment-14152251
 ] 

Manish Amde commented on SPARK-1547:


Sure. I like your naming suggestion. 

I will rebase from the latest master now that the RF PR has been accepted.  I 
will create a WIP PR soon after (with tests and docs) so that we can discuss 
the code in greater detail.

> Add gradient boosting algorithm to MLlib
> 
>
> Key: SPARK-1547
> URL: https://issues.apache.org/jira/browse/SPARK-1547
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the gradient boosting algorithm to Spark MLlib. The 
> implementation needs to adapt the gradient boosting algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation
> [Ensembles design document (Google doc) | 
> https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-09-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152101#comment-14152101
 ] 

Joseph K. Bradley commented on SPARK-1547:
--

This will be great to have!  The WIP code and the list of to-do items look good 
to me.

Small comment: For the losses, it would be good to rename "residual" to either 
"pseudoresidual" (following Friedman's paper) or to "lossGradient" (which is 
more literal/accurate).  It would also be nice to have the loss classes compute 
the loss itself, so that we can compute that at the end (and later track it 
along the way).


> Add gradient boosting algorithm to MLlib
> 
>
> Key: SPARK-1547
> URL: https://issues.apache.org/jira/browse/SPARK-1547
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the gradient boosting algorithm to Spark MLlib. The 
> implementation needs to adapt the gradient boosting algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation
> [Ensembles design document (Google doc) | 
> https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-09-27 Thread Manish Amde (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150964#comment-14150964
 ] 

Manish Amde commented on SPARK-1547:


Given the interesting in boosting algorithms for the MLlib 1.2 release, I have 
revived the boosting work that Hirakendu Das and I worked on a few months ago 
but was stalled due to the decision tree optimization effort. 

A basic version of the GBT code with "pluggable" loss functions can be found 
here:
https://github.com/manishamde/spark/compare/gbt?diff=unified

I will put it up for a WIP PR in a few days if I don't hear any major concerns.

Here are a few things that are left:
1. Stochastic gradient boosting support -- I am waiting for the [RF 
ticket|https://issues.apache.org/jira/browse/SPARK-1545] to be closed so that I 
can re-use the BaggedPoint approach.
2. Checkpointing -- This approach will avoid long lineage chains. Need 
Hirakendu's inputs on this especially his findings on large-scale experiments. 
Also need to conduct experiments of my own.
3. Unit Tests -- I have done some basic test but I need to add unit tests.
4. Classification support -- It should be straightforward to add
5. Create public APIs
6. Tests on multiple cluster sizes and datasets -- require help from the 
community on this front. 

Feedback will be appreciated.

> Add gradient boosting algorithm to MLlib
> 
>
> Key: SPARK-1547
> URL: https://issues.apache.org/jira/browse/SPARK-1547
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the gradient boosting algorithm to Spark MLlib. The 
> implementation needs to adapt the gradient boosting algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation
> [Ensembles design document (Google doc) | 
> https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-09-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149918#comment-14149918
 ] 

Joseph K. Bradley commented on SPARK-1547:
--

[~hector.yee] I strongly agree about keeping ensembles general enough to work 
with any weak learning algorithm.  This is difficult now because of the lack of 
a general class hierarchy, but that will be easier after the [current API 
redesign|https://issues.apache.org/jira/browse/SPARK-1856].  Starting with 
trees, and later generalizing once the new API is available, will be great.

> Add gradient boosting algorithm to MLlib
> 
>
> Key: SPARK-1547
> URL: https://issues.apache.org/jira/browse/SPARK-1547
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the gradient boosting algorithm to Spark MLlib. The 
> implementation needs to adapt the gradient boosting algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation
> [Ensembles design document (Google doc) | 
> https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-07-01 Thread Hector Yee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049659#comment-14049659
 ] 

Hector Yee commented on SPARK-1547:
---

Just generic log loss with L1 regularization should suffice. Most of the work 
is in feature engineering anyway. It is no hurry at all, I already have several 
implementations not in MLLib that I am using. It would just be convenient to 
have another implementation to compare against.

> Add gradient boosting algorithm to MLlib
> 
>
> Key: SPARK-1547
> URL: https://issues.apache.org/jira/browse/SPARK-1547
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the gradient boosting algorithm to Spark MLlib. The 
> implementation needs to adapt the gradient boosting algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-07-01 Thread Manish Amde (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049644#comment-14049644
 ] 

Manish Amde commented on SPARK-1547:


Yes, the loss function and the solver should ideally be independent of the tree 
algorithm and hence work with sparse data. Any particular non-tree algorithm 
you had in mind?

I will definitely keep your suggestion in mind during the implementation 
(coming up soon)  but might postpone it for a later release if it involves a 
lot more work than implementing it for decision trees since the goal is to get 
ensembles built on top of decision trees ASAP.

> Add gradient boosting algorithm to MLlib
> 
>
> Key: SPARK-1547
> URL: https://issues.apache.org/jira/browse/SPARK-1547
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the gradient boosting algorithm to Spark MLlib. The 
> implementation needs to adapt the gradient boosting algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-07-01 Thread Hector Yee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049394#comment-14049394
 ] 

Hector Yee commented on SPARK-1547:
---

Honestly trees are most useful when the feature vectors are dense. Any 
possibility that the solver can be decoupled from the tree part for dealing 
with sparse data?

> Add gradient boosting algorithm to MLlib
> 
>
> Key: SPARK-1547
> URL: https://issues.apache.org/jira/browse/SPARK-1547
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the gradient boosting algorithm to Spark MLlib. The 
> implementation needs to adapt the gradient boosting algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)