[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154201#comment-14154201 ] Apache Spark commented on SPARK-1547: - User 'manishamde' has created a pull request for this issue: https://github.com/apache/spark/pull/2607 > Add gradient boosting algorithm to MLlib > > > Key: SPARK-1547 > URL: https://issues.apache.org/jira/browse/SPARK-1547 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the gradient boosting algorithm to Spark MLlib. The > implementation needs to adapt the gradient boosting algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation > [Ensembles design document (Google doc) | > https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154051#comment-14154051 ] Manish Amde commented on SPARK-1547: Adding Hirakendu's feedback on checkpointing below. ~ Hi Manish, Just looked at the JIRA, quite impressive progress. Will create a JIRA account and submit the Apache CLA, long time procrastinating :). About checkpointing, as I may have discussed before, I had a very crude solution to the problem with long lineage chains. I simply cache the residual dataset periodically after a certain number of iterations. I uncache the previous cached dataset prior to caching the current one. It does materialize the dataset, but apparently doesn't break the lineage graph. Nonetheless, in practice I see that speed improves to as if I started afresh. It's a crude form of checkpointing, but it would be good to do the real thing. About the choice of interval, obviously it depends on the dataset. Suppose there is a linear increase in time with iterations (the map operation to subtract the previous tree predictions), and say it takes time t for each such map operation. If we break the chain every B iterations, then the time for such map operations is t + 2t + ... + Bt ~= B^2 * t. Suppose the time taken to checkpoint/materialize is c, then the time for the period of B iterations is B^2 * t + c. Thus, the time per iteration is (B^2 * t + c)/B = B*t + c/B. Thus, B = sqrt(c/t). (Ignoring the -1s and factor of 2s in B^2 * t.) Of course, t and c depend on the dataset. So setting aside all calculations, in my implementation B is a user defined parameter. In practice, I run it once and see how much time it takes current iteration and when it slows down to the extent that it's same as the first iteration. It means might have as started afresh. Note that mathematically, for optimal B = sqrt(c/t), the last iteration takes time B^2 t = c, which is same as the checkpoint time. Clearly, this can be automated by keeping track of iteration times, just subtract the actual computation time and leave some room for noisy running times. Nonetheless, would be good to have a user defined override by a optional parameter. I think a similar optional parameter may also be provided to override the automatically determined batch size for deep trees. Btw, the comment in the JIRA about support for sparse vectors is something nice to have and something I have been thinking about. Sparse datasets are now far too common and flexible, although the go-to solution is by linear models. Thanks, Hirakendu. > Add gradient boosting algorithm to MLlib > > > Key: SPARK-1547 > URL: https://issues.apache.org/jira/browse/SPARK-1547 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the gradient boosting algorithm to Spark MLlib. The > implementation needs to adapt the gradient boosting algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation > [Ensembles design document (Google doc) | > https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152251#comment-14152251 ] Manish Amde commented on SPARK-1547: Sure. I like your naming suggestion. I will rebase from the latest master now that the RF PR has been accepted. I will create a WIP PR soon after (with tests and docs) so that we can discuss the code in greater detail. > Add gradient boosting algorithm to MLlib > > > Key: SPARK-1547 > URL: https://issues.apache.org/jira/browse/SPARK-1547 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the gradient boosting algorithm to Spark MLlib. The > implementation needs to adapt the gradient boosting algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation > [Ensembles design document (Google doc) | > https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152101#comment-14152101 ] Joseph K. Bradley commented on SPARK-1547: -- This will be great to have! The WIP code and the list of to-do items look good to me. Small comment: For the losses, it would be good to rename "residual" to either "pseudoresidual" (following Friedman's paper) or to "lossGradient" (which is more literal/accurate). It would also be nice to have the loss classes compute the loss itself, so that we can compute that at the end (and later track it along the way). > Add gradient boosting algorithm to MLlib > > > Key: SPARK-1547 > URL: https://issues.apache.org/jira/browse/SPARK-1547 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the gradient boosting algorithm to Spark MLlib. The > implementation needs to adapt the gradient boosting algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation > [Ensembles design document (Google doc) | > https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150964#comment-14150964 ] Manish Amde commented on SPARK-1547: Given the interesting in boosting algorithms for the MLlib 1.2 release, I have revived the boosting work that Hirakendu Das and I worked on a few months ago but was stalled due to the decision tree optimization effort. A basic version of the GBT code with "pluggable" loss functions can be found here: https://github.com/manishamde/spark/compare/gbt?diff=unified I will put it up for a WIP PR in a few days if I don't hear any major concerns. Here are a few things that are left: 1. Stochastic gradient boosting support -- I am waiting for the [RF ticket|https://issues.apache.org/jira/browse/SPARK-1545] to be closed so that I can re-use the BaggedPoint approach. 2. Checkpointing -- This approach will avoid long lineage chains. Need Hirakendu's inputs on this especially his findings on large-scale experiments. Also need to conduct experiments of my own. 3. Unit Tests -- I have done some basic test but I need to add unit tests. 4. Classification support -- It should be straightforward to add 5. Create public APIs 6. Tests on multiple cluster sizes and datasets -- require help from the community on this front. Feedback will be appreciated. > Add gradient boosting algorithm to MLlib > > > Key: SPARK-1547 > URL: https://issues.apache.org/jira/browse/SPARK-1547 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the gradient boosting algorithm to Spark MLlib. The > implementation needs to adapt the gradient boosting algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation > [Ensembles design document (Google doc) | > https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149918#comment-14149918 ] Joseph K. Bradley commented on SPARK-1547: -- [~hector.yee] I strongly agree about keeping ensembles general enough to work with any weak learning algorithm. This is difficult now because of the lack of a general class hierarchy, but that will be easier after the [current API redesign|https://issues.apache.org/jira/browse/SPARK-1856]. Starting with trees, and later generalizing once the new API is available, will be great. > Add gradient boosting algorithm to MLlib > > > Key: SPARK-1547 > URL: https://issues.apache.org/jira/browse/SPARK-1547 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the gradient boosting algorithm to Spark MLlib. The > implementation needs to adapt the gradient boosting algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation > [Ensembles design document (Google doc) | > https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049659#comment-14049659 ] Hector Yee commented on SPARK-1547: --- Just generic log loss with L1 regularization should suffice. Most of the work is in feature engineering anyway. It is no hurry at all, I already have several implementations not in MLLib that I am using. It would just be convenient to have another implementation to compare against. > Add gradient boosting algorithm to MLlib > > > Key: SPARK-1547 > URL: https://issues.apache.org/jira/browse/SPARK-1547 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the gradient boosting algorithm to Spark MLlib. The > implementation needs to adapt the gradient boosting algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049644#comment-14049644 ] Manish Amde commented on SPARK-1547: Yes, the loss function and the solver should ideally be independent of the tree algorithm and hence work with sparse data. Any particular non-tree algorithm you had in mind? I will definitely keep your suggestion in mind during the implementation (coming up soon) but might postpone it for a later release if it involves a lot more work than implementing it for decision trees since the goal is to get ensembles built on top of decision trees ASAP. > Add gradient boosting algorithm to MLlib > > > Key: SPARK-1547 > URL: https://issues.apache.org/jira/browse/SPARK-1547 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the gradient boosting algorithm to Spark MLlib. The > implementation needs to adapt the gradient boosting algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049394#comment-14049394 ] Hector Yee commented on SPARK-1547: --- Honestly trees are most useful when the feature vectors are dense. Any possibility that the solver can be decoupled from the tree part for dealing with sparse data? > Add gradient boosting algorithm to MLlib > > > Key: SPARK-1547 > URL: https://issues.apache.org/jira/browse/SPARK-1547 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the gradient boosting algorithm to Spark MLlib. The > implementation needs to adapt the gradient boosting algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation -- This message was sent by Atlassian JIRA (v6.2#6252)