GitHub user facaiy reopened a pull request: https://github.com/apache/spark/pull/17383
[SPARK-3165][MLlib][WIP] DecisionTree does not use sparsity in data ## What changes were proposed in this pull request? DecisionTree should take advantage of sparse feature vectors. Aggregation over training data could handle the empty/zero-valued data elements more efficiently. ## How was this patch tested? Modifying Inner implementation won't change behavior of DecisionTree module, hence all unit tests before should pass. Some performance benchmark perhaps are need. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark ENH/use_sparsity_in_decision_tree Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17383.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17383 ---- commit d2eea0645110b3bcc6c0b905bc55e43e0af9debb Author: é¢åæï¼Yan Facaiï¼ <facai....@gmail.com> Date: 2017-03-22T05:45:58Z CLN: use Vector to implement binnedFeatures in TreePoint commit 9ce6b813beffb9d58e7b2907425a1262610256be Author: é¢åæï¼Yan Facaiï¼ <facai....@gmail.com> Date: 2017-03-22T09:15:30Z BUG: fix for incompatible argument of predictImpl method commit 37f05f9b0386acc8bea048e72aff2b9c37ca4ca6 Author: é¢åæï¼Yan Facaiï¼ <facai....@gmail.com> Date: 2017-03-22T09:18:04Z CLN: create sparse vector when converting to TreePoint commit c9664ce6c94b98cbc76253817e637d9a968e4bd6 Author: é¢åæï¼Yan Facaiï¼ <facai....@gmail.com> Date: 2017-03-22T09:21:59Z CLN: change Array to Vector in TreePoint when created commit d6ef9e512ea4a58db2dccf3e7cca95f9e8b0df8f Author: é¢åæï¼Yan Facaiï¼ <facai....@gmail.com> Date: 2017-03-23T02:12:22Z PREP: use Vector[Int] to store binnedFeature commit 59eb779a9d4f711e7b28d31d579cc49e3d3cc370 Author: é¢åæï¼Yan Facaiï¼ <facai....@gmail.com> Date: 2017-03-23T03:50:14Z CLN: change binnedFeatures from def to val commit 9cbe577b408e987f3026d01316f5a7f2d4c5cfb2 Author: é¢åæï¼Yan Facaiï¼ <facai....@gmail.com> Date: 2017-03-28T00:57:42Z CLN: use filter to select non-zero bits commit b5b0dc8683b6e2d7d274aa8d39932dec61e6193d Author: é¢åæï¼Yan Facaiï¼ <facai....@gmail.com> Date: 2017-03-28T01:03:55Z BUG: fix, compile fails commit cf7e3d8e03f73df725336d0d5a9dd6cc16e7bf95 Author: Yan Facai (é¢åæ) <facai....@gmail.com> Date: 2017-07-05T05:42:09Z Merge branch 'master' into ENH/use_sparsity_in_decision_tree commit 032d50d8c8a851671ba2754cec817d0f6e9ae70f Author: Yan Facai (é¢åæ) <facai....@gmail.com> Date: 2017-07-05T06:20:38Z CLN: use BSV in predictImpl commit 257ddf773eb47499962d6cc57fd1323324dd4ab8 Author: Yan Facai (é¢åæ) <facai....@gmail.com> Date: 2017-07-05T06:42:24Z ENH: create subclass TreeSparsePoint commit 8a919735f9474283d263df78feb2e176f66917f3 Author: Yan Facai (é¢åæ) <facai....@gmail.com> Date: 2017-07-05T06:58:54Z ENH: use TreeDensePoint when numFeatures < 10000 ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org