subject:"\[GitHub\] spark issue #17383\: \[SPARK\-3165\]\[MLlib\] DecisionTree use sparsity in data"

[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

2017-09-26 Thread facaiy

Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17383 Hi, since the work has been done for a long time, I take a review by myself. After careful review, as SparseVector is compressed sparse row format, so the only benefit of the PR would be

[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

2017-09-08 Thread facaiy

Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17383 Sure, @WeichenXu123 , perhaps one or two weeks later, is it OK? By the way, I think using sparse representation can only reduce memory usage, and it is in the cost of compute performance.

[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

2017-09-07 Thread WeichenXu123

Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17383 @facaiy So can you do benchmark first (by generating random testing data) ? So we can see how much this can speed up. ---

[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

2017-09-06 Thread facaiy

Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17383 Thank you for comment. Very good question, at least for me, the answer to both questions is no. In most case, we feed dense raw data into tree model. However, if large dimensions required,