[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2018-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17000 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2018-05-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17000 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2018-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17000 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-12-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17000 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-11-06 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17000 @MLnick It looks like VF-LBFGS has a different scenario. In VF algos, the vectors will be too large to store in driver memory, so we slice the vectors into different machines (stored by `RDD[V

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-03-12 Thread ZunwenYou
Github user ZunwenYou commented on the issue: https://github.com/apache/spark/pull/17000 ping @yanboliang , please has a look at this improvement. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-23 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17000 cc @yanboliang - it seems actually similar in effect to the VL-BFGS work with RDD-based coefficients? --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-23 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17000 I'm not totally certain there will be some huge benefit with porting vector summary to UDAF framework. But there are API-level benefits to doing so. Perhaps there is a way to incorporate the `sliceAg

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-23 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17000 @ZunwenYou yes I understand that the `sliceAggregate` is different from SPARK-19634 and more comparable to `treeAggregate`. But I'm not sure, if we plan to port the vector summary to use `DataFrame`

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-21 Thread ZunwenYou
Github user ZunwenYou commented on the issue: https://github.com/apache/spark/pull/17000 Hi, @MLnick Firstly, `sliceAggregate `is a common aggregate for array-like data. Besides `MultivariateOnlineSummarizer ` case, it can be used in many large machine learning cases. I chose `Mu

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17000 Is the speedup coming mostly from the `MultivariateOnlineSummarizer` stage? See https://issues.apache.org/jira/browse/SPARK-19634 which is for porting this operation to use DataFrame UDAF and

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread ZunwenYou
Github user ZunwenYou commented on the issue: https://github.com/apache/spark/pull/17000 Hi, @hhbyyh In our experiment, the class **_MultivariateOnlineSummarizer_** contains 8 arrays, if the dimension reaches 20 million, the memory of MultivariateOnlineSummarizer is 1280M(8B

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread ZunwenYou
Github user ZunwenYou commented on the issue: https://github.com/apache/spark/pull/17000 Hi, @MLnick You are right, sliceAggregate splits an array into smaller chunks before shuffle. It has three advantage Firstly, the shuffle data is less than treeAggregate during the whol

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread ZunwenYou
Github user ZunwenYou commented on the issue: https://github.com/apache/spark/pull/17000 Hi, @MLnick You are right, sliceAggregate splits an array into smaller chunks before shuffle. It has three advantage Firstly, the shuffle data is less than treeAggregate during the w

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread hhbyyh
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/17000 Hi @ZunwenYou Do you know what's the reason that treeAggregate failed when feature dimension reach 20 million? I think this potentially can help with the 2G disk shuffle spill limit. (to be ver

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17000 Just to be clear - this is essentially just splitting an array up into smaller chunks so that overall communication is more efficient? It would be good to look at why Spark is not doing a good job wi

[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17000 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feat