Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17000
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17000
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17000
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17000
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17000
@MLnick It looks like VF-LBFGS has a different scenario. In VF algos, the
vectors will be too large to store in driver memory, so we slice the vectors
into different machines (stored by `RDD[V
Github user ZunwenYou commented on the issue:
https://github.com/apache/spark/pull/17000
ping @yanboliang , please has a look at this improvement.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17000
cc @yanboliang - it seems actually similar in effect to the VL-BFGS work
with RDD-based coefficients?
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17000
I'm not totally certain there will be some huge benefit with porting vector
summary to UDAF framework. But there are API-level benefits to doing so.
Perhaps there is a way to incorporate the `sliceAg
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17000
@ZunwenYou yes I understand that the `sliceAggregate` is different from
SPARK-19634 and more comparable to `treeAggregate`. But I'm not sure, if we
plan to port the vector summary to use `DataFrame`
Github user ZunwenYou commented on the issue:
https://github.com/apache/spark/pull/17000
Hi, @MLnick
Firstly, `sliceAggregate `is a common aggregate for array-like data.
Besides `MultivariateOnlineSummarizer ` case, it can be used in many large
machine learning cases. I chose `Mu
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17000
Is the speedup coming mostly from the `MultivariateOnlineSummarizer` stage?
See https://issues.apache.org/jira/browse/SPARK-19634 which is for porting
this operation to use DataFrame UDAF and
Github user ZunwenYou commented on the issue:
https://github.com/apache/spark/pull/17000
Hi, @hhbyyh
In our experiment, the class **_MultivariateOnlineSummarizer_** contains 8
arrays, if the dimension reaches 20 million, the memory of
MultivariateOnlineSummarizer is 1280M(8B
Github user ZunwenYou commented on the issue:
https://github.com/apache/spark/pull/17000
Hi, @MLnick
You are right, sliceAggregate splits an array into smaller chunks before
shuffle.
It has three advantage
Firstly, the shuffle data is less than treeAggregate during the whol
Github user ZunwenYou commented on the issue:
https://github.com/apache/spark/pull/17000
Hi, @MLnick
You are right, sliceAggregate splits an array into smaller chunks before
shuffle.
It has three advantage
Firstly, the shuffle data is less than treeAggregate during the w
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/17000
Hi @ZunwenYou Do you know what's the reason that treeAggregate failed when
feature dimension reach 20 million?
I think this potentially can help with the 2G disk shuffle spill limit. (to
be ver
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17000
Just to be clear - this is essentially just splitting an array up into
smaller chunks so that overall communication is more efficient? It would be
good to look at why Spark is not doing a good job wi
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17000
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feat
17 matches
Mail list logo