GitHub user KyleLi1985 opened a pull request:
https://github.com/apache/spark/pull/23271
[SPARK-26318][SQL] Enhance function merge performance in Row
## What changes were proposed in this pull request?
Enhance function merge performance in Row
Like do 1 time
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r237551217
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
---
@@ -128,6 +128,82 @@ class RowMatrix @Since("
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r237532703
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
---
@@ -128,6 +128,82 @@ class RowMatrix @Since("
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r237505113
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
---
@@ -128,6 +128,69 @@ class RowMatrix @Since("
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r236927771
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/RowMatrixSuite.scala
---
@@ -266,6 +266,16 @@ class RowMatrixSuite extends
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r236927721
--- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaPCASuite.java
---
@@ -67,7 +66,7 @@ public void testPCA() {
JavaRDD dataRDD
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Add test case in RowMatrixSuite for this PR,
The breeze output is
6.711333870761802E-11 -3.833375461575691E-12
-3.833375461575691E-12 2.916662578525011E-12
Before
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
align JavaPCASuite expected data process behavior with PCA function fit
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Ok, I will do it later
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Um, the unit test in spark indeed cover both case. But there is function
closeToZero to handle accuracy problem, so
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Sure, the test cases include sparse and dense case.
Do these case again for new commit
we use data from
http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
It would be better, update the commit
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Plug do some more test on real data after add this commit
we use data from
http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals
and data
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
After add this commit
We get the result for RowMatrix computeCovariance function:
For the input data
1.0,2.0,3.0,4.0,5.0
2.0,3.0,1.0,2.0,6.0
RowMatrix function
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Compare Spark computeCovariance function in RowMatrix for DenseVector and
Numpy's function cov,
Find two problem, below is the result:
1)The Spark function computeCovariance
GitHub user KyleLi1985 opened a pull request:
https://github.com/apache/spark/pull/23126
[SPARK-26158] [MLLIB] fix covariance accuracy problem for DenseVector
## What changes were proposed in this pull request?
Enhance accuracy of the covariance logic in RowMatrix for function
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> Thanks @KyleLi1985 this looks like a nice win in the end. Thanks for your
investigation.
@srowen @HyukjinKwon @mgaido91 Thanks for review. It is my pleas
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
@SparkQA retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22893#discussion_r232457128
--- Diff: python/pyspark/ml/clustering.py ---
@@ -88,6 +88,14 @@ def clusterSizes(self):
"""
return
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
@SparkQA test this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
It seems the related file spark/python/pyspark/ml/clustering.py has been
changed, during these days. My local latest commit stay on "bfe60fc on 30
Jul". So I need re-fork spar
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
@AmplabJenkins test this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
I form the final test case for sparse case and dense case on realistic data
to test new commit
[SparkMLlibTest.txt](https://github.com/apache/spark/files/2561442/SparkMLlibTest.txt
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22893#discussion_r231838390
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
@@ -521,19 +521,21 @@ object MLUtils extends Logging {
* The bound
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> OK, the Spark part doesn't seem relevant. The input might be more
realistic here, yes. I was commenting that your test code doesn't show what
you're testing, though I understand you manua
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> So the pull request right now doesn't reflect what you tested, but you
tested the version pasted above. You're saying that the optimization just never
helps the dense-dense case, and sqd
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> Hm, actually that's the best case. You're exercising the case where the
code path you prefer is fast. And the case where the precision bound applies is
exactly the case where the branch
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> Hm, actually that's the best case. You're exercising the case where the
code path you prefer is fast. And the case where the precision bound applies is
exactly the case where the branch
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> Hm, actually that's the best case. You're exercising the case where the
code path you prefer is fast. And the case where the precision bound applies is
exactly the case where the branch
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
There is my test for situation sparse-sparse, dense-dense, sparse-dense case
`
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> I don't think BLAS matters here as these are all vector-vector operations
and f2jblas is used directly (i.e. stays in the JVM).
>
> Are all the vectors dense? I suppose
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> then I think you have to try with native BLAS installed, otherwise the
results are not valid IMHO.
This part only use F2j to calculate as I said in last comment, so the
performa
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> I don't think BLAS matters here as these are all vector-vector operations
and f2jblas is used directly (i.e. stays in the JVM).
>
> Are all the vectors dense? I suppose
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> then I think you have to try with native BLAS installed, otherwise the
results are not valid IMHO.
Ok, For a fair result, I will
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> @KyleLi1985 do you have native BLAS installed?
Like code said : // For level-1 routines, we use Java implementat
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
End-to-End TEST Situation:
Use below code to test
`
test("kmeanproblem") {
val rdd = sc
.textFile("/Users/liliang/Desktop/inputdata.txt"
GitHub user KyleLi1985 opened a pull request:
https://github.com/apache/spark/pull/22893
One part of Spark MLlib Kmean Logic Performance problem
[SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic Performance problem
## What changes were proposed in this pull request
37 matches
Mail list logo