[GitHub] spark pull request #23271: [SPARK-26318][SQL] Enhance function merge perform...

2018-12-10 Thread KyleLi1985
GitHub user KyleLi1985 opened a pull request: https://github.com/apache/spark/pull/23271 [SPARK-26318][SQL] Enhance function merge performance in Row ## What changes were proposed in this pull request? Enhance function merge performance in Row Like do 1 time

[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

2018-11-29 Thread KyleLi1985
Github user KyleLi1985 commented on a diff in the pull request: https://github.com/apache/spark/pull/23126#discussion_r237551217 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -128,6 +128,82 @@ class RowMatrix @Since("

[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

2018-11-29 Thread KyleLi1985
Github user KyleLi1985 commented on a diff in the pull request: https://github.com/apache/spark/pull/23126#discussion_r237532703 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -128,6 +128,82 @@ class RowMatrix @Since("

[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

2018-11-29 Thread KyleLi1985
Github user KyleLi1985 commented on a diff in the pull request: https://github.com/apache/spark/pull/23126#discussion_r237505113 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -128,6 +128,69 @@ class RowMatrix @Since("

[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

2018-11-27 Thread KyleLi1985
Github user KyleLi1985 commented on a diff in the pull request: https://github.com/apache/spark/pull/23126#discussion_r236927771 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/RowMatrixSuite.scala --- @@ -266,6 +266,16 @@ class RowMatrixSuite extends

[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

2018-11-27 Thread KyleLi1985
Github user KyleLi1985 commented on a diff in the pull request: https://github.com/apache/spark/pull/23126#discussion_r236927721 --- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaPCASuite.java --- @@ -67,7 +66,7 @@ public void testPCA() { JavaRDD dataRDD

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-27 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Add test case in RowMatrixSuite for this PR, The breeze output is 6.711333870761802E-11 -3.833375461575691E-12 -3.833375461575691E-12 2.916662578525011E-12 Before

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-27 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 align JavaPCASuite expected data process behavior with PCA function fit --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-26 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Ok, I will do it later --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-26 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Um, the unit test in spark indeed cover both case. But there is function closeToZero to handle accuracy problem, so

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-26 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Sure, the test cases include sparse and dense case. Do these case again for new commit we use data from http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-26 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 It would be better, update the commit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-23 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Plug do some more test on real data after add this commit we use data from http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals and data

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-23 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 After add this commit We get the result for RowMatrix computeCovariance function: For the input data 1.0,2.0,3.0,4.0,5.0 2.0,3.0,1.0,2.0,6.0 RowMatrix function

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-23 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Compare Spark computeCovariance function in RowMatrix for DenseVector and Numpy's function cov, Find two problem, below is the result: 1)The Spark function computeCovariance

[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

2018-11-23 Thread KyleLi1985
GitHub user KyleLi1985 opened a pull request: https://github.com/apache/spark/pull/23126 [SPARK-26158] [MLLIB] fix covariance accuracy problem for DenseVector ## What changes were proposed in this pull request? Enhance accuracy of the covariance logic in RowMatrix for function

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-14 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > Thanks @KyleLi1985 this looks like a nice win in the end. Thanks for your investigation. @srowen @HyukjinKwon @mgaido91 Thanks for review. It is my pleas

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-10 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 @SparkQA retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark pull request #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmea...

2018-11-10 Thread KyleLi1985
Github user KyleLi1985 commented on a diff in the pull request: https://github.com/apache/spark/pull/22893#discussion_r232457128 --- Diff: python/pyspark/ml/clustering.py --- @@ -88,6 +88,14 @@ def clusterSizes(self): """ return

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-09 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 @SparkQA test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-09 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 It seems the related file spark/python/pyspark/ml/clustering.py has been changed, during these days. My local latest commit stay on "bfe60fc on 30 Jul". So I need re-fork spar

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-09 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 @AmplabJenkins test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-08 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 I form the final test case for sparse case and dense case on realistic data to test new commit [SparkMLlibTest.txt](https://github.com/apache/spark/files/2561442/SparkMLlibTest.txt

[GitHub] spark pull request #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmea...

2018-11-08 Thread KyleLi1985
Github user KyleLi1985 commented on a diff in the pull request: https://github.com/apache/spark/pull/22893#discussion_r231838390 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala --- @@ -521,19 +521,21 @@ object MLUtils extends Logging { * The bound

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-03 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > OK, the Spark part doesn't seem relevant. The input might be more realistic here, yes. I was commenting that your test code doesn't show what you're testing, though I understand you manua

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-03 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > So the pull request right now doesn't reflect what you tested, but you tested the version pasted above. You're saying that the optimization just never helps the dense-dense case, and sqd

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-02 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-01 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-01 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-01 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 There is my test for situation sparse-sparse, dense-dense, sparse-dense case ` import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.mllib.linalg

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-01 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM). > > Are all the vectors dense? I suppose

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-01 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO. This part only use F2j to calculate as I said in last comment, so the performa

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-01 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM). > > Are all the vectors dense? I suppose

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-10-31 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO. Ok, For a fair result, I will

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-10-31 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > @KyleLi1985 do you have native BLAS installed? Like code said : // For level-1 routines, we use Java implementat

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-10-31 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 End-to-End TEST Situation: Use below code to test ` test("kmeanproblem") { val rdd = sc .textFile("/Users/liliang/Desktop/inputdata.txt"

[GitHub] spark pull request #22893: One part of Spark MLlib Kmean Logic Performance p...

2018-10-30 Thread KyleLi1985
GitHub user KyleLi1985 opened a pull request: https://github.com/apache/spark/pull/22893 One part of Spark MLlib Kmean Logic Performance problem [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic Performance problem ## What changes were proposed in this pull request