[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14937 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14937 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85510/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14937 **[Test build #85510 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85510/consoleFull)** for PR 14937 at commit [`1c31cda`](https://github.com/apache/spark/commit/1c31cda0f78b8c2b11406d76da447e9b3216a97d). * This patch **fails PySpark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14937 **[Test build #85510 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85510/consoleFull)** for PR 14937 at commit [`1c31cda`](https://github.com/apache/spark/commit/1c31cda0f78b8c2b11406d76da447e9b3216a97d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/14937 @sethah Yeah, I agree it's better to run more test against large-scale data. If the number of feature or cluster is large, the center array slice cost and some other place can be optimized which I did not pay more attention. And we definitely should really understand the performance test result. So feel free to share your result. When I did this optimization, we found ```KMeans``` was usually used when the number of feature is not too large. If users have a high-dimensional data, they usually reduce feature dimension by ```PCA```, ```LDA``` or similar algorithms and then feed them into ```KMeans``` for clustering. So the optimization should be more focus on not very high dimensional data if we can not guarantee better performance for any cases. However, it's well if we can figure out one way to benefit both cases. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/14937 @yanboliang I ran the test. The master branch runs in 10 seconds and the current patch runs in 6 seconds. Still, the results are meaningless in my opinion on such a small dataset. I also ran both branches at larger scale and I saw that master branch takes ~20 seconds per iteration in one case while this patch takes 10 minutes. I traced it down to the way the data is being copied. Could you also run tests at scale to verify this? Again, with some refactoring I ran some very preliminary tests (data size approximately 100gb with 100 - 1k clusters) and saw that this branch can improve performance for some cases and degrades it in others. We need to test this at scale to really understand the implications I think. I will try to summarize my results sometime in the next week. I think we will see performance gains when the number of features/clusters is large. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/14937 @sethah You can try the following piece of code even in a single node: ```Scala import org.apache.spark.ml.clustering.KMeans val dataset = spark.read.format("libsvm").options(Map("vectorType" -> "dense")).load("/Users/yliang/Downloads/libsvm/combined") val kmeans = new KMeans().setK(3).setSeed(1L).setTol(1E-16).setMaxIter(100).setInitMode("random") val model = kmeans.fit(dataset) ``` You can find the dataset at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html . I run it against master and this PR, it spends different time for each iterations. Before this PR (master code): ``` Time: 32.076 seconds. Iteration number: 35. ``` After this PR: ``` Time: 16.322 seconds. Iteration number: 85. ``` I think the value of ```tol``` is not set properly, so it causes the two implementations converge in different iteration number. We can have more robust dataset or force each one to run until a fixed number to compare spent time, but we can still get some sense from this result. Please feel free to try this test in your environment, and let me know whether it can be reproduced. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/14937 A small update: I have run a few tests on a refactored version of this patch which avoids some data copying. I have found at least one case where the current patch is faster, but many where it is not. I'll try to post formal results at some point. (All test cases using dense data btw) In the meantime, I think it would be helpful to have more detail about the tests above. They are rather small datasets. How many centers were used? How were the timings observed? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/14937 @yanboliang I ran some tests on a 3 node bare-metal cluster, 144 cores, 384 gb ram on some dense synthetic data. I installed OpenBLAS customized for the hardware on the nodes (I can confirm it's successfully using NativeBLAS, not positive it's optimized though). With this patch at first, I was seeing something like 10 minute iteration times compared to master branch of ~30 seconds. After refactoring the code to avoid some copying, I am still seeing about a 3-5x slowdown using this approach. I am still working through some of the timings and I haven't done a lot of experimentation with the block size. I will give more details at some point. For now, I can point out that copying the center in [here](https://github.com/yanboliang/spark/blob/1c31cda0f78b8c2b11406d76da447e9b3216a97d/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L379) seems to have a huge impact. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/14937 @sethah I think the test result can be reproduced against the current patch, however, there are two issues should be considered: * Make sure you installed optimized/native BLAS on your system and loaded it correctly in JVM via netlib-java. Otherwise, it will fall back to Java implementation. * Make sure you load the dataset via DenseVector which will be converted into DenseMatrix and get performance improvement. ```Scala val df = spark.read.format("libsvm").options(Map("vectorType" -> "dense")).load(path) ``` Spark loads dataset of libsvm format into SparseVector/SparseMatrix by default, and it will fall into the branch of processing sparse data which will cause huge performance degradation. Could you share some of your test detail? If you already considered the above two tips correctly, please let me know as well. I'm on a business travel and will resolve the merge conflicts in a few days. I'm very appreciate to hear your thoughts about this issue. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/14937 @yanboliang I began to run some performance tests on this patch today. With this patch the way it is, I am seeing a huge performance **_degradation_**. The most critical reason is the slicing (copying) of the centers array inside the inner, inner while loop. The reason I ask is because I don't see how the results posted in this PR could even occur against the current patch. Were those from an older version? I know this PR has gone through several iterations and so I'm just trying to get a sense for where those results came from. It would be great if we could resolve the merge conflicts and start moving review along. I'm happy to help :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/14937 @srowen Please feel free to send that PR. This PR involves some significant change and should be careful discussed, it may not be merged too fast. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14937 @yanboliang would it be useful if I worked on a PR to just remove `runs`? I had started that already. But I don't want to cause a big merge conflict for you if you're going to update this and merge it soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/14937 @srowen Yes, I'm working on this. You can see the performance test result in the PR description. We can found that the optimization k-means can get performance improvements about 2 ~ 4 times by using native BLAS level 3 matrix-matrix multiplications for dense input. However, we saw performance degradation for sparse input. For example, the new implementation spent almost twice time as much as the old one when training k-means model on the famous mnist data set. In the view of the current performance test result, I think we should only make this optimization for dense input and let sparse input still run the old code. I have sent the performance test result to @mengxr and waiting for his opinion. I'm also appreciate your thoughts and suggestions. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14937 @yanboliang are you still working on this? it seems like an important change, I'd love to help get it in. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org