[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2017-12-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14937
  
Build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2017-12-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14937
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85510/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2017-12-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14937
  
**[Test build #85510 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85510/consoleFull)**
 for PR 14937 at commit 
[`1c31cda`](https://github.com/apache/spark/commit/1c31cda0f78b8c2b11406d76da447e9b3216a97d).
 * This patch **fails PySpark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2017-12-29 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14937
  
**[Test build #85510 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85510/consoleFull)**
 for PR 14937 at commit 
[`1c31cda`](https://github.com/apache/spark/commit/1c31cda0f78b8c2b11406d76da447e9b3216a97d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2016-11-03 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/14937
  
@sethah Yeah, I agree it's better to run more test against large-scale 
data. If the number of feature or cluster is large, the center array slice cost 
and some other place can be optimized which I did not pay more attention. And 
we definitely should really understand the performance test result. So feel 
free to share your result.
When I did this optimization, we found ```KMeans``` was usually used when 
the number of feature is not too large. If users have a high-dimensional data, 
they usually reduce feature dimension by ```PCA```, ```LDA``` or similar 
algorithms and then feed them into ```KMeans``` for clustering. So the 
optimization should be more focus on not very high dimensional data if we can 
not guarantee better performance for any cases. However, it's well if we can 
figure out one way to benefit both cases. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2016-11-02 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/14937
  
@yanboliang I ran the test. The master branch runs in 10 seconds and the 
current patch runs in 6 seconds. Still, the results are meaningless in my 
opinion on such a small dataset. I also ran both branches at larger scale and I 
saw that master branch takes ~20 seconds per iteration in one case while this 
patch takes 10 minutes. I traced it down to the way the data is being copied. 
Could you also run tests at scale to verify this?

Again, with some refactoring I ran some very preliminary tests (data size 
approximately 100gb with 100 - 1k clusters) and saw that this branch can 
improve performance for some cases and degrades it in others. We need to test 
this at scale to really understand the implications I think. I will try to 
summarize my results sometime in the next week. I think we will see performance 
gains when the number of features/clusters is large.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2016-11-02 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/14937
  
@sethah You can try the following piece of code even in a single node:
```Scala
import org.apache.spark.ml.clustering.KMeans
val dataset = spark.read.format("libsvm").options(Map("vectorType" -> 
"dense")).load("/Users/yliang/Downloads/libsvm/combined")
val kmeans = new 
KMeans().setK(3).setSeed(1L).setTol(1E-16).setMaxIter(100).setInitMode("random")

val model = kmeans.fit(dataset)
```
You can find the dataset at 
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html .
I run it against master and this PR, it spends different time for each 
iterations. 
Before this PR (master code):
```
Time: 32.076 seconds.
Iteration number: 35.
```
After this PR:
```
Time: 16.322 seconds.
Iteration number: 85.
```
I think the value of ```tol``` is not set properly, so it causes the two 
implementations converge in different iteration number. We can have more robust 
dataset or force each one to run until a fixed number to compare spent time, 
but we can still get some sense from this result. Please feel free to try this 
test in your environment, and let me know whether it can be reproduced. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2016-11-01 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/14937
  
A small update: I have run a few tests on a refactored version of this 
patch which avoids some data copying. I have found at least one case where the 
current patch is faster, but many where it is not. I'll try to post formal 
results at some point. (All test cases using dense data btw)

In the meantime, I think it would be helpful to have more detail about the 
tests above. They are rather small datasets. How many centers were used? How 
were the timings observed? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2016-10-30 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/14937
  
@yanboliang I ran some tests on a 3 node bare-metal cluster, 144 cores, 384 
gb ram on some dense synthetic data. I installed OpenBLAS customized for the 
hardware on the nodes (I can confirm it's successfully using NativeBLAS, not 
positive it's optimized though).

With this patch at first, I was seeing something like 10 minute iteration 
times compared to master branch of ~30 seconds. After refactoring the code to 
avoid some copying, I am still seeing about a 3-5x slowdown using this 
approach. I am still working through some of the timings and I haven't done a 
lot of experimentation with the block size. I will give more details at some 
point. For now, I can point out that copying the center in 
[here](https://github.com/yanboliang/spark/blob/1c31cda0f78b8c2b11406d76da447e9b3216a97d/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L379)
 seems to have a huge impact. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2016-10-30 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/14937
  
@sethah I think the test result can be reproduced against the current 
patch, however, there are two issues should be considered:
* Make sure you installed optimized/native BLAS on your system and loaded 
it correctly in JVM via netlib-java. Otherwise, it will fall back to Java 
implementation.
* Make sure you load the dataset via DenseVector which will be converted 
into DenseMatrix and get performance improvement.

```Scala
val df = spark.read.format("libsvm").options(Map("vectorType" -> 
"dense")).load(path)
```
Spark loads dataset of libsvm format into SparseVector/SparseMatrix by 
default, and it will fall into the branch of processing sparse data which will 
cause huge performance degradation.

Could you share some of your test detail? If you already considered the 
above two tips correctly, please let me know as well. I'm on a business travel 
and will resolve the merge conflicts in a few days. I'm very appreciate to hear 
your thoughts about this issue. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2016-10-29 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/14937
  
@yanboliang I began to run some performance tests on this patch today. With 
this patch the way it is, I am seeing a huge performance **_degradation_**. The 
most critical reason is the slicing (copying) of the centers array inside the 
inner, inner while loop. The reason I ask is because I don't see how the 
results posted in this PR could even occur against the current patch. Were 
those from an older version? I know this PR has gone through several iterations 
and so I'm just trying to get a sense for where those results came from. 

It would be great if we could resolve the merge conflicts and start moving 
review along. I'm happy to help :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2016-10-03 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/14937
  
@srowen Please feel free to send that PR. This PR involves some significant 
change and should be careful discussed, it may not be merged too fast. Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2016-10-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14937
  
@yanboliang would it be useful if I worked on a PR to just remove `runs`? I 
had started that already. But I don't want to cause a big merge conflict for 
you if you're going to update this and merge it soon. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2016-09-18 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/14937
  
@srowen Yes, I'm working on this. You can see the performance test result 
in the PR description. We can found that the optimization k-means can get 
performance improvements about 2 ~ 4 times by using native BLAS level 3 
matrix-matrix multiplications for dense input. However, we saw performance 
degradation for sparse input. For example, the new implementation spent almost 
twice time as much as the old one when training k-means model on the famous 
mnist data set.
In the view of the current performance test result, I think we should only 
make this optimization for dense input and let sparse input still run the old 
code.
I have sent the performance test result to @mengxr and waiting for his 
opinion. I'm also appreciate your thoughts and suggestions. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2016-09-18 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14937
  
@yanboliang are you still working on this? it seems like an important 
change, I'd love to help get it in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org