[GitHub] [spark] zhengruifeng commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

GitBox Mon, 02 Mar 2020 23:56:12 -0800

zhengruifeng commented on issue #27758: [SPARK-31007][ML] KMeans optimization 
based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593816313
 
 
   I update the `radii` to the `statistics` including more than `radii`, so 
another bound can be used in both distance measurers: In `findClosest`, when 
cluster `i` is too far alway from current closest cluster `bestIndex` (for 
example, distance(cluster_i, cluster_bestIndex) > 2 * 
distance(cluster_bestIndex, x) in `EuclideanDistance`), then we can say that 
point x should not belong to cluster `i`, so no need to compute 
distance(cluster_i, x).
   
   Then I retest above testsuite, the prediction is a litter faster, but not 
significantly.
   
   I also test on **cosine-distance**, results are:
   
   |Test on webspam| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) 
| Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16)  |
   
|------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration 
(sec)|28.851|39.434|107.571|306.996|29.291|37.332|99.98|275.32|
   |NumIters|10|7|11|14|10|7|11|14|
   
|Cost|3585.915362367295|2830.5043071043824|2410.540059493046|2057.831172250597|3585.9153623672946|2830.5043071043824|2410.540059493046|2057.8311722505964|
   |Prediction Duration (millsec)|29|29|29|33|32|29|32|35|
   
   
   |Test on a9a| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | 
This PR(k=32) | This PR(k=64) | This PR(k=128) | This PR(k=256) | Master(k=2) | 
Master(k=4) | Master(k=8) | Master(k=16)  | Master(k=32) | Master(k=64) | 
Master(k=238) | Master(k=3566)  |
   
|------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration 
(sec)|0.445|0.559|0.77|1.067|1.379|2.114|3.857|7.786|0.461|0.613|0.728|1.208|1.547|2.387|4.287|9.11|
   |NumIters|4|9|10|20|20|20|20|20|4|9|10|20|20|20|20|20|
   
|Cost|9458.881512958757|8727.181294576074|7646.536181047704|6743.890831633205|6063.089195117649|5381.196166489522|4787.4797985497125|4275.705880212388|9458.881512958757|8727.181294576074|7646.536181047704|6743.890831633205|6063.089195117649|5381.196166489522|4787.4797985497125|4275.705880212388|
   |Prediction Duration 
(millsec)|29|30|28|28|28|28|29|28|32|30|34|30|31|31|30|36|
   
   
   We can see that KMeans impls with cosine distance have the (almost) same 
convergen.
   And the prediction is about 10% faster than master.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Reply via email to