zhengruifeng commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality URL: https://github.com/apache/spark/pull/27758#issuecomment-593816313 I update the `radii` to the `statistics` including more than `radii`, so another bound can be used in both distance measurers: In `findClosest`, when cluster `i` is too far alway from current closest cluster `bestIndex` (for example, distance(cluster_i, cluster_bestIndex) > 2 * distance(cluster_bestIndex, x) in `EuclideanDistance`), then we can say that point x should not belong to cluster `i`, so no need to compute distance(cluster_i, x). Then I retest above testsuite, the prediction is a litter faster, but not significantly. I also test on **cosine-distance**, results are: |Test on webspam| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16) | |------|----------|------------|----------|------------|----------|------------|----------|------------| |Train Duration (sec)|28.851|39.434|107.571|306.996|29.291|37.332|99.98|275.32| |NumIters|10|7|11|14|10|7|11|14| |Cost|3585.915362367295|2830.5043071043824|2410.540059493046|2057.831172250597|3585.9153623672946|2830.5043071043824|2410.540059493046|2057.8311722505964| |Prediction Duration (millsec)|29|29|29|33|32|29|32|35| |Test on a9a| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | This PR(k=32) | This PR(k=64) | This PR(k=128) | This PR(k=256) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16) | Master(k=32) | Master(k=64) | Master(k=238) | Master(k=3566) | |------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------| |Train Duration (sec)|0.445|0.559|0.77|1.067|1.379|2.114|3.857|7.786|0.461|0.613|0.728|1.208|1.547|2.387|4.287|9.11| |NumIters|4|9|10|20|20|20|20|20|4|9|10|20|20|20|20|20| |Cost|9458.881512958757|8727.181294576074|7646.536181047704|6743.890831633205|6063.089195117649|5381.196166489522|4787.4797985497125|4275.705880212388|9458.881512958757|8727.181294576074|7646.536181047704|6743.890831633205|6063.089195117649|5381.196166489522|4787.4797985497125|4275.705880212388| |Prediction Duration (millsec)|29|30|28|28|28|28|29|28|32|30|34|30|31|31|30|36| We can see that KMeans impls with cosine distance have the (almost) same convergen. And the prediction is about 10% faster than master.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org