Github user srowen commented on the issue: https://github.com/apache/spark/pull/14937 @yanboliang here are a few other changes I made in my PR that accidentally duplicated some of this work. Refer to https://github.com/apache/spark/pull/14948 for details. For your consideration: I think getRuns/setRuns should be formally deprecated and the runs param to the constructor removed (it's private). There are some mentions of 'runs' in the docs that should be removed too at this point. mergeContribs and the "type WeightedPoint" don't really serve a purpose at this point and can be 'inlined' IMHO. Minor: the "contribs.iterator" can really be an iterator only over triples with non-zero counts, which eliminates the filtering by 0 counts The "run finished" log message is obsolete now. Minor, but in k-means|| the sample of 1 element is very slightly better if it's without replacement. Won't matter much but otherwise you might sample a couple elements. pointsWithCosts.flatMap might be a little faster as filter + map instead because virtually every element is filtered out. mergeNewCenters() is pretty superfluous, because it's simpler to compute newCenters, then add it to centers, in the same loop. No clear() or multiple calls to update this. weightMap can be computed with countByValue directly
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org