Derrick Burns created SPARK-3218: ------------------------------------ Summary: K-Means clusterer can fail on degenerate data Key: SPARK-3218 URL: https://issues.apache.org/jira/browse/SPARK-3218 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Reporter: Derrick Burns
The KMeans parallel implementation selects points to be cluster centers with probability weighted by their distance to cluster centers. However, if there are fewer than k DISTINCT points in the data set, this approach will fail. Further, the recent checkin to work around this problem results in selection of the same point repeatedly as a cluster center. The fix is to allow fewer than k cluster centers to be selected. This requires several changes to the code, as the number of cluster centers is woven into the implementation. I have a version of the code that addresses this problem, AND generalizes the distance metric. However, I see that there are literally hundreds of outstanding pull requests. If someone will commit to working with me to sponsor the pull request, I will create it. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org