Derrick Burns created SPARK-3253: ------------------------------------ Summary: KMeans cluster will fail on large number of clusters/high dimensional data Key: SPARK-3253 URL: https://issues.apache.org/jira/browse/SPARK-3253 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Reporter: Derrick Burns
The latest changes to use broadcast to communicate cluster centers to workers keeps closure size small, but does not avoid the problem of returning the cluster centers to the master in the final collect() stage. At this step, the collect() may fail because the resulting cluster centers are larger than the akka framesize can accommodate. What is frustrating about this is that there is no indication that the failure was caused by the frame size being exceeded. This makes this a Major issue, even though there is a simple workaround, i.e. increasing the frame size. What would be helpful is a check BEFORE the clusterer begins the heavy lifting. The check would compute the expected result size and compare it to the akka frame size. If the result won't fit, at the very least it emits a warning. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org