Derrick Burns created SPARK-3253:
------------------------------------

             Summary: KMeans cluster will fail on large number of clusters/high 
dimensional data
                 Key: SPARK-3253
                 URL: https://issues.apache.org/jira/browse/SPARK-3253
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 1.0.2
            Reporter: Derrick Burns


The latest changes to use broadcast to communicate cluster centers to workers 
keeps closure size small, but does not avoid the problem of returning the 
cluster centers to the master in the final collect() stage. At this step, the 
collect() may fail because the resulting cluster centers are larger than the 
akka framesize can accommodate.  What is frustrating about this is that there 
is no indication that the failure was caused by the frame size being exceeded.  
This makes this a Major issue, even though there is a simple workaround, i.e. 
increasing the frame size. 

What would be helpful is a check BEFORE the clusterer begins the heavy lifting. 
 The check would compute the expected result size and compare it to the akka 
frame size.  If the result won't fit, at the very least it emits a warning.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to