[ https://issues.apache.org/jira/browse/SPARK-18731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15724488#comment-15724488 ]
Sean Owen commented on SPARK-18731: ----------------------------------- Yes, the scheduler delay comes from having to transmit the huge centroids at each step. This is inherent in how k-means is implemented. It proceeds by broadcasting the centroids, which at least means it's transmitted just once per executor. I don't think more partitions helps here. Thinking about it, I would actually be surprised if the broadcast variable's size is part of the task size, because it's not transmitted as part of the task closure. That makes me wonder if, actually, the k-means implementation is inadvertently also sending the centroids a second time as part of the task closure. I haven't looked into it, but are you able to enable DEBUG logging, and then look for messages from util.ClosureCleaner? it will print some details about what it's serializing. Or attach a debugger to the driver, and break at the statement where it logs the warning, and see what object is part of the task closure? if I'm right, it shouldn't have a copy of the centroids, because that's broadcast. If it has something huge I think that's a problem. > Task size in K-means is so large > -------------------------------- > > Key: SPARK-18731 > URL: https://issues.apache.org/jira/browse/SPARK-18731 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.6.1 > Reporter: Xiaoye Sun > Priority: Minor > Original Estimate: 5h > Remaining Estimate: 5h > > When run the KMeans algorithm for a large model (e.g. 100k features and 100 > centers), there will be warning shown for many of the stages saying that the > task size is very large. Here is an example warning. > WARN TaskSetManager: Stage 23 contains a task of very large size (56256 KB). > The maximum recommended task size is 100 KB. > This could happen at (sum at KMeansModel.scala:88), (takeSample at > KMeans.scala:378), (aggregate at KMeans.scala:404) and (collect at > KMeans.scala:436). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org