[ https://issues.apache.org/jira/browse/SPARK-18731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726824#comment-15726824 ]
Sean Owen commented on SPARK-18731: ----------------------------------- Yes, that's the kind of thing worth looking at. Nothing here is obviously amiss, and you can see the Broadcast variable with the centers is in the closure, which is fine. The broadcast itself isn't big. Are you able to run your app in a debugger and break in TaskSetManager around: {code} if (serializedTask.limit > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 && !emittedTaskSizeWarning) { emittedTaskSizeWarning = true logWarning(s"Stage ${task.stageId} contains a task of very large size " + s"(${serializedTask.limit / 1024} KB). The maximum recommended task size is " + s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.") } {code} and then see what is in the task that is serialized? if the taskBinary is large, that would further suggest it's a big closure and you would know exactly what task it is. (Does anyone else know a better way to debug why a closure has something big in it?) > Task size in K-means is so large > -------------------------------- > > Key: SPARK-18731 > URL: https://issues.apache.org/jira/browse/SPARK-18731 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.6.1 > Reporter: Xiaoye Sun > Priority: Minor > Original Estimate: 5h > Remaining Estimate: 5h > > When run the KMeans algorithm for a large model (e.g. 100k features and 100 > centers), there will be warning shown for many of the stages saying that the > task size is very large. Here is an example warning. > WARN TaskSetManager: Stage 23 contains a task of very large size (56256 KB). > The maximum recommended task size is 100 KB. > This could happen at (sum at KMeansModel.scala:88), (takeSample at > KMeans.scala:378), (aggregate at KMeans.scala:404) and (collect at > KMeans.scala:436). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org