[ 
https://issues.apache.org/jira/browse/SPARK-18731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15724488#comment-15724488
 ] 

Sean Owen commented on SPARK-18731:
-----------------------------------

Yes, the scheduler delay comes from having to transmit the huge centroids at 
each step. This is inherent in how k-means is implemented. It proceeds by 
broadcasting the centroids, which at least means it's transmitted just once per 
executor. I don't think more partitions helps here.

Thinking about it, I would actually be surprised if the broadcast variable's 
size is part of the task size, because it's not transmitted as part of the task 
closure. That makes me wonder if, actually, the k-means implementation is 
inadvertently also sending the centroids a second time as part of the task 
closure. 

I haven't looked into it, but are you able to enable DEBUG logging, and then 
look for messages from util.ClosureCleaner? it will print some details about 
what it's serializing. Or attach a debugger to the driver, and break at the 
statement where it logs the warning, and see what object is part of the task 
closure? if I'm right, it shouldn't have a copy of the centroids, because 
that's broadcast. If it has something huge I think that's a problem.

> Task size in K-means is so large
> --------------------------------
>
>                 Key: SPARK-18731
>                 URL: https://issues.apache.org/jira/browse/SPARK-18731
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.6.1
>            Reporter: Xiaoye Sun
>            Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> When run the KMeans algorithm for a large model (e.g. 100k features and 100 
> centers), there will be warning shown for many of the stages saying that the 
> task size is very large. Here is an example warning. 
> WARN TaskSetManager: Stage 23 contains a task of very large size (56256 KB). 
> The maximum recommended task size is 100 KB.
> This could happen at (sum at KMeansModel.scala:88), (takeSample at 
> KMeans.scala:378), (aggregate at KMeans.scala:404) and (collect at 
> KMeans.scala:436). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to