[ 
https://issues.apache.org/jira/browse/SPARK-18731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726855#comment-15726855
 ] 

Xiaoye Sun commented on SPARK-18731:
------------------------------------

The size of the broadcast variable (the model) in my case is about 100MB. 
However, I think torrent broadcast won't serialize the broadcast data, instead, 
the broadcast data will be sent to the worker later when the task requests for 
it. So, the large task size looks strange to me (also it is strange that the 
task size changes as the model size changes). I am not quite familiar with 
Debugger, so it will be great if someone else could help me look into that. I 
will also try that myself, but it may take some time.

Thanks Sean!

> Task size in K-means is so large
> --------------------------------
>
>                 Key: SPARK-18731
>                 URL: https://issues.apache.org/jira/browse/SPARK-18731
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.6.1
>            Reporter: Xiaoye Sun
>            Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> When run the KMeans algorithm for a large model (e.g. 100k features and 100 
> centers), there will be warning shown for many of the stages saying that the 
> task size is very large. Here is an example warning. 
> WARN TaskSetManager: Stage 23 contains a task of very large size (56256 KB). 
> The maximum recommended task size is 100 KB.
> This could happen at (sum at KMeansModel.scala:88), (takeSample at 
> KMeans.scala:378), (aggregate at KMeans.scala:404) and (collect at 
> KMeans.scala:436). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to