[ https://issues.apache.org/jira/browse/SPARK-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048429#comment-14048429 ]
Piotr Szul commented on SPARK-2138: ----------------------------------- I tested my code with 1.0.1 and I can confirm that now I am getting an error when task size is bigger than the akka.frameSize {noformat} 14/07/01 11:55:21 INFO DAGScheduler: Failed to run collectAsMap at KMeans.scala:190 Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 645:52 was 12306732 bytes which exceeds spark.akka.frameSize (10485760 bytes). Consider using broadcast variables for large values. at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) {noformat} This solves my problem at least to the extend that I can increase the frameSize and have my clustering done. The message in the error however suggest that it would be better to use broadcast variables in kMeans implementation. Perhaps this issue can be closed and a new one created for the above. Also in my case the increasing size of the task in kmeans iteration was (I believe) due to me using sparse features which initially give cluster centers with many zeros (so good compression - smaller size). In subsequent iterations the centers have more uniform distributions and the size of the task increases as they can not be compressed to the same extend. I would be good it the original reporter checked the solution. > The KMeans algorithm in the MLlib can lead to the Serialized Task size become > bigger and bigger > ----------------------------------------------------------------------------------------------- > > Key: SPARK-2138 > URL: https://issues.apache.org/jira/browse/SPARK-2138 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 0.9.0, 0.9.1 > Reporter: DjvuLee > Assignee: Xiangrui Meng > > When the algorithm running at certain stage, when running the reduceBykey() > function, It can lead to Executor Lost and Task lost, after several times. > the application exit. > When this error occurred, the size of serialized task is bigger than 10MB, > and the size become larger as the iteration increase. > the data generation file: https://gist.github.com/djvulee/7e3b2c9eb33ff0037622 > the running code: https://gist.github.com/djvulee/6bf00e60885215e3bfd5 -- This message was sent by Atlassian JIRA (v6.2#6252)