[jira] [Commented] (SPARK-2138) The KMeans algorithm in the MLlib can lead to the Serialized Task size become bigger and bigger

Piotr Szul (JIRA) Mon, 30 Jun 2014 19:10:27 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048429#comment-14048429
 ]


Piotr Szul commented on SPARK-2138:
-----------------------------------

I tested my code with 1.0.1 and I can confirm that now I am getting an error 
when task size is bigger than the akka.frameSize

{noformat}
14/07/01 11:55:21 INFO DAGScheduler: Failed to run collectAsMap at 
KMeans.scala:190
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Serialized task 645:52 was 12306732 bytes which exceeds 
spark.akka.frameSize (10485760 bytes). Consider using broadcast variables for 
large values.
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
{noformat}

This solves my problem at least to the extend that I can increase the frameSize 
and have my clustering done.
The message in the error however suggest that it would be better to use 
broadcast variables in kMeans implementation.
Perhaps this issue can be closed and a new one created for the above.

Also in my case the increasing size of the task in kmeans iteration was (I 
believe) due to me using sparse features which initially give cluster centers 
with many zeros (so good compression - smaller size). In subsequent iterations 
the centers have more uniform distributions and the size of the task increases 
as they can not be compressed to the same extend.

I would be good it the original reporter checked the solution.






> The KMeans algorithm in the MLlib can lead to the Serialized Task size become 
> bigger and bigger
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-2138
>                 URL: https://issues.apache.org/jira/browse/SPARK-2138
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 0.9.0, 0.9.1
>            Reporter: DjvuLee
>            Assignee: Xiangrui Meng
>
> When the algorithm running at certain stage, when running the reduceBykey() 
> function, It can lead to Executor Lost and Task lost, after several times. 
> the application exit.
> When this error occurred, the size of serialized task is bigger than 10MB, 
> and the size become larger as the iteration increase.
> the data generation file: https://gist.github.com/djvulee/7e3b2c9eb33ff0037622
> the running code: https://gist.github.com/djvulee/6bf00e60885215e3bfd5



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2138) The KMeans algorithm in the MLlib can lead to the Serialized Task size become bigger and bigger

Reply via email to