[jira] [Commented] (SPARK-4743) Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey

2016-03-25 Thread Jack Franson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212726#comment-15212726
 ] 

Jack Franson commented on SPARK-4743:
-

Hi, I'm still running into "Task not serializable" errors with aggregateByKey 
when my initial value is an Avro object (I've configured the Kryo serializer to 
handle it via a custom KryoRegistrator). I don't see the issue when I pass in 
an empty instance of the Avro object (via new MyAvroObject()), as the initial 
value, but then I get an exception from the Kryo serializer since required 
fields are null. To get around that, I tried creating an initial value with all 
the required fields set to defaults, but then I was hit with a 
java.io.NotSerializableException causing the "Task not serializable" exception 
to fail the job, which seems to indicate that Java serialization is taking over 
again.

This is on Spark 1.5.2 with Avro 1.7.7. The line throwing the fatal Exception 
is 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304).
 

This is the closest ticket I could find around this issue, so I'm wondering if 
there are further tweaks that the Spark libraries can make to use the 
SparkEnv.serializer, or if the problem is on my end (any tips in that case 
would be much appreciated!).

Thanks for your help.

> Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and 
> foldByKey
> 
>
> Key: SPARK-4743
> URL: https://issues.apache.org/jira/browse/SPARK-4743
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ivan Vergiliev
>Assignee: Ivan Vergiliev
>  Labels: performance
> Fix For: 1.3.0
>
>
> AggregateByKey and foldByKey in PairRDDFunctions both use the closure 
> serializer to serialize and deserialize the initial value. This means that 
> the Java serializer is always used, which can be very expensive if there's a 
> large number of groups. Calling combineByKey manually and using the normal 
> serializer instead of the closure one improved the performance on the dataset 
> I'm testing with by about 30-35%.
> I'm not familiar enough with the codebase to be certain that replacing the 
> serializer here is OK, but it works correctly in my tests, and it's only 
> serializing a single value of type U, which should be serializable by the 
> default one since it can be the output of a job. Let me know if I'm missing 
> anything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4743) Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey

2014-12-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234165#comment-14234165
 ] 

Apache Spark commented on SPARK-4743:
-

User 'IvanVergiliev' has created a pull request for this issue:
https://github.com/apache/spark/pull/3605

> Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and 
> foldByKey
> 
>
> Key: SPARK-4743
> URL: https://issues.apache.org/jira/browse/SPARK-4743
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ivan Vergiliev
>  Labels: performance
>
> AggregateByKey and foldByKey in PairRDDFunctions both use the closure 
> serializer to serialize and deserialize the initial value. This means that 
> the Java serializer is always used, which can be very expensive if there's a 
> large number of groups. Calling combineByKey manually and using the normal 
> serializer instead of the closure one improved the performance on the dataset 
> I'm testing with by about 30-35%.
> I'm not familiar enough with the codebase to be certain that replacing the 
> serializer here is OK, but it works correctly in my tests, and it's only 
> serializing a single value of type U, which should be serializable by the 
> default one since it can be the output of a job. Let me know if I'm missing 
> anything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org