[ 
https://issues.apache.org/jira/browse/SPARK-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-2114:
------------------------------

    Description: 
For groupByKey and join transformations, Spark tasks on the reduce side 
deserialize every record into a Java object before calling any user function.  

This causes all kinds of problems for garbage collection - when aggregating 
enough data, objects can escape the young gen and trigger full GCs down the 
line.  Additionally, when records are spilled, they must be serialized and 
deserialized multiple times.

It would be helpful to allow aggregations on serialized data - using some sort 
of RawHasher interface that could implement hashCode and equals for serialized 
records.  This would also require encoding record boundaries in the serialized 
format, which I'm not sure we currently do.

  was:
For all the ByKey transformations, Spark tasks on the reduce side deserialize 
every record into a Java object before calling any user function.  

This causes all kinds of problems for garbage collection - when aggregating 
enough data, objects can escape the young gen and trigger full GCs down the 
line.  Additionally, when records are spilled, they must be serialized and 
deserialized multiple times.

It would be helpful to allow aggregations on serialized data - using some sort 
of RawHasher interface that could implement hashCode and equals for serialized 
records.  This would also require encoding record boundaries in the serialized 
format, which I'm not sure we currently do.


> Aggregations on raw data
> ------------------------
>
>                 Key: SPARK-2114
>                 URL: https://issues.apache.org/jira/browse/SPARK-2114
>             Project: Spark
>          Issue Type: New Feature
>          Components: Shuffle, Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Sandy Ryza
>
> For groupByKey and join transformations, Spark tasks on the reduce side 
> deserialize every record into a Java object before calling any user function. 
>  
> This causes all kinds of problems for garbage collection - when aggregating 
> enough data, objects can escape the young gen and trigger full GCs down the 
> line.  Additionally, when records are spilled, they must be serialized and 
> deserialized multiple times.
> It would be helpful to allow aggregations on serialized data - using some 
> sort of RawHasher interface that could implement hashCode and equals for 
> serialized records.  This would also require encoding record boundaries in 
> the serialized format, which I'm not sure we currently do.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to