[ https://issues.apache.org/jira/browse/SPARK-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sandy Ryza updated SPARK-2114: ------------------------------ Summary: groupByKey and joins on raw data (was: Aggregations on raw data) > groupByKey and joins on raw data > -------------------------------- > > Key: SPARK-2114 > URL: https://issues.apache.org/jira/browse/SPARK-2114 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core > Affects Versions: 1.0.0 > Reporter: Sandy Ryza > > For groupByKey and join transformations, Spark tasks on the reduce side > deserialize every record into a Java object before calling any user function. > > This causes all kinds of problems for garbage collection - when aggregating > enough data, objects can escape the young gen and trigger full GCs down the > line. Additionally, when records are spilled, they must be serialized and > deserialized multiple times. > It would be helpful to allow aggregations on serialized data - using some > sort of RawHasher interface that could implement hashCode and equals for > serialized records. This would also require encoding record boundaries in > the serialized format, which I'm not sure we currently do. -- This message was sent by Atlassian JIRA (v6.2#6252)