Tycho Lamerigts created CRUNCH-485:
--------------------------------------
Summary: groupByKey on Spark incorrect if key is Avro record with
defined sort order
Key: CRUNCH-485
URL: https://issues.apache.org/jira/browse/CRUNCH-485
Project: Crunch
Issue Type: Bug
Components: Core
Affects Versions: 0.11.0
Reporter: Tycho Lamerigts
Assignee: Josh Wills
GroupByKey on Spark is incorrect if the key type is an Avro record with defined
sort order (http://avro.apache.org/docs/1.7.7/spec.html#order).
Instead, it serializes the entire avro record to a binary blob (byte array) and
groups identical blobs. This is wrong. By contrast, groupByKey on MapReduce
works as expected, so it does take Avro's sort order into account.
The culprit is probably the following code from
org.apache.crunch.impl.spark.collect.PGroupedTableImpl#getJavaRDDLikeInternal
{code}
groupedRDD = parentRDD.map(new PairMapFunction(ptype.getOutputMapFn(),
runtime.getRuntimeContext()))
.mapToPair(new MapOutputFunction(keySerde, valueSerde))
.groupByKey(numPartitions);
{code}
where MapOutputFunction simply converts the entire key object to a binary blob,
without taking sort order into account.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)