I'd like to override the logic of comparing keys for equality in
groupByKey. Kinda like how combineByKey allows you to pass in the combining
logic for "values", I'd like to do the same for keys.

My code looks like this:
val res = rdd.groupBy(myPartitioner)
Here, rdd is of type RDD[(MyKey, MyValue)], so res turns out to be of type
RDD[(MyKey, Seq[MyValue])]

MyKey is defined as case class MyKey(field1: Int, field2: Int)
and myPartitioner's getPartition(key: Any), here key is of type MyKey and
the partitioning logic is an expression on both field1 and field2.

I'm guessing the groupBy uses "equals" to compare like instances of MyKey.
Currently, the "equals" method of MyKey uses both field1 and field2, as
would be natural to its implementation. However, I'd like to have the
groupBy only use field1. Any pointers on how I can go about doing it?

One way is the following, but I'd like to avoid creating all those MyNewKey
objects:
val partitionedRdd = rdd.partitionBy(myPartitioner)
val mappedRdd = partitionedRdd.mapPartitions(partition =>
    partition.map(case (myKey, myValue) => (new MyNewKey(myKey.field1),
myValue)),
    preservesPartitioning=true)
val groupedRdd = mappedRdd.groupByKey()


Thanks,
Ameet

Reply via email to