I have had to use this as well.
Sometimes, I create a POJO to hold the multi-dimensional key to make things
easier.
ie.
class MultiKey(i, j, k) {
}
then I can define a reduce function that is over the multikey, e.g.
def reduceByI(mkv1: (MultiKey, Value), mkv2: (MultiKey: Value)) = if
(mkv1.i >
If you had RDD[[i, j, k], value] then you could reduce by j by essentially
mapping j into the key slot, doing the reduce, and then mapping it back:
rdd.map( ((i,j,k),v) => (j, (i, k, v)).reduce( ... ).map( (j,(i,k,v)) =>
((i,j,k),v))
It's not pretty, but I've had to use this pattern before too.
Hi,
How is it possible to reduce by multidimensional keys?
For example, if every line is a tuple like:
(i, j, k, value)
or, alternatively:
((I, j, k), value)
how can spark handle reducing over j, or k?