Re: GroupByKey implementation.

Archit Thakur Sun, 26 Jan 2014 23:59:12 -0800

Thanks Mark, Reynold for the quick response.


On Mon, Jan 27, 2014 at 5:07 AM, Reynold Xin <[email protected]> wrote:

> While I echo Mark's sentiment, versioning has nothing to do with this
> problem. It has been the case even in Spark 0.8.0.
>
> Note that mapSideCombine is turned off for groupByKey, so there is no need
> to merge any combiners.
>
>
> On Sun, Jan 26, 2014 at 12:22 PM, Archit Thakur
> <[email protected]>wrote:
>
> > Hi,
> >
> > Below is the implementation for GroupByKey. (v, 0.8.0)
> >
> >
> > def groupByKey(partitioner: Partitioner): RDD[(K, Seq[V])] = {
> >     def createCombiner(v: V) = ArrayBuffer(v)
> >     def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v
> >     val bufs = combineByKey[ArrayBuffer[V]](
> >       createCombiner _, mergeValue _, null, partitioner,
> > mapSideCombine=false)
> >     bufs.asInstanceOf[RDD[(K, Seq[V])]]
> >   }
> >
> > and CombineValuesByKey (Aggregator.scala):
> >
> > def combineValuesByKey(iter: Iterator[_ <: Product2[K, V]]) :
> Iterator[(K,
> > C)] = {
> >     val combiners = new JHashMap[K, C]
> >     for (kv <- iter) {
> >       val oldC = combiners.get(kv._1)
> >       if (oldC == null) {
> >         combiners.put(kv._1, createCombiner(kv._2))
> >       } else {
> >         combiners.put(kv._1, mergeValue(oldC, kv._2))
> >       }
> >     }
> >     combiners.iterator
> >   }
> >
> > My doubt is why null is being passed for mergeCombiners closure.
> >
> > If two different partitions have same key, wouldn't there be the
> > requirement to merge them afterwards?
> >
> > Thanks,
> > Archit.
> >
>

Re: GroupByKey implementation.

Reply via email to