[ https://issues.apache.org/jira/browse/SPARK-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin resolved SPARK-2534. -------------------------------- Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 > Avoid pulling in the entire RDD or PairRDDFunctions in various operators > ------------------------------------------------------------------------ > > Key: SPARK-2534 > URL: https://issues.apache.org/jira/browse/SPARK-2534 > Project: Spark > Issue Type: Bug > Reporter: Reynold Xin > Assignee: Reynold Xin > Priority: Critical > Fix For: 1.1.0, 1.0.2 > > > The way groupByKey is written actually pulls the entire PairRDDFunctions into > the 3 closures, sometimes resulting in gigantic task sizes: > {code} > def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = { > // groupByKey shouldn't use map side combine because map side combine > does not > // reduce the amount of data shuffled and requires all map side data be > inserted > // into a hash table, leading to more objects in the old gen. > def createCombiner(v: V) = ArrayBuffer(v) > def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v > def mergeCombiners(c1: ArrayBuffer[V], c2: ArrayBuffer[V]) = c1 ++ c2 > val bufs = combineByKey[ArrayBuffer[V]]( > createCombiner _, mergeValue _, mergeCombiners _, partitioner, > mapSideCombine=false) > bufs.mapValues(_.toIterable) > } > {code} > Changing the functions from def to val would solve it. -- This message was sent by Atlassian JIRA (v6.2#6252)