[jira] [Updated] (SPARK-2534) Avoid pulling in the entire RDD or PairRDDFunctions in various operators
[ https://issues.apache.org/jira/browse/SPARK-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2534: - Component/s: Spark Core Avoid pulling in the entire RDD or PairRDDFunctions in various operators Key: SPARK-2534 URL: https://issues.apache.org/jira/browse/SPARK-2534 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.1.0, 1.0.2 The way groupByKey is written actually pulls the entire PairRDDFunctions into the 3 closures, sometimes resulting in gigantic task sizes: {code} def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = { // groupByKey shouldn't use map side combine because map side combine does not // reduce the amount of data shuffled and requires all map side data be inserted // into a hash table, leading to more objects in the old gen. def createCombiner(v: V) = ArrayBuffer(v) def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v def mergeCombiners(c1: ArrayBuffer[V], c2: ArrayBuffer[V]) = c1 ++ c2 val bufs = combineByKey[ArrayBuffer[V]]( createCombiner _, mergeValue _, mergeCombiners _, partitioner, mapSideCombine=false) bufs.mapValues(_.toIterable) } {code} Changing the functions from def to val would solve it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2534) Avoid pulling in the entire RDD or PairRDDFunctions in various operators
[ https://issues.apache.org/jira/browse/SPARK-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2534: --- Summary: Avoid pulling in the entire RDD or PairRDDFunctions in various operators (was: Avoid pulling in the entire RDD in groupByKey) Avoid pulling in the entire RDD or PairRDDFunctions in various operators Key: SPARK-2534 URL: https://issues.apache.org/jira/browse/SPARK-2534 Project: Spark Issue Type: Bug Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical The way groupByKey is written actually pulls the entire PairRDDFunctions into the 3 closures, sometimes resulting in gigantic task sizes: {code} def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = { // groupByKey shouldn't use map side combine because map side combine does not // reduce the amount of data shuffled and requires all map side data be inserted // into a hash table, leading to more objects in the old gen. def createCombiner(v: V) = ArrayBuffer(v) def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v def mergeCombiners(c1: ArrayBuffer[V], c2: ArrayBuffer[V]) = c1 ++ c2 val bufs = combineByKey[ArrayBuffer[V]]( createCombiner _, mergeValue _, mergeCombiners _, partitioner, mapSideCombine=false) bufs.mapValues(_.toIterable) } {code} Changing the functions from def to val would solve it. -- This message was sent by Atlassian JIRA (v6.2#6252)