[jira] [Updated] (SPARK-2534) Avoid pulling in the entire RDD or PairRDDFunctions in various operators

2014-08-05 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2534:
-

Component/s: Spark Core

 Avoid pulling in the entire RDD or PairRDDFunctions in various operators
 

 Key: SPARK-2534
 URL: https://issues.apache.org/jira/browse/SPARK-2534
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.1.0, 1.0.2


 The way groupByKey is written actually pulls the entire PairRDDFunctions into 
 the 3 closures, sometimes resulting in gigantic task sizes:
 {code}
   def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = {
 // groupByKey shouldn't use map side combine because map side combine 
 does not
 // reduce the amount of data shuffled and requires all map side data be 
 inserted
 // into a hash table, leading to more objects in the old gen.
 def createCombiner(v: V) = ArrayBuffer(v)
 def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v
 def mergeCombiners(c1: ArrayBuffer[V], c2: ArrayBuffer[V]) = c1 ++ c2
 val bufs = combineByKey[ArrayBuffer[V]](
   createCombiner _, mergeValue _, mergeCombiners _, partitioner, 
 mapSideCombine=false)
 bufs.mapValues(_.toIterable)
   }
 {code}
 Changing the functions from def to val would solve it. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2534) Avoid pulling in the entire RDD or PairRDDFunctions in various operators

2014-07-16 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2534:
---

Summary: Avoid pulling in the entire RDD or PairRDDFunctions in various 
operators  (was: Avoid pulling in the entire RDD in groupByKey)

 Avoid pulling in the entire RDD or PairRDDFunctions in various operators
 

 Key: SPARK-2534
 URL: https://issues.apache.org/jira/browse/SPARK-2534
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical

 The way groupByKey is written actually pulls the entire PairRDDFunctions into 
 the 3 closures, sometimes resulting in gigantic task sizes:
 {code}
   def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = {
 // groupByKey shouldn't use map side combine because map side combine 
 does not
 // reduce the amount of data shuffled and requires all map side data be 
 inserted
 // into a hash table, leading to more objects in the old gen.
 def createCombiner(v: V) = ArrayBuffer(v)
 def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v
 def mergeCombiners(c1: ArrayBuffer[V], c2: ArrayBuffer[V]) = c1 ++ c2
 val bufs = combineByKey[ArrayBuffer[V]](
   createCombiner _, mergeValue _, mergeCombiners _, partitioner, 
 mapSideCombine=false)
 bufs.mapValues(_.toIterable)
   }
 {code}
 Changing the functions from def to val would solve it. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)