Re: partitioning via groupByKey

2014-03-19 Thread Jaka JanĨar
The former: a single new RDD is returned. Check the PairRDDFunctions docs (http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions): def groupByKey(): RDD[(K, Seq[V])] Group the values for each key in the RDD into a single sequence. On Wednesday, March 19,

partitioning via groupByKey

2014-03-19 Thread Adrian Mocanu
When you partition via groupByKey tulpes (parts of the RDD) are moved from some node to another node based on key (hash partitioning). Do the tuples remain part of 1 RDD as before but moved to different nodes or does this shuffling create, say, several RDDs which will have parts of the original