The former: a single new RDD is returned.
Check the PairRDDFunctions docs
(http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions):
def groupByKey(): RDD[(K, Seq[V])]
Group the values for each key in the RDD into a single sequence.
On Wednesday, March 19,
When you partition via groupByKey tulpes (parts of the RDD) are moved from some
node to another node based on key (hash partitioning).
Do the tuples remain part of 1 RDD as before but moved to different nodes or
does this shuffling create, say, several RDDs which will have parts of the
original