When a MappedRDD is handled by groupByKey transformation, tuples distributed in different worker nodes with the same key will be collected into one worker nodes, say, (K, V1), (K, V2), ..., (K, Vn) -> (K, Seq(V1, V2, ..., Vn)).
I want to know whether the value /Seq(V1, V2, ..., Vn)/ of a tuple in the grouped RDD can reside in different nodes or have to be in one node, if I set the number of partitions when using groupByKey. If the value /Seq(V1, V2, ..., Vn)/ can only reside in the memory of just one machine, out of memory risk exists in case the size of the /Seq(V1, V2, ..., Vn)/ is larger than the JVM memory limit of this machine. if this case happens, how should we deal with? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-groupByKey-partition-out-of-memory-tp13669.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org