I am currently using spark streaming. During my batch processing I must groupByKey. Afterwards I call foreachRDD & foreachPartition & write to an external datastore.
My only concern with this is if it's future proof? I know groupByKey by default uses the hashPartitioner. I have printed out the internals of partitions and loaded large text files into memory and ran groupByKey just to make sure. I have two questions. #1 First will my implementation ever break in the future? Will partitions & groupByKey work differently? #2 Is it possible for a (key,values) to exist on more than 1 partition after using groupByKey. Notes: I'm aware groupByKey, is not very efficient. However I am not working with large amounts of data & can process batches very quickly. Below I could have used aggregateByKey because I printed the sum, however my real implementation is much different and I do need each value for each key I can not reduce the data. 1 Million line test log file Partition HashCode: 965943941 Key:lol Size:2346 Partition HashCode: 1605678983 Key:ee Size:4692 Partition HashCode: 1605678983 Key:aa Size:32844 Partition HashCode: 1605678983 Key:gg Size:4692 Partition HashCode: 1605678983 Key:dd Size:11730 Partition HashCode: 1605678983 Key:hh Size:4692 Partition HashCode: 1605678983 Key:kk Size:2346 Partition HashCode: 1605678983 Key:tt Size:4692 Partition HashCode: 1605678983 Key:ff Size:2346 Partition HashCode: 1605678983 Key:bb Size:18768 Partition HashCode: 1605678983 Key:cc Size:14076 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-groupByKey-does-it-always-create-at-least-1-partition-per-key-tp22938.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org