I am currently using spark streaming. During my batch processing I must
groupByKey. Afterwards I call foreachRDD & foreachPartition & write to an
external datastore.

My only concern with this is if it's future proof? I know groupByKey by
default uses the hashPartitioner. I have printed out the internals of
partitions and loaded large text files into memory and ran groupByKey just
to make sure.

I have two questions.
#1 First will my implementation ever break in the future? Will partitions &
groupByKey work differently?
#2 Is it possible for a (key,values) to exist on more than 1 partition after
using groupByKey.

Notes: I'm aware groupByKey, is not very efficient. However I am not working
with large amounts of data & can process batches very quickly. Below I could
have used aggregateByKey because I printed the sum, however my real
implementation is much different and I do need each value for each key I can
not reduce the data.

1 Million line test log file
Partition HashCode: 965943941 Key:lol Size:2346
Partition HashCode: 1605678983 Key:ee Size:4692
Partition HashCode: 1605678983 Key:aa Size:32844
Partition HashCode: 1605678983 Key:gg Size:4692
Partition HashCode: 1605678983 Key:dd Size:11730
Partition HashCode: 1605678983 Key:hh Size:4692
Partition HashCode: 1605678983 Key:kk Size:2346
Partition HashCode: 1605678983 Key:tt Size:4692
Partition HashCode: 1605678983 Key:ff Size:2346
Partition HashCode: 1605678983 Key:bb Size:18768
Partition HashCode: 1605678983 Key:cc Size:14076




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-groupByKey-does-it-always-create-at-least-1-partition-per-key-tp22938.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to