Re: Spark groupByKey, does it always create at least 1 partition per key?

Tathagata Das Mon, 18 May 2015 15:25:53 -0700

By definition, all the values of a key will be only in one partition. This
is some of the oldest API in Spark and will continue to work as it is now.


On Mon, May 18, 2015 at 10:38 AM, tomboyle <icyn...@gmail.com> wrote:

> I am currently using spark streaming. During my batch processing I must
> groupByKey. Afterwards I call foreachRDD & foreachPartition & write to an
> external datastore.
>
> My only concern with this is if it's future proof? I know groupByKey by
> default uses the hashPartitioner. I have printed out the internals of
> partitions and loaded large text files into memory and ran groupByKey just
> to make sure.
>
> I have two questions.
> #1 First will my implementation ever break in the future? Will partitions &
> groupByKey work differently?
> #2 Is it possible for a (key,values) to exist on more than 1 partition
> after
> using groupByKey.
>
> Notes: I'm aware groupByKey, is not very efficient. However I am not
> working
> with large amounts of data & can process batches very quickly. Below I
> could
> have used aggregateByKey because I printed the sum, however my real
> implementation is much different and I do need each value for each key I
> can
> not reduce the data.
>
> 1 Million line test log file
> Partition HashCode: 965943941 Key:lol Size:2346
> Partition HashCode: 1605678983 Key:ee Size:4692
> Partition HashCode: 1605678983 Key:aa Size:32844
> Partition HashCode: 1605678983 Key:gg Size:4692
> Partition HashCode: 1605678983 Key:dd Size:11730
> Partition HashCode: 1605678983 Key:hh Size:4692
> Partition HashCode: 1605678983 Key:kk Size:2346
> Partition HashCode: 1605678983 Key:tt Size:4692
> Partition HashCode: 1605678983 Key:ff Size:2346
> Partition HashCode: 1605678983 Key:bb Size:18768
> Partition HashCode: 1605678983 Key:cc Size:14076
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-groupByKey-does-it-always-create-at-least-1-partition-per-key-tp22938.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark groupByKey, does it always create at least 1 partition per key?

Reply via email to