grouping by a partitioned key

Philip Weaver Tue, 11 Aug 2015 15:20:12 -0700

If I have an RDD that happens to already be partitioned by a key, how
efficient can I expect a groupBy operation to be? I would expect that Spark
shouldn't have to move data around between nodes, and simply will have a
small amount of work just checking the partitions to discover that it
doesn't need to move anything around.


Now, what if we're talking about a parquet database created by using
DataFrameWriter.partitionBy(...), then will Spark SQL be smart when I group
by a key that I'm already partitioned by?

- Philip

grouping by a partitioned key

Reply via email to