If I have an RDD that happens to already be partitioned by a key, how
efficient can I expect a groupBy operation to be? I would expect that Spark
shouldn't have to move data around between nodes, and simply will have a
small amount of work just checking the partitions to discover that it
doesn't need to move anything around.

Now, what if we're talking about a parquet database created by using
DataFrameWriter.partitionBy(...), then will Spark SQL be smart when I group
by a key that I'm already partitioned by?

- Philip

Reply via email to