If I have an RDD that happens to already be partitioned by a key, how efficient can I expect a groupBy operation to be? I would expect that Spark shouldn't have to move data around between nodes, and simply will have a small amount of work just checking the partitions to discover that it doesn't need to move anything around.
Now, what if we're talking about a parquet database created by using DataFrameWriter.partitionBy(...), then will Spark SQL be smart when I group by a key that I'm already partitioned by? - Philip