Posting this for John Humphreys who posted this in the MapR community but I think this may benefit all users:
https://community.mapr.com/thread/22719-re-how-can-i-partition-data-in-drill 1. If I had Spark re-partition a data frame based on a column, and then saved the data frame to parquet, this post is indicating that drill would query based on that column faster, correct? 2. Does the coalesce # (the number of .snappy.parquet files inside the whole parquet file) make a big difference? Spark defaults to 200. 3. Also, does sorting the data help too? Or does partitioning sort it implicitly? Thanks, Saurabh