Hello, I think there is a confusion in the name "repartitioning" since it can be understood in two different ways: * changing the number of partitions independently from the content * regrouping data with the same value in a given column in the same partitions (potentially changing the number of partitions at the same time). And this comes from Spark API itself, there is repartition API in the Dataframe that supports both, and a partitionBy in the API of the DataFrameWriter that takes care of managing the partitioning directories.
1) Repartition without grouping in Spark does impact the number of files that will be generated (since every task in the write stage will generate a Parquet file). It does not magically help the scanning speed, except that multiple files will increase the parallelism on the reading part. 2) Coalesce is equivalent functionally to repartition without grouping but is better in the sense that it does not force shuffling the data around, since merge between partitions can be done inside the same workers, but it can lead to skewed dataset sizes. There is a nice explanation here: https://jaceklaskowski. gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html I found that Spark's default of 200 for shuffling is almost always wrong, it is too big for small dataset, too small for large dataset, so there are some initiatives in Spark around this point, but it is probably better to find your local optimum yourself. 3) Sorting the data helps, because it allows to have nice data ranges in the various files and inside the rowgroups of the files. This allows Drill to aggressively prune whole Files or whole Rowgroups inside files at planning time. And this leads to very nice speed ups. There is a presentation around these topics here: https://conferences.oreilly.com/strata/strata-ny-2016/public/schedule/detail/52110 Regards, Joel On Tue, Feb 6, 2018 at 9:00 PM, Saurabh Mahapatra < saurabhmahapatr...@gmail.com> wrote: > Posting this for John Humphreys who posted this in the MapR community but I > think this may benefit all users: > > https://community.mapr.com/thread/22719-re-how-can-i- > partition-data-in-drill > > > 1. If I had Spark re-partition a data frame based on a column, and then > saved the data frame to parquet, this post is indicating that drill > would > query based on that column faster, correct? > 2. Does the coalesce # (the number of .snappy.parquet files inside the > whole parquet file) make a big difference? Spark defaults to 200. > 3. Also, does sorting the data help too? Or does partitioning sort it > implicitly? > > > Thanks, > Saurabh >