Hello,

I think there is a confusion in the name "repartitioning" since it can be
understood in two different ways:
 * changing the number of partitions independently from the content
 * regrouping data with the same value in a given column in the same
partitions (potentially changing the number of partitions at the same time).
And this comes from Spark API itself, there is repartition API in the
Dataframe that supports both, and a partitionBy in the API of the
DataFrameWriter that takes care of managing the partitioning directories.

1) Repartition without grouping in Spark does impact the number of files
that will be generated (since every task in the write stage will generate a
Parquet file). It does not magically help the scanning speed, except that
multiple files will increase the parallelism on the reading part.

2) Coalesce is equivalent functionally to repartition without grouping but
is better in the sense that it does not force shuffling the data around,
since merge between partitions can be done inside the same workers, but it
can lead to skewed dataset sizes.
There is a nice explanation here: https://jaceklaskowski.
gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html

I found that Spark's default of 200 for shuffling is almost always wrong,
it is too big for small dataset, too small for large dataset, so there are
some initiatives in Spark around this point, but it is probably better to
find your local optimum yourself.

3) Sorting the data helps, because it allows to have nice data ranges in
the various files and inside the rowgroups of the files. This allows Drill
to aggressively prune whole Files or whole Rowgroups inside files at
planning time. And this leads to very nice speed ups.

There is a presentation around these topics here:
https://conferences.oreilly.com/strata/strata-ny-2016/public/schedule/detail/52110

Regards, Joel


On Tue, Feb 6, 2018 at 9:00 PM, Saurabh Mahapatra <
saurabhmahapatr...@gmail.com> wrote:

> Posting this for John Humphreys who posted this in the MapR community but I
> think this may benefit all users:
>
> https://community.mapr.com/thread/22719-re-how-can-i-
> partition-data-in-drill
>
>
>    1. If I had Spark re-partition a data frame based on a column, and then
>    saved the data frame to parquet, this post is indicating that drill
> would
>    query based on that column faster, correct?
>    2. Does the coalesce # (the number of .snappy.parquet files inside the
>    whole parquet file) make a big difference?  Spark defaults to 200.
>    3. Also, does sorting the data help too?  Or does partitioning sort it
>    implicitly?
>
>
> Thanks,
> Saurabh
>

Reply via email to