Re: [Python] dataset filter performance and partitioning

Joris Van den Bossche Fri, 25 Sep 2020 05:07:09 -0700

Hi Josh,

Thanks for the question!

In general, filtering on partition columns will be faster than filtering on
actual data columns using row group statistics. For partition-based
filtering, the scanner can skip full files based on the information from
the file path (in your example case, there are 11 files, for which it can
select 4 of them to actually read), while for row-group-based filtering, it
actually needs to parse the statistics of all row groups of all files to
determine which can be skipped, which is typically more information to
process compared to the file paths.

That said, there are some oddities I noticed:

- As I mentioned, I expect partition-based filtering to faster, but not
that much faster (certainly in a case with a limited number of files, the
overhead of the parsing / filtering row groups should be really minimal)
- Inspecting the result a bit, it seems that for the first dataset (without
partitioning) it's not actually applying the filter correctly. The min/max
for the name column are included in the row group statistics, but the isin
filter didn't actually filter them out. Something to investigate, but that
certainly explains the difference in performance (it's actually reading all
data, and only filtering after reading, not skipping some parts before
reading)
- In your case, the partitioning has the same name as one of the actual
columns in the data files. I am not sure this corner case of duplicate
fields is tested very well, or how the filtering will work?

Joris

On Thu, 24 Sep 2020 at 21:02, Josh Mayer <[email protected]> wrote:

> I am comparing two datasets with a filter on a string column (that is also
> a partition column). I create the dataset from a common metadata file. In
> the first case I omit the partitioning information whereas in the second I
> include it. I would expect the performance to be similar since the column
> statistics should be able to identify the same row groups as the
> partitioning. However, I'm seeing the first case run almost 3x slower. Is
> this expected?
>
> An example is here (I'm running on linux, python 3.8, pyarrow 1.0.1):
>
> https://gist.github.com/josham/5d7cf52f9ea60b1b2bbef1e768ea992f
>

Re: [Python] dataset filter performance and partitioning

Reply via email to