[Python] dataset filter performance and partitioning

Josh Mayer Thu, 24 Sep 2020 12:02:56 -0700

I am comparing two datasets with a filter on a string column (that is also
a partition column). I create the dataset from a common metadata file. In
the first case I omit the partitioning information whereas in the second I
include it. I would expect the performance to be similar since the column
statistics should be able to identify the same row groups as the
partitioning. However, I'm seeing the first case run almost 3x slower. Is
this expected?


An example is here (I'm running on linux, python 3.8, pyarrow 1.0.1):

https://gist.github.com/josham/5d7cf52f9ea60b1b2bbef1e768ea992f

[Python] dataset filter performance and partitioning

Reply via email to