Using a small toy example, the "isin" filter is indeed not working for
filtering row groups:

>>> table = pa.table({"name": np.repeat(["a", "b", "c", "d"], 5), "value":
np.arange(20)})
>>> pq.write_table(table, "test_filter_string.parquet", row_group_size=5)
>>> dataset = ds.dataset("test_filter_string.parquet")
# get the single file fragment (dataset consists of one file)
>>> fragment = list(dataset.get_fragments())[0]
>>> fragment.ensure_complete_metadata()

# check that we do have statistics for our row groups
>>> fragment.row_groups[0].statistics
{'name': {'min': 'a', 'max': 'a'}, 'value': {'min': 0, 'max': 4}}

# I created the file such that there are 4 row groups (each with a unique
value in the name column)
>>> fragment.split_by_row_group()
[<pyarrow._dataset.ParquetFileFragment at 0x7ff783939810>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff783728cd8>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff78376c9c0>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff7835efd68>]

# simple equality filter works as expected -> only single row group left
>>> filter = ds.field("name") == "a"
>>> fragment.split_by_row_group(filter)
[<pyarrow._dataset.ParquetFileFragment at 0x7ff783662738>]

# isin filter does not work
>>> filter = ds.field("name").isin(["a", "b"])
>>> fragment.split_by_row_group(filter)
[<pyarrow._dataset.ParquetFileFragment at 0x7ff7837f46f0>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff783627a98>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff783581b70>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff7835fb780>]

While filtering with "isin" on partition columns is working fine. I opened
https://issues.apache.org/jira/browse/ARROW-10091 to track this as a
possible enhancement.
Now, to explain why for partitions this is an "easier" case: the partition
information gets translated into an equality expression, with your example
an expression like "name == 'a' ", while the statistics give a
bigger/lesser than expression, such as "(name > 'a') & (name < 'a')" (from
the min/max). So for the equality it is more trivial to compare this with
an "isin" expression like "name in ['a', 'b']" (for the min/max expression,
we would need to check the special case where min/max is equal).

Joris

On Fri, 25 Sep 2020 at 14:06, Joris Van den Bossche <
[email protected]> wrote:

> Hi Josh,
>
> Thanks for the question!
>
> In general, filtering on partition columns will be faster than filtering
> on actual data columns using row group statistics. For partition-based
> filtering, the scanner can skip full files based on the information from
> the file path (in your example case, there are 11 files, for which it can
> select 4 of them to actually read), while for row-group-based filtering, it
> actually needs to parse the statistics of all row groups of all files to
> determine which can be skipped, which is typically more information to
> process compared to the file paths.
>
> That said, there are some oddities I noticed:
>
> - As I mentioned, I expect partition-based filtering to faster, but not
> that much faster (certainly in a case with a limited number of files, the
> overhead of the parsing / filtering row groups should be really minimal)
> - Inspecting the result a bit, it seems that for the first dataset
> (without partitioning) it's not actually applying the filter correctly. The
> min/max for the name column are included in the row group statistics, but
> the isin filter didn't actually filter them out. Something to investigate,
> but that certainly explains the difference in performance (it's actually
> reading all data, and only filtering after reading, not skipping some parts
> before reading)
> - In your case, the partitioning has the same name as one of the actual
> columns in the data files. I am not sure this corner case of duplicate
> fields is tested very well, or how the filtering will work?
>
> Joris
>
> On Thu, 24 Sep 2020 at 21:02, Josh Mayer <[email protected]> wrote:
>
>> I am comparing two datasets with a filter on a string column (that is
>> also a partition column). I create the dataset from a common metadata file.
>> In the first case I omit the partitioning information whereas in the second
>> I include it. I would expect the performance to be similar since the column
>> statistics should be able to identify the same row groups as the
>> partitioning. However, I'm seeing the first case run almost 3x slower. Is
>> this expected?
>>
>> An example is here (I'm running on linux, python 3.8, pyarrow 1.0.1):
>>
>> https://gist.github.com/josham/5d7cf52f9ea60b1b2bbef1e768ea992f
>>
>

Reply via email to