Re: [Python] dataset filter performance and partitioning

Troy Zimmerman Fri, 25 Sep 2020 08:56:42 -0700

Hi Joris,

I have a dumb question — if the “isin” expression is returning all row groups, 
why does it appear to still work?


For example ole, I created a similar toy setup, and while all row groups seem 
to “match” the expression (i.e. I get all fragments with the expression), the 
table that “to_table” returns has only the rows I expect. Does the filtering 
happen again somewhere upstream?

Best,
Troy

> On Sep 25, 2020, at 07:35, Joris Van den Bossche 
> <[email protected]> wrote:
> 
> 
> Using a small toy example, the "isin" filter is indeed not working for 
> filtering row groups:
> 
> >>> table = pa.table({"name": np.repeat(["a", "b", "c", "d"], 5), "value": 
> >>> np.arange(20)})
> >>> pq.write_table(table, "test_filter_string.parquet", row_group_size=5)
> >>> dataset = ds.dataset("test_filter_string.parquet")
> # get the single file fragment (dataset consists of one file)
> >>> fragment = list(dataset.get_fragments())[0]
> >>> fragment.ensure_complete_metadata()
> 
> # check that we do have statistics for our row groups
> >>> fragment.row_groups[0].statistics
> {'name': {'min': 'a', 'max': 'a'}, 'value': {'min': 0, 'max': 4}}
> 
> # I created the file such that there are 4 row groups (each with a unique 
> value in the name column)
> >>> fragment.split_by_row_group()
> [<pyarrow._dataset.ParquetFileFragment at 0x7ff783939810>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff783728cd8>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff78376c9c0>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff7835efd68>]
> 
> # simple equality filter works as expected -> only single row group left
> >>> filter = ds.field("name") == "a"
> >>> fragment.split_by_row_group(filter)
> [<pyarrow._dataset.ParquetFileFragment at 0x7ff783662738>]
> 
> # isin filter does not work
> >>> filter = ds.field("name").isin(["a", "b"])
> >>> fragment.split_by_row_group(filter)
> [<pyarrow._dataset.ParquetFileFragment at 0x7ff7837f46f0>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff783627a98>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff783581b70>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff7835fb780>]
> 
> While filtering with "isin" on partition columns is working fine. I opened 
> https://issues.apache.org/jira/browse/ARROW-10091 to track this as a possible 
> enhancement. 
> Now, to explain why for partitions this is an "easier" case: the partition 
> information gets translated into an equality expression, with your example an 
> expression like "name == 'a' ", while the statistics give a bigger/lesser 
> than expression, such as "(name > 'a') & (name < 'a')" (from the min/max). So 
> for the equality it is more trivial to compare this with an "isin" expression 
> like "name in ['a', 'b']" (for the min/max expression, we would need to check 
> the special case where min/max is equal).
> 
> Joris
> 
>> On Fri, 25 Sep 2020 at 14:06, Joris Van den Bossche 
>> <[email protected]> wrote:
>> Hi Josh,
>> 
>> Thanks for the question!
>> 
>> In general, filtering on partition columns will be faster than filtering on 
>> actual data columns using row group statistics. For partition-based 
>> filtering, the scanner can skip full files based on the information from the 
>> file path (in your example case, there are 11 files, for which it can select 
>> 4 of them to actually read), while for row-group-based filtering, it 
>> actually needs to parse the statistics of all row groups of all files to 
>> determine which can be skipped, which is typically more information to 
>> process compared to the file paths. 
>> 
>> That said, there are some oddities I noticed:
>> 
>> - As I mentioned, I expect partition-based filtering to faster, but not that 
>> much faster (certainly in a case with a limited number of files, the 
>> overhead of the parsing / filtering row groups should be really minimal)
>> - Inspecting the result a bit, it seems that for the first dataset (without 
>> partitioning) it's not actually applying the filter correctly. The min/max 
>> for the name column are included in the row group statistics, but the isin 
>> filter didn't actually filter them out. Something to investigate, but that 
>> certainly explains the difference in performance (it's actually reading all 
>> data, and only filtering after reading, not skipping some parts before 
>> reading)
>> - In your case, the partitioning has the same name as one of the actual 
>> columns in the data files. I am not sure this corner case of duplicate 
>> fields is tested very well, or how the filtering will work?
>> 
>> Joris
>> 
>>> On Thu, 24 Sep 2020 at 21:02, Josh Mayer <[email protected]> wrote:
>>> I am comparing two datasets with a filter on a string column (that is also 
>>> a partition column). I create the dataset from a common metadata file. In 
>>> the first case I omit the partitioning information whereas in the second I 
>>> include it. I would expect the performance to be similar since the column 
>>> statistics should be able to identify the same row groups as the 
>>> partitioning. However, I'm seeing the first case run almost 3x slower. Is 
>>> this expected?
>>> 
>>> An example is here (I'm running on linux, python 3.8, pyarrow 1.0.1):
>>> 
>>> https://gist.github.com/josham/5d7cf52f9ea60b1b2bbef1e768ea992f

Re: [Python] dataset filter performance and partitioning

Reply via email to