Re: [Python] dataset filter performance and partitioning

Troy Zimmerman Fri, 25 Sep 2020 09:30:43 -0700

Please disregard. As I said, it was dumb. :) I spent a bit of time 
familiarizing myself with the code and it makes much more sense now.


> On Sep 25, 2020, at 10:56, Troy Zimmerman <[email protected]> wrote:
> 
> 
> Hi Joris,
> 
> I have a dumb question — if the “isin” expression is returning all row 
> groups, why does it appear to still work?
> 
> For example ole, I created a similar toy setup, and while all row groups seem 
> to “match” the expression (i.e. I get all fragments with the expression), the 
> table that “to_table” returns has only the rows I expect. Does the filtering 
> happen again somewhere upstream?
> 
> Best,
> Troy
> 
>>> On Sep 25, 2020, at 07:35, Joris Van den Bossche 
>>> <[email protected]> wrote:
>>> 
>> 
>> Using a small toy example, the "isin" filter is indeed not working for 
>> filtering row groups:
>> 
>> >>> table = pa.table({"name": np.repeat(["a", "b", "c", "d"], 5), "value": 
>> >>> np.arange(20)})
>> >>> pq.write_table(table, "test_filter_string.parquet", row_group_size=5)
>> >>> dataset = ds.dataset("test_filter_string.parquet")
>> # get the single file fragment (dataset consists of one file)
>> >>> fragment = list(dataset.get_fragments())[0]
>> >>> fragment.ensure_complete_metadata()
>> 
>> # check that we do have statistics for our row groups
>> >>> fragment.row_groups[0].statistics
>> {'name': {'min': 'a', 'max': 'a'}, 'value': {'min': 0, 'max': 4}}
>> 
>> # I created the file such that there are 4 row groups (each with a unique 
>> value in the name column)
>> >>> fragment.split_by_row_group()
>> [<pyarrow._dataset.ParquetFileFragment at 0x7ff783939810>,
>>  <pyarrow._dataset.ParquetFileFragment at 0x7ff783728cd8>,
>>  <pyarrow._dataset.ParquetFileFragment at 0x7ff78376c9c0>,
>>  <pyarrow._dataset.ParquetFileFragment at 0x7ff7835efd68>]
>> 
>> # simple equality filter works as expected -> only single row group left
>> >>> filter = ds.field("name") == "a"
>> >>> fragment.split_by_row_group(filter)
>> [<pyarrow._dataset.ParquetFileFragment at 0x7ff783662738>]
>> 
>> # isin filter does not work
>> >>> filter = ds.field("name").isin(["a", "b"])
>> >>> fragment.split_by_row_group(filter)
>> [<pyarrow._dataset.ParquetFileFragment at 0x7ff7837f46f0>,
>>  <pyarrow._dataset.ParquetFileFragment at 0x7ff783627a98>,
>>  <pyarrow._dataset.ParquetFileFragment at 0x7ff783581b70>,
>>  <pyarrow._dataset.ParquetFileFragment at 0x7ff7835fb780>]
>> 
>> While filtering with "isin" on partition columns is working fine. I opened 
>> https://issues.apache.org/jira/browse/ARROW-10091 to track this as a 
>> possible enhancement. 
>> Now, to explain why for partitions this is an "easier" case: the partition 
>> information gets translated into an equality expression, with your example 
>> an expression like "name == 'a' ", while the statistics give a bigger/lesser 
>> than expression, such as "(name > 'a') & (name < 'a')" (from the min/max). 
>> So for the equality it is more trivial to compare this with an "isin" 
>> expression like "name in ['a', 'b']" (for the min/max expression, we would 
>> need to check the special case where min/max is equal).
>> 
>> Joris
>> 
>>> On Fri, 25 Sep 2020 at 14:06, Joris Van den Bossche 
>>> <[email protected]> wrote:
>>> Hi Josh,
>>> 
>>> Thanks for the question!
>>> 
>>> In general, filtering on partition columns will be faster than filtering on 
>>> actual data columns using row group statistics. For partition-based 
>>> filtering, the scanner can skip full files based on the information from 
>>> the file path (in your example case, there are 11 files, for which it can 
>>> select 4 of them to actually read), while for row-group-based filtering, it 
>>> actually needs to parse the statistics of all row groups of all files to 
>>> determine which can be skipped, which is typically more information to 
>>> process compared to the file paths. 
>>> 
>>> That said, there are some oddities I noticed:
>>> 
>>> - As I mentioned, I expect partition-based filtering to faster, but not 
>>> that much faster (certainly in a case with a limited number of files, the 
>>> overhead of the parsing / filtering row groups should be really minimal)
>>> - Inspecting the result a bit, it seems that for the first dataset (without 
>>> partitioning) it's not actually applying the filter correctly. The min/max 
>>> for the name column are included in the row group statistics, but the isin 
>>> filter didn't actually filter them out. Something to investigate, but that 
>>> certainly explains the difference in performance (it's actually reading all 
>>> data, and only filtering after reading, not skipping some parts before 
>>> reading)
>>> - In your case, the partitioning has the same name as one of the actual 
>>> columns in the data files. I am not sure this corner case of duplicate 
>>> fields is tested very well, or how the filtering will work?
>>> 
>>> Joris
>>> 
>>>> On Thu, 24 Sep 2020 at 21:02, Josh Mayer <[email protected]> wrote:
>>>> I am comparing two datasets with a filter on a string column (that is also 
>>>> a partition column). I create the dataset from a common metadata file. In 
>>>> the first case I omit the partitioning information whereas in the second I 
>>>> include it. I would expect the performance to be similar since the column 
>>>> statistics should be able to identify the same row groups as the 
>>>> partitioning. However, I'm seeing the first case run almost 3x slower. Is 
>>>> this expected?
>>>> 
>>>> An example is here (I'm running on linux, python 3.8, pyarrow 1.0.1):
>>>> 
>>>> https://gist.github.com/josham/5d7cf52f9ea60b1b2bbef1e768ea992f

Re: [Python] dataset filter performance and partitioning

Reply via email to