Please disregard. As I said, it was dumb. :) I spent a bit of time familiarizing myself with the code and it makes much more sense now.
> On Sep 25, 2020, at 10:56, Troy Zimmerman <[email protected]> wrote: > > > Hi Joris, > > I have a dumb question — if the “isin” expression is returning all row > groups, why does it appear to still work? > > For example ole, I created a similar toy setup, and while all row groups seem > to “match” the expression (i.e. I get all fragments with the expression), the > table that “to_table” returns has only the rows I expect. Does the filtering > happen again somewhere upstream? > > Best, > Troy > >>> On Sep 25, 2020, at 07:35, Joris Van den Bossche >>> <[email protected]> wrote: >>> >> >> Using a small toy example, the "isin" filter is indeed not working for >> filtering row groups: >> >> >>> table = pa.table({"name": np.repeat(["a", "b", "c", "d"], 5), "value": >> >>> np.arange(20)}) >> >>> pq.write_table(table, "test_filter_string.parquet", row_group_size=5) >> >>> dataset = ds.dataset("test_filter_string.parquet") >> # get the single file fragment (dataset consists of one file) >> >>> fragment = list(dataset.get_fragments())[0] >> >>> fragment.ensure_complete_metadata() >> >> # check that we do have statistics for our row groups >> >>> fragment.row_groups[0].statistics >> {'name': {'min': 'a', 'max': 'a'}, 'value': {'min': 0, 'max': 4}} >> >> # I created the file such that there are 4 row groups (each with a unique >> value in the name column) >> >>> fragment.split_by_row_group() >> [<pyarrow._dataset.ParquetFileFragment at 0x7ff783939810>, >> <pyarrow._dataset.ParquetFileFragment at 0x7ff783728cd8>, >> <pyarrow._dataset.ParquetFileFragment at 0x7ff78376c9c0>, >> <pyarrow._dataset.ParquetFileFragment at 0x7ff7835efd68>] >> >> # simple equality filter works as expected -> only single row group left >> >>> filter = ds.field("name") == "a" >> >>> fragment.split_by_row_group(filter) >> [<pyarrow._dataset.ParquetFileFragment at 0x7ff783662738>] >> >> # isin filter does not work >> >>> filter = ds.field("name").isin(["a", "b"]) >> >>> fragment.split_by_row_group(filter) >> [<pyarrow._dataset.ParquetFileFragment at 0x7ff7837f46f0>, >> <pyarrow._dataset.ParquetFileFragment at 0x7ff783627a98>, >> <pyarrow._dataset.ParquetFileFragment at 0x7ff783581b70>, >> <pyarrow._dataset.ParquetFileFragment at 0x7ff7835fb780>] >> >> While filtering with "isin" on partition columns is working fine. I opened >> https://issues.apache.org/jira/browse/ARROW-10091 to track this as a >> possible enhancement. >> Now, to explain why for partitions this is an "easier" case: the partition >> information gets translated into an equality expression, with your example >> an expression like "name == 'a' ", while the statistics give a bigger/lesser >> than expression, such as "(name > 'a') & (name < 'a')" (from the min/max). >> So for the equality it is more trivial to compare this with an "isin" >> expression like "name in ['a', 'b']" (for the min/max expression, we would >> need to check the special case where min/max is equal). >> >> Joris >> >>> On Fri, 25 Sep 2020 at 14:06, Joris Van den Bossche >>> <[email protected]> wrote: >>> Hi Josh, >>> >>> Thanks for the question! >>> >>> In general, filtering on partition columns will be faster than filtering on >>> actual data columns using row group statistics. For partition-based >>> filtering, the scanner can skip full files based on the information from >>> the file path (in your example case, there are 11 files, for which it can >>> select 4 of them to actually read), while for row-group-based filtering, it >>> actually needs to parse the statistics of all row groups of all files to >>> determine which can be skipped, which is typically more information to >>> process compared to the file paths. >>> >>> That said, there are some oddities I noticed: >>> >>> - As I mentioned, I expect partition-based filtering to faster, but not >>> that much faster (certainly in a case with a limited number of files, the >>> overhead of the parsing / filtering row groups should be really minimal) >>> - Inspecting the result a bit, it seems that for the first dataset (without >>> partitioning) it's not actually applying the filter correctly. The min/max >>> for the name column are included in the row group statistics, but the isin >>> filter didn't actually filter them out. Something to investigate, but that >>> certainly explains the difference in performance (it's actually reading all >>> data, and only filtering after reading, not skipping some parts before >>> reading) >>> - In your case, the partitioning has the same name as one of the actual >>> columns in the data files. I am not sure this corner case of duplicate >>> fields is tested very well, or how the filtering will work? >>> >>> Joris >>> >>>> On Thu, 24 Sep 2020 at 21:02, Josh Mayer <[email protected]> wrote: >>>> I am comparing two datasets with a filter on a string column (that is also >>>> a partition column). I create the dataset from a common metadata file. In >>>> the first case I omit the partitioning information whereas in the second I >>>> include it. I would expect the performance to be similar since the column >>>> statistics should be able to identify the same row groups as the >>>> partitioning. However, I'm seeing the first case run almost 3x slower. Is >>>> this expected? >>>> >>>> An example is here (I'm running on linux, python 3.8, pyarrow 1.0.1): >>>> >>>> https://gist.github.com/josham/5d7cf52f9ea60b1b2bbef1e768ea992f
