Re: [Python] dataset filter performance and partitioning

2020-10-19 Thread Joris Van den Bossche
Coming back to this older thread, specifically on the topic of "duplicated" information as both partition field and actual column in the data: On Fri, 25 Sep 2020 at 14:43, Robin Kåveland Hansen wrote: > Hi, > > Just thought I'd chime in on this point: > > > - In your case, the partitioning has

Re: [Python] dataset filter performance and partitioning

2020-09-25 Thread Wes McKinney
hi Matt, This is because `arrow::compute::IsIn` is being called on each batch materialized by the datasets API so the internal kernel state is being set up and torn down for every batch. This means that a large hash table is being set up and torn down many times rather than only created once as wi

Re: [Python] dataset filter performance and partitioning

2020-09-25 Thread Troy Zimmerman
Please disregard. As I said, it was dumb. :) I spent a bit of time familiarizing myself with the code and it makes much more sense now. > On Sep 25, 2020, at 10:56, Troy Zimmerman wrote: > >  > Hi Joris, > > I have a dumb question — if the “isin” expression is returning all row > groups, why

Re: [Python] dataset filter performance and partitioning

2020-09-25 Thread Troy Zimmerman
Hi Joris, I have a dumb question — if the “isin” expression is returning all row groups, why does it appear to still work? For example ole, I created a similar toy setup, and while all row groups seem to “match” the expression (i.e. I get all fragments with the expression), the table that “to_

Re: [Python] dataset filter performance and partitioning

2020-09-25 Thread Matthew Corley
I don't want to cloud this discussion, but I do want to mention that when I tested isin filtering with a high cardinality (on the order of many millions) set on a parquet dataset with a single rowgroup (so filtering should have to happen after data load), it performed much worse in terms of runtime

Re: [Python] dataset filter performance and partitioning

2020-09-25 Thread Josh Mayer
Thanks Joris, that info is very helpful. A few follow up questions, you mention that: > ... it actually needs to parse the statistics of all row groups of all files to determine which can be skipped ... Is that something that is only done once (and perhaps stored inside a dataset object in some o

Re: [Python] dataset filter performance and partitioning

2020-09-25 Thread Robin Kåveland Hansen
Hi, Just thought I'd chime in on this point: > - In your case, the partitioning has the same name as one of the actual columns in the data files. I am not sure this corner case of duplicate fields is tested very well, or how the filtering will work? I _think_ this is the default behaviour for py

Re: [Python] dataset filter performance and partitioning

2020-09-25 Thread Joris Van den Bossche
Using a small toy example, the "isin" filter is indeed not working for filtering row groups: >>> table = pa.table({"name": np.repeat(["a", "b", "c", "d"], 5), "value": np.arange(20)}) >>> pq.write_table(table, "test_filter_string.parquet", row_group_size=5) >>> dataset = ds.dataset("test_filter_st

Re: [Python] dataset filter performance and partitioning

2020-09-25 Thread Joris Van den Bossche
Hi Josh, Thanks for the question! In general, filtering on partition columns will be faster than filtering on actual data columns using row group statistics. For partition-based filtering, the scanner can skip full files based on the information from the file path (in your example case, there are

[Python] dataset filter performance and partitioning

2020-09-24 Thread Josh Mayer
I am comparing two datasets with a filter on a string column (that is also a partition column). I create the dataset from a common metadata file. In the first case I omit the partitioning information whereas in the second I include it. I would expect the performance to be similar since the column s