Coming back to this older thread, specifically on the topic of "duplicated"
information as both partition field and actual column in the data:
On Fri, 25 Sep 2020 at 14:43, Robin Kåveland Hansen
wrote:
> Hi,
>
> Just thought I'd chime in on this point:
>
> > - In your case, the partitioning has
hi Matt,
This is because `arrow::compute::IsIn` is being called on each batch
materialized by the datasets API so the internal kernel state is being
set up and torn down for every batch. This means that a large hash
table is being set up and torn down many times rather than only
created once as wi
Please disregard. As I said, it was dumb. :) I spent a bit of time
familiarizing myself with the code and it makes much more sense now.
> On Sep 25, 2020, at 10:56, Troy Zimmerman wrote:
>
>
> Hi Joris,
>
> I have a dumb question — if the “isin” expression is returning all row
> groups, why
Hi Joris,
I have a dumb question — if the “isin” expression is returning all row groups,
why does it appear to still work?
For example ole, I created a similar toy setup, and while all row groups seem
to “match” the expression (i.e. I get all fragments with the expression), the
table that “to_
I don't want to cloud this discussion, but I do want to mention that when I
tested isin filtering with a high cardinality (on the order of many
millions) set on a parquet dataset with a single rowgroup (so filtering
should have to happen after data load), it performed much worse in terms of
runtime
Thanks Joris, that info is very helpful. A few follow up questions, you
mention that:
> ... it actually needs to parse the statistics of all row groups of all
files to determine which can be skipped ...
Is that something that is only done once (and perhaps stored inside a
dataset object in some o
Hi,
Just thought I'd chime in on this point:
> - In your case, the partitioning has the same name as one of the actual
columns in the data files. I am not sure this corner case of duplicate
fields is tested very well, or how the filtering will work?
I _think_ this is the default behaviour for py
Using a small toy example, the "isin" filter is indeed not working for
filtering row groups:
>>> table = pa.table({"name": np.repeat(["a", "b", "c", "d"], 5), "value":
np.arange(20)})
>>> pq.write_table(table, "test_filter_string.parquet", row_group_size=5)
>>> dataset = ds.dataset("test_filter_st
Hi Josh,
Thanks for the question!
In general, filtering on partition columns will be faster than filtering on
actual data columns using row group statistics. For partition-based
filtering, the scanner can skip full files based on the information from
the file path (in your example case, there are
I am comparing two datasets with a filter on a string column (that is also
a partition column). I create the dataset from a common metadata file. In
the first case I omit the partitioning information whereas in the second I
include it. I would expect the performance to be similar since the column
s
10 matches
Mail list logo