i have a dataframe on which i need to run many queries that start with a
filter on a column x.

currently i write the dataframe out to parquet datasource partitioned by
field x, after which i repeatedly read the datasource back in from parquet.
the queries are efficient because the filter gets pushed into the
datasource, which filters out directories, so only a subset of the data
gets read for every query.

how can i achieve the same efficiency without going to datasource and back?
the round trip feels artificial and unnecessary.

Reply via email to