i have a dataframe on which i need to run many queries that start with a filter on a column x.
currently i write the dataframe out to parquet datasource partitioned by field x, after which i repeatedly read the datasource back in from parquet. the queries are efficient because the filter gets pushed into the datasource, which filters out directories, so only a subset of the data gets read for every query. how can i achieve the same efficiency without going to datasource and back? the round trip feels artificial and unnecessary.