efficient filtering on a dataframe

Koert Kuipers Tue, 06 Dec 2016 11:06:20 -0800

i have a dataframe on which i need to run many queries that start with a
filter on a column x.


currently i write the dataframe out to parquet datasource partitioned by
field x, after which i repeatedly read the datasource back in from parquet.
the queries are efficient because the filter gets pushed into the
datasource, which filters out directories, so only a subset of the data
gets read for every query.

how can i achieve the same efficiency without going to datasource and back?
the round trip feels artificial and unnecessary.

efficient filtering on a dataframe

Reply via email to