Data-source/partition pruning with remotely stored non-parquet files

Lokendra Singh Panwar Tue, 19 Feb 2019 16:18:04 -0800

Hi All,

I am writing a custom storage plugin to read and query non-static json
files stored on remote services and wanted to use something similar to
Drill's partition pruning to optimise my queries.


The files are looked dynamically within the plugin up via an external
service based on the table-id and, optionally also, one of the attributes
in json files 'age'. IOW, the lookup service API resembles:
List<FileLocations> getDataSources (String tableId)
List<FileLocations> getDataSources (String tableId, long ageStart, long
ageEnd)

So, a query like SELECT * FROM pluginName.tableId WHERE age > 10 AND age <
20, has the potential for optimisation to only scan limited files rather
than all the data-sources with all the ages.

>From my understanding so far from the drill's documentation, this would be
hard to do because:
a) Since the remote json files are non-static, meaning they keep changing
by the external service, my understanding is that generation of static
Parquet files and using Parquet metadata for pruning is not going to help,
or it will need to be generated for every query. (Also, CTAS operations on
my system are not allowed).
b) The drill's pushdown capability is apparently also limited to only
'SELECT col FROM (SELECT * FROM tableid)' types of select subqueries. So,
it would not be applicable to generic SELECT queries.

I just wanted to confirm that my understanding is correct and I have not
overloooked some aspect of drill which enables such type of pruning.

Thanks,
Lokendra

Data-source/partition pruning with remotely stored non-parquet files

Reply via email to