Data-source/partition pruning with remotely stored non-parquet files

2019-02-19 Thread Lokendra Singh Panwar
Hi All,

I am writing a custom storage plugin to read and query non-static json
files stored on remote services and wanted to use something similar to
Drill's partition pruning to optimise my queries.

The files are looked dynamically within the plugin up via an external
service based on the table-id and, optionally also, one of the attributes
in json files 'age'. IOW, the lookup service API resembles:
List getDataSources (String tableId)
List getDataSources (String tableId, long ageStart, long
ageEnd)

So, a query like SELECT * FROM pluginName.tableId WHERE age > 10 AND age <
20, has the potential for optimisation to only scan limited files rather
than all the data-sources with all the ages.

>From my understanding so far from the drill's documentation, this would be
hard to do because:
a) Since the remote json files are non-static, meaning they keep changing
by the external service, my understanding is that generation of static
Parquet files and using Parquet metadata for pruning is not going to help,
or it will need to be generated for every query. (Also, CTAS operations on
my system are not allowed).
b) The drill's pushdown capability is apparently also limited to only
'SELECT col FROM (SELECT * FROM tableid)' types of select subqueries. So,
it would not be applicable to generic SELECT queries.

I just wanted to confirm that my understanding is correct and I have not
overloooked some aspect of drill which enables such type of pruning.

Thanks,
Lokendra


Some queries related to writing a custom storage-plugin

2019-01-31 Thread Lokendra Singh Panwar
Hi All,

I am relatively new to Drill and trying to write a custom storage plugin.

I have couple of (naive sounding) queries, so mostly need some brief
pointers:

a) Why do a StoragePlugin have to implement a registerSchemas() (coming
from SchemaFactory)? I assumed that drill would discover the data-schema
on-the-fly, so that shouldn't be a need for the plugin to register it
beforehand.

(I created a version of my plugin and skipped implementing the
registerSchemas method, assuming it will be discovered, and tried to do a*
"SELECT * FROM myplugin.`tableid`"*  and it threw a *"VALIDATION_ERROR:
Schema [[myplugin]] is not valid with respect to either root schema or
current default schema"* --> So, I suspect that might be due to me not
implementing registerSchemas(), hence the question)

b) Similarly, I see plugins creating their own DrillTable class extending
either DrillTable/DynamicDrill and then have to override the
RelDataType getRowType(RelDataTypeFactory typeFactory) method, that seems
to be converting the relation-item to drill types. But, I see similar type
conversion also being
done in the RecordReader classes when creating and loading the
value-vectors.  Am I reading it right, that we are doing it twice?

Any pointers will be greatly appreciated.

Thanks,
Lokendra