Re: [DISCUSS] Schema queries - solutions?

Paul Rogers Mon, 17 Feb 2020 23:48:14 -0800

Hi Igor,

Thanks! I should have remembered that bit of SQL.


Yes, if we can generalize `DESCRIBE`, we could create another path of some kind 
through the plugins that say, "return schema, not data."

Then, for the HDF5 use case we could have:

DESCRIBE TABLE `dfs`.`myFile.hdf5` -- returns schema


And

SELECT * FROM `dfs`.`myFile.hdf5` -- returns data


Nice solution! I'll file a feature request.


The next interesting bit about HDF5 is that it is a file system, it contains 
multiple data sets. Would be great to be able to express that in the FROM 
clause:

SELECT * FROM `dfs`.`myFile.hdf5`.`dataSet1`

>From my random walks though Calcite, it appears that we can have any level of 
>schema/table path. True? We'd need some way to resolve a name part to a file, 
>then ask the format plugin for that file if it supports additional parts. This 
>seems pretty obscure. Have we done anything like that before? Maybe in storage 
>(rather than format) plugin?

Thanks,
- Paul

 

    On Monday, February 17, 2020, 11:34:48 PM PST, Igor Guzenko 
<ihor.huzenko....@gmail.com> wrote:  
 
 Hello Paul,

Seems like we simply need to improve our DESCRIBE [1] table functionality.

[1] https://drill.apache.org/docs/describe/

Thanks,
Igor

On Tue, Feb 18, 2020 at 9:23 AM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi All,
>
> Charles has a little PR,  #1978, that adds a convenient feature to his
> HDF5 format reader: the ability to query the schema of the file. (It seems
> that HDF5 is a bit like a zip file: it contains a set of files. Unlike zip,
> each file is a data set with a schema.) Charles added a clever way to tell
> the reader that the user wants a schema rather than data.
>
> If we think a bit, we realize that a schema query would be handy for any
> data source. Maybe I want to know the fields in a JSON or Parquet file
> without getting the data for those fields (and, for example, inferring type
> and nullability from data.)
>
> In a relational DB, we'd get the schema by querying system tables. We'd do
> the same thing in Hive because Hive requires an up-front schema. But, Drill
> is unique in that it can infer schema at run time; no previous schema
> required. So, we have no system tables to answer schema questions. Instead,
> we want to get the schema directly from the data source itself by doing a
> query.
>
> (This feature would be in addition to the case when the Metastore does
> hold a schema.)
>
>
> How might we accomplish the same result? Can we create some kind of
> "virtual" system table that tells us to rewrite the query to get schema?
> Something like:
>
> SELECT * FROM sys.columns WHERE tableName = `dfs`.`my/path/someFile.json`
>
> Or, maybe some implied columns in the table schema?
>
>
> SELECT schema.* FROM `dfs`.`my/path/someFile.json`
>
>
> Or, maybe some special schema name space?
>
> SELECT schema.* FROM schema.`dfs`.`my/path/someFile.json`
>
>
> Anyone know of any system that has an elegant solution we could mimic?
> Other suggestions?
>
>
> Thanks,
> - Paul
>
>

Re: [DISCUSS] Schema queries - solutions?

Reply via email to