[ 
https://issues.apache.org/jira/browse/DRILL-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16581738#comment-16581738
 ] 

Paul Rogers commented on DRILL-6552:
------------------------------------

[~parthc], outstanding summary; I think you've got the major points.

Perhaps you might add an item 4: how Drill will consume the information: 
changes to the planner, system tables, physical plan, readers and code 
generation to exploit the metadata. Working that out will help sharpen the 
definition of the metadata required. (See, in particular, the bugs associated 
with schema ambiguities in JSON.)

To expand on a point in your summary: would be handy to separate out concerns 
of implementation (concurrent writes, versioning) from the needs of the 
planner/execution engine (ability to query the information efficiently.)

To motivate why that is useful, it will be handy to associate a metastore 
plugin with a storage plugin and/or workspace. I may have a bunch of "official" 
Parquet files described in HMS. But, I may also have a bunch of ad-hoc JSON or 
other files described in, say, a schema file associated with my ad-hoc 
directory. If I can define a meta-store plugin per workspace, I can point my 
working directory at a file-based, simple metastore. Later, when I do ETL, to 
Parquet, I can create entries in HMS.

This project sounds pretty large. Given the modular structure you've outlined, 
the initial implementation might focus on the API and Drill internals changes 
to use the data. Create starter implementations for HMS, Drill's existing 
Parquet metadata, an easy-to-use file based description for ad-hoc uses, and a 
system table based system when querying a JDBC data store.

Once the interfaces work, and the internals consume the metadata, then work can 
move onto a Drill-specfic implementation of the metastore. (Or, work with the 
team splitting HMS out of Hive.)

Further thought: the weakness of the HMS is well known. Perhaps no one expected 
it to become so heavily used when it was first cranked out on top of an RDBMS. 
If the Drill team invests in a new, more scalable solution, would be great if 
Drill's metastore could (eventually) implement the HMS APIs so that the "DMS" 
(Drill MetaStore) becomes a viable, scalable alternative. Doing that will 
encourage others to contribute to building the metastore. This means Drill 
might want to influence the HMS APIs (the current one's were quick hacks) to 
allow Drill to better use HMS, and to allow the Drill metastore to be consumed 
by HMS clients.

> Drill Metadata management "Drill MetaStore"
> -------------------------------------------
>
>                 Key: DRILL-6552
>                 URL: https://issues.apache.org/jira/browse/DRILL-6552
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Metadata
>    Affects Versions: 1.13.0
>            Reporter: Vitalii Diravka
>            Assignee: Vitalii Diravka
>            Priority: Major
>             Fix For: 2.0.0
>
>
> It would be useful for Drill to have some sort of metastore which would 
> enable Drill to remember previously defined schemata so Drill doesn’t have to 
> do the same work over and over again.
> It allows to store schema and statistics, which will allow to accelerate 
> queries validation, planning and execution time. Also it increases stability 
> of Drill and allows to avoid different kind if issues: "schema change 
> Exceptions", "limit 0" optimization and so on. 
> One of the main candidates is Hive Metastore.
> Starting from 3.0 version Hive Metastore can be the separate service from 
> Hive server:
> [https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+3.0+Administration]
> Optional enhancement is storing Drill's profiles, UDFs, plugins configs in 
> some kind of metastore as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to