[
https://issues.apache.org/jira/browse/DRILL-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16581724#comment-16581724
]
Parth Chandra commented on DRILL-6552:
--------------------------------------
Some thoughts I had jotted down on this topic a while ago. (These might be more
than what people are thinking for the first cut but I figured I'd throw them in
to the discussion anyway).
There are three parts to this problem :
1) The design of the schema of the metastore itself ( the schema).
2) The storage of the metadata to the metastore (the store).
3) The metadata APIs.
As an example the metadata cache for Parquet is a metadata store that defines
Parquet files, their schema, the rowgroups within the files, and statistics for
the rowgroups. There have been at least three versions of the information kept
for Parquet files; i.e. the schema and APIs have had at least three versions.
The storage layer is simply files on hdfs. This solution was easy to develop,
but has shortcomings when it comes to allowing concurrent access and updates,
and also does not scale too well for directories that may have tens of
thousands of files.
The schema must be versioned allowing multiple versions of Drill to access data
from the same metastore. This means that not only must the schema be versioned,
but the metastore API must provide backward and forward compatibility.
The schema representation must be extensible allowing new objects to be added
without requiring additional code in the metastore access layer. For instance a
new storage plugin may have properties that we are not aware of and need to be
stored and retrieved.
An initial list of items that can be stored -
Schemas (tables, columns, types). Types may be complex.
Files, file splits and locality information.
Table partitioning information.
Column statistics.
UDF and built-in function definitions
Storage plugin configurations
Runtime metadata (query profiles)
The store must allow concurrent reads and writes. Reads are likely to be orders
of magnitude more than writes. A common use case is of Parquet files produced
by an external source being added to a subdirectory every day or every hour
while the parent directory (and therefore all subdirectories under it) is being
queried by end users.
The implementation must scale and be able to store metadata from hundreds of
thousands of data files and hundreds of concurrent reads of the metadata.
It is highly desirable that the Drill planner and execution engine be able to
access the metadata without knowledge of the underlying store. The underlying
store may be the file system, a relational db, a no-sql db, another metastore
or may even be in-memory. This necessarily implies an API design to separate
out the underlying storage. Note that this also allows an existing metastore
(like the Hive meta store) to be used.
The initial implementation of the metastore may need to have support for more
than one implementation of the underlying store.
Since accessing the metastore is a critical operation, the metastore
implementation must not have a single point of failure.
> Drill Metadata management "Drill MetaStore"
> -------------------------------------------
>
> Key: DRILL-6552
> URL: https://issues.apache.org/jira/browse/DRILL-6552
> Project: Apache Drill
> Issue Type: New Feature
> Components: Metadata
> Affects Versions: 1.13.0
> Reporter: Vitalii Diravka
> Assignee: Vitalii Diravka
> Priority: Major
> Fix For: 2.0.0
>
>
> It would be useful for Drill to have some sort of metastore which would
> enable Drill to remember previously defined schemata so Drill doesn’t have to
> do the same work over and over again.
> It allows to store schema and statistics, which will allow to accelerate
> queries validation, planning and execution time. Also it increases stability
> of Drill and allows to avoid different kind if issues: "schema change
> Exceptions", "limit 0" optimization and so on.
> One of the main candidates is Hive Metastore.
> Starting from 3.0 version Hive Metastore can be the separate service from
> Hive server:
> [https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+3.0+Administration]
> Optional enhancement is storing Drill's profiles, UDFs, plugins configs in
> some kind of metastore as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)