paul-rogers commented on a change in pull request #1953: Add docs for Drill
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r374438429
##########
File path:
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
##########
@@ -0,0 +1,408 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-31
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is a Beta feature; it is subject to change. We encourage you to
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level
with one of the following commands:
+
+ SET `metastore.enabled` = true;
+ ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at
`http://<drill-hostname-or-ip-address>:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data.
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill
infers the schema by scanning your table
+ in the same way as it is done during regular select and computes some
metadata like `MIN` / `MAX` column values and
+ `NULLS_COUNT` designated as "metadata" to be able to produce more
optimizations like filter push-down, etc. If
+ `planner.statistics.use` option is enabled, this command will also calculate
and store table statistics into Drill
+ Metastore.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf`
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property
`drill.metastore.implementation.class`.
+The default value is the following:
+
+```
+drill.metastore: {
+ implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore"
+}
+```
+
+Note, that currently out of box Iceberg Metastore is available and is the
default one. Though any custom
+ implementation can be added by placing the JAR into classpath which has the
implementation of
+ `org.apache.drill.metastore.Metastore` interface and indicating custom class
in the `drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views, etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like UDFs, storage plugins, etc. can be added in
the future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and
top-level segments metadata.
+Table general information contains basic table information and corresponds to
the `BaseTableMetadata` class.
+
+A table can be non-partitioned and partitioned. Non-partitioned tables have
only one top-level segment
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row
groups, and partitions.
Review comment:
This is a bit to vague to be useful. First, describe how the Metastore
describes a table. Maybe, "A table is assumed to be made op of one or more
files which reside in one or more directories. Every table described by the
Metastore must reside in a top-level folder; the Metastore cannot describe bare
files. (True?)
If a table consists of a single directory, then it is non-partitioned. The
single directory can contain any number of files. Larger tables tend to have
subdirectories. Each subdirectory is a partition and such table are called
"partitioned."
It looks like the material is trying to then say how the information is
stored. I would describe that separately since there is nothing a user can do
with the information.
Instead, it would be better to show an example. We do hae a way to query
metastore data? Then say:
"The following shows metadata for a non-partitioned table:
```
SELECT ...
<show output>
```
Partitioned table use *partition keys*: the directory name describes data
within that directory. For example, the "2019" folder holds data for the year
2019. The Metastore contains an entry for each partition. (Show an example.)
Partitions can be nested: we might have subdirectories 01-12 within the 2019
folder. (Explain how this appears when queried.)
The Metastore assumes your directory names follow a convention. You may be
familiar with the Hive 'col=value` format. Drill instead uses (describe what it
is.)"
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services