paul-rogers commented on a change in pull request #1953: Add docs for Drill Metastore URL: https://github.com/apache/drill/pull/1953#discussion_r374442403
########## File path: _docs/performance-tuning/drill-metastore/010-using-drill-metastore.md ########## @@ -0,0 +1,408 @@ +--- +title: "Using Drill Metastore" +parent: "Drill Metastore" +date: 2020-01-31 +--- + +Drill 1.17 introduces the Drill Metastore which stores the table schema and table statistics. Statistics allow Drill to better create optimal query plans. + +The Metastore is a Beta feature; it is subject to change. We encourage you to try it and provide feedback. +Because the Metastore is in Beta, the SQL commands and Metastore formats may change in the next release. +{% include startnote.html %}In Drill 1.17, this feature is supported for Parquet tables only and is disabled by default.{% include endnote.html %} + +## Enabling Drill Metastore + +To use the Drill Metastore, you must enable it at the session or system level with one of the following commands: + + SET `metastore.enabled` = true; + ALTER SYSTEM SET `metastore.enabled` = true; + +Alternatively, you can enable the option in the Drill Web UI at `http://<drill-hostname-or-ip-address>:8047/options`. + +## Computing and storing table metadata to Drill Metastore + +Once you enable the Metastore, the next step is to populate it with data. Drill can query a table whether that table + has a Metastore entry or not. (If you are familiar with Hive, then you know that Hive requires that all tables have + Hive Metastore entries before you can query them.) In Drill, only add data to the Metastore when doing so improves + query performance. In general, large tables benefit from statistics more than small tables do. + +Unlike Hive, Drill does not require you to declare a schema. Instead, Drill infers the schema by scanning your table + in the same way as it is done during regular select and computes some metadata like `MIN` / `MAX` column values and + `NULLS_COUNT` designated as "metadata" to be able to produce more optimizations like filter push-down, etc. If + `planner.statistics.use` option is enabled, this command will also calculate and store table statistics into Drill + Metastore. + +## Configuration + +Default Metastore configuration is defined in `drill-metastore-default.conf` file. +It can be overridden in `drill-metastore-override.conf`. Distribution configuration can be +indicated in `drill-metastore-distrib.conf`. + +All configuration properties should reside in `drill.metastore` namespace. +Metastore implementation based on class implementation config property `drill.metastore.implementation.class`. +The default value is the following: + +``` +drill.metastore: { + implementation.class: "org.apache.drill.metastore.iceberg.IcebergMetastore" +} +``` + +Note, that currently out of box Iceberg Metastore is available and is the default one. Though any custom + implementation can be added by placing the JAR into classpath which has the implementation of + `org.apache.drill.metastore.Metastore` interface and indicating custom class in the `drill.metastore.implementation.class`. + +### Metastore Components + +Metastore can store metadata for various components: tables, views, etc. +Current implementation provides fully functioning support for tables component. +Views component support is not implemented but contains stub methods to show +how new Metastore components like UDFs, storage plugins, etc. can be added in the future. + +### Metastore Tables + +Metastore Tables component contains metadata about Drill tables, including general information, as well as +information about table segments, files, row groups, partitions. + +Full table metadata consists of two major concepts: general information and top-level segments metadata. +Table general information contains basic table information and corresponds to the `BaseTableMetadata` class. + +A table can be non-partitioned and partitioned. Non-partitioned tables have only one top-level segment +which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned tables may have several top-level segments. +Each top-level segment can include metadata about inner segments, files, row groups, and partitions. + +A unique table identifier in Metastore Tables is a combination of storage plugin, workspace, and table name. +Table metadata inside is grouped by top-level segments, unique identifier of the top-level segment and its metadata +is storage plugin, workspace, table name, and metadata key. + +### Related Session/System Options + +The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` or via the Drill Web console. + +- **metastore.enabled** +Enables Drill Metastore usage to be able to store table metadata during ANALYZE TABLE commands execution and to be able + to read table metadata during regular queries execution or when querying some INFORMATION_SCHEMA tables. Default is `false`. +- **metastore.metadata.store.depth_level** +Specifies the maximum level of depth for collecting metadata. Same options as the _level_ option above. Default is `'ALL'`. +- **metastore.retrieval.retry_attempts** +If you run the `ANALYZE TABLE` command at the same time as queries run, then the query can read incorrect or corrupt statistics. +Drill will reload statistics and replan the query. This option specifies the maximum number of retry attempts. Default is `5`. +- **metastore.metadata.fallback_to_file_metadata** +Allows using [file metadata cache]({{site.baseurl}}/docs/refresh-table-metadata) for the case when required metadata is absent in the Metastore. +Default is `true`. +- **metastore.metadata.use_schema** +The `ANALYZE TABLE` command infers table schema as it gathers statistics. This option tells Drill to use that schema information while planning the query. +Disable this option if Drill has inferred the schema incorrectly, or schema will be provided separately (see [CREATE OR REPLACE SCHEMA]({{site.baseurl}}/docs/create-or-replace-schema)). +Default is `true`. +- **metastore.metadata.use_statistics** +Enables obtaining table and column statistics, stored in the Metastore, at the planning stage. Default is `true`. +Enable `planner.statistics.use` to be able to use statistics during query planning. +- **metastore.metadata.ctas.auto-collect** +Drill provides the [`CREATE TABLE AS`]({{site.baseurl}}/docs/create-or-replace-schema) commands to create new tables. +This option causes Drill to gather schema and statistics for those tables automatically as they are written. +This option is not implemented for now. Possible values: `'ALL'`, `'SCHEMA'`, `'NONE'`. Default is `'NONE'`. +- **drill.exec.storage.implicit.last_modified_time.column.label** +Sets the implicit column name for the last modified time (`lmt`) column. Used when producing Metastore analyze. You can + set the last modified time column name to custom name when current column name clashes which column name present in the + table. If your table contains a column name with the same name as an implicit column, the implicit column takes + priority and shadows column from the table. +- **drill.exec.storage.implicit.row_group_index.column.label** +Sets the implicit column name for the row group index (`rgi`) column. Used when producing Metastore analyze. You can + set row group index column name to custom name when current column name clashes which column name present in the + table. If your table contains a column name with the same name as an implicit column, the implicit column takes + priority and shadows column from the table. +- **drill.exec.storage.implicit.row_group_length.column.label** +Sets the implicit column name for the row group length (`rgl`) column. Used when producing Metastore analyze. You can + set row group length column name to custom name when current column name clashes which column name present in the + table. If your table contains a column name with the same name as an implicit column, the implicit column takes + priority and shadows column from the table. +- **drill.exec.storage.implicit.row_group_start.column.label** +Sets the implicit column name for the row group start (`rgs`) column. Used when producing Metastore analyze. You can + set row group start column name to custom name when current column name clashes which column name present in the + table. If your table contains a column name with the same name as an implicit column, the implicit column takes + priority and shadows column from the table. + +## Incremental analysis + +If you have computed statistics for a table, and issue `ANALYZE TABLE` a second time, Drill will attempt to update Review comment: This sentence touches on two topics not yet described: "compute statistics" and "ANALYZE TABLE". We need to explain that. Something like: You create Metastore metadata by running the ANALYZE TABLE command. The first time you run it, the Metastore will infer the schema and (depending on which options you have selected), populate statistics. Tables change over time. To keep the Metastore up-to-date, you must periodically run ANALYZE TABLE again on each changed table. When you do ... (insert material from here). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services