vvysotskyi commented on a change in pull request #1953: Add docs for Drill 
Metastore
URL: https://github.com/apache/drill/pull/1953#discussion_r373564880
 
 

 ##########
 File path: 
_docs/performance-tuning/drill-metastore/010-using-drill-metastore.md
 ##########
 @@ -0,0 +1,378 @@
+---
+title: "Using Drill Metastore"
+parent: "Drill Metastore"
+date: 2020-01-30
+---
+
+Drill 1.17 introduces the Drill Metastore which stores the table schema and 
table statistics. Statistics allow Drill to better create optimal query plans.
+
+The Metastore is an Beta feature; it is subject to change. We encourage you to 
try it and provide feedback.
+Because the Metastore is in Beta, the SQL commands and Metastore formats may 
change in the next release.
+{% include startnote.html %}In Drill 1.17, this feature is supported for 
Parquet tables only and is disabled by default.{% include endnote.html %}
+
+## Enabling Drill Metastore
+
+To use the Drill Metastore, you must enable it at the session or system level 
with one of the following commands:
+
+       SET `metastore.enabled` = true;
+       ALTER SYSTEM SET `metastore.enabled` = true;
+
+Alternatively, you can enable the option in the Drill Web UI at 
`http://<drill-hostname-or-ip-address>:8047/options`.
+
+## Computing and storing table metadata to Drill Metastore
+
+Once you enable the Metastore, the next step is to populate it with data. 
Drill can query a table whether that table
+ has a Metastore entry or not. (If you are familiar with Hive, then you know 
that Hive requires that all tables have
+ Hive Metastore entries before you can query them.) In Drill, only add data to 
the Metastore when doing so improves
+ query performance. In general, large tables benefit from statistics more than 
small tables do.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table 
+ and computes some metadata like MIN / MAX column values and NULLS COUNT 
designated as "metadata" to be able to
+ produce more optimizations like filter push-down, etc. If 
`planner.statistics.use` option is enabled, this command
+ will also calculate and store table statistics into Drill Metastore.
+
+Unlike Hive, Drill does not require you to declare a schema. Instead, Drill 
infers the schema by scanning your table
+ in the same way as it is done during regular select.
+
+## Configuration
+
+Default Metastore configuration is defined in `drill-metastore-default.conf` 
file.
+It can be overridden in `drill-metastore-override.conf`. Distribution 
configuration can be
+indicated in `drill-metastore-distrib.conf`.
+
+All configuration properties should reside in `drill.metastore` namespace.
+Metastore implementation based on class implementation config property 
`drill.metastore.implementation.class`.
+
+### Metastore Components
+
+Metastore can store metadata for various components: tables, views etc.
+Current implementation provides fully functioning support for tables component.
+Views component support is not implemented but contains stub methods to show
+how new Metastore components like udfs, storage plugins, etc. be added in 
future.
+
+### Metastore Tables
+
+Metastore Tables component contains metadata about Drill tables, including 
general information, as well as
+information about table segments, files, row groups, partitions.
+
+Full table metadata consists of two major concepts: general information and 
top-level segments metadata.
+Table general information contains basic table information and corresponds to 
`BaseTableMetadata` class.
+
+Table can be non-partitioned and partitioned. Non-partitioned tables, have 
only one top-level segment 
+which is called default (`MetadataInfo#DEFAULT_SEGMENT_KEY`). Partitioned 
tables may have several top-level segments.
+Each top-level segment can include metadata about inner segments, files, row 
groups and partitions.
+
+Unique table identifier in Metastore Tables is combination of storage plugin, 
workspace and table name.
+Table metadata inside is grouped by top-level segments, unique identifier of 
the top-level segment and its metadata
+is storage plugin, workspace, table name and metadata key.
+
+### Related Session/System Options
+
+The following options are set via `ALTER SYSTEM SET`, or `ALTER SESSION SET` 
or via the Drill Web console.
+
+- **metastore.enabled**
+Enables Drill Metastore usage to be able to store table metadata during 
ANALYZE TABLE commands execution and to be able
+ to read table metadata during regular queries execution or when querying some 
INFORMATION_SCHEMA tables. Default is `false`.
+- **metastore.metadata.store.depth_level**
+Specifies maximum level depth for collecting metadata. Same options as the 
_level_ option above. Default is `'ALL'`.
+- **metastore.retrieval.retry_attempts**
+If you run the `ANALYZE TABLE` command at the same time as queries run, then 
the query can read incorrect or corrupt statistics.
+Drill will reload statistics and replan the query. This option specifies the 
maximum number of retry attempts. Default is `5`.
+- **metastore.metadata.fallback_to_file_metadata**
+Allows using [file metadata 
cache]({{site.baseurl}}/docs/refresh-table-metadata) for the case when required 
metadata is absent in the Metastore.
+Default is `true`.
+- **metastore.metadata.use_schema**
+The `ANALYZE TABLE` command infers table schema as it gathers statistics. This 
option tells Drill to use that schema information while planning the query. 
+Disable this option if Drill has inferred the schema incorrectly, or schema 
will be provided separately (see [CREATE OR REPLACE 
SCHEMA]({{site.baseurl}}/docs/create-or-replace-schema)).
+Default is `true`.
+- **metastore.metadata.use_statistics**
+Enables obtaining table and column statistics, stored in the Metastore, at the 
planning stage. Default is `true`.
+Enable `planner.statistics.use` to be able to use statistics during query 
planning.
+- **metastore.metadata.ctas.auto-collect**
+Drill provides the [`CREATE TABLE 
AS`]({{site.baseurl}}/docs/create-or-replace-schema) commands to create new 
tables.
+This option causes Drill to gather schema and statistics for those tables 
automatically as they are written.
+This option is not active for now. Possible values: `'ALL'`, `'SCHEMA'`, 
`'NONE'`. Default is `'NONE'`.
+- **drill.exec.storage.implicit.last_modified_time.column.label**
+Sets the implicit column name for the last modified time (`lmt`) column. Used 
when producing Metastore analyze.
+- **drill.exec.storage.implicit.row_group_index.column.label**
+Sets the implicit column name for the row group index (`rgi`) column. Used 
when producing Metastore analyze.
+- **drill.exec.storage.implicit.row_group_length.column.label**
+Sets the implicit column name for the row group length (`rgl`) column. Used 
when producing Metastore analyze.
+- **drill.exec.storage.implicit.row_group_start.column.label**
+Sets the implicit column name for the row group start (`rgs`) column. Used 
when producing Metastore analyze.
+
+## Incremental analysis
+
+If you have computed statistics for a table, and issue `ANALYZE TABLE` a 
second time, Drill will attempt to update statistics, called "incremental 
analysis."
+Incremental analysis will compute metadata only for files and partitions 
changed since the last analysis and reuse
+ actual metadata from the Metastore where possible.
+
+Drill performs incremental analysis only when the `ANALYZE TABLE command` is 
identical to the previous command:
+- The list of columns in the `COLUMNS` clause is a subset of interesting 
columns from the previous run.
+- The metadata level in the LEVEL clause must be the same as the previous run.
+
+If either of these two conditions is false, Drill will perform a full analysis 
over the entire table.
+
+## General Information
+
+- Drill 1.17 supports the Metastore and `ANALYZE TABLE` only for tables stored 
as Parquet files and only when stored in the `DFS` storage plugin.
+- The first time you execute ANALYZE TABLE for a table, Drill will scan the 
entire tables (all files.)
+When you next issue the same command, Drill will scan only those files added 
since the previous run.
+The command will return the message if table statistics are up-to-date:
+
+
+```
+apache drill (dfs.tmp)> analyze table lineitem refresh metadata;
++-------+---------------------------------------------------------+
+|  ok   |                         summary                         |
++-------+---------------------------------------------------------+
+| false | Table metadata is up to date, analyze wasn't performed. |
++-------+---------------------------------------------------------+
+```
+
+### Metadata usage
+
+Drill uses the Metastore in several places. When you run a query with multiple 
directories, files or Parquet row groups,
+Drill will use statistics to "prune" the scan. That is, to identify those 
directories, files or row groups which
+do not contain data which your query needs. If you add new files or 
directories, and do not rerun `ANALYZE TABLE`,
+then Drill will assume that existing metadata is invalid and wouldn't use it. 
Periodically rerun `ANALYZE TABLE` so that
+ Drill can use table metadata when possible.
+
+### Limitations
+
+This feature is currently in the alpha phase (preview, experimental) for Drill 
1.17 and only applies to Parquet
+ tables in this release. You must enable this feature through the 
`metastore.enabled` system/session option.
+
+## Examples
+
+Examples throughout this topic use the files and directories described in the 
following section, Directory and File Setup.
+
+### Directory and File Setup
+
+Download [TPC-H sf1 
tables](https://s3-us-west-1.amazonaws.com/drill-public/tpch/sf1/tpch_sf1_parquet.tar.gz)
 and unpack
+ archive.
+
+Create lineitem directory in `/tmp/` and two subdirectories under 
`/tmp/lineitem` named `s1` and `s2` and copy there table data:
+
+    mkdir /tmp/lineitem
+    mkdir /tmp/lineitem/s1
+    mkdir /tmp/lineitem/s2
+    cp TPCH/lineitem /tmp/lineitem/s1
+    cp TPCH/lineitem /tmp/lineitem/s2
+
+Query the directory `/tmp/lineitem`:
+
+```
+SELECT count(*) FROM dfs.tmp.lineitem;
++----------+
+|  EXPR$0  |
++----------+
+| 12002430 |
++----------+
+1 row selected (0.291 seconds)
+```
+
+Notice that the query plan contains group scan with `usedMetastore = false`:
+
+
+```
+00-00    Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, 
cumulative cost = {2.1 rows, 2.1 cpu, 1.0 io, 0.0 network, 0.0 memory}, id = 
8410
+00-01      Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): 
rowcount = 1.0, cumulative cost = {2.0 rows, 2.0 cpu, 1.0 io, 0.0 network, 0.0 
memory}, id = 8409
+00-02        DirectScan(groupscan=[selectionRoot = file:/tmp/lineitem, 
numFiles = 12, usedMetadataSummaryFile = false, usedMetastore = false, ...
+```
+
+### Computing and storing table metadata to Drill Metastore
+
+Run [ANALYZE TABLE]({{site.baseurl}}/docs/analyze-table-refresh-metadata) 
command on the table, whose metadata should
+ be computed and stored into the Drill Metastore:
+
+
+```
+apache drill> ANALYZE TABLE dfs.tmp.lineitem REFRESH METADATA;
++------+-------------------------------------------------------------+
+|  ok  |                           summary                           |
++------+-------------------------------------------------------------+
+| true | Collected / refreshed metadata for table [dfs.tmp.lineitem] |
++------+-------------------------------------------------------------+
+1 row selected (32.257 seconds)
+```
+
+The output of this command provides the status of the command execution and 
its summary.
+
+Once, its metadata is collected and stored, it will be used when querying the 
table. To ensure that it was used, its
 
 Review comment:
   Thanks, updated.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to