[CARBONDATA-2575] Add document to explain DataMap Management Add document to explain DataMap Management
This closes #2360 Project: http://git-wip-us.apache.org/repos/asf/carbondata/repo Commit: http://git-wip-us.apache.org/repos/asf/carbondata/commit/d401e060 Tree: http://git-wip-us.apache.org/repos/asf/carbondata/tree/d401e060 Diff: http://git-wip-us.apache.org/repos/asf/carbondata/diff/d401e060 Branch: refs/heads/carbonstore Commit: d401e060adcc531d66468dc61f4d468768cfea3f Parents: 5f68a79 Author: Jacky Li <jacky.li...@qq.com> Authored: Mon Jun 4 21:18:31 2018 +0800 Committer: Jacky Li <jacky.li...@qq.com> Committed: Tue Jun 5 19:57:06 2018 +0800 ---------------------------------------------------------------------- docs/datamap/datamap-management.md | 111 ++++++++++++++++++++++++++++++++ 1 file changed, 111 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/carbondata/blob/d401e060/docs/datamap/datamap-management.md ---------------------------------------------------------------------- diff --git a/docs/datamap/datamap-management.md b/docs/datamap/datamap-management.md new file mode 100644 index 0000000..01bb69f --- /dev/null +++ b/docs/datamap/datamap-management.md @@ -0,0 +1,111 @@ +# CarbonData DataMap Management + +## Overview + +DataMap can be created using following DDL + +``` + CREATE DATAMAP [IF NOT EXISTS] datamap_name + [ON TABLE main_table] + USING "datamap_provider" + [WITH DEFERRED REBUILD] + DMPROPERTIES ('key'='value', ...) + AS + SELECT statement +``` + +Currently, there are 5 DataMap implementation in CarbonData. + +| DataMap Provider | Description | DMPROPERTIES | Management | +| ---------------- | ---------------------------------------- | ---------------------------------------- | ---------------- | +| preaggregate | single table pre-aggregate table | No DMPROPERTY is required | Automatic | +| timeseries | time dimension rollup table. | event_time, xx_granularity, please refer to [Timeseries DataMap](https://github.com/apache/carbondata/blob/master/docs/datamap/timeseries-datamap-guide.md) | Automatic | +| mv | multi-table pre-aggregate table, | No DMPROPERTY is required | Manual | +| lucene | lucene indexing for text column | index_columns to specifying the index columns | Manual/Automatic | +| bloom | bloom filter for high cardinality column, geospatial column | index_columns to specifying the index columns | Manual/Automatic | + +## DataMap Management + +There are two kinds of management semantic for DataMap. + +1. Autmatic Refresh: Create datamap without `WITH DEFERED REBUILD` in the statement +2. Manual Refresh: Create datamap with `WITH DEFERED REBUILD` in the statement + +### Automatic Refresh + +When user creates a datamap on the main table without using `WITH DEFERED REBUILD` syntax, the datamap will be managed by system automatically. +For every data load to the main table, system will immediately triger a load to the datamap automatically. These two data loading (to main table and datamap) is executed in a transactional manner, meaning that it will be either both success or neither success. + +The data loading to datamap is incremental based on Segment concept, avoiding a expesive total rebuild. + +If user perform following command on the main table, system will return failure. (reject the operation) + +1. Data management command: `UPDATE/DELETE/DELETE SEGMENT`. +2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE DATATYPE`, + `ALTER TABLE RENAME`. Note that adding a new column is supported, and for dropping columns and + change datatype command, CarbonData will check whether it will impact the pre-aggregate table, if + not, the operation is allowed, otherwise operation will be rejected by throwing exception. +3. Partition management command: `ALTER TABLE ADD/DROP PARTITION + +If user do want to perform above operations on the main table, user can first drop the datamap, perform the operation, and re-create the datamap again. + +If user drop the main table, the datamap will be dropped immediately too. + +### Manual Refresh + +When user creates a datamap specifying maunal refresh semantic, the datamap is created with status *disabled* and query will NOT use this datamap until user can issue REBUILD DATAMAP command to build the datamap. For every REBUILD DATAMAP command, system will trigger a full rebuild of the datamap. After rebuild is done, system will change datamap status to *enabled*, so that it can be used in query rewrite. + +For every new data loading, data update, delete, the related datamap will be made *disabled*. + +If the main table is dropped by user, the related datamap will be dropped immediately. + +*Note: If you are creating a datamap on external table, you need to do manual managment of the datamap.* + + + +## DataMap Catalog + +Currently, when user creates a datamap, system will store the datamap metadata in a configurable *system* folder in HDFS or S3. + +In this *system* folder, it contains: + +- DataMapSchema file. It is a json file containing schema for one datamap. Ses DataMapSchema class. If user creates 100 datamaps (on different tables), there will be 100 files in *system* folder. +- DataMapStatus file. Only one file, it is in json format, and each entry in the file represents for one datamap. Ses DataMapStatusDetail class + +There is a DataMapCatalog interface to retrieve schema of all datamap, it can be used in optimizer to get the metadata of datamap. + + + +## DataMap Related Commands + +### Explain + +How can user know whether datamap is used in the query? + +User can use EXPLAIN command to know, it will print out something like + +```text +== CarbonData Profiler == +Hit mv DataMap: datamap1 +Scan Table: default.datamap1_table ++- filter: ++- pruning by CG DataMap ++- all blocklets: 1 + skipped blocklets: 0 +``` + +### Show DataMap + +There is a SHOW DATAMAPS command, when this is issued, system will read all datamap from *system* folder and print all information on screen. The current information includes: + +- DataMapName +- DataMapProviderName like mv, preaggreagte, timeseries, etc +- Associated Table + +### Compaction on DataMap + +This feature applies for preaggregate datamap only + +Running Compaction command (`ALTER TABLE COMPACT`) on main table will **not automatically** compact the pre-aggregate tables created on the main table. User need to run Compaction command separately on each pre-aggregate table to compact them. + +Compaction is an optional operation for pre-aggregate table. If compaction is performed on main table but not performed on pre-aggregate table, all queries still can benefit from pre-aggregate tables. To further improve the query performance, compaction on pre-aggregate tables can be triggered to merge the segments and files in the pre-aggregate tables.