[CARBONDATA-2815][Doc] Add documentation for spilling memory and datamap rebuild

Add documentation for:1.spilling unsafe memory for data loading,2.datamap 
rebuild for index datamap

This closes #2604


Project: http://git-wip-us.apache.org/repos/asf/carbondata/repo
Commit: http://git-wip-us.apache.org/repos/asf/carbondata/commit/cc3f2bea
Tree: http://git-wip-us.apache.org/repos/asf/carbondata/tree/cc3f2bea
Diff: http://git-wip-us.apache.org/repos/asf/carbondata/diff/cc3f2bea

Branch: refs/heads/branch-1.4
Commit: cc3f2bea32b27662e808be3bbe402cb684afdccd
Parents: 41bd359
Author: xuchuanyin <xuchuan...@hust.edu.cn>
Authored: Thu Aug 2 22:39:49 2018 +0800
Committer: ravipesala <ravi.pes...@gmail.com>
Committed: Thu Aug 9 23:50:43 2018 +0530

----------------------------------------------------------------------
 docs/configuration-parameters.md   |   3 +-
 docs/datamap/datamap-management.md | 119 ++++++++++++++++++++++++++++++++
 2 files changed, 121 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/carbondata/blob/cc3f2bea/docs/configuration-parameters.md
----------------------------------------------------------------------
diff --git a/docs/configuration-parameters.md b/docs/configuration-parameters.md
index 77cf230..eee85e2 100644
--- a/docs/configuration-parameters.md
+++ b/docs/configuration-parameters.md
@@ -69,7 +69,8 @@ This section provides the details of all the configurations 
required for CarbonD
 | carbon.options.bad.record.path |  | Specifies the HDFS path where bad 
records are stored. By default the value is Null. This path must to be 
configured by the user if bad record logger is enabled or bad record action 
redirect. | |
 | carbon.enable.vector.reader | true | This parameter increases the 
performance of select queries as it fetch columnar batch of size 4*1024 rows 
instead of fetching data row by row. | |
 | carbon.blockletgroup.size.in.mb | 64 MB | The data are read as a group of 
blocklets which are called blocklet groups. This parameter specifies the size 
of the blocklet group. Higher value results in better sequential IO access.The 
minimum value is 16MB, any value lesser than 16MB will reset to the default 
value (64MB). |  |
-| carbon.task.distribution | block | **block**: Setting this value will launch 
one task per block. This setting is suggested in case of concurrent queries and 
queries having big shuffling scenarios. **custom**: Setting this value will 
group the blocks and distribute it uniformly to the available resources in the 
cluster. This enhances the query performance but not suggested in case of 
concurrent queries and queries having big shuffling scenarios. **blocklet**: 
Setting this value will launch one task per blocklet. This setting is suggested 
in case of concurrent queries and queries having big shuffling scenarios. 
**merge_small_files**: Setting this value will merge all the small partitions 
to a size of (128 MB is the default value of 
"spark.sql.files.maxPartitionBytes",it is configurable) during querying. The 
small partitions are combined to a map task to reduce the number of read task. 
This enhances the performance. | | 
+| carbon.task.distribution | block | **block**: Setting this value will launch 
one task per block. This setting is suggested in case of concurrent queries and 
queries having big shuffling scenarios. **custom**: Setting this value will 
group the blocks and distribute it uniformly to the available resources in the 
cluster. This enhances the query performance but not suggested in case of 
concurrent queries and queries having big shuffling scenarios. **blocklet**: 
Setting this value will launch one task per blocklet. This setting is suggested 
in case of concurrent queries and queries having big shuffling scenarios. 
**merge_small_files**: Setting this value will merge all the small partitions 
to a size of (128 MB is the default value of 
"spark.sql.files.maxPartitionBytes",it is configurable) during querying. The 
small partitions are combined to a map task to reduce the number of read task. 
This enhances the performance. | |
+| carbon.load.sortmemory.spill.percentage | 0 | If we use unsafe memory during 
data loading, this configuration will be used to control the behavior of 
spilling inmemory pages to disk. Internally in Carbondata, during sorting 
carbondata will sort data in pages and add them in unsafe memory. If the memory 
is insufficient, carbondata will spill the pages to disk and generate sort temp 
file. This configuration controls how many pages in memory will be spilled to 
disk based size. The size can be calculated by multiplying this configuration 
value with 'carbon.sort.storage.inmemory.size.inmb'. For example, default value 
0 means that no pages in unsafe memory will be spilled and all the newly sorted 
data will be spilled to disk; Value 50 means that if the unsafe memory is 
insufficient, about half of pages in the unsafe memory will be spilled to disk 
while value 100 means that almost all pages in unsafe memory will be spilled. 
**Note**: This configuration only works for 'LOCAL_SORT' and 'BA
 TCH_SORT' and the actual spilling behavior may slightly be different in each 
data loading. | Integer values between 0 and 100 |
 
 * **Compaction Configuration**
   

http://git-wip-us.apache.org/repos/asf/carbondata/blob/cc3f2bea/docs/datamap/datamap-management.md
----------------------------------------------------------------------
diff --git a/docs/datamap/datamap-management.md 
b/docs/datamap/datamap-management.md
new file mode 100644
index 0000000..1695a23
--- /dev/null
+++ b/docs/datamap/datamap-management.md
@@ -0,0 +1,119 @@
+# CarbonData DataMap Management
+
+## Overview
+
+DataMap can be created using following DDL
+
+```
+  CREATE DATAMAP [IF NOT EXISTS] datamap_name
+  [ON TABLE main_table]
+  USING "datamap_provider"
+  [WITH DEFERRED REBUILD]
+  DMPROPERTIES ('key'='value', ...)
+  AS
+    SELECT statement
+```
+
+Currently, there are 5 DataMap implementation in CarbonData.
+
+| DataMap Provider | Description                              | DMPROPERTIES   
                          | Management       |
+| ---------------- | ---------------------------------------- | 
---------------------------------------- | ---------------- |
+| preaggregate     | single table pre-aggregate table         | No DMPROPERTY 
is required                | Automatic        |
+| timeseries       | time dimension rollup table.             | event_time, 
xx_granularity, please refer to [Timeseries 
DataMap](https://github.com/apache/carbondata/blob/master/docs/datamap/timeseries-datamap-guide.md)
 | Automatic        |
+| mv               | multi-table pre-aggregate table,         | No DMPROPERTY 
is required                | Manual           |
+| lucene           | lucene indexing for text column          | index_columns 
to specifying the index columns | Manual/Automatic |
+| bloomfilter      | bloom filter for high cardinality column, geospatial 
column | index_columns to specifying the index columns | Manual/Automatic |
+
+## DataMap Management
+
+There are two kinds of management semantic for DataMap.
+
+1. Automatic Refresh: Create datamap without `WITH DEFERED REBUILD` in the 
statement, which is by default.
+2. Manual Refresh: Create datamap with `WITH DEFERED REBUILD` in the statement
+
+### Automatic Refresh
+
+When user creates a datamap on the main table without using `WITH DEFERED 
REBUILD` syntax, the datamap will be managed by system automatically.
+For every data load to the main table, system will immediately triger a load 
to the datamap automatically. These two data loading (to main table and 
datamap) is executed in a transactional manner, meaning that it will be either 
both success or neither success. 
+
+The data loading to datamap is incremental based on Segment concept, avoiding 
a expesive total rebuild.
+
+If user perform following command on the main table, system will return 
failure. (reject the operation)
+
+1. Data management command: `UPDATE/DELETE/DELETE SEGMENT`.
+2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE 
DATATYPE`,
+   `ALTER TABLE RENAME`. Note that adding a new column is supported, and for 
dropping columns and
+   change datatype command, CarbonData will check whether it will impact the 
pre-aggregate table, if
+    not, the operation is allowed, otherwise operation will be rejected by 
throwing exception.
+3. Partition management command: `ALTER TABLE ADD/DROP PARTITION
+
+If user do want to perform above operations on the main table, user can first 
drop the datamap, perform the operation, and re-create the datamap again.
+
+If user drop the main table, the datamap will be dropped immediately too.
+
+We do recommend you to use this management for index datamap.
+
+### Manual Refresh
+
+When user creates a datamap specifying maunal refresh semantic, the datamap is 
created with status *disabled* and query will NOT use this datamap until user 
can issue REBUILD DATAMAP command to build the datamap. For every REBUILD 
DATAMAP command, system will trigger a full rebuild of the datamap. After 
rebuild is done, system will change datamap status to *enabled*, so that it can 
be used in query rewrite.
+
+For every new data loading, data update, delete, the related datamap will be 
made *disabled*,
+which means that the following queries will not benefit from the datamap 
before it becomes *enabled* again.
+
+If the main table is dropped by user, the related datamap will be dropped 
immediately.
+
+**Note**:
++ If you are creating a datamap on external table, you need to do manual 
management of the datamap.
++ For index datamap such as BloomFilter datamap, there is no need to do manual 
refresh.
+ By default it is automatic refresh,
+ which means its data will get refreshed immediately after the datamap is 
created or the main table is loaded.
+ Manual refresh on this datamap will has no impact.
+
+
+
+## DataMap Catalog
+
+Currently, when user creates a datamap, system will store the datamap metadata 
in a configurable *system* folder in HDFS or S3.
+
+In this *system* folder, it contains:
+
+- DataMapSchema file. It is a json file containing schema for one datamap. Ses 
DataMapSchema class. If user creates 100 datamaps (on different tables), there 
will be 100 files in *system* folder.
+- DataMapStatus file. Only one file, it is in json format, and each entry in 
the file represents for one datamap. Ses DataMapStatusDetail class
+
+There is a DataMapCatalog interface to retrieve schema of all datamap, it can 
be used in optimizer to get the metadata of datamap.
+
+
+
+## DataMap Related Commands
+
+### Explain
+
+How can user know whether datamap is used in the query?
+
+User can use EXPLAIN command to know, it will print out something like
+
+```text
+== CarbonData Profiler ==
+Hit mv DataMap: datamap1
+Scan Table: default.datamap1_table
++- filter:
++- pruning by CG DataMap
++- all blocklets: 1
+   skipped blocklets: 0
+```
+
+### Show DataMap
+
+There is a SHOW DATAMAPS command, when this is issued, system will read all 
datamap from *system* folder and print all information on screen. The current 
information includes:
+
+- DataMapName
+- DataMapProviderName like mv, preaggreagte, timeseries, etc
+- Associated Table
+
+### Compaction on DataMap
+
+This feature applies for preaggregate datamap only
+
+Running Compaction command (`ALTER TABLE COMPACT`) on main table will **not 
automatically** compact the pre-aggregate tables created on the main table. 
User need to run Compaction command separately on each pre-aggregate table to 
compact them.
+
+Compaction is an optional operation for pre-aggregate table. If compaction is 
performed on main table but not performed on pre-aggregate table, all queries 
still can benefit from pre-aggregate tables. To further improve the query 
performance, compaction on pre-aggregate tables can be triggered to merge the 
segments and files in the pre-aggregate tables.

Reply via email to