[06/45] carbondata git commit: [CARBONDATA-2966]Update Documentation For Avro DataType conversion

ravipesala Tue, 09 Oct 2018 08:50:25 -0700

[CARBONDATA-2966]Update Documentation For Avro DataType conversion

Updated document for the following features:
1. Avro DataType conversion to carbon
2. Remove min, max for varchar columns
3. LRU enhancements for driver cache


This closes #2756


Project: http://git-wip-us.apache.org/repos/asf/carbondata/repo
Commit: http://git-wip-us.apache.org/repos/asf/carbondata/commit/b3a5e3a8
Tree: http://git-wip-us.apache.org/repos/asf/carbondata/tree/b3a5e3a8
Diff: http://git-wip-us.apache.org/repos/asf/carbondata/diff/b3a5e3a8

Branch: refs/heads/branch-1.5
Commit: b3a5e3a8bb4b051779f91bca071336703742296c
Parents: d84cd81
Author: Indhumathi27 <indhumathi...@gmail.com>
Authored: Mon Sep 24 23:34:04 2018 +0530
Committer: kunal642 <kunalkapoor...@gmail.com>
Committed: Wed Sep 26 16:16:02 2018 +0530

----------------------------------------------------------------------
 docs/configuration-parameters.md           |  6 ++-
 docs/faq.md                                | 16 +++++++
 docs/sdk-guide.md                          | 55 +++++++++++++++++--------
 docs/supported-data-types-in-carbondata.md |  1 +
 4 files changed, 58 insertions(+), 20 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/carbondata/blob/b3a5e3a8/docs/configuration-parameters.md
----------------------------------------------------------------------
diff --git a/docs/configuration-parameters.md b/docs/configuration-parameters.md
index c6b0fcb..7edae47 100644
--- a/docs/configuration-parameters.md
+++ b/docs/configuration-parameters.md
@@ -42,6 +42,7 @@ This section provides the details of all the configurations 
required for the Car
 | carbon.lock.type | LOCALLOCK | This configuration specifies the type of lock 
to be acquired during concurrent operations on table. There are following types 
of lock implementation: - LOCALLOCK: Lock is created on local file system as 
file. This lock is useful when only one spark driver (thrift server) runs on a 
machine and no other CarbonData spark application is launched concurrently. - 
HDFSLOCK: Lock is created on HDFS file system as file. This lock is useful when 
multiple CarbonData spark applications are launched and no ZooKeeper is running 
on cluster and HDFS supports file based locking. |
 | carbon.lock.path | TABLEPATH | This configuration specifies the path where 
lock files have to be created. Recommended to configure zookeeper lock type or 
configure HDFS lock path(to this property) in case of S3 file system as locking 
is not feasible on S3. |
 | carbon.unsafe.working.memory.in.mb | 512 | CarbonData supports storing data 
in off-heap memory for certain operations during data loading and query.This 
helps to avoid the Java GC and thereby improve the overall performance.The 
Minimum value recommeded is 512MB.Any value below this is reset to default 
value of 512MB.**NOTE:** The below formulas explain how to arrive at the 
off-heap size required.<u>Memory Required For Data 
Loading:</u>(*carbon.number.of.cores.while.loading*) * (Number of tables to 
load in parallel) * (*offheap.sort.chunk.size.inmb* + 
*carbon.blockletgroup.size.in.mb* + *carbon.blockletgroup.size.in.mb*/3.5 ). 
<u>Memory required for Query:</u>SPARK_EXECUTOR_INSTANCES * 
(*carbon.blockletgroup.size.in.mb* + *carbon.blockletgroup.size.in.mb* * 3.5) * 
spark.executor.cores |
+| carbon.unsafe.driver.working.memory.in.mb | 60% of JVM Heap Memory | 
CarbonData supports storing data in unsafe on-heap memory in driver for certain 
operations like insert into, query for loading datamap cache. The Minimum value 
recommended is 512MB. |
 | carbon.update.sync.folder | /tmp/carbondata | CarbonData maintains last 
modification time entries in modifiedTime.mdt to determine the schema changes 
and reload only when necessary.This configuration specifies the path where the 
file needs to be written. |
 | carbon.invisible.segments.preserve.count | 200 | CarbonData maintains each 
data load entry in tablestatus file. The entries from this file are not deleted 
for those segments that are compacted or dropped, but are made invisible.If the 
number of data loads are very high, the size and number of entries in 
tablestatus file can become too many causing unnecessary reading of all 
data.This configuration specifies the number of segment entries to be 
maintained afte they are compacted or dropped.Beyond this, the entries are 
moved to a separate history tablestatus file.**NOTE:** The entries in 
tablestatus file help to identify the operations performed on CarbonData table 
and is also used for checkpointing during various data manupulation 
operations.This is similar to AUDIT file maintaining all the operations and its 
status.Hence the entries are never deleted but moved to a separate history 
file. |
 | carbon.lock.retries | 3 | CarbonData ensures consistency of operations by 
blocking certain operations from running in parallel.In order to block the 
operations from running in parallel, lock is obtained on the table.This 
configuration specifies the maximum number of retries to obtain the lock for 
any operations other than load.**NOTE:** Data manupulation operations like 
Compaction,UPDATE,DELETE  or LOADING,UPDATE,DELETE are not allowed to run in 
parallel.How ever data loading can happen in parallel to compaction. |
@@ -92,7 +93,8 @@ This section provides the details of all the configurations 
required for the Car
 | carbon.load.directWriteHdfs.enabled | false | During data load all the 
carbondata files are written to local disk and finally copied to the target 
location in HDFS.Enabling this parameter will make carrbondata files to be 
written directly onto target HDFS location bypassing the local disk.**NOTE:** 
Writing directly to HDFS saves local disk IO(once for writing the files and 
again for copying to HDFS) there by improving the performance.But the drawback 
is when data loading fails or the application crashes, unwanted carbondata 
files will remain in the target HDFS location until it is cleared during next 
data load or by running *CLEAN FILES* DDL command |
 | carbon.options.serialization.null.format | \N | Based on the business 
scenarios, some columns might need to be loaded with null values.As null value 
cannot be written in csv files, some special characters might be adopted to 
specify null values.This configuration can be used to specify the null values 
format in the data being loaded. |
 | carbon.sort.storage.inmemory.size.inmb | 512 | CarbonData writes every 
***carbon.sort.size*** number of records to intermediate temp files during data 
loading to ensure memory footprint is within limits.When 
***enable.unsafe.sort*** configuration is enabled, instead of using 
***carbon.sort.size*** which is based on rows count, size occupied in memory is 
used to determine when to flush data pages to intermediate temp files.This 
configuration determines the memory to be used for storing data pages in 
memory.**NOTE:** Configuring a higher values ensures more data is maintained in 
memory and hence increases data loading performance due to reduced or no 
IO.Based on the memory availability in the nodes of the cluster, configure the 
values accordingly. |
-| carbon.column.compressor | snappy | CarbonData will compress the column 
values using the compressor specified by this configuration. Currently 
CarbonData supports 'snappy' and 'zstd' compressors. | |
+| carbon.column.compressor | snappy | CarbonData will compress the column 
values using the compressor specified by this configuration. Currently 
CarbonData supports 'snappy' and 'zstd' compressors. |
+| carbon.minmax.allowed.byte.count | 200 | CarbonData will write the min max 
values for string/varchar types column using the byte count specified by this 
configuration. Max value is 1000 bytes(500 characters) and Min value is 10 
bytes(5 characters). **NOTE:** This property is useful for reducing the store 
size thereby improving the query performance but can lead to query degradation 
if value is not configured properly. | |
 
 ## Compaction Configuration
 
@@ -117,7 +119,7 @@ This section provides the details of all the configurations 
required for the Car
 
 | Parameter | Default Value | Description |
 
|--------------------------------------|---------------|---------------------------------------------------|
-| carbon.max.driver.lru.cache.size | -1 | Maximum memory **(in MB)** upto 
which the driver process can cache the data (BTree and dictionary values). 
Beyond this, least recently used data will be removed from cache before loading 
new set of values.Default value of -1 means there is no memory limit for 
caching. Only integer values greater than 0 are accepted.**NOTE:** Minimum 
number of entries that needs to be removed from cache in order to load the new 
set of data is determined and unloaded.ie.,for example if 3 cache entries 
qualify for pre-emption, out of these, those entries that free up more cache 
memory is removed prior to others. |
+| carbon.max.driver.lru.cache.size | -1 | Maximum memory **(in MB)** upto 
which the driver process can cache the data (BTree and dictionary values). 
Beyond this, least recently used data will be removed from cache before loading 
new set of values.Default value of -1 means there is no memory limit for 
caching. Only integer values greater than 0 are accepted.**NOTE:** Minimum 
number of entries that needs to be removed from cache in order to load the new 
set of data is determined and unloaded.ie.,for example if 3 cache entries 
qualify for pre-emption, out of these, those entries that free up more cache 
memory is removed prior to others. Please refer 
[FAQs](./faq.md#how-to-check-LRU-cache-memory-footprint) for checking LRU cache 
memory footprint. |
 | carbon.max.executor.lru.cache.size | -1 | Maximum memory **(in MB)** upto 
which the executor process can cache the data (BTree and reverse dictionary 
values).Default value of -1 means there is no memory limit for caching. Only 
integer values greater than 0 are accepted.**NOTE:** If this parameter is not 
configured, then the value of ***carbon.max.driver.lru.cache.size*** will be 
used. |
 | max.query.execution.time | 60 | Maximum time allowed for one query to be 
executed. The value is in minutes. |
 | carbon.enableMinMax | true | CarbonData maintains the metadata which enables 
to prune unnecessary files from being scanned as per the query conditions.To 
achieve pruning, Min,Max of each column is maintined.Based on the filter 
condition in the query, certain data can be skipped from scanning by matching 
the filter value against the min,max values of the column(s) present in that 
carbondata file.This pruing enhances query performance significantly. |

http://git-wip-us.apache.org/repos/asf/carbondata/blob/b3a5e3a8/docs/faq.md
----------------------------------------------------------------------
diff --git a/docs/faq.md b/docs/faq.md
index 8ec7290..dbf9155 100644
--- a/docs/faq.md
+++ b/docs/faq.md
@@ -28,6 +28,7 @@
 * [Why aggregate query is not fetching data from aggregate 
table?](#why-aggregate-query-is-not-fetching-data-from-aggregate-table)
 * [Why all executors are showing success in Spark UI even after Dataload 
command failed at Driver 
side?](#why-all-executors-are-showing-success-in-spark-ui-even-after-dataload-command-failed-at-driver-side)
 * [Why different time zone result for select query output when query SDK 
writer 
output?](#why-different-time-zone-result-for-select-query-output-when-query-sdk-writer-output)
+* [How to check LRU cache memory 
footprint?](#how-to-check-LRU-cache-memory-footprint)
 
 # TroubleShooting
 
@@ -212,7 +213,22 @@ cluster timezone is Asia/Shanghai
 TimeZone.setDefault(TimeZone.getTimeZone("Asia/Shanghai"))
 ```
 
+## How to check LRU cache memory footprint?
+To observe the LRU cache memory footprint in the logs, configure the below 
properties in log4j.properties file.
+```
+log4j.logger.org.apache.carbondata.core.memory.UnsafeMemoryManager = DEBUG
+log4j.logger.org.apache.carbondata.core.cache.CarbonLRUCache = DEBUG
+```
+These properties will enable the DEBUG log for the CarbonLRUCache and 
UnsafeMemoryManager which will print the information of memory consumed using 
which the LRU cache size can be decided. **Note:** Enabling the DEBUG log will 
degrade the query performance.
 
+**Example:**
+```
+18/09/26 15:05:28 DEBUG UnsafeMemoryManager: pool-44-thread-1 Memory block 
(org.apache.carbondata.core.memory.MemoryBlock@21312095) is created with size 
10. Total memory used 413Bytes, left 536870499Bytes
+18/09/26 15:05:29 DEBUG CarbonLRUCache: main Required size for entry 
/home/target/store/default/stored_as_carbondata_table/Fact/Part0/Segment_0/0_1537954529044.carbonindexmerge
 :: 181 Current cache size :: 0
+18/09/26 15:05:30 DEBUG UnsafeMemoryManager: main Freeing memory of size: 
105available memory:  536870836
+18/09/26 15:05:30 DEBUG UnsafeMemoryManager: main Freeing memory of size: 
76available memory:  536870912
+18/09/26 15:05:30 INFO CarbonLRUCache: main Removed entry from InMemory lru 
cache :: 
/home/target/store/default/stored_as_carbondata_table/Fact/Part0/Segment_0/0_1537954529044.carbonindexmerge
+```
 
 ## Getting tablestatus.lock issues When loading data
 

http://git-wip-us.apache.org/repos/asf/carbondata/blob/b3a5e3a8/docs/sdk-guide.md
----------------------------------------------------------------------
diff --git a/docs/sdk-guide.md b/docs/sdk-guide.md
index d1e4bc5..be42b3f 100644
--- a/docs/sdk-guide.md
+++ b/docs/sdk-guide.md
@@ -181,22 +181,31 @@ public class TestSdkJson {
 ```
 
 ## Datatypes Mapping
-Each of SQL data types are mapped into data types of SDK. Following are the 
mapping:
-
-| SQL DataTypes | Mapped SDK DataTypes |
-|---------------|----------------------|
-| BOOLEAN | DataTypes.BOOLEAN |
-| SMALLINT | DataTypes.SHORT |
-| INTEGER | DataTypes.INT |
-| BIGINT | DataTypes.LONG |
-| DOUBLE | DataTypes.DOUBLE |
-| VARCHAR | DataTypes.STRING |
-| FLOAT | DataTypes.FLOAT |
-| BYTE | DataTypes.BYTE |
-| DATE | DataTypes.DATE |
-| TIMESTAMP | DataTypes.TIMESTAMP |
-| STRING | DataTypes.STRING |
-| DECIMAL | DataTypes.createDecimalType(precision, scale) |
+Each of SQL data types and Avro Data Types are mapped into data types of SDK. 
Following are the mapping:
+
+| SQL DataTypes | Avro DataTypes | Mapped SDK DataTypes |
+|---------------|----------------|----------------------|
+| BOOLEAN | BOOLEAN | DataTypes.BOOLEAN |
+| SMALLINT |  -  | DataTypes.SHORT |
+| INTEGER | INTEGER | DataTypes.INT |
+| BIGINT | LONG | DataTypes.LONG |
+| DOUBLE | DOUBLE | DataTypes.DOUBLE |
+| VARCHAR |  -  | DataTypes.STRING |
+| FLOAT | FLOAT | DataTypes.FLOAT |
+| BYTE |  -  | DataTypes.BYTE |
+| DATE | DATE | DataTypes.DATE |
+| TIMESTAMP |  -  | DataTypes.TIMESTAMP |
+| STRING | STRING | DataTypes.STRING |
+| DECIMAL | DECIMAL | DataTypes.createDecimalType(precision, scale) |
+| ARRAY | ARRAY | DataTypes.createArrayType(elementType) |
+| STRUCT | RECORD | DataTypes.createStructType(fields) |
+|  -  | ENUM | DataTypes.STRING |
+|  -  | UNION | DataTypes.createStructType(types) |
+|  -  | MAP | DataTypes.createMapType(keyType, valueType) |
+|  -  | TimeMillis | DataTypes.INT |
+|  -  | TimeMicros | DataTypes.LONG |
+|  -  | TimestampMillis | DataTypes.TIMESTAMP |
+|  -  | TimestampMicros | DataTypes.TIMESTAMP |
 
 **NOTE:**
  1. Carbon Supports below logical types of AVRO.
@@ -209,12 +218,22 @@ Each of SQL data types are mapped into data types of SDK. 
Following are the mapp
  c. Timestamp (microsecond precision)
     The timestamp-micros logical type represents an instant on the global 
timeline, independent of a particular time zone or calendar, with a precision 
of one microsecond.
     A timestamp-micros logical type annotates an Avro long, where the long 
stores the number of microseconds from the unix epoch, 1 January 1970 
00:00:00.000000 UTC.
+ d. Decimal
+    The decimal logical type represents an arbitrary-precision signed decimal 
number of the form unscaled Ã 10-scale.
+    A decimal logical type annotates Avro bytes or fixed types. The byte array 
must contain the two's-complement representation of the unscaled integer value 
in big-endian byte order. The scale is fixed, and is specified using an 
attribute.
+ e. Time (millisecond precision)
+    The time-millis logical type represents a time of day, with no reference 
to a particular calendar, time zone or date, with a precision of one 
millisecond.
+    A time-millis logical type annotates an Avro int, where the int stores the 
number of milliseconds after midnight, 00:00:00.000.
+ f. Time (microsecond precision)
+    The time-micros logical type represents a time of day, with no reference 
to a particular calendar, time zone or date, with a precision of one 
microsecond.
+    A time-micros logical type annotates an Avro long, where the long stores 
the number of microseconds after midnight, 00:00:00.000000.
+
     
     Currently the values of logical types are not validated by carbon. 
     Expect that avro record passed by the user is already validated by avro 
record generator tools.    
  2. If the string data is more than 32K in length, use withTableProperties() 
with "long_string_columns" property
-    or directly use DataTypes.VARCHAR if it is carbon schema.      
-
+    or directly use DataTypes.VARCHAR if it is carbon schema.
+ 3. Avro Bytes, Fixed and Duration data types are not yet supported.
 ## Run SQL on files directly
 Instead of creating table and query it, you can also query that file directly 
with SQL.
 

http://git-wip-us.apache.org/repos/asf/carbondata/blob/b3a5e3a8/docs/supported-data-types-in-carbondata.md
----------------------------------------------------------------------
diff --git a/docs/supported-data-types-in-carbondata.md 
b/docs/supported-data-types-in-carbondata.md
index fd13079..daf1acf 100644
--- a/docs/supported-data-types-in-carbondata.md
+++ b/docs/supported-data-types-in-carbondata.md
@@ -45,6 +45,7 @@
   * Complex Types
     * arrays: ARRAY``<data_type>``
     * structs: STRUCT``<col_name : data_type COMMENT col_comment, ...>``
+    * maps: MAP``<primitive_type, data_type>``
     
     **NOTE**: Only 2 level complex type schema is supported for now.

[06/45] carbondata git commit: [CARBONDATA-2966]Update Documentation For Avro DataType conversion

Reply via email to