[hudi] branch asf-site updated: [DOC] Doc changes for release 0.6.0 (#2011)

bhavanisudha Mon, 24 Aug 2020 11:05:45 -0700

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 0ac5f3f  [DOC] Doc changes for release 0.6.0  (#2011)
0ac5f3f is described below

commit 0ac5f3f4e20cee484412ed89e6631b2171196f0c
Author: Bhavani Sudha Saktheeswaran <bhavanisud...@gmail.com>
AuthorDate: Mon Aug 24 11:05:13 2020 -0700

    [DOC] Doc changes for release 0.6.0  (#2011)
    
    * [DOC] Change instructions and queries supported by PrestoDB
    
    * Adding video and blog from 'PrestoDB and Apache Hudi' talk on Presto 
Meetup
    
    * Config page changes
    
    - Add doc for using jdbc during hive sync
    - Fix index types to include all avialable indexes
    - Fix default val for hoodie.copyonwrite.insert.auto.split
    - Add doc for user defined bulk insert partitioner class
    - Add simple index configs
    - Reorder all index configs to be grouped together
    - Add docs for auto cleanign and async cleaning
    - Add docs for rollback parallelism and marker based rollback
    - Add doc for bulk-insert sort modes
    - Add doc for markers delete parallelism
    
    * CR feedback
    
    Co-authored-by: Vinoth Chandar <vin...@apache.org>
---
 docs/_docs/1_2_structure.md        |  2 +-
 docs/_docs/1_4_powered_by.md       |  3 ++
 docs/_docs/1_5_comparison.md       |  4 +-
 docs/_docs/2_3_querying_data.cn.md |  8 ++--
 docs/_docs/2_3_querying_data.md    | 21 ++++++---
 docs/_docs/2_4_configurations.md   | 88 ++++++++++++++++++++++++++++++++++----
 docs/_docs/2_6_deployment.md       |  6 +--
 7 files changed, 107 insertions(+), 25 deletions(-)

diff --git a/docs/_docs/1_2_structure.md b/docs/_docs/1_2_structure.md
index ddcdb1a..1c59960 100644
--- a/docs/_docs/1_2_structure.md
+++ b/docs/_docs/1_2_structure.md
@@ -16,6 +16,6 @@ Hudi (pronounced “Hoodie”) ingests & manages storage of large 
analytical tab
     <img class="docimage" src="/assets/images/hudi_intro_1.png" 
alt="hudi_intro_1.png" />
 </figure>
 
-By carefully managing how data is laid out in storage & how it’s exposed to 
queries, Hudi is able to power a rich data ecosystem where external sources can 
be ingested in near real-time and made available for interactive SQL Engines 
like [Presto](https://prestodb.io) & [Spark](https://spark.apache.org/sql/), 
while at the same time capable of being consumed incrementally from 
processing/ETL frameworks like [Hive](https://hive.apache.org/) & 
[Spark](https://spark.apache.org/docs/latest/) t [...]
+By carefully managing how data is laid out in storage & how it’s exposed to 
queries, Hudi is able to power a rich data ecosystem where external sources can 
be ingested in near real-time and made available for interactive SQL Engines 
like [PrestoDB](https://prestodb.io) & [Spark](https://spark.apache.org/sql/), 
while at the same time capable of being consumed incrementally from 
processing/ETL frameworks like [Hive](https://hive.apache.org/) & 
[Spark](https://spark.apache.org/docs/latest/) [...]
 
 Hudi broadly consists of a self contained Spark library to build tables and 
integrations with existing query engines for data access. See 
[quickstart](/docs/quick-start-guide) for a demo.
diff --git a/docs/_docs/1_4_powered_by.md b/docs/_docs/1_4_powered_by.md
index a731979..8e093a4 100644
--- a/docs/_docs/1_4_powered_by.md
+++ b/docs/_docs/1_4_powered_by.md
@@ -113,6 +113,8 @@ Using Hudi at Yotpo for several usages. Firstly, integrated 
Hudi as a writer in
 
 14. ["Apache Hudi - Design/Code Walkthrough Session for 
Contributors"](https://www.youtube.com/watch?v=N2eDfU_rQ_U) - By Vinoth 
Chandar, July 2020, Hudi community.
 
+15. ["PrestoDB and Apache Hudi"](https://youtu.be/nA3rwOdmm3A) - By Bhavani 
Sudha Saktheeswaran and Brandon Scheller, Aug 2020, PrestoDB Community Meetup.
+
 ## Articles
 
 1. ["The Case for incremental processing on 
Hadoop"](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)
 - O'reilly Ideas article by Vinoth Chandar
@@ -122,6 +124,7 @@ Using Hudi at Yotpo for several usages. Firstly, integrated 
Hudi as a writer in
 5. ["Apache Hudi grows cloud data lake 
maturity"](https://searchdatamanagement.techtarget.com/news/252484740/Apache-Hudi-grows-cloud-data-lake-maturity)
 6. ["Building a Large-scale Transactional Data Lake at Uber Using Apache 
Hudi"](https://eng.uber.com/apache-hudi-graduation/) - Uber eng blog by Nishith 
Agarwal
 7. ["Hudi On 
Hops"](https://www.diva-portal.org/smash/get/diva2:1413103/FULLTEXT01.pdf) - By 
NETSANET GEBRETSADKAN KIDANE
+8. ["PrestoDB and Apachi 
Hudi](https://prestodb.io/blog/2020/08/04/prestodb-and-hudi) - PrestoDB - Hudi 
integration blog by Bhavani Sudha Saktheeswaran and Brandon Scheller 
 
 ## Powered by
 
diff --git a/docs/_docs/1_5_comparison.md b/docs/_docs/1_5_comparison.md
index 32b73c6..41131a8 100644
--- a/docs/_docs/1_5_comparison.md
+++ b/docs/_docs/1_5_comparison.md
@@ -31,7 +31,7 @@ we expect Hudi to positioned at something that ingests 
parquet with superior per
 Hive transactions does not offer the read-optimized storage option or the 
incremental pulling, that Hudi does. In terms of implementation choices, Hudi 
leverages
 the full power of a processing framework like Spark, while Hive transactions 
feature is implemented underneath by Hive tasks/queries kicked off by user or 
the Hive metastore.
 Based on our production experience, embedding Hudi as a library into existing 
Spark pipelines was much easier and less operationally heavy, compared with the 
other approach.
-Hudi is also designed to work with non-hive enginers like Presto/Spark and 
will incorporate file formats other than parquet over time.
+Hudi is also designed to work with non-hive engines like PrestoDB/Spark and 
will incorporate file formats other than parquet over time.
 
 ## HBase
 
@@ -49,7 +49,7 @@ integration of Hudi library with Spark/Spark streaming DAGs. 
In case of Non-Spar
 and later sent into a Hudi table via a Kafka topic/DFS intermediate file. In 
more conceptual level, data processing
 pipelines just consist of three components : `source`, `processing`, `sink`, 
with users ultimately running queries against the sink to use the results of 
the pipeline.
 Hudi can act as either a source or sink, that stores data on DFS. 
Applicability of Hudi to a given stream processing pipeline ultimately boils 
down to suitability
-of Presto/SparkSQL/Hive for your queries.
+of PrestoDB/SparkSQL/Hive for your queries.
 
 More advanced use cases revolve around the concepts of [incremental 
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
 which effectively
 uses Hudi even inside the `processing` engine to speed up typical batch 
pipelines. For e.g: Hudi can be used as a state store inside a processing DAG 
(similar
diff --git a/docs/_docs/2_3_querying_data.cn.md 
b/docs/_docs/2_3_querying_data.cn.md
index c72c2b7..5332790 100644
--- a/docs/_docs/2_3_querying_data.cn.md
+++ b/docs/_docs/2_3_querying_data.cn.md
@@ -36,7 +36,7 @@ language: cn
 |**Hive**|Y|Y|
 |**Spark SQL**|Y|Y|
 |**Spark Datasource**|Y|Y|
-|**Presto**|Y|N|
+|**PrestoDB**|Y|N|
 |**Impala**|Y|N|
 
 
@@ -47,7 +47,7 @@ language: cn
 |**Hive**|Y|Y|Y|
 |**Spark SQL**|Y|Y|Y|
 |**Spark Datasource**|Y|N|Y|
-|**Presto**|N|N|Y|
+|**PrestoDB**|Y|N|Y|
 |**Impala**|N|N|Y|
 
 
@@ -187,9 +187,9 @@ Dataset<Row> hoodieRealtimeViewDF = 
spark.read().format("org.apache.hudi")
 | checkExists(keys) | 检查提供的键是否存在于Hudi数据集中 |
 
 
-## Presto
+## PrestoDB
 
-Presto是一种常用的查询引擎，可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。
+PrestoDB是一种常用的查询引擎，可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。
 这需要在整个安装过程中将`hudi-presto-bundle` jar放入`<presto_install>/plugin/hive-hadoop2/`中。
 
 ## Impala (3.4 or later)
diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md
index 33a5c13..0af3418 100644
--- a/docs/_docs/2_3_querying_data.md
+++ b/docs/_docs/2_3_querying_data.md
@@ -9,7 +9,7 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 
 Conceptually, Hudi stores data physically once on DFS, while providing 3 
different ways of querying, as explained 
[before](/docs/concepts.html#query-types). 
 Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been installed, the table can be queried by popular query engines 
like Hive, Spark SQL, Spark Datasource API and Presto.
+bundle has been installed, the table can be queried by popular query engines 
like Hive, Spark SQL, Spark Datasource API and PrestoDB.
 
 Specifically, following Hive tables are registered based off [table 
name](/docs/configurations.html#TABLE_NAME_OPT_KEY) 
 and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) configs passed 
during write.   
@@ -40,7 +40,7 @@ Following tables show whether a given query is supported on 
specific query engin
 |**Hive**|Y|Y|
 |**Spark SQL**|Y|Y|
 |**Spark Datasource**|Y|Y|
-|**Presto**|Y|N|
+|**PrestoDB**|Y|N|
 |**Impala**|Y|N|
 
 
@@ -53,7 +53,7 @@ Note that `Read Optimized` queries are not applicable for 
COPY_ON_WRITE tables.
 |**Hive**|Y|Y|Y|
 |**Spark SQL**|Y|Y|Y|
 |**Spark Datasource**|Y|N|Y|
-|**Presto**|N|N|Y|
+|**PrestoDB**|Y|N|Y|
 |**Impala**|N|N|Y|
 
 
@@ -176,10 +176,19 @@ Additionally, `HoodieReadClient` offers the following 
functionality using Hudi's
 | filterExists() | Filter out already existing records from the provided 
`RDD[HoodieRecord]`. Useful for de-duplication |
 | checkExists(keys) | Check if the provided keys exist in a Hudi table |
 
-## Presto
+## PrestoDB
 
-Presto is a popular query engine, providing interactive query performance. 
Presto currently supports snapshot queries on COPY_ON_WRITE and read optimized 
queries 
-on MERGE_ON_READ Hudi tables. This requires the `hudi-presto-bundle` jar to be 
placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation.
+PrestoDB is a popular query engine, providing interactive query performance. 
PrestoDB currently supports snapshot querying on COPY_ON_WRITE tables. 
+Both snapshot and read optimized queries are supported on MERGE_ON_READ Hudi 
tables. Since PrestoDB-Hudi integration has evolved over time, the installation
+instructions for PrestoDB would vary based on versions. Please check the below 
table for query types supported and installation instructions 
+for different versions of PrestoDB.
+
+
+| **PrestoDB Version** | **Installation description** | **Query types 
supported** |
+|----------------------|------------------------------|---------------------------|
+| < 0.233              | Requires the `hudi-presto-bundle` jar to be placed 
into `<presto_install>/plugin/hive-hadoop2/`, across the installation. | 
Snapshot querying on COW tables. Read optimized querying on MOR tables. |
+| >= 0.233             | No action needed. Hudi (0.5.1-incubating) is a 
compile time dependency. | Snapshot querying on COW tables. Read optimized 
querying on MOR tables. |
+| >= 0.240             | No action needed. Hudi 0.5.3 version is a compile 
time dependency. | Snapshot querying on both COW and MOR tables |
 
 ## Impala (3.4 or later)
 
diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md
index 5536ac0..aa472dd 100644
--- a/docs/_docs/2_4_configurations.md
+++ b/docs/_docs/2_4_configurations.md
@@ -128,6 +128,11 @@ This is useful to store checkpointing information, in a 
consistent way with the
 #### HIVE_ASSUME_DATE_PARTITION_OPT_KEY {#HIVE_ASSUME_DATE_PARTITION_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.assume_date_partitioning`, Default: 
`false` <br/>
   <span style="color:grey">Assume partitioning is yyyy/mm/dd</span>
+  
+#### HIVE_USE_JDBC_OPT_KEY {#HIVE_USE_JDBC_OPT_KEY}
+  Property: `hoodie.datasource.hive_sync.use_jdbc`, Default: `true` <br/>
+  <span style="color:grey">Use JDBC when hive synchronization is enabled</span>
+
 
 ### Read Options
 
@@ -187,6 +192,18 @@ Property: `hoodie.table.name` [Required] <br/>
 Property: `hoodie.bulkinsert.shuffle.parallelism`<br/>
 <span style="color:grey">Bulk insert is meant to be used for large initial 
imports and this parallelism determines the initial number of files in your 
table. Tune this to achieve a desired optimal size during initial import.</span>
 
+#### withUserDefinedBulkInsertPartitionerClass(className = 
x.y.z.UserDefinedPatitionerClass) {#withUserDefinedBulkInsertPartitionerClass} 
+Property: `hoodie.bulkinsert.user.defined.partitioner.class`<br/>
+<span style="color:grey">If specified, this class will be used to re-partition 
input records before they are inserted.</span>
+
+#### withBulkInsertSortMode(mode = BulkInsertSortMode.GLOBAL_SORT) 
{#withBulkInsertSortMode} 
+Property: `hoodie.bulkinsert.sort.mode`<br/>
+<span style="color:grey">Sorting modes to use for sorting records for bulk 
insert. This is leveraged when user defined partitioner is not configured. 
Default is GLOBAL_SORT. 
+   Available values are - **GLOBAL_SORT**:  this ensures best file sizes, with 
lowest memory overhead at cost of sorting. 
+  **PARTITION_SORT**: Strikes a balance by only sorting within a partition, 
still keeping the memory overhead of writing lowest and best effort file 
sizing. 
+  **NONE**: No sorting. Fastest and matches `spark.write.parquet()` in terms 
of number of files, overheads 
+</span>
+
 #### withParallelism(insert_shuffle_parallelism = 1500, 
upsert_shuffle_parallelism = 1500) {#withParallelism} 
 Property: `hoodie.insert.shuffle.parallelism`, 
`hoodie.upsert.shuffle.parallelism`<br/>
 <span style="color:grey">Once data has been initially imported, this 
parallelism controls initial parallelism for reading input records. Ensure this 
value is high enough say: 1 partition for 1 GB of input data</span>
@@ -211,10 +228,22 @@ Property: `hoodie.assume.date.partitioning`<br/>
 Property: `hoodie.consistency.check.enabled`<br/>
 <span style="color:grey">Should HoodieWriteClient perform additional checks to 
ensure written files' are listable on the underlying filesystem/storage. Set 
this to true, to workaround S3's eventual consistency model and ensure all data 
written as a part of a commit is faithfully available for queries. </span>
 
+#### withRollbackParallelism(rollbackParallelism = 100) 
{#withRollbackParallelism} 
+Property: `hoodie.rollback.parallelism`<br/>
+<span style="color:grey">Determines the parallelism for rollback of 
commits.</span>
+
+#### withRollbackUsingMarkers(rollbackUsingMarkers = false) 
{#withRollbackUsingMarkers} 
+Property: `hoodie.rollback.using.markers`<br/>
+<span style="color:grey">Enables a more efficient mechanism for rollbacks 
based on the marker files generated during the writes. Turned off by 
default.</span>
+
+#### withMarkersDeleteParallelism(parallelism = 100) 
{#withMarkersDeleteParallelism} 
+Property: `hoodie.markers.delete.parallelism`<br/>
+<span style="color:grey">Determines the parallelism for deleting marker 
files.</span>
+
 ### Index configs
 Following configs control indexing behavior, which tags incoming records as 
either inserts or updates to older records. 
 
-[withIndexConfig](#withIndexConfig) (HoodieIndexConfig) <br/>
+[withIndexConfig](#index-configs) (HoodieIndexConfig) <br/>
 <span style="color:grey">This is pluggable to have a external index (HBase) or 
use the default bloom filter stored in the Parquet files</span>
 
 #### withIndexClass(indexClass = "x.y.z.UserDefinedIndex") {#withIndexClass}
@@ -223,7 +252,9 @@ Property: `hoodie.index.class` <br/>
 
 #### withIndexType(indexType = BLOOM) {#withIndexType}
 Property: `hoodie.index.type` <br/>
-<span style="color:grey">Type of index to use. Default is Bloom filter. 
Possible options are [BLOOM | HBASE | INMEMORY]. Bloom filters removes the 
dependency on a external system and is stored in the footer of the Parquet Data 
Files</span>
+<span style="color:grey">Type of index to use. Default is Bloom filter. 
Possible options are [BLOOM | GLOBAL_BLOOM |SIMPLE | GLOBAL_SIMPLE | INMEMORY | 
HBASE]. Bloom filters removes the dependency on a external system and is stored 
in the footer of the Parquet Data Files</span>
+
+#### Bloom Index configs
 
 #### bloomFilterNumEntries(numEntries = 60000) {#bloomFilterNumEntries}
 Property: `hoodie.index.bloom.num_entries` <br/>
@@ -233,6 +264,10 @@ Property: `hoodie.index.bloom.num_entries` <br/>
 Property: `hoodie.index.bloom.fpp` <br/>
 <span style="color:grey">Only applies if index type is BLOOM. <br/> Error rate 
allowed given the number of entries. This is used to calculate how many bits 
should be assigned for the bloom filter and the number of hash functions. This 
is usually set very low (default: 0.000000001), we like to tradeoff disk space 
for lower false positives</span>
 
+#### bloomIndexParallelism(0) {#bloomIndexParallelism}
+Property: `hoodie.bloom.index.parallelism` <br/>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> This is 
the amount of parallelism for index lookup, which involves a Spark Shuffle. By 
default, this is auto computed based on input workload characteristics</span>
+
 #### bloomIndexPruneByRanges(pruneRanges = true) {#bloomIndexPruneByRanges}
 Property: `hoodie.bloom.index.prune.by.ranges` <br/>
 <span style="color:grey">Only applies if index type is BLOOM. <br/> When true, 
range information from files to leveraged speed up index lookups. Particularly 
helpful, if the key has a monotonously increasing prefix, such as 
timestamp.</span>
@@ -249,13 +284,27 @@ Property: `hoodie.bloom.index.use.treebased.filter` <br/>
 Property: `hoodie.bloom.index.bucketized.checking` <br/>
 <span style="color:grey">Only applies if index type is BLOOM. <br/> When true, 
bucketized bloom filtering is enabled. This reduces skew seen in sort based 
bloom index lookup</span>
 
+#### bloomIndexFilterType(bucketizedChecking = BloomFilterTypeCode.SIMPLE) 
{#bloomIndexFilterType}
+Property: `hoodie.bloom.index.filter.type` <br/>
+<span style="color:grey">Filter type used. Default is 
BloomFilterTypeCode.SIMPLE. Available values are [BloomFilterTypeCode.SIMPLE , 
BloomFilterTypeCode.DYNAMIC_V0]. Dynamic bloom filters auto size themselves 
based on number of keys</span>
+
+#### bloomIndexFilterDynamicMaxEntries(maxNumberOfEntries = 100000) 
{#bloomIndexFilterDynamicMaxEntries}
+Property: `hoodie.bloom.index.filter.dynamic.max.entries` <br/>
+<span style="color:grey">The threshold for the maximum number of keys to 
record in a dynamic Bloom filter row. Only applies if filter type is 
BloomFilterTypeCode.DYNAMIC_V0.</span>
+
 #### bloomIndexKeysPerBucket(keysPerBucket = 10000000) 
{#bloomIndexKeysPerBucket}
 Property: `hoodie.bloom.index.keys.per.bucket` <br/>
 <span style="color:grey">Only applies if bloomIndexBucketizedChecking is 
enabled and index type is bloom. <br/> This configuration controls the "bucket" 
size which tracks the number of record-key checks made against a single file 
and is the unit of work allocated to each partition performing bloom filter 
lookup. A higher value would amortize the fixed cost of reading a bloom filter 
to memory. </span>
 
-#### bloomIndexParallelism(0) {#bloomIndexParallelism}
-Property: `hoodie.bloom.index.parallelism` <br/>
-<span style="color:grey">Only applies if index type is BLOOM. <br/> This is 
the amount of parallelism for index lookup, which involves a Spark Shuffle. By 
default, this is auto computed based on input workload characteristics</span>
+##### withBloomIndexInputStorageLevel(level = MEMORY_AND_DISK_SER) 
{#withBloomIndexInputStorageLevel}
+Property: `hoodie.bloom.index.input.storage.level` <br/>
+<span style="color:grey">Only applies when 
[#bloomIndexUseCaching](#bloomIndexUseCaching) is set. Determine what level of 
persistence is used to cache input RDDs.<br/> Refer to 
org.apache.spark.storage.StorageLevel for different values</span>
+
+##### bloomIndexUpdatePartitionPath(updatePartitionPath = false) 
{#bloomIndexUpdatePartitionPath}
+Property: `hoodie.bloom.index.update.partition.path` <br/>
+<span style="color:grey">Only applies if index type is GLOBAL_BLOOM. <br/>When 
set to true, an update including the partition path of a record that already 
exists will result in inserting the incoming record into the new partition and 
deleting the original record in the old partition. When set to false, the 
original record will only be updated in the old partition.</span>
+
+#### HBase Index configs
 
 #### hbaseZkQuorum(zkString) [Required] {#hbaseZkQuorum}  
 Property: `hoodie.index.hbase.zkquorum` <br/>
@@ -273,10 +322,23 @@ Property: `hoodie.index.hbase.zknode.path` <br/>
 Property: `hoodie.index.hbase.table` <br/>
 <span style="color:grey">Only applies if index type is HBASE. HBase Table name 
to use as the index. Hudi stores the row_key and [partition_path, fileID, 
commitTime] mapping in the table.</span>
 
-##### bloomIndexUpdatePartitionPath(updatePartitionPath = false) 
{#bloomIndexUpdatePartitionPath}
-Property: `hoodie.bloom.index.update.partition.path` <br/>
-<span style="color:grey">Only applies if index type is GLOBAL_BLOOM. <br/>When 
set to true, an update including the partition path of a record that already 
exists will result in inserting the incoming record into the new partition and 
deleting the original record in the old partition. When set to false, the 
original record will only be updated in the old partition.</span>
+#### Simple Index configs
+
+#### simpleIndexUseCaching(useCaching = true) {#simpleIndexUseCaching}
+Property: `hoodie.simple.index.use.caching` <br/>
+<span style="color:grey">Only applies if index type is SIMPLE. <br/> When 
true, the input RDD will cached to speed up index lookup by reducing IO for 
computing parallelism or affected partitions</span>
+
+##### withSimpleIndexInputStorageLevel(level = MEMORY_AND_DISK_SER) 
{#withSimpleIndexInputStorageLevel}
+Property: `hoodie.simple.index.input.storage.level` <br/>
+<span style="color:grey">Only applies when 
[#simpleIndexUseCaching](#simpleIndexUseCaching) is set. Determine what level 
of persistence is used to cache input RDDs.<br/> Refer to 
org.apache.spark.storage.StorageLevel for different values</span>
 
+#### withSimpleIndexParallelism(parallelism = 50) {#withSimpleIndexParallelism}
+Property: `hoodie.simple.index.parallelism` <br/>
+<span style="color:grey">Only applies if index type is SIMPLE. <br/> This is 
the amount of parallelism for index lookup, which involves a Spark 
Shuffle.</span>
+
+#### withGlobalSimpleIndexParallelism(parallelism = 100) 
{#withGlobalSimpleIndexParallelism}
+Property: `hoodie.global.simple.index.parallelism` <br/>
+<span style="color:grey">Only applies if index type is GLOBAL_SIMPLE. <br/> 
This is the amount of parallelism for index lookup, which involves a Spark 
Shuffle.</span>
 
 ### Storage configs
 Controls aspects around sizing parquet and log files.
@@ -331,6 +393,14 @@ Property: `hoodie.cleaner.policy` <br/>
 Property: `hoodie.cleaner.commits.retained` <br/>
 <span style="color:grey">Number of commits to retain. So data will be retained 
for num_of_commits * time_between_commits (scheduled). This also directly 
translates into how much you can incrementally pull on this table</span>
 
+#### withAutoClean(autoClean = true) {#withAutoClean} 
+Property: `hoodie.clean.automatic` <br/>
+<span style="color:grey">Should cleanup if there is anything to cleanup 
immediately after the commit</span>
+
+#### withAsyncClean(asyncClean = false) {#withAsyncClean} 
+Property: `hoodie.clean.async` <br/>
+<span style="color:grey">Only applies when [#withAutoClean](#withAutoClean) is 
turned on. When turned on runs cleaner async with writing. </span>
+
 #### archiveCommitsWith(minCommits = 96, maxCommits = 128) 
{#archiveCommitsWith} 
 Property: `hoodie.keep.min.commits`, `hoodie.keep.max.commits` <br/>
 <span style="color:grey">Each commit is a small file in the `.hoodie` 
directory. Since DFS typically does not favor lots of small files, Hudi 
archives older commits into a sequential log. A commit is published atomically 
by a rename of the commit file.</span>
@@ -349,7 +419,7 @@ Property: `hoodie.copyonwrite.insert.split.size` <br/>
 
 #### autoTuneInsertSplits(true) {#autoTuneInsertSplits} 
 Property: `hoodie.copyonwrite.insert.auto.split` <br/>
-<span style="color:grey">Should hudi dynamically compute the insertSplitSize 
based on the last 24 commit's metadata. Turned off by default. </span>
+<span style="color:grey">Should hudi dynamically compute the insertSplitSize 
based on the last 24 commit's metadata. Turned on by default. </span>
 
 #### approxRecordSize(size = 1024) {#approxRecordSize} 
 Property: `hoodie.copyonwrite.record.size.estimate` <br/>
diff --git a/docs/_docs/2_6_deployment.md b/docs/_docs/2_6_deployment.md
index df54add..9aadf2a 100644
--- a/docs/_docs/2_6_deployment.md
+++ b/docs/_docs/2_6_deployment.md
@@ -21,9 +21,9 @@ Specifically, we will cover the following aspects.
 
 All in all, Hudi deploys with no long running servers or additional 
infrastructure cost to your data lake. In fact, Hudi pioneered this model of 
building a transactional distributed storage layer
 using existing infrastructure and its heartening to see other systems adopting 
similar approaches as well. Hudi writing is done via Spark jobs (DeltaStreamer 
or custom Spark datasource jobs), deployed per standard Apache Spark 
[recommendations](https://spark.apache.org/docs/latest/cluster-overview.html).
-Querying Hudi tables happens via libraries installed into Apache Hive, Apache 
Spark or Presto and hence no additional infrastructure is necessary. 
+Querying Hudi tables happens via libraries installed into Apache Hive, Apache 
Spark or PrestoDB and hence no additional infrastructure is necessary. 
 
-A typical Hudi data ingestion can be achieved in 2 modes. In a singe run mode, 
Hudi ingestion reads next batch of data, ingest them to Hudi table and exits. 
In continuous mode, Hudi ingestion runs as a long-running service executing 
ingestion in a loop.
+A typical Hudi data ingestion can be achieved in 2 modes. In a single run 
mode, Hudi ingestion reads next batch of data, ingest them to Hudi table and 
exits. In continuous mode, Hudi ingestion runs as a long-running service 
executing ingestion in a loop.
 
 With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting 
delta files. Again, compaction can be performed in an asynchronous-mode by 
letting compaction run concurrently with ingestion or in a serial fashion with 
one after another.
 
@@ -529,7 +529,7 @@ Compaction successfully repaired
 
 ## Troubleshooting
 
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
+Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/PrestoDB/Spark)
 
  - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
  - **_hoodie_commit_time** - Last commit that touched this record

[hudi] branch asf-site updated: [DOC] Doc changes for release 0.6.0 (#2011)

Reply via email to