This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new 0ac5f3f [DOC] Doc changes for release 0.6.0 (#2011) 0ac5f3f is described below commit 0ac5f3f4e20cee484412ed89e6631b2171196f0c Author: Bhavani Sudha Saktheeswaran <bhavanisud...@gmail.com> AuthorDate: Mon Aug 24 11:05:13 2020 -0700 [DOC] Doc changes for release 0.6.0 (#2011) * [DOC] Change instructions and queries supported by PrestoDB * Adding video and blog from 'PrestoDB and Apache Hudi' talk on Presto Meetup * Config page changes - Add doc for using jdbc during hive sync - Fix index types to include all avialable indexes - Fix default val for hoodie.copyonwrite.insert.auto.split - Add doc for user defined bulk insert partitioner class - Add simple index configs - Reorder all index configs to be grouped together - Add docs for auto cleanign and async cleaning - Add docs for rollback parallelism and marker based rollback - Add doc for bulk-insert sort modes - Add doc for markers delete parallelism * CR feedback Co-authored-by: Vinoth Chandar <vin...@apache.org> --- docs/_docs/1_2_structure.md | 2 +- docs/_docs/1_4_powered_by.md | 3 ++ docs/_docs/1_5_comparison.md | 4 +- docs/_docs/2_3_querying_data.cn.md | 8 ++-- docs/_docs/2_3_querying_data.md | 21 ++++++--- docs/_docs/2_4_configurations.md | 88 ++++++++++++++++++++++++++++++++++---- docs/_docs/2_6_deployment.md | 6 +-- 7 files changed, 107 insertions(+), 25 deletions(-) diff --git a/docs/_docs/1_2_structure.md b/docs/_docs/1_2_structure.md index ddcdb1a..1c59960 100644 --- a/docs/_docs/1_2_structure.md +++ b/docs/_docs/1_2_structure.md @@ -16,6 +16,6 @@ Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical tab <img class="docimage" src="/assets/images/hudi_intro_1.png" alt="hudi_intro_1.png" /> </figure> -By carefully managing how data is laid out in storage & how it’s exposed to queries, Hudi is able to power a rich data ecosystem where external sources can be ingested in near real-time and made available for interactive SQL Engines like [Presto](https://prestodb.io) & [Spark](https://spark.apache.org/sql/), while at the same time capable of being consumed incrementally from processing/ETL frameworks like [Hive](https://hive.apache.org/) & [Spark](https://spark.apache.org/docs/latest/) t [...] +By carefully managing how data is laid out in storage & how it’s exposed to queries, Hudi is able to power a rich data ecosystem where external sources can be ingested in near real-time and made available for interactive SQL Engines like [PrestoDB](https://prestodb.io) & [Spark](https://spark.apache.org/sql/), while at the same time capable of being consumed incrementally from processing/ETL frameworks like [Hive](https://hive.apache.org/) & [Spark](https://spark.apache.org/docs/latest/) [...] Hudi broadly consists of a self contained Spark library to build tables and integrations with existing query engines for data access. See [quickstart](/docs/quick-start-guide) for a demo. diff --git a/docs/_docs/1_4_powered_by.md b/docs/_docs/1_4_powered_by.md index a731979..8e093a4 100644 --- a/docs/_docs/1_4_powered_by.md +++ b/docs/_docs/1_4_powered_by.md @@ -113,6 +113,8 @@ Using Hudi at Yotpo for several usages. Firstly, integrated Hudi as a writer in 14. ["Apache Hudi - Design/Code Walkthrough Session for Contributors"](https://www.youtube.com/watch?v=N2eDfU_rQ_U) - By Vinoth Chandar, July 2020, Hudi community. +15. ["PrestoDB and Apache Hudi"](https://youtu.be/nA3rwOdmm3A) - By Bhavani Sudha Saktheeswaran and Brandon Scheller, Aug 2020, PrestoDB Community Meetup. + ## Articles 1. ["The Case for incremental processing on Hadoop"](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop) - O'reilly Ideas article by Vinoth Chandar @@ -122,6 +124,7 @@ Using Hudi at Yotpo for several usages. Firstly, integrated Hudi as a writer in 5. ["Apache Hudi grows cloud data lake maturity"](https://searchdatamanagement.techtarget.com/news/252484740/Apache-Hudi-grows-cloud-data-lake-maturity) 6. ["Building a Large-scale Transactional Data Lake at Uber Using Apache Hudi"](https://eng.uber.com/apache-hudi-graduation/) - Uber eng blog by Nishith Agarwal 7. ["Hudi On Hops"](https://www.diva-portal.org/smash/get/diva2:1413103/FULLTEXT01.pdf) - By NETSANET GEBRETSADKAN KIDANE +8. ["PrestoDB and Apachi Hudi](https://prestodb.io/blog/2020/08/04/prestodb-and-hudi) - PrestoDB - Hudi integration blog by Bhavani Sudha Saktheeswaran and Brandon Scheller ## Powered by diff --git a/docs/_docs/1_5_comparison.md b/docs/_docs/1_5_comparison.md index 32b73c6..41131a8 100644 --- a/docs/_docs/1_5_comparison.md +++ b/docs/_docs/1_5_comparison.md @@ -31,7 +31,7 @@ we expect Hudi to positioned at something that ingests parquet with superior per Hive transactions does not offer the read-optimized storage option or the incremental pulling, that Hudi does. In terms of implementation choices, Hudi leverages the full power of a processing framework like Spark, while Hive transactions feature is implemented underneath by Hive tasks/queries kicked off by user or the Hive metastore. Based on our production experience, embedding Hudi as a library into existing Spark pipelines was much easier and less operationally heavy, compared with the other approach. -Hudi is also designed to work with non-hive enginers like Presto/Spark and will incorporate file formats other than parquet over time. +Hudi is also designed to work with non-hive engines like PrestoDB/Spark and will incorporate file formats other than parquet over time. ## HBase @@ -49,7 +49,7 @@ integration of Hudi library with Spark/Spark streaming DAGs. In case of Non-Spar and later sent into a Hudi table via a Kafka topic/DFS intermediate file. In more conceptual level, data processing pipelines just consist of three components : `source`, `processing`, `sink`, with users ultimately running queries against the sink to use the results of the pipeline. Hudi can act as either a source or sink, that stores data on DFS. Applicability of Hudi to a given stream processing pipeline ultimately boils down to suitability -of Presto/SparkSQL/Hive for your queries. +of PrestoDB/SparkSQL/Hive for your queries. More advanced use cases revolve around the concepts of [incremental processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop), which effectively uses Hudi even inside the `processing` engine to speed up typical batch pipelines. For e.g: Hudi can be used as a state store inside a processing DAG (similar diff --git a/docs/_docs/2_3_querying_data.cn.md b/docs/_docs/2_3_querying_data.cn.md index c72c2b7..5332790 100644 --- a/docs/_docs/2_3_querying_data.cn.md +++ b/docs/_docs/2_3_querying_data.cn.md @@ -36,7 +36,7 @@ language: cn |**Hive**|Y|Y| |**Spark SQL**|Y|Y| |**Spark Datasource**|Y|Y| -|**Presto**|Y|N| +|**PrestoDB**|Y|N| |**Impala**|Y|N| @@ -47,7 +47,7 @@ language: cn |**Hive**|Y|Y|Y| |**Spark SQL**|Y|Y|Y| |**Spark Datasource**|Y|N|Y| -|**Presto**|N|N|Y| +|**PrestoDB**|Y|N|Y| |**Impala**|N|N|Y| @@ -187,9 +187,9 @@ Dataset<Row> hoodieRealtimeViewDF = spark.read().format("org.apache.hudi") | checkExists(keys) | 检查提供的键是否存在于Hudi数据集中 | -## Presto +## PrestoDB -Presto是一种常用的查询引擎,可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。 +PrestoDB是一种常用的查询引擎,可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。 这需要在整个安装过程中将`hudi-presto-bundle` jar放入`<presto_install>/plugin/hive-hadoop2/`中。 ## Impala (3.4 or later) diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md index 33a5c13..0af3418 100644 --- a/docs/_docs/2_3_querying_data.md +++ b/docs/_docs/2_3_querying_data.md @@ -9,7 +9,7 @@ last_modified_at: 2019-12-30T15:59:57-04:00 Conceptually, Hudi stores data physically once on DFS, while providing 3 different ways of querying, as explained [before](/docs/concepts.html#query-types). Once the table is synced to the Hive metastore, it provides external Hive tables backed by Hudi's custom inputformats. Once the proper hudi -bundle has been installed, the table can be queried by popular query engines like Hive, Spark SQL, Spark Datasource API and Presto. +bundle has been installed, the table can be queried by popular query engines like Hive, Spark SQL, Spark Datasource API and PrestoDB. Specifically, following Hive tables are registered based off [table name](/docs/configurations.html#TABLE_NAME_OPT_KEY) and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) configs passed during write. @@ -40,7 +40,7 @@ Following tables show whether a given query is supported on specific query engin |**Hive**|Y|Y| |**Spark SQL**|Y|Y| |**Spark Datasource**|Y|Y| -|**Presto**|Y|N| +|**PrestoDB**|Y|N| |**Impala**|Y|N| @@ -53,7 +53,7 @@ Note that `Read Optimized` queries are not applicable for COPY_ON_WRITE tables. |**Hive**|Y|Y|Y| |**Spark SQL**|Y|Y|Y| |**Spark Datasource**|Y|N|Y| -|**Presto**|N|N|Y| +|**PrestoDB**|Y|N|Y| |**Impala**|N|N|Y| @@ -176,10 +176,19 @@ Additionally, `HoodieReadClient` offers the following functionality using Hudi's | filterExists() | Filter out already existing records from the provided `RDD[HoodieRecord]`. Useful for de-duplication | | checkExists(keys) | Check if the provided keys exist in a Hudi table | -## Presto +## PrestoDB -Presto is a popular query engine, providing interactive query performance. Presto currently supports snapshot queries on COPY_ON_WRITE and read optimized queries -on MERGE_ON_READ Hudi tables. This requires the `hudi-presto-bundle` jar to be placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation. +PrestoDB is a popular query engine, providing interactive query performance. PrestoDB currently supports snapshot querying on COPY_ON_WRITE tables. +Both snapshot and read optimized queries are supported on MERGE_ON_READ Hudi tables. Since PrestoDB-Hudi integration has evolved over time, the installation +instructions for PrestoDB would vary based on versions. Please check the below table for query types supported and installation instructions +for different versions of PrestoDB. + + +| **PrestoDB Version** | **Installation description** | **Query types supported** | +|----------------------|------------------------------|---------------------------| +| < 0.233 | Requires the `hudi-presto-bundle` jar to be placed into `<presto_install>/plugin/hive-hadoop2/`, across the installation. | Snapshot querying on COW tables. Read optimized querying on MOR tables. | +| >= 0.233 | No action needed. Hudi (0.5.1-incubating) is a compile time dependency. | Snapshot querying on COW tables. Read optimized querying on MOR tables. | +| >= 0.240 | No action needed. Hudi 0.5.3 version is a compile time dependency. | Snapshot querying on both COW and MOR tables | ## Impala (3.4 or later) diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md index 5536ac0..aa472dd 100644 --- a/docs/_docs/2_4_configurations.md +++ b/docs/_docs/2_4_configurations.md @@ -128,6 +128,11 @@ This is useful to store checkpointing information, in a consistent way with the #### HIVE_ASSUME_DATE_PARTITION_OPT_KEY {#HIVE_ASSUME_DATE_PARTITION_OPT_KEY} Property: `hoodie.datasource.hive_sync.assume_date_partitioning`, Default: `false` <br/> <span style="color:grey">Assume partitioning is yyyy/mm/dd</span> + +#### HIVE_USE_JDBC_OPT_KEY {#HIVE_USE_JDBC_OPT_KEY} + Property: `hoodie.datasource.hive_sync.use_jdbc`, Default: `true` <br/> + <span style="color:grey">Use JDBC when hive synchronization is enabled</span> + ### Read Options @@ -187,6 +192,18 @@ Property: `hoodie.table.name` [Required] <br/> Property: `hoodie.bulkinsert.shuffle.parallelism`<br/> <span style="color:grey">Bulk insert is meant to be used for large initial imports and this parallelism determines the initial number of files in your table. Tune this to achieve a desired optimal size during initial import.</span> +#### withUserDefinedBulkInsertPartitionerClass(className = x.y.z.UserDefinedPatitionerClass) {#withUserDefinedBulkInsertPartitionerClass} +Property: `hoodie.bulkinsert.user.defined.partitioner.class`<br/> +<span style="color:grey">If specified, this class will be used to re-partition input records before they are inserted.</span> + +#### withBulkInsertSortMode(mode = BulkInsertSortMode.GLOBAL_SORT) {#withBulkInsertSortMode} +Property: `hoodie.bulkinsert.sort.mode`<br/> +<span style="color:grey">Sorting modes to use for sorting records for bulk insert. This is leveraged when user defined partitioner is not configured. Default is GLOBAL_SORT. + Available values are - **GLOBAL_SORT**: this ensures best file sizes, with lowest memory overhead at cost of sorting. + **PARTITION_SORT**: Strikes a balance by only sorting within a partition, still keeping the memory overhead of writing lowest and best effort file sizing. + **NONE**: No sorting. Fastest and matches `spark.write.parquet()` in terms of number of files, overheads +</span> + #### withParallelism(insert_shuffle_parallelism = 1500, upsert_shuffle_parallelism = 1500) {#withParallelism} Property: `hoodie.insert.shuffle.parallelism`, `hoodie.upsert.shuffle.parallelism`<br/> <span style="color:grey">Once data has been initially imported, this parallelism controls initial parallelism for reading input records. Ensure this value is high enough say: 1 partition for 1 GB of input data</span> @@ -211,10 +228,22 @@ Property: `hoodie.assume.date.partitioning`<br/> Property: `hoodie.consistency.check.enabled`<br/> <span style="color:grey">Should HoodieWriteClient perform additional checks to ensure written files' are listable on the underlying filesystem/storage. Set this to true, to workaround S3's eventual consistency model and ensure all data written as a part of a commit is faithfully available for queries. </span> +#### withRollbackParallelism(rollbackParallelism = 100) {#withRollbackParallelism} +Property: `hoodie.rollback.parallelism`<br/> +<span style="color:grey">Determines the parallelism for rollback of commits.</span> + +#### withRollbackUsingMarkers(rollbackUsingMarkers = false) {#withRollbackUsingMarkers} +Property: `hoodie.rollback.using.markers`<br/> +<span style="color:grey">Enables a more efficient mechanism for rollbacks based on the marker files generated during the writes. Turned off by default.</span> + +#### withMarkersDeleteParallelism(parallelism = 100) {#withMarkersDeleteParallelism} +Property: `hoodie.markers.delete.parallelism`<br/> +<span style="color:grey">Determines the parallelism for deleting marker files.</span> + ### Index configs Following configs control indexing behavior, which tags incoming records as either inserts or updates to older records. -[withIndexConfig](#withIndexConfig) (HoodieIndexConfig) <br/> +[withIndexConfig](#index-configs) (HoodieIndexConfig) <br/> <span style="color:grey">This is pluggable to have a external index (HBase) or use the default bloom filter stored in the Parquet files</span> #### withIndexClass(indexClass = "x.y.z.UserDefinedIndex") {#withIndexClass} @@ -223,7 +252,9 @@ Property: `hoodie.index.class` <br/> #### withIndexType(indexType = BLOOM) {#withIndexType} Property: `hoodie.index.type` <br/> -<span style="color:grey">Type of index to use. Default is Bloom filter. Possible options are [BLOOM | HBASE | INMEMORY]. Bloom filters removes the dependency on a external system and is stored in the footer of the Parquet Data Files</span> +<span style="color:grey">Type of index to use. Default is Bloom filter. Possible options are [BLOOM | GLOBAL_BLOOM |SIMPLE | GLOBAL_SIMPLE | INMEMORY | HBASE]. Bloom filters removes the dependency on a external system and is stored in the footer of the Parquet Data Files</span> + +#### Bloom Index configs #### bloomFilterNumEntries(numEntries = 60000) {#bloomFilterNumEntries} Property: `hoodie.index.bloom.num_entries` <br/> @@ -233,6 +264,10 @@ Property: `hoodie.index.bloom.num_entries` <br/> Property: `hoodie.index.bloom.fpp` <br/> <span style="color:grey">Only applies if index type is BLOOM. <br/> Error rate allowed given the number of entries. This is used to calculate how many bits should be assigned for the bloom filter and the number of hash functions. This is usually set very low (default: 0.000000001), we like to tradeoff disk space for lower false positives</span> +#### bloomIndexParallelism(0) {#bloomIndexParallelism} +Property: `hoodie.bloom.index.parallelism` <br/> +<span style="color:grey">Only applies if index type is BLOOM. <br/> This is the amount of parallelism for index lookup, which involves a Spark Shuffle. By default, this is auto computed based on input workload characteristics</span> + #### bloomIndexPruneByRanges(pruneRanges = true) {#bloomIndexPruneByRanges} Property: `hoodie.bloom.index.prune.by.ranges` <br/> <span style="color:grey">Only applies if index type is BLOOM. <br/> When true, range information from files to leveraged speed up index lookups. Particularly helpful, if the key has a monotonously increasing prefix, such as timestamp.</span> @@ -249,13 +284,27 @@ Property: `hoodie.bloom.index.use.treebased.filter` <br/> Property: `hoodie.bloom.index.bucketized.checking` <br/> <span style="color:grey">Only applies if index type is BLOOM. <br/> When true, bucketized bloom filtering is enabled. This reduces skew seen in sort based bloom index lookup</span> +#### bloomIndexFilterType(bucketizedChecking = BloomFilterTypeCode.SIMPLE) {#bloomIndexFilterType} +Property: `hoodie.bloom.index.filter.type` <br/> +<span style="color:grey">Filter type used. Default is BloomFilterTypeCode.SIMPLE. Available values are [BloomFilterTypeCode.SIMPLE , BloomFilterTypeCode.DYNAMIC_V0]. Dynamic bloom filters auto size themselves based on number of keys</span> + +#### bloomIndexFilterDynamicMaxEntries(maxNumberOfEntries = 100000) {#bloomIndexFilterDynamicMaxEntries} +Property: `hoodie.bloom.index.filter.dynamic.max.entries` <br/> +<span style="color:grey">The threshold for the maximum number of keys to record in a dynamic Bloom filter row. Only applies if filter type is BloomFilterTypeCode.DYNAMIC_V0.</span> + #### bloomIndexKeysPerBucket(keysPerBucket = 10000000) {#bloomIndexKeysPerBucket} Property: `hoodie.bloom.index.keys.per.bucket` <br/> <span style="color:grey">Only applies if bloomIndexBucketizedChecking is enabled and index type is bloom. <br/> This configuration controls the "bucket" size which tracks the number of record-key checks made against a single file and is the unit of work allocated to each partition performing bloom filter lookup. A higher value would amortize the fixed cost of reading a bloom filter to memory. </span> -#### bloomIndexParallelism(0) {#bloomIndexParallelism} -Property: `hoodie.bloom.index.parallelism` <br/> -<span style="color:grey">Only applies if index type is BLOOM. <br/> This is the amount of parallelism for index lookup, which involves a Spark Shuffle. By default, this is auto computed based on input workload characteristics</span> +##### withBloomIndexInputStorageLevel(level = MEMORY_AND_DISK_SER) {#withBloomIndexInputStorageLevel} +Property: `hoodie.bloom.index.input.storage.level` <br/> +<span style="color:grey">Only applies when [#bloomIndexUseCaching](#bloomIndexUseCaching) is set. Determine what level of persistence is used to cache input RDDs.<br/> Refer to org.apache.spark.storage.StorageLevel for different values</span> + +##### bloomIndexUpdatePartitionPath(updatePartitionPath = false) {#bloomIndexUpdatePartitionPath} +Property: `hoodie.bloom.index.update.partition.path` <br/> +<span style="color:grey">Only applies if index type is GLOBAL_BLOOM. <br/>When set to true, an update including the partition path of a record that already exists will result in inserting the incoming record into the new partition and deleting the original record in the old partition. When set to false, the original record will only be updated in the old partition.</span> + +#### HBase Index configs #### hbaseZkQuorum(zkString) [Required] {#hbaseZkQuorum} Property: `hoodie.index.hbase.zkquorum` <br/> @@ -273,10 +322,23 @@ Property: `hoodie.index.hbase.zknode.path` <br/> Property: `hoodie.index.hbase.table` <br/> <span style="color:grey">Only applies if index type is HBASE. HBase Table name to use as the index. Hudi stores the row_key and [partition_path, fileID, commitTime] mapping in the table.</span> -##### bloomIndexUpdatePartitionPath(updatePartitionPath = false) {#bloomIndexUpdatePartitionPath} -Property: `hoodie.bloom.index.update.partition.path` <br/> -<span style="color:grey">Only applies if index type is GLOBAL_BLOOM. <br/>When set to true, an update including the partition path of a record that already exists will result in inserting the incoming record into the new partition and deleting the original record in the old partition. When set to false, the original record will only be updated in the old partition.</span> +#### Simple Index configs + +#### simpleIndexUseCaching(useCaching = true) {#simpleIndexUseCaching} +Property: `hoodie.simple.index.use.caching` <br/> +<span style="color:grey">Only applies if index type is SIMPLE. <br/> When true, the input RDD will cached to speed up index lookup by reducing IO for computing parallelism or affected partitions</span> + +##### withSimpleIndexInputStorageLevel(level = MEMORY_AND_DISK_SER) {#withSimpleIndexInputStorageLevel} +Property: `hoodie.simple.index.input.storage.level` <br/> +<span style="color:grey">Only applies when [#simpleIndexUseCaching](#simpleIndexUseCaching) is set. Determine what level of persistence is used to cache input RDDs.<br/> Refer to org.apache.spark.storage.StorageLevel for different values</span> +#### withSimpleIndexParallelism(parallelism = 50) {#withSimpleIndexParallelism} +Property: `hoodie.simple.index.parallelism` <br/> +<span style="color:grey">Only applies if index type is SIMPLE. <br/> This is the amount of parallelism for index lookup, which involves a Spark Shuffle.</span> + +#### withGlobalSimpleIndexParallelism(parallelism = 100) {#withGlobalSimpleIndexParallelism} +Property: `hoodie.global.simple.index.parallelism` <br/> +<span style="color:grey">Only applies if index type is GLOBAL_SIMPLE. <br/> This is the amount of parallelism for index lookup, which involves a Spark Shuffle.</span> ### Storage configs Controls aspects around sizing parquet and log files. @@ -331,6 +393,14 @@ Property: `hoodie.cleaner.policy` <br/> Property: `hoodie.cleaner.commits.retained` <br/> <span style="color:grey">Number of commits to retain. So data will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much you can incrementally pull on this table</span> +#### withAutoClean(autoClean = true) {#withAutoClean} +Property: `hoodie.clean.automatic` <br/> +<span style="color:grey">Should cleanup if there is anything to cleanup immediately after the commit</span> + +#### withAsyncClean(asyncClean = false) {#withAsyncClean} +Property: `hoodie.clean.async` <br/> +<span style="color:grey">Only applies when [#withAutoClean](#withAutoClean) is turned on. When turned on runs cleaner async with writing. </span> + #### archiveCommitsWith(minCommits = 96, maxCommits = 128) {#archiveCommitsWith} Property: `hoodie.keep.min.commits`, `hoodie.keep.max.commits` <br/> <span style="color:grey">Each commit is a small file in the `.hoodie` directory. Since DFS typically does not favor lots of small files, Hudi archives older commits into a sequential log. A commit is published atomically by a rename of the commit file.</span> @@ -349,7 +419,7 @@ Property: `hoodie.copyonwrite.insert.split.size` <br/> #### autoTuneInsertSplits(true) {#autoTuneInsertSplits} Property: `hoodie.copyonwrite.insert.auto.split` <br/> -<span style="color:grey">Should hudi dynamically compute the insertSplitSize based on the last 24 commit's metadata. Turned off by default. </span> +<span style="color:grey">Should hudi dynamically compute the insertSplitSize based on the last 24 commit's metadata. Turned on by default. </span> #### approxRecordSize(size = 1024) {#approxRecordSize} Property: `hoodie.copyonwrite.record.size.estimate` <br/> diff --git a/docs/_docs/2_6_deployment.md b/docs/_docs/2_6_deployment.md index df54add..9aadf2a 100644 --- a/docs/_docs/2_6_deployment.md +++ b/docs/_docs/2_6_deployment.md @@ -21,9 +21,9 @@ Specifically, we will cover the following aspects. All in all, Hudi deploys with no long running servers or additional infrastructure cost to your data lake. In fact, Hudi pioneered this model of building a transactional distributed storage layer using existing infrastructure and its heartening to see other systems adopting similar approaches as well. Hudi writing is done via Spark jobs (DeltaStreamer or custom Spark datasource jobs), deployed per standard Apache Spark [recommendations](https://spark.apache.org/docs/latest/cluster-overview.html). -Querying Hudi tables happens via libraries installed into Apache Hive, Apache Spark or Presto and hence no additional infrastructure is necessary. +Querying Hudi tables happens via libraries installed into Apache Hive, Apache Spark or PrestoDB and hence no additional infrastructure is necessary. -A typical Hudi data ingestion can be achieved in 2 modes. In a singe run mode, Hudi ingestion reads next batch of data, ingest them to Hudi table and exits. In continuous mode, Hudi ingestion runs as a long-running service executing ingestion in a loop. +A typical Hudi data ingestion can be achieved in 2 modes. In a single run mode, Hudi ingestion reads next batch of data, ingest them to Hudi table and exits. In continuous mode, Hudi ingestion runs as a long-running service executing ingestion in a loop. With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting delta files. Again, compaction can be performed in an asynchronous-mode by letting compaction run concurrently with ingestion or in a serial fashion with one after another. @@ -529,7 +529,7 @@ Compaction successfully repaired ## Troubleshooting -Section below generally aids in debugging Hudi failures. Off the bat, the following metadata is added to every record to help triage issues easily using standard Hadoop SQL engines (Hive/Presto/Spark) +Section below generally aids in debugging Hudi failures. Off the bat, the following metadata is added to every record to help triage issues easily using standard Hadoop SQL engines (Hive/PrestoDB/Spark) - **_hoodie_record_key** - Treated as a primary key within each DFS partition, basis of all updates/inserts - **_hoodie_commit_time** - Last commit that touched this record