This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new 6dacc12699 [HUDI-4583][DOCS] Optimal write configs for bulk insert (#6399) 6dacc12699 is described below commit 6dacc126995f18406d76019eae047270de43a44d Author: Sagar Sumit <sagarsumi...@gmail.com> AuthorDate: Tue Aug 16 23:34:35 2022 +0530 [HUDI-4583][DOCS] Optimal write configs for bulk insert (#6399) --- website/docs/performance.md | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/website/docs/performance.md b/website/docs/performance.md index e64b0e551f..5bb7f935a1 100644 --- a/website/docs/performance.md +++ b/website/docs/performance.md @@ -30,6 +30,35 @@ the conventional alternatives for achieving these tasks. ### Write Path +#### Bulk Insert + +Write configurations in Hudi are optimized for incremental upserts by default. In fact, the default write operation type is UPSERT as well. +For simple append-only use case to bulk load the data, following set of configurations are recommended for optimal writing: +``` +-- Use “bulk-insert” write-operation instead of default “upsert” +hoodie.datasource.write.operation = BULK_INSERT +-- Disable populating meta columns and metadata, and enable virtual keys +hoodie.populate.meta.fields = false +hoodie.metadata.enable = false +-- Enable snappy compression codec for lesser CPU cycles (but more storage overhead) +hoodie.parquet.compression.codec = snappy +``` + +For ingesting via spark-sql +``` +-- Use “bulk-insert” write-operation instead of default “upsert” +hoodie.sql.insert.mode = non-strict, +hoodie.sql.bulk.insert.enable = true, +-- Disable populating meta columns and metadata, and enable virtual keys +hoodie.populate.meta.fields = false +hoodie.metadata.enable = false +-- Enable snappy compression codec for lesser CPU cycles (but more storage overhead) +hoodie.parquet.compression.codec = snappy +``` + +We recently benchmarked Hudi against TPC-DS workload. +Please check out [our blog](/blog/2022/06/29/Apache-Hudi-vs-Delta-Lake-transparent-tpc-ds-lakehouse-performance-benchmarks) for more details. + #### Upserts Following shows the speed up obtained for NoSQL database ingestion, from incrementally upserting on a Hudi table on the copy-on-write storage,