This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 6587cad67c [DOCS] Add faq for async/offline compaction options (#5304)
6587cad67c is described below

commit 6587cad67c6b3e8463315d5bad34404ca2cf5e8a
Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com>
AuthorDate: Thu Apr 28 17:43:05 2022 -0700

    [DOCS] Add faq for async/offline compaction options (#5304)
    
    Co-authored-by: Bhavani Sudha Saktheeswaran <sudha@vmacs.local>
---
 website/docs/compaction.md |  2 +-
 website/learn/faq.md       | 22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/website/docs/compaction.md b/website/docs/compaction.md
index 015d21ec68..fe679f4ac9 100644
--- a/website/docs/compaction.md
+++ b/website/docs/compaction.md
@@ -10,7 +10,7 @@ Compaction is executed asynchronously with Hudi by default. 
Async Compaction is
 
 1. ***Compaction Scheduling***: This is done by the ingestion job. In this 
step, Hudi scans the partitions and selects **file
    slices** to be compacted. A compaction plan is finally written to Hudi 
timeline.
-1. ***Compaction Execution***: A separate process reads the compaction plan 
and performs compaction of file slices.
+1. ***Compaction Execution***: In this step the compaction plan is read and 
file slices are compacted.
 
 There are few ways by which we can execute compactions asynchronously.
 
diff --git a/website/learn/faq.md b/website/learn/faq.md
index ace41cc7b6..ab6501a88b 100644
--- a/website/learn/faq.md
+++ b/website/learn/faq.md
@@ -253,6 +253,28 @@ Simplest way to run compaction on MOR dataset is to run 
the [compaction inline](
 
 That said, for obvious reasons of not blocking ingesting for compaction, you 
may want to run it asynchronously as well. This can be done either via a 
separate [compaction 
job](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java)
 that is scheduled by your workflow scheduler/notebook independently. If you 
are using delta streamer, then you can run in [continuous 
mode](https://github.com/apache/hudi/blob/d3edac4612bde2fa9dec [...]
 
+### What options do I have for asynchronous/offline compactions on MOR dataset?
+
+There are a couple of options depending on how you write to Hudi. But first 
let us understand briefly what is involved. There are two parts to compaction
+- Scheduling: In this step, Hudi scans the partitions and selects file slices 
to be compacted. A compaction plan is finally written to Hudi timeline. 
Scheduling needs tighter coordination with other writers (regular ingestion is 
considered one of the writers). If scheduling is done inline with the ingestion 
job, this coordination is automatically taken care of. Else when scheduling 
happens asynchronously a lock provider needs to be configured for this 
coordination among multiple writers.
+- Execution: In this step the compaction plan is read and file slices are 
compacted. Execution doesnt need the same level of coordination with other 
writers as Scheduling step and can be decoupled from ingestion job easily.
+
+Depending on how you write to Hudi these are the possible options currently.
+- DeltaStreamer:
+   - In Continuous mode, asynchronous compaction is achieved by default. Here 
scheduling is done by the ingestion job inline and compaction execution is 
achieved asynchronously by a separate parallel thread.
+   - In non continuous mode, only inline compaction is possible. 
+   - Please note in either mode, by passing --disable-compaction compaction is 
completely disabled
+- Spark datasource:
+   - Async scheduling and async execution can be achieved by periodically 
running an offline Hudi Compactor Utility or Hudi CLI. However this needs a 
lock provider to be configured.
+   - Alternately, from 0.11.0, to avoid dependency on lock providers, 
scheduling alone can be done inline by regular writer using the config 
`hoodie.compact.schedule.inline` . And compaction execution can be done offline 
by periodically triggering the Hudi Compactor Utility or Hudi CLI.
+- Spark structured streaming:
+   - Compactions are scheduled and executed asynchronously inside the 
streaming job. Async Compactions are enabled by default for structured 
streaming jobs on Merge-On-Read table.
+   - Please note it is not possible to disable async compaction for MOR 
dataset with spark structured streaming. 
+- Flink:
+   - Async compaction is enabled by default for Merge-On-Read table.
+   - Offline compaction can be achieved by setting 
```compaction.async.enabled``` to ```false``` and periodically running [Flink 
offline 
Compactor](https://hudi.apache.org/docs/next/compaction/#flink-offline-compaction).
 When running the offline compactor, one needs to ensure there are no active 
writes to the table.
+   - Third option (highly recommended over the second one) is to schedule the 
compactions from the regular ingestion job and executing the compaction plans 
from an offline job. To achieve this set ```compaction.async.enabled``` to 
```false```, ```compaction.schedule.enabled``` to ```true``` and then run the 
[Flink offline 
Compactor](https://hudi.apache.org/docs/next/compaction/#flink-offline-compaction)
 periodically to execute the plans.
+
 ### What performance/ingest latency can I expect for Hudi writing?
 
 The speed at which you can write into Hudi depends on the [write 
operation](https://hudi.apache.org/docs/writing_data/) and some trade-offs you 
make along the way like file sizing. Just like how databases incur overhead 
over direct/raw file I/O on disks,  Hudi operations may have overhead from 
supporting  database like features compared to reading/writing raw DFS files. 
That said, Hudi implements advanced techniques from database literature to keep 
these minimal. User is encouraged to ha [...]

Reply via email to