[GitHub] [hudi] bvaradar commented on a change in pull request #1996: [BLOG] Async Compaction and Efficient Migration of large Parquet tables

GitBox Sun, 23 Aug 2020 18:03:44 -0700


bvaradar commented on a change in pull request #1996:
URL: https://github.com/apache/hudi/pull/1996#discussion_r475285098




##########
File path: docs/_posts/2020-08-21-async-compaction-deployment-model.md
##########
@@ -0,0 +1,99 @@
+---
+title: "Async Compaction Deployment Models"
+excerpt: "Mechanisms for executing compaction jobs in Hudi asynchronously"
+author: vbalaji
+category: blog
+---
+
+We will look at different deployment models for executing compactions 
asynchronously.
+
+# Compaction
+
+For Merge-On-Read table, data is stored using a combination of columnar (e.g 
parquet) + row based (e.g avro) file formats. 
+Updates are logged to delta files & later compacted to produce new versions of 
columnar files synchronously or 
+asynchronously. One of th main motivations behind Merge-On-Read is to reduce 
data latency when ingesting records.
+Hence, it makes sense to run compaction asynchronously without blocking 
ingestion.
+
+
+# Async Compaction
+
+Async Compaction is performed in 2 steps:
+
+1. ***Compaction Scheduling***: This is done by the ingestion job. In this 
step, Hudi scans the partitions and selects **file 
+slices** to be compacted. A compaction plan is finally written to Hudi 
timeline.
+1. ***Compaction Execution***: A separate process reads the compaction plan 
and performs compaction of file slices.
+
+  
+# Deployment Models
+
+There are few ways by which we can execute compactions asynchronously. 
+
+## Spark Structured Streaming
+
+With 0.6.0, we now have support for running async compactions in Spark 
+Structured Streaming jobs. Compactions are scheduled and executed 
asynchronously inside the 
+streaming job.  Async Compactions are enabled by default for structured 
streaming jobs
+on Merge-On-Read table.
+
+Here is an example snippet in java
+
+```properties
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.HoodieDataSourceHelpers;
+import org.apache.hudi.config.HoodieCompactionConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import org.apache.spark.sql.streaming.OutputMode;
+import org.apache.spark.sql.streaming.ProcessingTime;
+
+
+ DataStreamWriter<Row> writer = 
streamingInput.writeStream().format("org.apache.hudi")
+        .option(DataSourceWriteOptions.OPERATION_OPT_KEY(), operationType)
+        .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY(), tableType)
+        .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
+        .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), 
"partition")
+        .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
+        .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 
"10")
+        .option(DataSourceWriteOptions.ASYNC_COMPACT_ENABLE_OPT_KEY(), "true")
+        .option(HoodieWriteConfig.TABLE_NAME, 
tableName).option("checkpointLocation", checkpointLocation)
+        .outputMode(OutputMode.Append());
+ writer.trigger(new ProcessingTime(30000)).start(tablePath);
+```
+
+## DeltaStreaminer Continuous Mode

Review comment:
       Fixed. Thanks,




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on a change in pull request #1996: [BLOG] Async Compaction and Efficient Migration of large Parquet tables

Reply via email to