nsivabalan commented on a change in pull request #3525:
URL: https://github.com/apache/hudi/pull/3525#discussion_r694961636



##########
File path: website/blog/2021-08-23-async-clustering.md
##########
@@ -0,0 +1,153 @@
+---
+title: "Asynchronous Clustering using Hudi"
+excerpt: "How to setup Hudi for asynchronous clustering"
+author: codope 
+category: blog
+---
+
+In one of the [previous 
blog](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro) posts, we 
introduced a new
+kind of table service called clustering to reorganize data for improved query 
performance without compromising on
+ingestion speed. We learnt how to setup inline clustering. In this post, we 
will discuss what has changed since then and
+see how asynchronous clustering can be setup using HoodieClusteringJob as well 
as DeltaStreamer utility.
+
+## Introduction
+
+On a high level, clustering creates a plan based on a configurable strategy, 
groups eligible files based on specific
+criteria and then executes the plan. Hudi's [MVCC 
model](https://hudi.apache.org/docs/concurrency_control) provides
+snapshot isolation between multiple table services, which allows writers to 
continue with ingestion while clustering
+runs in the background. For a more detailed overview of the clustering 
architecture please check out the previous blog
+post.
+
+## Clustering Strategies
+
+As mentioned before, clustering plan as well as execution depends on 
configurable strategy. These strategies can be
+broadly classified into three types: clustering plan strategy, execution 
strategy and update strategy.
+
+### Plan Strategy
+
+This strategy comes into play while creating clustering plan. It helps to 
decide what file groups should be clustered.
+Let's look at different plan strategies that are available with Hudi. Note 
that these strategies are easily pluggable
+using this 
[config](https://hudi.apache.org/docs/next/configurations#hoodieclusteringplanstrategyclass).
+
+1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
+   the [small file 
limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+   of base files and creates clustering groups upto max file size allowed per 
group. The max size can be specified using

Review comment:
       guess we don't talk much about these strategies in previous blogs. so, 
may be good to talk about some use-cases here. for eg, something like "for cold 
partitions, users might want to stitch lot of medium sized files to larger ones 
to reduce lot of files in the data lake"

##########
File path: website/blog/2021-08-23-async-clustering.md
##########
@@ -0,0 +1,153 @@
+---
+title: "Asynchronous Clustering using Hudi"
+excerpt: "How to setup Hudi for asynchronous clustering"
+author: codope 
+category: blog
+---
+
+In one of the [previous 
blog](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro) posts, we 
introduced a new
+kind of table service called clustering to reorganize data for improved query 
performance without compromising on
+ingestion speed. We learnt how to setup inline clustering. In this post, we 
will discuss what has changed since then and
+see how asynchronous clustering can be setup using HoodieClusteringJob as well 
as DeltaStreamer utility.
+
+## Introduction
+
+On a high level, clustering creates a plan based on a configurable strategy, 
groups eligible files based on specific
+criteria and then executes the plan. Hudi's [MVCC 
model](https://hudi.apache.org/docs/concurrency_control) provides
+snapshot isolation between multiple table services, which allows writers to 
continue with ingestion while clustering
+runs in the background. For a more detailed overview of the clustering 
architecture please check out the previous blog
+post.
+
+## Clustering Strategies
+
+As mentioned before, clustering plan as well as execution depends on 
configurable strategy. These strategies can be
+broadly classified into three types: clustering plan strategy, execution 
strategy and update strategy.
+
+### Plan Strategy
+
+This strategy comes into play while creating clustering plan. It helps to 
decide what file groups should be clustered.
+Let's look at different plan strategies that are available with Hudi. Note 
that these strategies are easily pluggable
+using this 
[config](https://hudi.apache.org/docs/next/configurations#hoodieclusteringplanstrategyclass).
+
+1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
+   the [small file 
limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+   of base files and creates clustering groups upto max file size allowed per 
group. The max size can be specified using
+   this 
[config.](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategymaxbytespergroup)
+2. `SparkRecentDaysClusteringPlanStrategy`: It looks back previous 'N' days 
partitions and creates a plan that will
+   cluster the 'small' file slices within those partitions. This is the 
default strategy.
+3. `SparkSelectedPartitionsClusteringPlanStrategy`: In case you want to 
cluster only specific partitions within a range,
+   no matter how old or new are those partitions, then this strategy could be 
useful. To use this partition, one needs
+   to set below two configs additionally (both begin and end partitions are 
inclusive):
+
+```
+hoodie.clustering.plan.strategy.cluster.begin.partition
+hoodie.clustering.plan.strategy.cluster.end.partition
+```
+
+**NOTE**: All the strategies are partition-aware and the latter two are still 
bound by the size limits of the first

Review comment:
       MD tip. use note as follows. it will render well. 
   :::note
   content for note. 
   :::

##########
File path: website/blog/2021-08-23-async-clustering.md
##########
@@ -0,0 +1,153 @@
+---
+title: "Asynchronous Clustering using Hudi"
+excerpt: "How to setup Hudi for asynchronous clustering"
+author: codope 
+category: blog
+---
+
+In one of the [previous 
blog](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro) posts, we 
introduced a new
+kind of table service called clustering to reorganize data for improved query 
performance without compromising on
+ingestion speed. We learnt how to setup inline clustering. In this post, we 
will discuss what has changed since then and
+see how asynchronous clustering can be setup using HoodieClusteringJob as well 
as DeltaStreamer utility.
+
+## Introduction
+
+On a high level, clustering creates a plan based on a configurable strategy, 
groups eligible files based on specific
+criteria and then executes the plan. Hudi's [MVCC 
model](https://hudi.apache.org/docs/concurrency_control) provides
+snapshot isolation between multiple table services, which allows writers to 
continue with ingestion while clustering
+runs in the background. For a more detailed overview of the clustering 
architecture please check out the previous blog
+post.
+
+## Clustering Strategies
+
+As mentioned before, clustering plan as well as execution depends on 
configurable strategy. These strategies can be
+broadly classified into three types: clustering plan strategy, execution 
strategy and update strategy.
+
+### Plan Strategy
+
+This strategy comes into play while creating clustering plan. It helps to 
decide what file groups should be clustered.
+Let's look at different plan strategies that are available with Hudi. Note 
that these strategies are easily pluggable
+using this 
[config](https://hudi.apache.org/docs/next/configurations#hoodieclusteringplanstrategyclass).
+
+1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
+   the [small file 
limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+   of base files and creates clustering groups upto max file size allowed per 
group. The max size can be specified using
+   this 
[config.](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategymaxbytespergroup)
+2. `SparkRecentDaysClusteringPlanStrategy`: It looks back previous 'N' days 
partitions and creates a plan that will
+   cluster the 'small' file slices within those partitions. This is the 
default strategy.
+3. `SparkSelectedPartitionsClusteringPlanStrategy`: In case you want to 
cluster only specific partitions within a range,
+   no matter how old or new are those partitions, then this strategy could be 
useful. To use this partition, one needs
+   to set below two configs additionally (both begin and end partitions are 
inclusive):
+
+```
+hoodie.clustering.plan.strategy.cluster.begin.partition
+hoodie.clustering.plan.strategy.cluster.end.partition
+```
+
+**NOTE**: All the strategies are partition-aware and the latter two are still 
bound by the size limits of the first
+strategy.
+
+### Execution Strategy
+
+After building the clustering groups in the planning phase, Hudi applies 
execution strategy, for each group, primarily
+based on sort columns and size. The strategy can be specified using
+this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringexecutionstrategyclass).
+
+`SparkSortAndSizeExecutionStrategy` is the default strategy. Users can specify 
the columns to sort the data by, when
+clustering using
+this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysortcolumns).
 Apart from
+that, we can also set [max file 
size](https://hudi.apache.org/docs/next/configurations/#hoodieparquetmaxfilesize)
+for the parquet files produced due to clustering. The strategy uses bulk 
insert to write data into new files, in which
+case, Hudi implicitly uses a partitioner that does sorting based on specified 
columns. In this way, the strategy changes
+the data layout in a way that not only improves query performance but also 
balance rewrite overhead automatically.
+
+Now this strategy can be executed either as a single spark job or multiple 
jobs depending on number of clustering groups
+created in the planning phase. By default, Hudi will submit multiple spark 
jobs and union the results. In case you want
+to force Hudi to use single spark job, set the execution strategy
+class 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringexecutionstrategyclass)
+to `SingleSparkJobExecutionStrategy`.
+
+### Update Strategy
+
+Currently, clustering can only be scheduled for tables/partitions not 
receiving any concurrent updates. By default,
+the [config for update 
strategy](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringupdatesstrategy)
 is
+set to ***SparkRejectUpdateStrategy***. However, in some use-cases updates are 
very sparse and the default strategy to
+simply reject updates and throw an error does not seem fair. In such 
use-cases, users can set the config to ***
+SparkAllowUpdateStrategy***.
+
+We discussed the critical strategy configurations. All other configurations 
related to clustering are
+listed 
[here](https://hudi.apache.org/docs/next/configurations/#Clustering-Configs). 
Out of this list, a few
+configurations that will be very useful are:
+
+|  Config key  | Remarks | Default |
+|  -----------  | -------  | ------- |
+| `hoodie.clustering.async.enabled` | Enable running of clustering service, 
asynchronously as inserts happen on the table. | False |

Review comment:
       minor. "... async as **writes** happen to hudi table"

##########
File path: website/blog/2021-08-23-async-clustering.md
##########
@@ -0,0 +1,153 @@
+---
+title: "Asynchronous Clustering using Hudi"
+excerpt: "How to setup Hudi for asynchronous clustering"
+author: codope 
+category: blog
+---
+
+In one of the [previous 
blog](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro) posts, we 
introduced a new
+kind of table service called clustering to reorganize data for improved query 
performance without compromising on
+ingestion speed. We learnt how to setup inline clustering. In this post, we 
will discuss what has changed since then and
+see how asynchronous clustering can be setup using HoodieClusteringJob as well 
as DeltaStreamer utility.
+
+## Introduction
+
+On a high level, clustering creates a plan based on a configurable strategy, 
groups eligible files based on specific
+criteria and then executes the plan. Hudi's [MVCC 
model](https://hudi.apache.org/docs/concurrency_control) provides
+snapshot isolation between multiple table services, which allows writers to 
continue with ingestion while clustering
+runs in the background. For a more detailed overview of the clustering 
architecture please check out the previous blog
+post.
+
+## Clustering Strategies
+
+As mentioned before, clustering plan as well as execution depends on 
configurable strategy. These strategies can be
+broadly classified into three types: clustering plan strategy, execution 
strategy and update strategy.
+
+### Plan Strategy
+
+This strategy comes into play while creating clustering plan. It helps to 
decide what file groups should be clustered.
+Let's look at different plan strategies that are available with Hudi. Note 
that these strategies are easily pluggable
+using this 
[config](https://hudi.apache.org/docs/next/configurations#hoodieclusteringplanstrategyclass).
+
+1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
+   the [small file 
limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+   of base files and creates clustering groups upto max file size allowed per 
group. The max size can be specified using
+   this 
[config.](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategymaxbytespergroup)
+2. `SparkRecentDaysClusteringPlanStrategy`: It looks back previous 'N' days 
partitions and creates a plan that will
+   cluster the 'small' file slices within those partitions. This is the 
default strategy.
+3. `SparkSelectedPartitionsClusteringPlanStrategy`: In case you want to 
cluster only specific partitions within a range,
+   no matter how old or new are those partitions, then this strategy could be 
useful. To use this partition, one needs
+   to set below two configs additionally (both begin and end partitions are 
inclusive):
+
+```
+hoodie.clustering.plan.strategy.cluster.begin.partition
+hoodie.clustering.plan.strategy.cluster.end.partition
+```
+
+**NOTE**: All the strategies are partition-aware and the latter two are still 
bound by the size limits of the first
+strategy.
+
+### Execution Strategy
+
+After building the clustering groups in the planning phase, Hudi applies 
execution strategy, for each group, primarily
+based on sort columns and size. The strategy can be specified using
+this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringexecutionstrategyclass).
+
+`SparkSortAndSizeExecutionStrategy` is the default strategy. Users can specify 
the columns to sort the data by, when
+clustering using
+this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysortcolumns).
 Apart from
+that, we can also set [max file 
size](https://hudi.apache.org/docs/next/configurations/#hoodieparquetmaxfilesize)
+for the parquet files produced due to clustering. The strategy uses bulk 
insert to write data into new files, in which
+case, Hudi implicitly uses a partitioner that does sorting based on specified 
columns. In this way, the strategy changes
+the data layout in a way that not only improves query performance but also 
balance rewrite overhead automatically.
+
+Now this strategy can be executed either as a single spark job or multiple 
jobs depending on number of clustering groups
+created in the planning phase. By default, Hudi will submit multiple spark 
jobs and union the results. In case you want
+to force Hudi to use single spark job, set the execution strategy
+class 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringexecutionstrategyclass)
+to `SingleSparkJobExecutionStrategy`.
+
+### Update Strategy
+
+Currently, clustering can only be scheduled for tables/partitions not 
receiving any concurrent updates. By default,
+the [config for update 
strategy](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringupdatesstrategy)
 is
+set to ***SparkRejectUpdateStrategy***. However, in some use-cases updates are 
very sparse and the default strategy to
+simply reject updates and throw an error does not seem fair. In such 
use-cases, users can set the config to ***
+SparkAllowUpdateStrategy***.
+
+We discussed the critical strategy configurations. All other configurations 
related to clustering are
+listed 
[here](https://hudi.apache.org/docs/next/configurations/#Clustering-Configs). 
Out of this list, a few
+configurations that will be very useful are:
+
+|  Config key  | Remarks | Default |
+|  -----------  | -------  | ------- |
+| `hoodie.clustering.async.enabled` | Enable running of clustering service, 
asynchronously as inserts happen on the table. | False |
+| `hoodie.clustering.async.max.commits` | Control frequency of async 
clustering by specifying after how many commits clustering should be triggered. 
| 4 |
+| `hoodie.clustering.preserve.commit.metadata` | When rewriting data, 
preserves existing _hoodie_commit_time. This means users can run incremental 
queries on clustered data without any side-effects. | False |

Review comment:
       just curious on why the default value is false. I was expecting to be 
true. 

##########
File path: website/blog/2021-08-23-async-clustering.md
##########
@@ -0,0 +1,153 @@
+---
+title: "Asynchronous Clustering using Hudi"
+excerpt: "How to setup Hudi for asynchronous clustering"
+author: codope 
+category: blog
+---
+
+In one of the [previous 
blog](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro) posts, we 
introduced a new
+kind of table service called clustering to reorganize data for improved query 
performance without compromising on
+ingestion speed. We learnt how to setup inline clustering. In this post, we 
will discuss what has changed since then and
+see how asynchronous clustering can be setup using HoodieClusteringJob as well 
as DeltaStreamer utility.
+
+## Introduction
+
+On a high level, clustering creates a plan based on a configurable strategy, 
groups eligible files based on specific
+criteria and then executes the plan. Hudi's [MVCC 
model](https://hudi.apache.org/docs/concurrency_control) provides
+snapshot isolation between multiple table services, which allows writers to 
continue with ingestion while clustering
+runs in the background. For a more detailed overview of the clustering 
architecture please check out the previous blog
+post.
+
+## Clustering Strategies
+
+As mentioned before, clustering plan as well as execution depends on 
configurable strategy. These strategies can be
+broadly classified into three types: clustering plan strategy, execution 
strategy and update strategy.
+
+### Plan Strategy
+
+This strategy comes into play while creating clustering plan. It helps to 
decide what file groups should be clustered.
+Let's look at different plan strategies that are available with Hudi. Note 
that these strategies are easily pluggable
+using this 
[config](https://hudi.apache.org/docs/next/configurations#hoodieclusteringplanstrategyclass).
+
+1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
+   the [small file 
limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+   of base files and creates clustering groups upto max file size allowed per 
group. The max size can be specified using
+   this 
[config.](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategymaxbytespergroup)
+2. `SparkRecentDaysClusteringPlanStrategy`: It looks back previous 'N' days 
partitions and creates a plan that will
+   cluster the 'small' file slices within those partitions. This is the 
default strategy.
+3. `SparkSelectedPartitionsClusteringPlanStrategy`: In case you want to 
cluster only specific partitions within a range,
+   no matter how old or new are those partitions, then this strategy could be 
useful. To use this partition, one needs
+   to set below two configs additionally (both begin and end partitions are 
inclusive):
+
+```
+hoodie.clustering.plan.strategy.cluster.begin.partition
+hoodie.clustering.plan.strategy.cluster.end.partition
+```
+
+**NOTE**: All the strategies are partition-aware and the latter two are still 
bound by the size limits of the first
+strategy.
+
+### Execution Strategy
+
+After building the clustering groups in the planning phase, Hudi applies 
execution strategy, for each group, primarily
+based on sort columns and size. The strategy can be specified using
+this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringexecutionstrategyclass).
+
+`SparkSortAndSizeExecutionStrategy` is the default strategy. Users can specify 
the columns to sort the data by, when
+clustering using
+this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysortcolumns).
 Apart from
+that, we can also set [max file 
size](https://hudi.apache.org/docs/next/configurations/#hoodieparquetmaxfilesize)

Review comment:
       How does hoodieparquetmaxfilesize differ from 
hoodieclusteringplanstrategymaxbytespergroup? ideally both should go hand in 
hand right? or am I missing something here ? Or in other words, why would 
someone set different values for both these configs? 

##########
File path: website/blog/2021-08-23-async-clustering.md
##########
@@ -0,0 +1,153 @@
+---
+title: "Asynchronous Clustering using Hudi"
+excerpt: "How to setup Hudi for asynchronous clustering"
+author: codope 
+category: blog
+---
+
+In one of the [previous 
blog](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro) posts, we 
introduced a new
+kind of table service called clustering to reorganize data for improved query 
performance without compromising on
+ingestion speed. We learnt how to setup inline clustering. In this post, we 
will discuss what has changed since then and
+see how asynchronous clustering can be setup using HoodieClusteringJob as well 
as DeltaStreamer utility.
+
+## Introduction
+
+On a high level, clustering creates a plan based on a configurable strategy, 
groups eligible files based on specific
+criteria and then executes the plan. Hudi's [MVCC 
model](https://hudi.apache.org/docs/concurrency_control) provides
+snapshot isolation between multiple table services, which allows writers to 
continue with ingestion while clustering
+runs in the background. For a more detailed overview of the clustering 
architecture please check out the previous blog
+post.
+
+## Clustering Strategies
+
+As mentioned before, clustering plan as well as execution depends on 
configurable strategy. These strategies can be
+broadly classified into three types: clustering plan strategy, execution 
strategy and update strategy.
+
+### Plan Strategy
+
+This strategy comes into play while creating clustering plan. It helps to 
decide what file groups should be clustered.
+Let's look at different plan strategies that are available with Hudi. Note 
that these strategies are easily pluggable
+using this 
[config](https://hudi.apache.org/docs/next/configurations#hoodieclusteringplanstrategyclass).
+
+1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
+   the [small file 
limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+   of base files and creates clustering groups upto max file size allowed per 
group. The max size can be specified using
+   this 
[config.](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategymaxbytespergroup)
+2. `SparkRecentDaysClusteringPlanStrategy`: It looks back previous 'N' days 
partitions and creates a plan that will
+   cluster the 'small' file slices within those partitions. This is the 
default strategy.
+3. `SparkSelectedPartitionsClusteringPlanStrategy`: In case you want to 
cluster only specific partitions within a range,
+   no matter how old or new are those partitions, then this strategy could be 
useful. To use this partition, one needs
+   to set below two configs additionally (both begin and end partitions are 
inclusive):
+
+```
+hoodie.clustering.plan.strategy.cluster.begin.partition
+hoodie.clustering.plan.strategy.cluster.end.partition
+```
+
+**NOTE**: All the strategies are partition-aware and the latter two are still 
bound by the size limits of the first
+strategy.
+
+### Execution Strategy
+
+After building the clustering groups in the planning phase, Hudi applies 
execution strategy, for each group, primarily
+based on sort columns and size. The strategy can be specified using
+this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringexecutionstrategyclass).
+
+`SparkSortAndSizeExecutionStrategy` is the default strategy. Users can specify 
the columns to sort the data by, when
+clustering using
+this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysortcolumns).
 Apart from
+that, we can also set [max file 
size](https://hudi.apache.org/docs/next/configurations/#hoodieparquetmaxfilesize)
+for the parquet files produced due to clustering. The strategy uses bulk 
insert to write data into new files, in which
+case, Hudi implicitly uses a partitioner that does sorting based on specified 
columns. In this way, the strategy changes
+the data layout in a way that not only improves query performance but also 
balance rewrite overhead automatically.
+
+Now this strategy can be executed either as a single spark job or multiple 
jobs depending on number of clustering groups
+created in the planning phase. By default, Hudi will submit multiple spark 
jobs and union the results. In case you want
+to force Hudi to use single spark job, set the execution strategy
+class 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringexecutionstrategyclass)
+to `SingleSparkJobExecutionStrategy`.
+
+### Update Strategy
+
+Currently, clustering can only be scheduled for tables/partitions not 
receiving any concurrent updates. By default,
+the [config for update 
strategy](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringupdatesstrategy)
 is
+set to ***SparkRejectUpdateStrategy***. However, in some use-cases updates are 
very sparse and the default strategy to
+simply reject updates and throw an error does not seem fair. In such 
use-cases, users can set the config to ***
+SparkAllowUpdateStrategy***.
+
+We discussed the critical strategy configurations. All other configurations 
related to clustering are
+listed 
[here](https://hudi.apache.org/docs/next/configurations/#Clustering-Configs). 
Out of this list, a few
+configurations that will be very useful are:
+
+|  Config key  | Remarks | Default |
+|  -----------  | -------  | ------- |
+| `hoodie.clustering.async.enabled` | Enable running of clustering service, 
asynchronously as inserts happen on the table. | False |
+| `hoodie.clustering.async.max.commits` | Control frequency of async 
clustering by specifying after how many commits clustering should be triggered. 
| 4 |
+| `hoodie.clustering.preserve.commit.metadata` | When rewriting data, 
preserves existing _hoodie_commit_time. This means users can run incremental 
queries on clustered data without any side-effects. | False |
+
+## Setup Asynchronous Clustering
+
+Previously, we have seen how users
+can [setup inline 
clustering](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro#setting-up-clustering).
+Additionally, users can
+leverage 
[HoodieClusteringJob](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-SetupforAsyncclusteringJob)
+to setup 2-step asynchronous clustering.
+
+### HoodieClusteringJob
+
+With the release of Hudi version 0.9.0, we can schedule as well as execute 
clustering in the same step. We just need to
+specify the `—mode` or `-m` option. There are three modes:
+
+1. `schedule`: Make a clustering plan. This gives an instant which can be 
passed in execute mode.
+2. `execute`: Execute a clustering plan at given instant which means 
--instant-time is required here.
+3. `scheduleAndExecute`: Make a clustering plan first and execute that plan 
immediately.
+
+A sample spark-submit command to setup HoodieClusteringJob is as below:
+
+```bash
+spark-submit \
+--class org.apache.hudi.utilities.HoodieClusteringJob \
+/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.9.0-SNAPSHOT.jar
 \
+--props /path/to/config/clusteringjob.properties \
+--mode scheduleAndExecute \
+--base-path /path/to/hudi_table/basePath \
+--table-name hudi_table_schedule_clustering \
+--spark-memory 1g
+```
+
+### HoodieDeltaStreamer
+
+This brings us to our users' favorite utility in Hudi. Now, we can trigger 
asynchronous clustering with DeltaStreamer.
+Just set the `hoodie.clustering.async.enabled` config to true and specify 
other clustering config in properties file
+whose location can be pased as `—props` when starting the deltastreamer (just 
like in the case of HoodieClusteringJob).
+
+A sample spark-submit command to setup HoodieDeltaStreamer is as below:
+
+```bash
+spark-submit \
+--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
+/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.9.0-SNAPSHOT.jar
 \
+--props /path/to/config/clustering_kafka.properties \
+--schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider 
\
+--source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
+--source-ordering-field impresssiontime \
+--table-type COPY_ON_WRITE \
+--target-base-path /path/to/hudi_table/basePath \
+--target-table impressions_cow_cluster \
+--op INSERT \
+--hoodie-conf hoodie.clustering.async.enabled=true

Review comment:
       Do you know if we can bolden a particular line within bash code snippet? 
if yes, can you highlight the ones required for clustering in this spark-submit 
job command.

##########
File path: website/blog/2021-08-23-async-clustering.md
##########
@@ -0,0 +1,153 @@
+---
+title: "Asynchronous Clustering using Hudi"
+excerpt: "How to setup Hudi for asynchronous clustering"
+author: codope 
+category: blog
+---
+
+In one of the [previous 
blog](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro) posts, we 
introduced a new
+kind of table service called clustering to reorganize data for improved query 
performance without compromising on
+ingestion speed. We learnt how to setup inline clustering. In this post, we 
will discuss what has changed since then and
+see how asynchronous clustering can be setup using HoodieClusteringJob as well 
as DeltaStreamer utility.
+
+## Introduction
+
+On a high level, clustering creates a plan based on a configurable strategy, 
groups eligible files based on specific
+criteria and then executes the plan. Hudi's [MVCC 
model](https://hudi.apache.org/docs/concurrency_control) provides
+snapshot isolation between multiple table services, which allows writers to 
continue with ingestion while clustering
+runs in the background. For a more detailed overview of the clustering 
architecture please check out the previous blog
+post.
+
+## Clustering Strategies
+
+As mentioned before, clustering plan as well as execution depends on 
configurable strategy. These strategies can be
+broadly classified into three types: clustering plan strategy, execution 
strategy and update strategy.
+
+### Plan Strategy
+
+This strategy comes into play while creating clustering plan. It helps to 
decide what file groups should be clustered.
+Let's look at different plan strategies that are available with Hudi. Note 
that these strategies are easily pluggable
+using this 
[config](https://hudi.apache.org/docs/next/configurations#hoodieclusteringplanstrategyclass).
+
+1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
+   the [small file 
limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+   of base files and creates clustering groups upto max file size allowed per 
group. The max size can be specified using
+   this 
[config.](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategymaxbytespergroup)
+2. `SparkRecentDaysClusteringPlanStrategy`: It looks back previous 'N' days 
partitions and creates a plan that will
+   cluster the 'small' file slices within those partitions. This is the 
default strategy.
+3. `SparkSelectedPartitionsClusteringPlanStrategy`: In case you want to 
cluster only specific partitions within a range,
+   no matter how old or new are those partitions, then this strategy could be 
useful. To use this partition, one needs
+   to set below two configs additionally (both begin and end partitions are 
inclusive):
+
+```
+hoodie.clustering.plan.strategy.cluster.begin.partition
+hoodie.clustering.plan.strategy.cluster.end.partition
+```
+
+**NOTE**: All the strategies are partition-aware and the latter two are still 
bound by the size limits of the first
+strategy.
+
+### Execution Strategy
+
+After building the clustering groups in the planning phase, Hudi applies 
execution strategy, for each group, primarily
+based on sort columns and size. The strategy can be specified using
+this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringexecutionstrategyclass).
+
+`SparkSortAndSizeExecutionStrategy` is the default strategy. Users can specify 
the columns to sort the data by, when

Review comment:
       as of now, we have only one execution strategy is it ? 

##########
File path: website/blog/2021-08-23-async-clustering.md
##########
@@ -0,0 +1,153 @@
+---
+title: "Asynchronous Clustering using Hudi"
+excerpt: "How to setup Hudi for asynchronous clustering"
+author: codope 
+category: blog
+---
+
+In one of the [previous 
blog](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro) posts, we 
introduced a new
+kind of table service called clustering to reorganize data for improved query 
performance without compromising on
+ingestion speed. We learnt how to setup inline clustering. In this post, we 
will discuss what has changed since then and
+see how asynchronous clustering can be setup using HoodieClusteringJob as well 
as DeltaStreamer utility.
+
+## Introduction
+
+On a high level, clustering creates a plan based on a configurable strategy, 
groups eligible files based on specific
+criteria and then executes the plan. Hudi's [MVCC 
model](https://hudi.apache.org/docs/concurrency_control) provides
+snapshot isolation between multiple table services, which allows writers to 
continue with ingestion while clustering
+runs in the background. For a more detailed overview of the clustering 
architecture please check out the previous blog
+post.
+
+## Clustering Strategies
+
+As mentioned before, clustering plan as well as execution depends on 
configurable strategy. These strategies can be
+broadly classified into three types: clustering plan strategy, execution 
strategy and update strategy.
+
+### Plan Strategy
+
+This strategy comes into play while creating clustering plan. It helps to 
decide what file groups should be clustered.
+Let's look at different plan strategies that are available with Hudi. Note 
that these strategies are easily pluggable
+using this 
[config](https://hudi.apache.org/docs/next/configurations#hoodieclusteringplanstrategyclass).
+
+1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
+   the [small file 
limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+   of base files and creates clustering groups upto max file size allowed per 
group. The max size can be specified using
+   this 
[config.](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategymaxbytespergroup)
+2. `SparkRecentDaysClusteringPlanStrategy`: It looks back previous 'N' days 
partitions and creates a plan that will
+   cluster the 'small' file slices within those partitions. This is the 
default strategy.
+3. `SparkSelectedPartitionsClusteringPlanStrategy`: In case you want to 
cluster only specific partitions within a range,
+   no matter how old or new are those partitions, then this strategy could be 
useful. To use this partition, one needs
+   to set below two configs additionally (both begin and end partitions are 
inclusive):
+
+```
+hoodie.clustering.plan.strategy.cluster.begin.partition
+hoodie.clustering.plan.strategy.cluster.end.partition
+```
+
+**NOTE**: All the strategies are partition-aware and the latter two are still 
bound by the size limits of the first
+strategy.
+
+### Execution Strategy
+
+After building the clustering groups in the planning phase, Hudi applies 
execution strategy, for each group, primarily
+based on sort columns and size. The strategy can be specified using
+this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringexecutionstrategyclass).
+
+`SparkSortAndSizeExecutionStrategy` is the default strategy. Users can specify 
the columns to sort the data by, when
+clustering using
+this 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysortcolumns).
 Apart from
+that, we can also set [max file 
size](https://hudi.apache.org/docs/next/configurations/#hoodieparquetmaxfilesize)
+for the parquet files produced due to clustering. The strategy uses bulk 
insert to write data into new files, in which
+case, Hudi implicitly uses a partitioner that does sorting based on specified 
columns. In this way, the strategy changes
+the data layout in a way that not only improves query performance but also 
balance rewrite overhead automatically.
+
+Now this strategy can be executed either as a single spark job or multiple 
jobs depending on number of clustering groups
+created in the planning phase. By default, Hudi will submit multiple spark 
jobs and union the results. In case you want
+to force Hudi to use single spark job, set the execution strategy
+class 
[config](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringexecutionstrategyclass)
+to `SingleSparkJobExecutionStrategy`.
+
+### Update Strategy
+
+Currently, clustering can only be scheduled for tables/partitions not 
receiving any concurrent updates. By default,
+the [config for update 
strategy](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringupdatesstrategy)
 is
+set to ***SparkRejectUpdateStrategy***. However, in some use-cases updates are 
very sparse and the default strategy to

Review comment:
       Can we add a line to explain what is SparkRejectUpdateStrategy. Next 
line implicitly talks about it, but thats in the context of 
SparkAllowUpdateStrategy.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to