(hudi) branch asf-site updated: docs: Update documentation for new features in Hudi 1.2.0 (#18867)

yihua Thu, 28 May 2026 09:28:09 -0700

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new e3a28fbdaf94 docs: Update documentation for new features in Hudi 1.2.0 
(#18867)
e3a28fbdaf94 is described below

commit e3a28fbdaf946a6a5cc0f46229dac9d687f2aa99
Author: Y Ethan Guo <[email protected]>
AuthorDate: Thu May 28 09:27:50 2026 -0700

    docs: Update documentation for new features in Hudi 1.2.0 (#18867)
---
 website/docs/ai_overview.md                    |   4 +-
 website/docs/azure_hoodie.md                   |  14 +-
 website/docs/blob_unstructured_data.md         |  45 ++++-
 website/docs/cleaning.md                       |  47 +++++
 website/docs/cli.md                            |  25 ++-
 website/docs/clustering.md                     |  59 +++++-
 website/docs/compaction.md                     |  13 +-
 website/docs/concurrency_control.md            |  42 ++++-
 website/docs/deployment.md                     |   4 +-
 website/docs/flink-quick-start-guide.md        |  33 ++--
 website/docs/flink_tuning.md                   |  47 +++++
 website/docs/hoodie_streaming_ingestion.md     | 106 ++++++++++-
 website/docs/ingestion_flink.md                | 244 +++++++++++++++++++++++--
 website/docs/key_generation.md                 |  31 +++-
 website/docs/lance_file_format.md              | 129 +++++++++++--
 website/docs/metadata.md                       |  31 ++++
 website/docs/metadata_indexing.md              |  33 +++-
 website/docs/metrics.md                        |  67 ++++---
 website/docs/overview.mdx                      |   4 +-
 website/docs/precommit_validator.md            |  67 +++++++
 website/docs/procedures.md                     |  60 +++++-
 website/docs/reading_tables_batch_reads.md     |  26 +++
 website/docs/reading_tables_streaming_reads.md |  44 +++++
 website/docs/sql_ddl.md                        |  55 +++++-
 website/docs/sql_queries.md                    |  33 +++-
 website/docs/syncing_aws_glue_data_catalog.md  |   4 +
 website/docs/syncing_metastore.md              |  41 +++++
 website/docs/variant_type.md                   |  26 ++-
 website/docs/vector_search.md                  |  42 ++++-
 website/docs/writing_data.md                   |  29 ++-
 30 files changed, 1293 insertions(+), 112 deletions(-)

diff --git a/website/docs/ai_overview.md b/website/docs/ai_overview.md
index 11c695d1bf9c..c28bcae6543a 100644
--- a/website/docs/ai_overview.md
+++ b/website/docs/ai_overview.md
@@ -18,7 +18,7 @@ Apache Hudi's AI-native capabilities bring this vision to 
life with four foundat
 
 ### VECTOR Type and Similarity Search
 
-Store high-dimensional embedding vectors as first-class column types and run 
approximate nearest neighbor (ANN)
+Store high-dimensional embedding vectors as first-class column types and run 
vector similarity
 search directly in Spark SQL.
 
 ```sql
@@ -99,7 +99,7 @@ query performance, while keeping the flexibility for 
everything else.
 Hudi's pluggable file format architecture supports **Lance**, a modern 
columnar format purpose-built for
 AI/ML workloads. Lance provides:
 
-- Efficient vector indexing and ANN search
+- Native vector column encoding (`FixedSizeList`) — no conversion overhead at 
the file-format layer
 - Fast random access for training data sampling
 - Optimized storage for high-dimensional arrays and nested structures
 
diff --git a/website/docs/azure_hoodie.md b/website/docs/azure_hoodie.md
index a22d66598141..2730f6f9411a 100644
--- a/website/docs/azure_hoodie.md
+++ b/website/docs/azure_hoodie.md
@@ -2,7 +2,7 @@
 title: Microsoft Azure
 keywords: [ hudi, hive, azure, spark, presto]
 summary: In this page, we go over how to configure Hudi with Azure filesystem.
-last_modified_at: 2020-05-25T19:00:57-04:00
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 In this page, we explain how to use Hudi on Microsoft Azure.
 
@@ -49,6 +49,18 @@ This combination works out of the box. No extra config 
needed.
     .load("/mountpoint/hudi-tables/customer")
   ```
 
+## Concurrency Control
+
+As of Hudi 1.2.0, the storage-based lock provider supports Azure ADLS Gen2 
(`abfs://`, `abfss://`) and Azure Blob Storage (`wasb://`, `wasbs://`) base 
paths for concurrency control. This allows multi-writer pipelines on Azure to 
use storage-native conditional writes for locking — without requiring external 
systems like ZooKeeper, or Hive Metastore.
+
+Add `hudi-azure-bundle` to your classpath and set:
+
+```properties
+hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.StorageBasedLockProvider
+```
+
+The lock client supports multiple Azure authentication methods (connection 
string, SAS token, managed identity, service principal, and 
`DefaultAzureCredential`). See [Concurrency Control — Azure Storage-Based 
Lock](concurrency_control.md#azure-storage-based-lock) for the full 
configuration reference and authentication precedence.
+
 ## Related Resources
 
 <h3>Blogs</h3>
diff --git a/website/docs/blob_unstructured_data.md 
b/website/docs/blob_unstructured_data.md
index 74acc799a230..253656565084 100644
--- a/website/docs/blob_unstructured_data.md
+++ b/website/docs/blob_unstructured_data.md
@@ -3,7 +3,7 @@ title: "Unstructured Data"
 keywords: [ hudi, blob, unstructured data, images, binary, pdf, audio, video, 
inline, out-of-line, read_blob]
 summary: "Store and query unstructured data (images, PDFs, audio, video) in 
Hudi tables using the BLOB type with inline or out-of-line storage"
 toc: true
-last_modified_at: 2026-04-25T00:00:00-00:00
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 
 import Tabs from '@theme/Tabs';
@@ -72,6 +72,7 @@ schema = pa.schema([
             pa.field("external_path", pa.string()),
             pa.field("offset",        pa.int64()),
             pa.field("length",        pa.int64()),
+            pa.field("managed",       pa.bool_()),
         ])),
     ]), metadata={b"hudi_type": b"BLOB"}),
 ])
@@ -84,7 +85,7 @@ The BLOB internal structure is a struct with three fields:
   - `external_path` — file path for out-of-line data
   - `offset` — byte offset in the file (null means read from start)
   - `length` — byte length to read (null means read to end of file)
-  - `managed` — boolean indicating whether Hudi manages the external file
+  - `managed` — boolean. Only meaningful for `OUT_OF_LINE` blobs. Marks 
whether Hudi owns the lifecycle of the referenced external file. **Not consumed 
by the cleaner yet** — set the value to record intent, and a future cleaner 
implementation will use it: `true` → cleaner may delete the external file when 
the blob row is no longer referenced; `false` → cleaner will leave the external 
file in place.
 
 </TabItem>
 </Tabs>
@@ -112,7 +113,7 @@ INSERT INTO media_assets VALUES (
     named_struct(
         'type',      'INLINE',
         'data',      /* binary literal or column reference */,
-        'reference', CAST(NULL AS STRUCT<external_path: STRING, offset: 
BIGINT, length: BIGINT>)
+        'reference', CAST(NULL AS STRUCT<external_path: STRING, offset: 
BIGINT, length: BIGINT, managed: BOOLEAN>)
     )
 );
 ```
@@ -158,7 +159,8 @@ INSERT INTO media_assets VALUES (
         'reference', named_struct(
             'external_path', 's3://my-bucket/media/container_001.bin',
             'offset',        8388608,       -- byte offset in the container
-            'length',        1073741824     -- number of bytes
+            'length',        1073741824,    -- number of bytes
+            'managed',       false          -- intent flag; not consumed by 
the cleaner yet
         )
     )
 );
@@ -290,15 +292,44 @@ Out-of-line BLOBs keep the Hudi table footprint extremely 
small:
 
 | Property | Default | Description |
 |:---------|:--------|:------------|
-| `hoodie.read.blob.inline.mode` | `CONTENT` | Controls how INLINE BLOBs are 
read. `CONTENT` materializes raw bytes in the `data` column. `DESCRIPTOR` 
surfaces `(position, size)` coordinates rewritten as OUT_OF_LINE references. |
+| `hoodie.read.blob.inline.mode` | `DESCRIPTOR` | Controls how INLINE BLOBs 
are read. `DESCRIPTOR` (default) returns an out-of-line-shaped reference 
pointing at the in-file coordinates of the bytes — no bytes are materialized. 
`CONTENT` materializes the raw inline bytes directly in the `data` field on 
every read. |
 | `hoodie.blob.batching.max.gap.bytes` | `4096` | Maximum gap (in bytes) 
between consecutive byte ranges before they are merged into a single read. 
Larger values reduce I/O calls at the cost of reading some unused bytes. |
 | `hoodie.blob.batching.lookahead.size` | `50` | Number of rows to buffer for 
batch read detection. Larger values improve batching for sorted data but 
increase memory usage. |
 
 :::note
-DESCRIPTOR mode is only supported on Lance-backed tables. CONTENT mode is 
always used for internal
-operations (compaction, merge, log replay) regardless of this setting.
+`DESCRIPTOR` mode is the default for all storage formats including Lance. 
`CONTENT` mode is always
+used for internal operations (compaction, merge, log replay) regardless of 
this setting.
 :::
 
+:::caution Calling read_blob() on INLINE columns under DESCRIPTOR mode
+Under the default `DESCRIPTOR` mode, calling `read_blob()` on an INLINE BLOB 
column **throws** —
+the raw bytes are not materialized in the scan, so there is nothing for 
`read_blob()` to return.
+To read inline bytes with `read_blob()`, switch to `CONTENT` mode first:
+
+```sql
+SET hoodie.read.blob.inline.mode=CONTENT;
+SELECT asset_id, read_blob(content) AS raw_bytes
+FROM media_assets
+WHERE asset_id = 'asset_001';
+```
+
+This setting affects only INLINE columns — OUT_OF_LINE columns always fetch 
from the external path
+regardless of mode.
+:::
+
+## Metastore Sync
+
+When syncing BLOB column schemas to Hive or BigQuery, Hudi maps the BLOB 
struct to the target
+catalog's native struct type:
+
+| Catalog | BLOB representation |
+|:--------|:-------------------|
+| Hive | `STRUCT<type:STRING, data:BINARY, 
reference:STRUCT<external_path:STRING, offset:BIGINT, length:BIGINT, 
managed:BOOLEAN>>` |
+| BigQuery | Equivalent `STRUCT` fields |
+
+The raw binary payload is preserved in the struct representation, but 
`read_blob()` is a Spark SQL
+function and is not available in Hive or BigQuery directly.
+
 ## Best Practices
 
 1. **Choose the right mode** — Use inline for small, frequently-accessed 
objects. Use out-of-line for
diff --git a/website/docs/cleaning.md b/website/docs/cleaning.md
index c3498c8aa914..829d0d57cd6d 100644
--- a/website/docs/cleaning.md
+++ b/website/docs/cleaning.md
@@ -3,6 +3,7 @@ title: Cleaning
 toc: true
 toc_min_heading_level: 2
 toc_max_heading_level: 4
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 ## Background
 Cleaning is a table service employed by Hudi to reclaim space occupied by 
older versions of data and keep storage costs 
@@ -50,6 +51,41 @@ Hudi cleaner currently supports the below cleaning policies 
to keep a certain nu
   be retained are cleaned. Currently you can configure by parameter 
[`hoodie.clean.hours.retained`](https://hudi.apache.org/docs/configurations/#hoodiecleanerhoursretained).
   The corresponding Flink related config is 
[`clean.retain_hours`](https://hudi.apache.org/docs/configurations/#cleanretain_hours).
 
+#### Empty Clean Commits for Append-Only Tables
+
+Append-only tables never accumulate updates, so the cleaner's 
`earliest_commit_to_retain` pointer never advances —
+causing the cleaner to scan the full table history on every run. Hudi 1.2.0 
introduced periodic _empty clean commits_
+to advance this pointer even when there is nothing to delete.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.write.empty.clean.interval.hours` | `-1` (disabled) | Interval in 
hours at which an empty clean commit is created. `-1` disables the feature. 
Must be `-1` or `>= 1`. When enabled, the cleaner advances 
`earliest_commit_to_retain` so that subsequent clean plans only scan partitions 
modified after the last empty clean's pointer. |
+
+#### Capping the Number of Commits Cleaned per Run
+
+Since 1.2.0, you can limit how many commits are cleaned in a single clean run, 
which is useful for controlling job
+duration on tables that have fallen significantly behind on cleaning.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clean.max.commits.to.clean` | `Long.MAX_VALUE` (unbounded) | Maximum 
number of commits cleaned in a single clean commit. Applicable when the 
cleaning policy is `KEEP_LATEST_COMMITS` or `KEEP_LATEST_BY_HOURS`. Must be `>= 
1`. |
+
+#### Full-Clean Partition Filtering
+
+When incremental cleaning is disabled 
(`hoodie.clean.incremental.enabled=false`), the cleaner scans every partition on
+every run. For very large tables this can cause OOM during planning. Hudi 
1.2.0 added two configs to restrict which
+partitions are examined.
+
+:::note
+Both configs require `hoodie.clean.incremental.enabled=false`. If both are 
set, `hoodie.clean.partition.filter.selected`
+takes precedence over the regex.
+:::
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clean.partition.filter.regex` | (none) | Java regex pattern; only 
partitions whose path matches are cleaned. |
+| `hoodie.clean.partition.filter.selected` | (none) | Comma-separated list of 
partition paths to clean; takes precedence over the regex when both are set. |
+
 ### Configs
 For details about all possible configurations and their default values see the 
[configuration 
docs](https://hudi.apache.org/docs/next/configurations/#Clean-Configs).
 For Flink related configs refer 
[here](https://hudi.apache.org/docs/next/configurations/#FLINK_SQL).
@@ -76,6 +112,17 @@ hoodie.clean.async=true
 
 For Flink based writing, this is the default mode of cleaning. Please refer to 
[`clean.async.enabled`](https://hudi.apache.org/docs/configurations/#cleanasyncenabled)
 for details.
 
+#### Pre-Write Cleaner Policy
+
+By default the cleaner runs _after_ a write commits. Hudi 1.2.0 introduced 
`hoodie.prewrite.cleaner.policy`, which
+lets you force a clean (or rollback of failed writes) _before_ each write 
begins. This is useful in multi-writer
+deployments where you want a deterministic table state before every write — 
see [concurrency control](concurrency_control.md)
+for related multi-writer configuration.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.prewrite.cleaner.policy` | `NONE` | Pre-write cleaning action. 
`NONE`: no pre-write action (default). `CLEAN`: run a clean pass before each 
write — this also rolls back failed writes as part of the clean. 
`ROLLBACK_FAILED_WRITES`: only roll back any failed writes before each write, 
without running a full clean. |
+
 #### Run independently
 Hoodie Cleaner can also be run as a separate process. Following is the command 
for running the cleaner independently:
 ```
diff --git a/website/docs/cli.md b/website/docs/cli.md
index a29ffdfc2637..ddb8132d3cf3 100644
--- a/website/docs/cli.md
+++ b/website/docs/cli.md
@@ -1,7 +1,7 @@
 ---
 title: CLI
 keywords: [hudi, cli]
-last_modified_at: 2021-08-18T15:59:57-04:00
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 
 ### Local set up
@@ -340,6 +340,13 @@ $ hdfs dfs -ls /app/uber/trips/.hoodie/*.inflight
 -rw-r--r--   3 vinoth supergroup     321984 2016-10-05 23:18 
/app/uber/trips/.hoodie/20161005225920.inflight
 ```
 
+To list all inflight and requested instants that have been running longer than 
a specified number of minutes, use `commits show_inflights`:
+
+```shell
+hudi:trips->commits show_inflights --lookbackInMins 30
+```
+
+This lists every inflight or requested instant whose requested timestamp is 
older than 30 minutes, showing the commit time, action type, and current state. 
This is useful for detecting hung or stuck writes. The `--lookbackInMins` 
option defaults to `0` (returns all inflight/requested instants).
 
 ### Drilling Down to a specific Commit
 
@@ -675,6 +682,22 @@ corresponding to the library release version is used:
 upgrade table
 ```
 
+### Record Index Lookup
+
+To look up a record's file location via the Record Level Index (RLI) stored in 
the Metadata Table:
+
+```shell
+hudi:trips->metadata lookup-record-index --record_key <key>
+```
+
+For a partitioned (non-global) RLI, the partition path is required:
+
+```shell
+hudi:trips->metadata lookup-record-index --record_key <key> --partition_path 
<partition>
+```
+
+The `--partition_path` argument is optional for a global RLI (where record 
keys are unique across all partitions) and required for a partitioned RLI. If 
`--partition_path` is omitted for a partitioned RLI, the command will return an 
error. The output columns are `Record key`, `Partition path`, `File Id`, and 
`Instant time`.
+
 ### Change Hudi Table Type
 There are cases we want to change the hudi table type. For example, change COW 
table to MOR for more efficient and 
 lower latency ingestion; change MOR to COW for better read performance and 
compatibility with downstream engines.
diff --git a/website/docs/clustering.md b/website/docs/clustering.md
index 7442967b6505..7426bf9e2dcc 100644
--- a/website/docs/clustering.md
+++ b/website/docs/clustering.md
@@ -2,7 +2,7 @@
 title: Clustering
 summary: "In this page, we describe async compaction in Hudi."
 toc: true
-last_modified_at: 2025-11-24T02:44:48
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 
 ## Background
@@ -134,6 +134,47 @@ dynamically expanding the buckets for bucket index 
datasets.
 :::note The latter two strategies are applicable only for the Spark engine.
 :::
 
+#### CommitBasedClusteringPlanStrategy
+
+Hudi 1.2.0 introduced 
`org.apache.hudi.table.action.cluster.strategy.CommitBasedClusteringPlanStrategy`,
 a plan
+strategy that schedules clustering based on commit patterns rather than just 
file size. It groups file slices by the
+commits that produced them, making it easier to cluster data written in 
specific time windows or under specific commit
+criteria.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clustering.plan.strategy.class` | 
`SparkSizeBasedClusteringPlanStrategy` | Set to 
`org.apache.hudi.table.action.cluster.strategy.CommitBasedClusteringPlanStrategy`
 to use commit-based planning. |
+| `hoodie.clustering.plan.strategy.earliest.commit.to.cluster` | (none) | 
Earliest commit time (exclusive) to start clustering from. Only commits after 
this instant are considered. Useful for incrementally clustering new data while 
skipping already-clustered history. |
+
+#### SparkStreamCopyClusteringPlanStrategy
+
+Available since Hudi 1.2.0, 
`org.apache.hudi.client.clustering.plan.strategy.SparkStreamCopyClusteringPlanStrategy`
+is a Spark-only plan strategy that performs binary file stitching (byte-level 
copy) rather than re-reading and
+re-writing records. This can be significantly faster when the goal is simply 
to coalesce small files and sort order is
+not required. It is paired with
+`org.apache.hudi.client.clustering.run.strategy.SparkStreamCopyClusteringExecutionStrategy`.
+
+#### Single-Group Clustering Control
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clustering.plan.strategy.single.group.clustering.enabled` | `true` | 
Whether to generate a clustering plan when only one file group is eligible. Set 
to `false` to skip clustering when there is nothing meaningful to consolidate 
(i.e., the partition already has a single file group). |
+
+#### File-Slice Sort Order in Clustering Plan Generation
+
+Since 1.2.0, the order in which file slices are packed into clustering groups 
is configurable, giving more control over
+which files are colocated and how groups are filled.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clustering.plan.strategy.file.slices.sort.by` | `SIZE` | 
Comma-separated list of fields used to sort file slices when packing them into 
clustering groups within a partition. `SIZE`: sort by file size descending 
(largest first). `INSTANT_TIME`: sort by commit time ascending (oldest files 
first). Example: `INSTANT_TIME,SIZE` sorts by commit time then by size. |
+
+#### Driver-Side Plan Generation
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clustering.plan.generation.use.local.engine.context` | `false` | 
When enabled, clustering group computation runs on the driver (local engine 
context) instead of being distributed across executors. Enable when there are 
only a few partitions with many files, where driver-local computation is more 
resource-efficient than allocating executor slots. |
+
 ### Execution Strategy
 
 After building the clustering groups in the planning phase, Hudi applies 
execution strategy, for each group, primarily
@@ -251,6 +292,22 @@ In addition to the basic mode options, HoodieClusteringJob 
supports the followin
 These retry options are only effective when using `--mode scheduleAndExecute`. 
The `--retry-last-failed-job` option requires `--job-max-processing-time-ms` to 
be set to a positive value to detect stale inflight instants.
 :::
 
+#### Automatic Expiration of Stale Clustering Instants
+
+When a clustering job is scheduled but never successfully executed (e.g., due 
to a driver failure), the inflight
+`replacecommit` instant blocks future clustering runs. Hudi 1.2.0 adds 
automatic expiration of such stale clustering
+instants, complementing the manual retry options above.
+
+:::note
+Expired clustering plan cleanup requires 
`hoodie.clean.failed.writes.policy=LAZY`. With LAZY cleaning, the rollback of
+failed writes (triggered on the next write) also rolls back expired clustering 
instants.
+:::
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.clustering.enable.expirations` | `false` | When enabled, rollback of 
failed writes (under LAZY cleaning) also rolls back clustering `replacecommit` 
instants whose heartbeat has expired. Clustering jobs record a heartbeat before 
scheduling so other writers can detect stale attempts. |
+| `hoodie.clustering.expiration.threshold.mins` | `60` | A clustering instant 
is not considered expired unless its creation time is at least this many 
minutes old. Acts as a guardrail to avoid rolling back clustering attempts that 
are still in progress. |
+
 Note that to run this job while the original writer is still running, please 
enable multi-writing:
 
 ```properties
diff --git a/website/docs/compaction.md b/website/docs/compaction.md
index 89b9214f0bd9..bf1dee7c12d6 100644
--- a/website/docs/compaction.md
+++ b/website/docs/compaction.md
@@ -4,7 +4,7 @@ summary: "In this page, we describe async compaction in Hudi."
 toc: true
 toc_min_heading_level: 2
 toc_max_heading_level: 4
-last_modified_at: 2025-11-24T02:44:48
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 ## Background
 
@@ -104,6 +104,17 @@ BoundedPartitionAwareCompactionStrategy</li></ul>
 Please refer to [advanced 
configs](https://hudi.apache.org/docs/next/configurations#Compaction-Configs) 
for more details.
 :::
 
+#### Metadata Table Compaction Trigger Strategy
+
+Available since Hudi 1.2.0, the metadata table (MDT) supports the same set of 
compaction trigger strategies as the
+data table, plus a time-based option.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.metadata.compact.trigger.strategy` | `NUM_COMMITS` | Trigger 
strategy for MDT compaction. Accepts the same values as 
`hoodie.compact.inline.trigger.strategy`: `NUM_COMMITS`, 
`NUM_COMMITS_AFTER_LAST_REQUEST`, `TIME_ELAPSED`, `NUM_AND_TIME`, 
`NUM_OR_TIME`. |
+| `hoodie.metadata.compact.max.delta.commits` | `10` | Number of delta commits 
after the last MDT compaction before a new one is scheduled (for 
`NUM_COMMITS`-based strategies). |
+| `hoodie.metadata.compact.max.delta.seconds` | `7200` | Elapsed seconds after 
the last MDT compaction before scheduling a new one. Takes effect only for 
`TIME_ELAPSED`, `NUM_AND_TIME`, and `NUM_OR_TIME` strategies. |
+
 ## Ways to trigger Compaction
 
 ### Inline
diff --git a/website/docs/concurrency_control.md 
b/website/docs/concurrency_control.md
index d6231aca79b7..2056acfca64c 100644
--- a/website/docs/concurrency_control.md
+++ b/website/docs/concurrency_control.md
@@ -4,7 +4,7 @@ summary: On this page, we discuss how to perform concurrent 
writes to Hudi table
 toc: true
 toc_min_heading_level: 2
 toc_max_heading_level: 4
-last_modified_at: 2025-11-23T14:20:00
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 
 Concurrency control defines how different writers, readers, and table services 
coordinate access to a Hudi table. Hudi ensures atomic writes by publishing 
commits atomically to the timeline, stamped with an instant time that denotes 
when the action is deemed to have occurred. Unlike general-purpose file version 
control, Hudi draws a clear distinction between writer processes that issue 
[write operations](write_operations.md), table services that (re)write 
data/metadata to optimize or per [...]
@@ -47,6 +47,7 @@ Add the corresponding cloud bundle to your classpath:
 
 * For S3: `hudi-aws-bundle`
 * For GCS: `hudi-gcp-bundle`
+* For Azure (`abfs://`, `abfss://`, `wasb://`, `wasbs://`): `hudi-azure-bundle`
 
 Set this configuration:
 
@@ -54,7 +55,7 @@ Set this configuration:
 
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.StorageBasedLockProvider
 ```
 
-Supported for S3 and GCS (additional systems planned). This cloud-native 
design works directly with storage features, simplifying large-scale cloud 
operations.
+Supported for S3, GCS, and Azure ADLS Gen2 / Azure Blob Storage. This 
cloud-native design works directly with storage features, simplifying 
large-scale cloud operations.
 
 Optional tuning configurations:
 
@@ -63,6 +64,27 @@ Optional tuning configurations:
 | hoodie.write.lock.storage.validity.timeout.secs | 300 (Optional) | Validity 
period (seconds) for each new lock. The provider renews its lock until the 
lease extends or timeout occurs.<br /><br />`Config Param: 
STORAGE_BASED_LOCK_VALIDITY_TIMEOUT_SECS`<br />`Since Version: 1.0.2` |
 | hoodie.write.lock.storage.renew.interval.secs   | 30 (Optional)  | Interval 
(seconds) between renewal attempts.<br /><br />`Config Param: 
STORAGE_BASED_LOCK_RENEW_INTERVAL_SECS`<br />`Since Version: 1.0.2`             
                                                              |
 
+#### Azure Storage-Based Lock
+
+Authentication is resolved in the following precedence order:
+
+| Priority | Config Key | Description |
+|----------|------------|-------------|
+| 1 (highest) | `hoodie.write.lock.azure.connection.string` | Azure Storage 
connection string |
+| 2 | `hoodie.write.lock.azure.sas.token` | SAS token (not recommended for 
production by Azure) |
+| 3 | `hoodie.write.lock.azure.managed.identity.client.id` | Client ID of a 
user-assigned managed identity (`ManagedIdentityCredential`) |
+| 4 | `hoodie.write.lock.azure.client.tenant.id` + `.client.id` + 
`.client.secret` | Service principal via `ClientSecretCredential` — all three 
must be set |
+| 5 (lowest) | _(none)_ | `DefaultAzureCredential` chain (system-assigned 
managed identity, environment variables, etc.) |
+
+Example configuration for service-principal authentication:
+
+```properties
+hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.StorageBasedLockProvider
+hoodie.write.lock.azure.client.tenant.id=<your-tenant-id>
+hoodie.write.lock.azure.client.id=<your-app-client-id>
+hoodie.write.lock.azure.client.secret=<your-client-secret>
+```
+
 ### Zookeeper-Based Lock Provider
 
 ```properties
@@ -359,6 +381,22 @@ hoodie.write.lock.client.num_retries
 
 *Setting the right values for these depends on a case by case basis; some 
defaults have been provided for general cases.*
 
+## Pre-Write Cleaner Policy
+
+When running multi-writer pipelines, failed writes can accumulate on storage 
if a writer crashes before a clean cycle runs. Hudi 1.2.0 introduces 
`hoodie.prewrite.cleaner.policy` to proactively handle this at write startup:
+
+| Config Key | Default | Description |
+|---|---|---|
+| `hoodie.prewrite.cleaner.policy` | `NONE` | Policy applied before starting a 
new ingestion write commit. `NONE`: no pre-write action (default). `CLEAN`: 
force a clean table service call (also rolls back failed writes). 
`ROLLBACK_FAILED_WRITES`: only roll back failed writes without running a full 
clean. |
+
+This is useful when a writer is perpetually crashing before completing a 
`CLEAN`. See [Cleaning](cleaning.md) for the full list of cleaning 
configurations.
+
+## Lock Audit Logging and Diagnostics
+
+The storage-based lock provider supports optional audit logging of lock 
operations. When enabled, a `.hoodie/lock/audit_enabled.json` marker is written 
to the table base path and lock acquisition/release events are recorded for 
post-hoc debugging.
+
+For ZooKeeper-based locking, the ZK lock node now stores the Spark application 
ID of the writer holding the lock, making it easier to correlate lock holders 
with running Spark jobs in cluster UIs.
+
 ## Caveats
 
 If you are using the `WriteClient` API, please note that multiple writes to 
the table need to be initiated from 2 different instances of the write client.
diff --git a/website/docs/deployment.md b/website/docs/deployment.md
index 51f92e40d407..62497285a333 100644
--- a/website/docs/deployment.md
+++ b/website/docs/deployment.md
@@ -3,7 +3,7 @@ title: Deployment
 keywords: [ hudi, administration, operation, devops, deployment]
 summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi
 toc: true
-last_modified_at: 2019-12-30T15:59:57-04:00
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 
 This section provides all the help you need to deploy and operate Hudi tables 
at scale.
@@ -32,6 +32,8 @@ from varied sources such as DFS, Kafka and DB Changelogs and 
ingest them to hudi
 To use Hudi Streamer in Spark, the `hudi-utilities-slim-bundle` and Hudi Spark 
bundle are required, by adding
 `--packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.1,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.1`
 to the `spark-submit` command.
 
+Pick the Spark bundle that matches your Spark runtime — for example, 
`hudi-spark3.3-bundle_2.12`, `hudi-spark3.4-bundle_2.12`, 
`hudi-spark3.5-bundle_2.12` (Scala 2.12 or 2.13), `hudi-spark3.5-bundle_2.13`, 
`hudi-spark4.0-bundle_2.13`, or `hudi-spark4.1-bundle_2.13`. Spark 4.0 and 4.1 
require Java 17 or later at runtime; Spark 3.x runs on Java 8 or later.
+
 - **Run Once Mode** : In this mode, Hudi Streamer performs one ingestion round 
which includes incrementally pulling events from upstream sources and ingesting 
them to hudi table. Background operations like cleaning old file versions and 
archiving hoodie timeline are automatically executed as part of the run. For 
Merge-On-Read tables, Compaction is also run inline as part of ingestion unless 
disabled by passing the flag "--disable-compaction". By default, Compaction is 
run inline for ever [...]
 
 Here is an example invocation for reading from kafka topic in a single-run 
mode and writing to Merge On Read table type in a yarn cluster.
diff --git a/website/docs/flink-quick-start-guide.md 
b/website/docs/flink-quick-start-guide.md
index 2a04dc85abb2..23ef5efbe3cc 100644
--- a/website/docs/flink-quick-start-guide.md
+++ b/website/docs/flink-quick-start-guide.md
@@ -1,7 +1,7 @@
 ---
 title: "Flink Quick Start"
 toc: true
-last_modified_at: 2025-11-22T14:30:00+08:00
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
@@ -12,12 +12,13 @@ This page introduces Flink–Hudi integration and 
demonstrates how Flink brings
 
 ### Flink Support Matrix
 
-| Hudi   | Supported Flink versions                                            
   |
-| :----- | 
:--------------------------------------------------------------------- |
-| 1.1.x  | 1.17.x, 1.18.x, 1.19.x, 1.20.x (default build), 2.0.x               
   |
-| 1.0.x  | 1.14.x, 1.15.x, 1.16.x, 1.17.x, 1.18.x, 1.19.x, 1.20.x (default 
build) |
-| 0.15.x | 1.14.x, 1.15.x, 1.16.x, 1.17.x, 1.18.x                              
   |
-| 0.14.x | 1.13.x, 1.14.x, 1.15.x, 1.16.x, 1.17.x                              
   |
+| Hudi   | Supported Flink versions                                            
          |
+| :----- | 
:---------------------------------------------------------------------------- |
+| 1.2.x  | 1.17.x, 1.18.x, 1.19.x, 1.20.x (default build), 2.0.x, 2.1.x        
        |
+| 1.1.x  | 1.17.x, 1.18.x, 1.19.x, 1.20.x (default build), 2.0.x               
         |
+| 1.0.x  | 1.14.x, 1.15.x, 1.16.x, 1.17.x, 1.18.x, 1.19.x, 1.20.x (default 
build)      |
+| 0.15.x | 1.14.x, 1.15.x, 1.16.x, 1.17.x, 1.18.x                              
         |
+| 0.14.x | 1.13.x, 1.14.x, 1.15.x, 1.16.x, 1.17.x                              
         |
 
 ### Download Flink and Start Flink cluster
 
@@ -62,9 +63,9 @@ You can build the jar manually under path 
`hudi-source-dir/packaging/hudi-flink-
 Now start the SQL CLI:
 
 ```bash
-# For Flink versions: 1.17-1.20, 2.0
-export FLINK_VERSION=1.20 
-export HUDI_VERSION=1.1.1
+# Supported Flink versions for Hudi 1.2.x: 1.17, 1.18, 1.19, 1.20 (default 
build), 2.0, 2.1
+export FLINK_VERSION=1.20
+export HUDI_VERSION=1.2.0
 wget 
https://repo1.maven.org/maven2/org/apache/hudi/hudi-flink${FLINK_VERSION}-bundle/${HUDI_VERSION}/hudi-flink${FLINK_VERSION}-bundle-${HUDI_VERSION}.jar
 -P /tmp/
 ./bin/sql-client.sh embedded -j 
/tmp/hudi-flink${FLINK_VERSION}-bundle-${HUDI_VERSION}.jar shell
 ```
@@ -77,11 +78,11 @@ The SQL CLI only executes the SQL line by line.
 
 Please add the desired dependency to your project:
 ```xml
-<!-- For Flink versions 1.17-1.20, 2.0-->
+<!-- Supported Flink versions for Hudi 1.2.x: 1.17, 1.18, 1.19, 1.20 (default 
build), 2.0, 2.1 -->
 <properties>
-    <flink.version>1.20.0</flink.version>
+    <flink.version>1.20.1</flink.version>
     <flink.binary.version>1.20</flink.binary.version>
-    <hudi.version>1.1.1</hudi.version>
+    <hudi.version>1.2.0</hudi.version>
 </properties>
 <dependency>
     <groupId>org.apache.hudi</groupId>
@@ -446,9 +447,9 @@ feature is that it lets you author streaming pipelines on 
streaming or batch dat
 
 - **Quick Start**: Read the Quick Start section above to get started quickly 
with the Flink SQL client to write to (and read from) Hudi.
 - **Configuration**: For [Global 
Configuration](flink_tuning.md#global-configurations), set up through 
`$FLINK_HOME/conf/flink-conf.yaml`. For per-job configuration, set up through 
[Table Option](flink_tuning.md#table-options).
-- **Writing Data** : Flink supports different modes for writing, such as [CDC 
Ingestion](ingestion_flink.md#cdc-ingestion), [Bulk 
Insert](ingestion_flink.md#bulk-insert), [Index 
Bootstrap](ingestion_flink.md#index-bootstrap), [Changelog 
Mode](ingestion_flink.md#changelog-mode) and [Append 
Mode](ingestion_flink.md#append-mode). Flink also supports multiple streaming 
writers with [non-blocking concurrency 
control](sql_dml.md#non-blocking-concurrency-control-experimental).
-- **Reading Data** : Flink supports different modes for reading, such as 
[Streaming Query](sql_queries.md#streaming-query) and [Incremental 
Query](sql_queries.md#incremental-query).
-- **Tuning**: For write/read tasks, this guide provides some tuning 
suggestions, such as [Memory Optimization](flink_tuning.md#memory-optimization) 
and [Write Rate Limit](flink_tuning.md#write-rate-limit).
+- **Writing Data** : Flink supports different modes for writing, such as [CDC 
Ingestion](ingestion_flink.md#cdc-ingestion), [Bulk 
Insert](ingestion_flink.md#bulk-insert), [Index 
Bootstrap](ingestion_flink.md#index-bootstrap), [Changelog 
Mode](ingestion_flink.md#changelog-mode) and [Append 
Mode](ingestion_flink.md#append-mode). For high-throughput append pipelines, 
choose an [append write buffer mode](ingestion_flink.md#append-write-buffer). 
For upsert workloads at scale, use [Record-Leve [...]
+- **Reading Data** : Flink supports different modes for reading, such as 
[Streaming Query](sql_queries.md#streaming-query) and [Incremental 
Query](sql_queries.md#incremental-query). For improved push-down and resumable 
reads, see [Flink Source V2](ingestion_flink.md#flink-source-v2). For 
dimension-table joins, use [lookup join](ingestion_flink.md#lookup-join) with 
an optional off-heap RocksDB cache.
+- **Tuning**: For write/read tasks, this guide provides some tuning 
suggestions, such as [Memory 
Optimization](flink_tuning.md#memory-optimization), the [Managed-Memory Write 
Buffer](flink_tuning.md#managed-memory-write-buffer), and [Write Rate 
Limit](flink_tuning.md#write-rate-limit).
 - **Optimization**: Offline compaction is supported: [Offline 
Compaction](compaction.md#flink-offline-compaction).
 - **Query Engines**: Besides Flink, many other engines are integrated: [Hive 
Query](syncing_metastore.md#flink-setup), [Presto Query](sql_queries.md#presto).
 - **Catalog**: A Hudi‑specific catalog is supported: [Hudi 
Catalog](sql_ddl/#create-catalog).
diff --git a/website/docs/flink_tuning.md b/website/docs/flink_tuning.md
index 28e70e48b1f2..bd08b4be0fba 100644
--- a/website/docs/flink_tuning.md
+++ b/website/docs/flink_tuning.md
@@ -115,3 +115,50 @@ the `write.rate.limit` option can be turned on to ensure 
smooth writing.
 |  Option Name  | Required | Default | Remarks |
 |  -----------  | -------  | ------- | ------- |
 | `write.rate.limit` | `false` | `0` | Turn off by default |
+
+## Managed-Memory Write Buffer
+
+By default, the Flink write buffer uses JVM heap memory (`ON_HEAP`). In 
containerized environments where heap memory is tightly budgeted, you can 
switch to Flink's managed (off-heap) memory pool to reduce GC pressure and 
avoid OOM errors.
+
+:::note
+When using `MANAGED` memory type, ensure `taskmanager.memory.managed.size` is 
configured sufficiently in `flink-conf.yaml`.
+:::
+
+|  Option Name  | Description | Default | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| `write.buffer.memory.type` | Memory type for the write buffer: `ON_HEAP` 
(default, uses JVM heap) or `MANAGED` (uses Flink managed off-heap memory) | 
`ON_HEAP` | Switch to `MANAGED` to avoid OOM in memory-constrained deployments |
+| `write.memory.segment.page.size` | Page size in bytes for memory segments 
used in the write buffer | `32768` (32 KB) | Tune for workload characteristics; 
larger pages reduce overhead for large records |
+
+## Disruptor Buffer Tuning
+
+When `write.buffer.type=DISRUPTOR` is set in the table options (see [Append 
Write Buffer](ingestion_flink.md#append-write-buffer)), the following tuning 
options control the Disruptor ring buffer:
+
+|  Option Name  | Description | Default | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| `write.buffer.disruptor.ring.size` | Size of the Disruptor ring buffer (must 
be a power of 2) | `16384` | Larger values absorb write bursts but consume more 
heap memory |
+| `write.buffer.disruptor.wait.strategy` | Wait strategy for the Disruptor 
consumer: `BLOCKING_WAIT` (default), `SLEEPING_WAIT`, `YIELDING_WAIT`, 
`BUSY_SPIN_WAIT` | `BLOCKING_WAIT` | `BLOCKING_WAIT` is safest for 
containerized environments; `BUSY_SPIN_WAIT` offers lowest latency at the cost 
of a dedicated CPU core |
+
+## Timeline-Server-Based Markers
+
+As of Hudi 1.2.0, Flink writers support `TIMELINE_SERVER_BASED` marker type 
(`hoodie.write.markers.type=TIMELINE_SERVER_BASED`). This is recommended over 
`DIRECT` markers on object stores (S3, GCS, ADLS) where the high cost of 
directory listings makes `DIRECT` markers slow.
+
+```sql
+CREATE TABLE my_table (...)
+WITH (
+  'connector' = 'hudi',
+  'path' = 's3a://my-bucket/my-table',
+  'hoodie.write.markers.type' = 'TIMELINE_SERVER_BASED'
+  -- other options
+);
+```
+
+## Source V2 Read-Lag Metrics
+
+When [Source V2](ingestion_flink.md#flink-source-v2) is enabled 
(`read.source-v2.enabled=true`), the following read-lag metrics are emitted to 
help monitor streaming pipeline health:
+
+| Metric | Description |
+|--------|-------------|
+| `issuedInstantDelay` | Time elapsed (ms) between when a new instant was 
written and when the source issued it for reading |
+| `sourceReaderIdleTime` | Time (ms) the source reader has been idle (no new 
splits assigned) |
+
+These metrics are exposed through Flink's standard metrics system and can be 
forwarded to Prometheus, JMX, or other reporters.
diff --git a/website/docs/hoodie_streaming_ingestion.md 
b/website/docs/hoodie_streaming_ingestion.md
index 286d5765c751..1fb282a5d718 100644
--- a/website/docs/hoodie_streaming_ingestion.md
+++ b/website/docs/hoodie_streaming_ingestion.md
@@ -146,7 +146,9 @@ Usage: <main class> [options]
       Default: 0
     --op
       Takes one of these values : UPSERT (default), INSERT, BULK_INSERT,
-      INSERT_OVERWRITE, INSERT_OVERWRITE_TABLE, DELETE_PARTITION
+      INSERT_OVERWRITE, INSERT_OVERWRITE_TABLE, DELETE_PARTITION, DELETE
+      (DELETE extracts HoodieKeys from source records and deletes the
+      corresponding records from the table.)
       Default: UPSERT
       Possible Values: [INSERT, INSERT_PREPPED, UPSERT, UPSERT_PREPPED, 
BULK_INSERT, BULK_INSERT_PREPPED, DELETE, DELETE_PREPPED, BOOTSTRAP, 
INSERT_OVERWRITE, CLUSTER, DELETE_PARTITION, INSERT_OVERWRITE_TABLE, COMPACT, 
INDEX, ALTER_SCHEMA, LOG_COMPACT, UNKNOWN]
     --payload-class
@@ -503,6 +505,77 @@ Check out [Kafka source 
config](https://hudi.apache.org/docs/configurations#Kafk
 Hudi Streamer also supports ingesting from Apache Pulsar via 
`org.apache.hudi.utilities.sources.PulsarSource`.
 Check out [Pulsar source 
config](https://hudi.apache.org/docs/configurations#Pulsar-Source-Configs) for 
more details.
 
+#### Amazon Kinesis
+
+Use the `JsonKinesisSource` 
(`org.apache.hudi.utilities.sources.JsonKinesisSource`) to ingest JSON records 
from an AWS Kinesis Data Stream into a Hudi table. It reads from every shard in 
parallel, tracks per-shard progress in the Hudi Streamer checkpoint, 
automatically handles shard splits and merges, and de-aggregates records 
produced by the Kinesis Producer Library (KPL).
+
+##### Common configuration
+
+All keys use the prefix `hoodie.streamer.source.kinesis.`. The settings most 
users need:
+
+| Config key | Default | Description |
+|---|---|---|
+| `hoodie.streamer.source.kinesis.stream.name` | (required) | Kinesis Data 
Streams stream name. |
+| `hoodie.streamer.source.kinesis.region` | (required) | AWS region for the 
stream (e.g., `us-east-1`). |
+| `hoodie.streamer.source.kinesis.starting.position` | `LATEST` | Where to 
start when no checkpoint exists yet. `LATEST` starts at the tip of each shard; 
`EARLIEST` replays from `TRIM_HORIZON`. |
+| `hoodie.streamer.source.kinesis.max.events` | `5000000` | Maximum number of 
records read per batch across all shards. Tune to control batch size. |
+| `hoodie.streamer.source.kinesis.partitions` | `0` | Spark partitions to use 
when reading. `0` means one Spark partition per Kinesis shard. Set a positive 
value to repartition for downstream parallelism. |
+
+For credentials, the source uses the default AWS credential chain (instance 
profile, environment variables, etc.). Authentication for custom endpoints 
(e.g., LocalStack), API-level rate limiting, and retry tuning are also 
available — see the [configurations reference](configurations.md) for the full 
list of `hoodie.streamer.source.kinesis.*` keys.
+
+##### Checkpoint format
+
+Hudi Streamer persists Kinesis progress as a single checkpoint string on the 
timeline. Each batch advances the checkpoint to the last record successfully 
read from every shard, so a failed batch can be retried without skipping or 
duplicating records.
+
+The checkpoint encodes per-shard state in plain text:
+
+```
+streamName,shardId:value,shardId:value,...
+```
+
+Each `value` is one of:
+
+- `lastSeq` — last sequence number consumed from an open shard.
+- `lastSeq@arrivalTime` — same, with the record's approximate arrival time 
(epoch millis) for lag/observability.
+- `lastSeq|endSeq` — closed shard. `endSeq` is the shard's final sequence 
number, used to detect data loss if the shard expires before being fully 
consumed.
+- `lastSeq@arrivalTime|endSeq` — closed shard with arrival time.
+
+Example (sequence numbers abbreviated; Kinesis assigns each shard a 56-digit 
decimal sequence number):
+
+```
+my-stream,shardId-000000000000:49590…88898,shardId-000000000001:49590…96306
+```
+
+You don't need to construct or parse this string yourself — it is read and 
updated automatically by the source — but it's useful for debugging, manual 
checkpoint resets, or comparing progress across shards.
+
+##### Minimal spark-submit example
+
+```properties
+# kinesis-source.properties
+hoodie.streamer.source.kinesis.stream.name=my-stream
+hoodie.streamer.source.kinesis.region=us-east-1
+hoodie.streamer.source.kinesis.starting.position=LATEST
+
+# Standard Hudi write / key-gen configs
+hoodie.datasource.write.recordkey.field=id
+hoodie.datasource.write.partitionpath.field=event_date
+hoodie.table.ordering.fields=ts
+```
+
+```bash
+spark-submit \
+  --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.2.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.2.0
 \
+  --class org.apache.hudi.utilities.streamer.HoodieStreamer \
+  hudi-utilities-slim-bundle-*.jar \
+  --props kinesis-source.properties \
+  --source-class org.apache.hudi.utilities.sources.JsonKinesisSource \
+  --table-type COPY_ON_WRITE \
+  --target-base-path s3://my-bucket/hudi/my-table \
+  --target-table my_db.my_table \
+  --op UPSERT \
+  --continuous
+```
+
 #### Cloud storage event sources
 AWS S3 storage provides an event notification service which will post 
notifications when certain events happen in your S3 bucket: 
 https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html
@@ -636,3 +709,34 @@ to how you run Hudi Streamer.
 ```
 
 For detailed information on how to configure and use 
`HoodieMultiTableStreamer`, please refer [blog 
section](/blog/2020/08/22/ingest-multiple-tables-using-hudi).
+
+## On-Demand Hive Sync (HudiHiveSyncJob)
+
+`org.apache.hudi.utilities.HudiHiveSyncJob` is a standalone Spark job that 
syncs a Hudi table's metadata to Hive metastore independently of any ingestion 
workflow. It is useful for backfills, manual data corrections, or reconciling 
metastore metadata after direct writes.
+
+### Arguments
+
+| Argument | Required | Description |
+|---|---|---|
+| `--base-path` / `-sp` | Yes | Base path of the Hudi table. |
+| `--base-file-format` / `-bff` | No | Base file format. Default: `PARQUET`. |
+| `--props-file-path` | No | Path to a properties file with Hudi / Hive sync 
configs. |
+| `--hoodie-conf` | No | Inline config override (repeatable). |
+| `--spark-master` | No | Spark master URL. Inherits from environment if 
unset. |
+
+### Example
+
+```bash
+spark-submit \
+  --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.2.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.2.0
 \
+  --class org.apache.hudi.utilities.HudiHiveSyncJob \
+  hudi-utilities-slim-bundle-*.jar \
+  --base-path s3://my-bucket/hudi/my-table \
+  --base-file-format PARQUET \
+  --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
+  --hoodie-conf 
hoodie.datasource.hive_sync.metastore.uris=thrift://hive-metastore:9083 \
+  --hoodie-conf hoodie.datasource.hive_sync.database=my_db \
+  --hoodie-conf hoodie.datasource.hive_sync.table=my_table
+```
+
+All `hoodie.datasource.hive_sync.*` options accepted by the DataSource writer 
are also accepted here. See [Syncing to Hive Metastore](syncing_metastore.md) 
for the full list.
diff --git a/website/docs/ingestion_flink.md b/website/docs/ingestion_flink.md
index e720a748c1c8..2e534e286d67 100644
--- a/website/docs/ingestion_flink.md
+++ b/website/docs/ingestion_flink.md
@@ -1,7 +1,7 @@
 ---
 title: Using Flink
 keywords: [hudi, flink, streamer, ingestion]
-last_modified_at: 2025-11-22T12:53:57+08:00
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 
 ## CDC Ingestion
@@ -112,15 +112,24 @@ the compaction options `compaction.delta_commits` and 
`compaction.delta_seconds`
 
 For `INSERT` mode write operations, new Parquet files are written directly, 
and the [auto‑file sizing](file_sizing.md) is not enabled.
 
-### In-Memory Buffer Sort
+### Append Write Buffer
 
-For append-only workloads, Hudi supports in-memory buffer sorting to improve 
Parquet compression ratio. When enabled, data is sorted within the write buffer 
before being flushed to disk. This improves columnar file compression 
efficiency by grouping similar values together.
+For append-only workloads, Hudi supports several write-buffer strategies that 
improve Parquet compression ratio and write throughput. Data is sorted or 
batched within the write buffer before being flushed to disk, grouping similar 
values together for better columnar compression.
 
-| Option Name                 | Required | Default | Remarks                   
                                                                                
                    |
-|-----------------------------|----------|---------|-------------------------------------------------------------------------------------------------------------------------------|
-| `write.buffer.sort.enabled` | `false`  | `false` | Whether to enable buffer 
sort within append write function. Improves Parquet compression ratio by 
sorting data before writing |
-| `write.buffer.sort.keys`    | `false`  | `N/A`   | Sort keys concatenated by 
comma (e.g., `col1,col2`). Required when `write.buffer.sort.enabled` is `true`  
                    |
-| `write.buffer.size`         | `false`  | `1000`  | Buffer size in number of 
records. When buffer reaches this size, data is sorted and flushed to disk      
                     |
+The buffer strategy is selected with `write.buffer.type`. In Hudi 1.2.0 this 
replaces the deprecated `write.buffer.sort.enabled` flag.
+
+| Option Name                              | Required | Default    | Remarks   
                                                                                
                                                                                
                                                       |
+|------------------------------------------|----------|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `write.buffer.type`                      | `false`  | `NONE`     | Buffer 
type for append write. Values: `NONE` (no buffering), `BOUNDED_IN_MEMORY` 
(double buffer with async write), `DISRUPTOR` (ring-buffer with async write, 
recommended for higher throughput), `CONTINUOUS_SORT` (TreeMap-based continuous 
sort with incremental draining) |
+| `write.buffer.size`                      | `false`  | `1000`     | Record 
count threshold at which the buffer is flushed. Applies to all non-`NONE` 
buffer types                                                                    
                                                                |
+| `write.buffer.sort.keys`                 | `false`  | `N/A`      | 
Comma-separated sort key columns (e.g., `col1,col2`). Required for `DISRUPTOR` 
and `CONTINUOUS_SORT` modes                                                     
                                                                  |
+| `write.buffer.sort.continuous.drain.size`| `false`  | `1`        | Number of 
records drained per flush cycle in `CONTINUOUS_SORT` mode. Default 1 provides 
smooth incremental draining; increase for batching (e.g., 10–100)               
                                                        |
+
+:::note
+`write.buffer.sort.enabled` is deprecated as of 1.2.0. Use 
`write.buffer.type=DISRUPTOR` instead for equivalent behavior. The `DISRUPTOR` 
and `CONTINUOUS_SORT` modes require `write.buffer.sort.keys` to be set.
+:::
+
+For Disruptor-specific tuning options, see 
[flink_tuning.md](flink_tuning.md#disruptor-buffer-tuning).
 
 ### Disable Meta Fields
 
@@ -156,7 +165,7 @@ Only Copy‑on‑Write tables are supported.
 
 ### Clustering Plan Strategy
 
-Custom clustering strategy is supported.
+Custom clustering strategy is supported. Hudi 1.2.0 adds 
`FlinkSkipSingleFileClusteringPlanStrategy` 
(`org.apache.hudi.client.clustering.plan.strategy.FlinkSkipSingleFileClusteringPlanStrategy`),
 which skips file groups that already consist of a single file, reducing 
unnecessary rewrites.
 
 | Option Name                                             | Required | Default 
| Remarks                                                                       
                                                                   |
 
|---------------------------------------------------------|----------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------|
@@ -185,7 +194,7 @@ Hudi Flink writer supports two types of writer indexes:
 | Cross‑Partition Changes | Cannot handle changes among partitions (unless 
input is a CDC stream)                                                          
                                                                                
                                                             | No limit on 
handling cross‑partition changes                                                
                           |
 
 :::note
-Bucket index supports only the `UPSERT` write operation and cannot be used 
with the [append mode](#append-mode) in Flink.
+Bucket index supports `UPSERT` write operations on both COW and MOR tables. As 
of Hudi 1.2.0, MOR + bucket index + upsert is fully supported. Bucket index 
cannot be used with the [append mode](#append-mode) in Flink.
 :::
 
 ### Bucket Index Examples
@@ -349,10 +358,215 @@ For Flink streaming reads, rate limiting helps avoid 
backpressure when processin
 
 The average read rate can be calculated as: **`read.splits.limit` / 
`read.streaming.check-interval`** splits per second.
 
+Hudi 1.2.0 adds `read.commits.limit`, which complements `read.splits.limit` by 
capping the number of commits (instants) consumed per check interval. This is 
useful when tables have many small commits — limiting commits bounds the number 
of splits regardless of their individual size.
+
+### Options
+
+| Option Name                     | Required | Default             | Remarks   
                                                                                
                                                                                
                 |
+|---------------------------------|----------|---------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `write.rate.limit`              | `false`  | `0`                 | Write 
record rate limit per second to prevent traffic jitter and improve stability. 
Default is 0 (no limit)                                                         
                       |
+| `read.splits.limit`             | `false`  | `Integer.MAX_VALUE` | Maximum 
number of splits allowed to read in each instant check for streaming reads. 
Average read rate = `read.splits.limit`/`read.streaming.check-interval`. 
Default is no limit           |
+| `read.commits.limit`            | `false`  | `(none)`            | Maximum 
number of commits (instants) allowed to read in each check interval. 
Complements `read.splits.limit`. Average rate = 
`read.commits.limit`/`read.streaming.check-interval`. Default is no limit |
+| `read.streaming.check-interval` | `false`  | `60`                | Check 
interval in seconds for streaming reads. Default is 60 seconds (1 minute)       
                                                                                
                     |
+
+## Flink Source V2
+
+Hudi 1.2.0 introduces a new Flink source implementation 
([RFC-95](https://github.com/apache/hudi/blob/master/rfc/rfc-95/rfc-95.md)) 
based on 
[FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface),
 available as an opt-in feature via the `read.source-v2.enabled` flag.
+
+### Why Source V2?
+
+The legacy Hudi Flink source was built on Flink's `SourceFunction` API. The 
FLIP-27 rewrite brings:
+
+- **Resumable split assignment** — splits can be checkpointed independently, 
enabling finer-grained recovery
+- **Checkpoint alignment** — the new API participates in Flink's coordinated 
checkpoint protocol, improving end-to-end consistency
+- **Push-down support** — predicate push-down, partition pruning, and `LIMIT` 
push-down are supported through the new source interface, reducing data scanned 
at the source level
+
+### Enabling Source V2
+
+```sql
+CREATE TABLE t1 (
+  uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
+  name VARCHAR(10),
+  age INT,
+  ts TIMESTAMP(3),
+  `partition` VARCHAR(20)
+)
+PARTITIONED BY (`partition`)
+WITH (
+  'connector' = 'hudi',
+  'path' = '${path}',
+  'table.type' = 'MERGE_ON_READ',
+  'read.source-v2.enabled' = 'true'  -- enable the FLIP-27 source
+);
+```
+
+### Options
+
+| Option Name               | Required | Default | Remarks                     
                                                                            |
+|---------------------------|----------|---------|---------------------------------------------------------------------------------------------------------|
+| `read.source-v2.enabled`  | `false`  | `false` | Whether to use the FLIP-27 
new source (Source V2) to consume data files. Default is the legacy source  |
+
+### Savepoint Incompatibility
+
+:::warning
+Savepoints taken with the **legacy source** (`read.source-v2.enabled=false`) 
are **not compatible** with the Source V2 source, and vice versa. When 
switching from the legacy source to Source V2, start a fresh job without 
restoring from a legacy savepoint. If you need to preserve read progress, 
record the last committed instant time and use `read.start-commit` to resume 
from that point.
+:::
+
+## Record-Level Index (RLI) Bucket Indexing for Flink
+
+As of Hudi 1.2.0, the Flink writer supports the Record-Level Index (RLI) 
backed by the metadata table, in addition to the existing `FLINK_STATE` and 
`BUCKET` index types. RLI is stored in the metadata table and avoids the 
state-backend overhead of `FLINK_STATE`, while supporting full global or 
partition-scoped uniqueness guarantees.
+
+Two RLI variants are available via `index.type`:
+
+- `RECORD_LEVEL_INDEX` — partitioned RLI; enforces uniqueness per (partition 
path, record key) pair
+- `GLOBAL_RECORD_LEVEL_INDEX` — global RLI; enforces uniqueness across all 
partitions
+
+### Bootstrap
+
+When enabling RLI on an existing table, the bootstrap process loads existing 
record locations into RocksDB before the first write. Bootstrap is triggered by 
setting `index.bootstrap.enabled=true`.
+
+```sql
+CREATE TABLE my_hudi_table (
+  id BIGINT,
+  name STRING,
+  ts BIGINT,
+  dt STRING,
+  PRIMARY KEY (id) NOT ENFORCED
+)
+PARTITIONED BY (dt)
+WITH (
+  'connector' = 'hudi',
+  'path' = 'hdfs:///warehouse/my_hudi_table',
+  'table.type' = 'MERGE_ON_READ',
+  'index.type' = 'RECORD_LEVEL_INDEX',
+  'metadata.enabled' = 'true',
+  'index.bootstrap.enabled' = 'true',  -- enable bootstrap on first run
+  'index.bootstrap.rocksdb.path' = '/tmp/hudi-rli-rocksdb'
+);
+```
+
+Once bootstrap completes (after the first successful checkpoint), you can 
optionally restart the job with `index.bootstrap.enabled=false` to skip the 
bootstrap operators. Leaving them enabled is harmless — they become no-ops on 
subsequent runs and do not affect write performance.
+
+### In-Pipeline MDT Compaction
+
+For RLI workloads, the metadata table (MDT) accumulates log files that need 
periodic compaction. The option `metadata.compaction.async.enabled` (default 
`true`) runs MDT compaction inside the Flink pipeline after every 
`metadata.compaction.delta_commits` (default `10`) delta commits.
+
+### Options
+
+| Option Name                         | Required | Default  | Remarks          
                                                                                
                                                       |
+|-------------------------------------|----------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `index.type`                        | `false`  | `FLINK_STATE` | Set to 
`RECORD_LEVEL_INDEX` or `GLOBAL_RECORD_LEVEL_INDEX` to use the 
metadata-table-backed RLI                                                    |
+| `index.bootstrap.enabled`           | `false`  | `false`  | Bootstrap the 
index from the existing table on first run. Blocks checkpoints during bootstrap 
                                                          |
+| `index.bootstrap.rocksdb.path`      | `false`  | system temp dir | Local 
path for RocksDB storage during RLI bootstrap. Each task manager creates a 
unique subdirectory under this path                             |
+| `index.rli.cache.size`              | `false`  | `256`    | Maximum memory 
in MB for the RLI cache per bucket-assign task. Dynamically adjusted based on 
historical usage                                           |
+| `index.rli.lookup.minibatch.size`   | `false`  | `1000`   | Maximum records 
buffered per mini-batch during RLI lookup. Mini-batching reduces individual 
index lookups. Minimum effective value is 1000              |
+| `metadata.compaction.async.enabled` | `false`  | `true`   | Whether to run 
MDT compaction asynchronously within the Flink pipeline. Recommended to keep 
enabled for RLI workloads                                  |
+| `metadata.compaction.delta_commits` | `false`  | `10`     | Number of MDT 
delta commits that trigger in-pipeline compaction                               
                                                          |
+
+:::note
+`GLOBAL_RECORD_LEVEL_INDEX` requires `metadata.enabled=true` and 
`index.global.enabled=true`. The Flink table factory validates these 
constraints automatically.
+:::
+
+## Lookup Join
+
+Hudi 1.2.0 adds a RocksDB-backed cache option for Flink lookup joins against 
Hudi dimension tables. This avoids JVM heap pressure when the dimension table 
is large.
+
 ### Options
 
-| Option Name                     | Required | Default             | Remarks   
                                                                                
                                                                                
       |
-|---------------------------------|----------|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `write.rate.limit`              | `false`  | `0`                 | Write 
record rate limit per second to prevent traffic jitter and improve stability. 
Default is 0 (no limit)                                                         
             |
-| `read.splits.limit`             | `false`  | `Integer.MAX_VALUE` | Maximum 
number of splits allowed to read in each instant check for streaming reads. 
Average read rate = `read.splits.limit`/`read.streaming.check-interval`. 
Default is no limit |
-| `read.streaming.check-interval` | `false`  | `60`                | Check 
interval in seconds for streaming reads. Default is 60 seconds (1 minute)       
                                                                                
           |
+| Option Name                    | Required | Default                         
| Remarks                                                                       
                                                                  |
+|--------------------------------|----------|---------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
+| `lookup.join.cache.type`       | `false`  | `heap`                          
| Storage backend for the lookup join cache. `heap` (default) stores rows in 
JVM heap; `rocksdb` stores rows off-heap in an embedded RocksDB instance |
+| `lookup.join.rocksdb.path`     | `false`  | 
`${java.io.tmpdir}/hudi-lookup-rocksdb` | Local directory for RocksDB data when 
`lookup.join.cache.type=rocksdb`. Cleaned up when the lookup function closes    
                |
+| `lookup.async`                 | `false`  | `false`                         
| Whether to enable async lookup join. Async join can improve throughput when 
the lookup function has high latency                                  |
+| `lookup.async-thread-number`   | `false`  | `16`                            
| Number of threads for async lookup join                                       
                                                                  |
+
+### Example
+
+```sql
+-- Streaming fact table with a processing-time attribute
+CREATE TABLE orders (
+  order_id BIGINT,
+  customer_id BIGINT,
+  amount DOUBLE,
+  proc_time AS PROCTIME(),
+  PRIMARY KEY (order_id) NOT ENFORCED
+) WITH (
+  'connector' = 'hudi',
+  'path' = 'hdfs:///warehouse/orders',
+  'table.type' = 'MERGE_ON_READ',
+  'read.streaming.enabled' = 'true'
+);
+
+-- Hudi dimension table with RocksDB-backed lookup cache
+CREATE TABLE customers (
+  customer_id BIGINT,
+  name STRING,
+  city STRING,
+  PRIMARY KEY (customer_id) NOT ENFORCED
+) WITH (
+  'connector' = 'hudi',
+  'path' = 'hdfs:///warehouse/customers',
+  'lookup.join.cache.type' = 'rocksdb',
+  'lookup.join.rocksdb.path' = '/tmp/hudi-lookup-rocksdb'
+);
+
+-- Lookup join keyed by the fact table's processing-time attribute
+SELECT o.order_id, c.name, o.amount
+FROM orders AS o
+JOIN customers FOR SYSTEM_TIME AS OF o.proc_time AS c
+  ON o.customer_id = c.customer_id;
+```
+
+## Virtual Metadata Columns
+
+Hudi metadata fields can be declared as `METADATA VIRTUAL` columns in the 
Flink DDL. This allows accessing system metadata (e.g., commit time, record 
key) without storing them as regular data columns.
+
+```sql
+CREATE TABLE events (
+  event_id BIGINT,
+  payload STRING,
+  -- virtual metadata columns (read-only, not persisted as data)
+  _hoodie_commit_time     STRING METADATA VIRTUAL,
+  _hoodie_record_key      STRING METADATA VIRTUAL,
+  _hoodie_partition_path  STRING METADATA VIRTUAL,
+  PRIMARY KEY (event_id) NOT ENFORCED
+)
+WITH (
+  'connector' = 'hudi',
+  'path' = 'hdfs:///warehouse/events'
+);
+
+-- Query metadata alongside data
+SELECT event_id, _hoodie_commit_time, payload FROM events;
+```
+
+:::note
+Only `VIRTUAL` metadata columns are supported. All valid virtual columns 
correspond to Hudi's built-in meta fields (`_hoodie_commit_time`, 
`_hoodie_commit_seqno`, `_hoodie_record_key`, `_hoodie_partition_path`, 
`_hoodie_file_name`, `_hoodie_operation`).
+:::
+
+## Advanced Options
+
+### Hadoop Configuration Pass-through
+
+Hadoop filesystem configuration properties can be passed to the Flink writer 
using the `properties.hadoop.*` prefix (or directly as `hadoop.*`):
+
+```sql
+WITH (
+  'connector' = 'hudi',
+  'path' = 's3a://my-bucket/my-table',
+  'properties.hadoop.fs.s3a.access.key' = 'AKID...',
+  'properties.hadoop.fs.s3a.secret.key' = '...'
+)
+```
+
+### Kafka Offset Tracing
+
+For advanced Kafka offset tracing (internal/optional), the following 
`kafka.offset.trace.*` options configure the checkpoint-service-based offset 
lookup used in some deployment environments. These are advanced options with no 
functional impact on standard Hudi writes:
+
+| Option Name                              | Default           | Remarks       
                                             |
+|------------------------------------------|-------------------|------------------------------------------------------------|
+| `kafka.offset.trace.caller.service.name` | `ingestion-rt`    | Caller 
service name for checkpoint-service RPC headers     |
+| `kafka.offset.trace.checkpoint.service`  | `athena-job-manager` | Checkpoint 
service name                                 |
+| `kafka.offset.trace.dc`                  | `(none)`          | Data center 
for checkpoint offset lookup                   |
+| `kafka.offset.trace.env`                 | `(none)`          | Environment 
for checkpoint offset lookup                   |
+| `kafka.offset.trace.job.name`            | `(none)`          | Flink job 
name for checkpoint offset lookup                |
diff --git a/website/docs/key_generation.md b/website/docs/key_generation.md
index 3a7b109c3363..5f00956cefdb 100644
--- a/website/docs/key_generation.md
+++ b/website/docs/key_generation.md
@@ -2,7 +2,7 @@
 title: Key Generation
 summary: "In this page, we describe key generation in Hudi."
 toc: true
-last_modified_at:
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 
 Hudi needs some way to point to records in the table, so that base/log files 
can be merged efficiently for updates/deletes, 
@@ -210,6 +210,35 @@ Partition path generated from key generator: "2020040118"
 Input field value: "20200401" <br/>
 Partition path generated from key generator: "04/01/2020"
 
+## Slash-Separated Date Partitioning
+
+By default, Hudi writes date-valued partition paths as a flat string (e.g. 
`2024-03-15`).
+When `hoodie.datasource.write.slash.separated.date.partitioning` is set to 
`true`, partition field
+values in `yyyy-MM-dd` format are stored as `yyyy/MM/dd` directory hierarchies 
(e.g. `2024/03/15`).
+
+| Config Name                                                          | 
Default   | Description                                                         
                                                                     |
+|----------------------------------------------------------------------|-----------|------------------------------------------------------------------------------------------------------------------------------------------|
+| `hoodie.datasource.write.slash.separated.date.partitioning`          | 
`false`   | When `true`, transforms date partition values from `yyyy-MM-dd` 
into `yyyy/MM/dd` directory paths. Cannot be used together with hive-style 
partitioning (`hoodie.datasource.write.hive_style_partitioning=true`). |
+
+Example:
+
+```java
+df.write.format("hudi")
+  .option("hoodie.datasource.write.partitionpath.field", "event_date")
+  .option("hoodie.datasource.write.slash.separated.date.partitioning", "true")
+  .option("hoodie.table.name", tableName)
+  .mode("append")
+  .save(basePath)
+```
+
+A record with `event_date = "2024-03-15"` will be stored under 
`basePath/2024/03/15/` instead of
+`basePath/2024-03-15/`.
+
+:::note
+`SHOW PARTITIONS` in Spark SQL correctly handles slash-separated date 
partition paths: it displays
+the value in `yyyy-MM-dd` form (normalizing the `/` separators back to `-`) 
for readability.
+:::
+
 ## Related Resources
 
 <h3>Blogs</h3>
diff --git a/website/docs/lance_file_format.md 
b/website/docs/lance_file_format.md
index 51269107f188..9754a4e31797 100644
--- a/website/docs/lance_file_format.md
+++ b/website/docs/lance_file_format.md
@@ -1,18 +1,28 @@
 ---
 title: "Lance File Format"
 keywords: [ hudi, lance, file format, vector, AI, ML, columnar, ANN, indexing]
-summary: "Use the Lance columnar file format with Hudi for vector-optimized 
storage, ANN indexing, and efficient ML workloads"
+summary: "Use the Lance columnar file format with Hudi for vector-friendly 
storage and efficient ML workloads"
 toc: true
-last_modified_at: 2026-04-25T00:00:00-00:00
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 
 [Lance](https://lancedb.github.io/lance/) is a modern columnar data format 
designed for AI and machine learning
 workloads. Hudi's pluggable storage architecture lets you use Lance as the 
base file format alongside Parquet
 and ORC, unlocking vector indexing, fast random access, and optimized 
high-dimensional array storage.
 
+:::caution Engine Support
+Lance file format support is **Spark-only**. Attempting to read a Lance-backed 
table from Flink or Hive throws a
+`HoodieValidationException`:
+> Lance base file format is currently only supported with the Spark engine. 
Please use Parquet, ORC, or HFile
+> for non-Spark engines (Flink, Hive, Presto, Trino).
+
+The Lance JAR is **not bundled** in the Hudi distribution — you must add it to 
your Spark classpath
+(see [Required Dependencies](#required-dependencies)).
+:::
+
 ## Enabling Lance in Hudi
 
-### Table Creation
+### Table Creation (COW)
 
 Set the base file format to `lance` in table properties:
 
@@ -26,7 +36,26 @@ TBLPROPERTIES (
     primaryKey = 'id',
     type = 'cow',
     hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger',
-    hoodie.datasource.write.base.file.format = 'lance'
+    hoodie.table.base.file.format = 'lance'
+);
+```
+
+### Table Creation (MOR)
+
+Lance base files work with MOR tables — Lance files act as base files while 
Avro log files capture
+incremental changes. Log compaction merges the delta log back into Lance base 
files.
+
+```sql
+CREATE TABLE my_ai_table_mor (
+    id        STRING,
+    embedding VECTOR(768),
+    metadata  STRING
+) USING hudi
+TBLPROPERTIES (
+    primaryKey = 'id',
+    type = 'mor',
+    hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger',
+    hoodie.table.base.file.format = 'lance'
 );
 ```
 
@@ -39,18 +68,21 @@ TBLPROPERTIES (
    .option("hoodie.datasource.write.recordkey.field", "id")
    .option("hoodie.record.merger.impls",
            "org.apache.hudi.DefaultSparkRecordMerger")
-   .option("hoodie.datasource.write.base.file.format", "lance")
+   .option("hoodie.table.base.file.format", "lance")
    .mode("overwrite")
    .save("/path/to/my_ai_table"))
 ```
 
 ### Required Dependencies
 
-Add the Lance Spark bundle to your Spark classpath:
+The Lance JAR is not bundled in Hudi. Add the appropriate Lance Spark bundle 
JAR to your Spark classpath:
 
-| Component | Maven Coordinates |
-|:----------|:-----------------|
-| Lance Spark Bundle (Spark 3.5) | 
`org.lance:lance-spark-bundle-3.5_2.12:0.4.0` |
+| Spark Version | Bundle JAR (Maven Central) |
+|:--------------|:---------------------------|
+| Spark 3.4 | `org.lance:lance-spark-bundle-3.4_2.12:0.4.0` |
+| Spark 3.5 | `org.lance:lance-spark-bundle-3.5_2.12:0.4.0` |
+| Spark 4.0 | `org.lance:lance-spark-bundle-4.0_2.13:0.4.0` |
+| Spark 4.1 | `org.lance:lance-spark-bundle-4.1_2.13:0.4.0` |
 
 ```bash
 export LANCE_BUNDLE_JAR=/path/to/lance-spark-bundle-3.5_2.12-0.4.0.jar
@@ -74,7 +106,6 @@ file-level storage:
 │  (same Hudi concepts as Parquet)  │
 ├───────────────────────────────────┤
 │     Lance Data Files (.lance)     │
-│  IVF-PQ vector index              │
 │  Columnar storage                 │
 │  Fragment-based layout            │
 ├───────────────────────────────────┤
@@ -87,11 +118,49 @@ All Hudi table services work with Lance-backed tables:
 - **Compaction** — merges log files into Lance base files
 - **Clustering** — reorganizes Lance files for better data locality
 - **Cleaning** — removes old Lance file versions
-- **Metadata indexing** — column stats and bloom filters work across Lance 
files
+- **Metadata indexing** — bloom filters work across Lance files; column stats 
and partition stats are
+  **automatically disabled** for Lance tables
+
+## VECTOR Storage on Lance
+
+VECTOR columns are stored natively in Lance as `FixedSizeList<Float32/Float64, 
dim>` — Lance's own
+vector column encoding, so embeddings are written without conversion overhead 
at the file-format
+layer.
+
+Only **FLOAT** and **DOUBLE** element types are supported as VECTOR columns on 
Lance. INT8 vectors
+are not yet supported and will fail fast at write time.
+
+See [Vector Search](vector_search.md) for the `hudi_vector_search` TVF that 
queries VECTOR columns.
+
+## BLOB Columns on Lance
+
+INLINE BLOB columns on Lance default to `DESCRIPTOR` read mode — standard 
queries return an
+out-of-line-shaped reference descriptor rather than materializing the raw 
bytes. To read inline
+byte content via `read_blob()`, set `hoodie.read.blob.inline.mode=CONTENT`. See
+[Unstructured Data](blob_unstructured_data.md) for full documentation.
+
+## Schema Evolution
+
+Lance supports the following schema changes at the Hudi layer:
+
+| Operation | Supported? |
+|:----------|:-----------|
+| Add column | Yes |
+| Rename column | Yes (via Hudi schema evolution) |
+| Promote `FLOAT` → `DOUBLE` | **No** — not supported on Lance |
+| Promote `FLOAT` → `STRING` | **No** — not supported on Lance |
+| Drop column | Yes |
+
+:::caution
+`FLOAT → DOUBLE` and `FLOAT → STRING` type promotions are supported for 
Parquet tables but **not**
+for Lance. Attempting these on a Lance table will fail. Use `DOUBLE` from the 
start if you anticipate
+needing higher precision.
+:::
 
 ## Vector Search with Lance
 
-The `hudi_vector_search` TVF leverages Lance's built-in IVF-PQ index for 
approximate nearest neighbor search:
+Use the `hudi_vector_search` TVF to run vector similarity queries against 
VECTOR columns on a
+Lance-backed table:
 
 ```sql
 SELECT id, metadata, _hudi_distance
@@ -107,10 +176,38 @@ See [Vector Search](vector_search.md) for full 
documentation on the TVF and dist
 
 ## Configuration Reference
 
-| Property | Description | Default |
-|:---------|:------------|:--------|
-| `hoodie.datasource.write.base.file.format` | Set to `lance` to use Lance as 
the base file format | `parquet` |
-| `hoodie.record.merger.impls` | Must be 
`org.apache.hudi.DefaultSparkRecordMerger` for Lance | — |
+| Property | Default | Description |
+|:---------|:--------|:------------|
+| `hoodie.table.base.file.format` | `parquet` | Set to `lance` to use Lance as 
the base file format. |
+| `hoodie.record.merger.impls` | — | Must be 
`org.apache.hudi.DefaultSparkRecordMerger` for Lance. |
+| `hoodie.lance.max.file.size` | `125829120` (120 MiB) | Target file size in 
bytes for Lance base files. |
+| `hoodie.lance.write.allocator.size.bytes` | `268435456` (256 MiB) | Maximum 
size of the Arrow child allocator used for buffering in-flight batch data. 
Increase for tables with very large BLOB columns. |
+| `hoodie.lance.write.flush.byte.watermark` | `100663296` (96 MiB) | Byte-size 
threshold at which the current write batch is flushed. Must be less than 
`hoodie.lance.write.allocator.size.bytes`. |
+
+### File Sizing and Memory
+
+The three sizing configs work together:
+
+- **`hoodie.lance.max.file.size`** controls when Hudi rolls over to a new 
Lance file, similar to
+  `hoodie.parquet.max.file.size` for Parquet tables.
+- **`hoodie.lance.write.allocator.size.bytes`** caps the Arrow allocator's 
in-flight memory. Arrow
+  uses power-of-2 buffer doubling; the default 256 MiB accommodates the 128 
MiB doubling step with
+  headroom.
+- **`hoodie.lance.write.flush.byte.watermark`** triggers an early batch flush 
when Arrow buffers
+  approach the cap. The default 96 MiB (≈ 3/8 of the allocator cap) leaves 
room for offset and
+  validity buffers to double without exceeding the allocator limit.
+
+For tables with large BLOB columns, increase both 
`hoodie.lance.write.allocator.size.bytes` and
+`hoodie.lance.write.flush.byte.watermark` proportionally (keep watermark at 
roughly 3/8 of allocator
+size).
+
+## Additional Notes
+
+- **`populateMetaFields=false`** is supported. User-defined key generators 
work normally with Lance
+  tables.
+- **Complex types** (struct, array, map) are supported as Lance columns.
+- **VARIANT columns** are **not supported** on Lance. Attempting to write a 
table with VARIANT columns
+  to Lance throws a `HoodieNotSupportedException`. Use Parquet for tables with 
VARIANT columns.
 
 ## Mixed-Format Tables
 
diff --git a/website/docs/metadata.md b/website/docs/metadata.md
index c5a01b42dfa9..9c108c815cdd 100644
--- a/website/docs/metadata.md
+++ b/website/docs/metadata.md
@@ -89,6 +89,15 @@ If you turn off the metadata table after enabling, be sure 
to wait for a few com
 cleaned up, before re-enabling the metadata table again.
 :::
 
+### Auto-Delete of Disabled MDT Partitions
+
+When an index is disabled in the write config, Hudi automatically deletes the 
corresponding metadata table partition.
+Available since Hudi 1.2.0, this behavior is configurable.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.metadata.auto.delete.partitions` | `true` | When enabled (default), 
metadata table partitions (indexes) that are disabled in the write config are 
automatically deleted. Set to `false` to prevent accidental deletion in 
multi-writer environments where not all writers may have the same config — 
users must then drop indexes explicitly via Hudi CLI or `DROP INDEX`. |
+
 ## Leveraging metadata during queries
 
 ### files index
@@ -129,6 +138,28 @@ can bring up the writers sequentially after stopping the 
writers for enabling me
 configurations to only a subset of writers or table services is unsafe and can 
lead to loss of data. So, please ensure you enable 
 metadata table across all writers.
 
+## MDT Cleaner and Compaction
+
+Hudi 1.2.0 introduced a config that lets the metadata table's cleaner derive 
its retention policy directly from the
+data table, rather than requiring a separate configuration.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.metadata.derive.from.datatable.clean.policy` | `true` | When 
enabled, the metadata table's cleaner uses the same cleaning policy (retention 
count, hours, etc.) as the data table. See also 
[cleaning](cleaning.md#mdt-cleaner-inherits-data-table-policy). |
+
+The metadata table's compaction and log compaction can also be delegated to an 
external table service platform. See
+[compaction](compaction.md#delegating-mdt-compaction-to-an-external-platform) 
for the full config reference.
+
+## Timeline Archival Controls
+
+Hudi 1.2.0 added two configs in `HoodieArchivalConfig` to fine-tune how the 
timeline manifest and archival interact
+with the most recent clean.
+
+| Config Name | Default | Description |
+|---|---|---|
+| `hoodie.timeline.manifest.retained.versions` | `3` | Number of timeline 
manifest file versions to retain. Older manifest versions are pruned during 
archival. |
+| `hoodie.archive.block.on.latest.clean.ectr` | `false` | When enabled, 
archival stops at the Earliest Commit To Retain (ECTR) from the last completed 
clean. This prevents archiving commits whose data files still exist on storage, 
avoiding inconsistencies between the timeline and actual data. |
+
 ## Related Resources
 <h3>Blogs</h3>
 * [Table service deployment models in Apache 
Hudi](https://medium.com/@simpsons/table-service-deployment-models-in-apache-hudi-9cfa5a44addf)
diff --git a/website/docs/metadata_indexing.md 
b/website/docs/metadata_indexing.md
index 7056a1e02671..dbdd523df60d 100644
--- a/website/docs/metadata_indexing.md
+++ b/website/docs/metadata_indexing.md
@@ -2,7 +2,7 @@
 title: Indexing
 summary: "In this page, we describe how to run metadata indexing 
asynchronously."
 toc: true
-last_modified_at:
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 
 Hudi maintains a scalable [metadata](metadata.md) that has some auxiliary data 
about the table.
@@ -36,7 +36,7 @@ For more information on these indexes please refer [metadata 
section](metadata/#
 :::note
 Please note in order to create secondary index:
 1. The table must have a primary key and merge mode should be 
[COMMIT_TIME_ORDERING](record_merger.md#commit_time_ordering).
-2. Record index must be enabled. This can be done by setting 
`hoodie.metadata.record.index.enable=true` and then creating `record_index`. 
Please note the example below.
+2. Record index must be enabled. This can be done by setting 
`hoodie.metadata.global.record.level.index.enable=true` and then creating 
`record_index`. Please note the example below.
 :::
 
 **Examples**
@@ -73,8 +73,8 @@ hoodie.metadata.index.column.stats.enable=true
 -- [Optional Configs] - list of columns to index on. By default all columns 
are indexed
 hoodie.metadata.index.column.stats.column.list=col1,col2,...
 
--- [Required Configs] Record Level Index
-hoodie.metadata.record.index.enable=true
+-- [Required Configs] Record Level Index (Global RLI — single record key 
unique across all partitions)
+hoodie.metadata.global.record.level.index.enable=true
 
 -- [Required Configs] Bloom filter Index
 hoodie.metadata.index.bloom.filter.enable=true
@@ -116,7 +116,7 @@ inserts.write.format("hudi").
   
 // Create record index and secondary index for the table
 spark.sql(s"CREATE TABLE test_table_external USING hudi LOCATION '$basePath'")
-spark.sql(s"SET hoodie.metadata.record.index.enable=true")
+spark.sql(s"SET hoodie.metadata.global.record.level.index.enable=true")
 spark.sql(s"CREATE INDEX record_index ON test_table_external (uuid)")
 spark.sql(s"CREATE INDEX idx_rider ON test_table_external (rider)")
 spark.sql(s"SHOW INDEXES FROM hudi_indexed_table").show(false)
@@ -191,6 +191,24 @@ Enabling the metadata table and configuring a lock 
provider are the prerequisite
 configuration below.
 :::
 
+#### Record-Level Index Configuration Keys
+
+Hudi supports two flavors of the Record Level Index, each with its own enable 
flag and sizing configs:
+
+- **Global RLI** — record key is unique across the entire table (across 
partitions).
+- **Partitioned RLI** — `partition_path + record_key` is unique within each 
partition.
+
+| Config Name | Default | Notes |
+|---|---|---|
+| `hoodie.metadata.global.record.level.index.enable` | `false` | Enables the 
global RLI. |
+| `hoodie.metadata.global.record.level.index.min.filegroup.count` | `10` | Min 
file groups for the global RLI. |
+| `hoodie.metadata.global.record.level.index.max.filegroup.count` | `10000` | 
Max file groups for the global RLI. |
+| `hoodie.metadata.record.level.index.enable` | `false` | Enables the 
partitioned RLI. Independent toggle from the global RLI above. |
+| `hoodie.metadata.record.level.index.min.filegroup.count` | `1` | Min file 
groups for the partitioned RLI. |
+| `hoodie.metadata.record.level.index.max.filegroup.count` | `10` | Max file 
groups for the partitioned RLI. |
+| `hoodie.metadata.record.level.index.defer.init` | `false` | When enabled, 
defers RLI initialization to the second commit on a fresh table so Hudi can 
size file groups based on actual record volume. Applies to both global and 
partitioned RLI. |
+| `hoodie.metadata.record.index.max.filegroup.size` | `1073741824` (1 GB) | 
Maximum size in bytes of a single RLI file group. Larger file groups take 
longer to compact. |
+
 ```
 # ensure that async indexing is enabled
 hoodie.metadata.index.async=true
@@ -284,7 +302,10 @@ indexer logs, we would find that it indeed caught up with 
instant `2022041419542
 
 ### Drop Index
 
-To drop an index, just run the index in `dropindex` mode.
+To drop an index, just run the index in `dropindex` mode. Note that as of Hudi 
1.2.0, when an index is disabled in
+the write config, Hudi automatically drops its metadata table partition by 
default; see
+[`hoodie.metadata.auto.delete.partitions`](metadata.md#auto-delete-of-disabled-mdt-partitions)
 to control this
+behavior.
 
 ```
 spark-submit \
diff --git a/website/docs/metrics.md b/website/docs/metrics.md
index ec94d21aa45a..151d27f5d319 100644
--- a/website/docs/metrics.md
+++ b/website/docs/metrics.md
@@ -3,7 +3,7 @@ title: Metrics
 keywords: [ hudi, administration, operation, devops, metrics]
 summary: This section offers an overview of metrics in Hudi
 toc: true
-last_modified_at: 2020-06-20T15:59:57-04:00
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 
 In this section, we will introduce the `MetricsReporter` and `HoodieMetrics` 
in Hudi. You can view the metrics-related configurations 
[here](configurations.md#METRICS).
@@ -204,29 +204,46 @@ These `HoodieMetrics` can then be plotted on a standard 
tool like grafana. Below
 
 ## List of metrics:
 
-The below metrics are available in all timeline operations that involves a 
commit such as deltacommit, compaction, clustering and rollback.
+The metrics below are emitted across timeline operations (deltacommit, 
compaction, clustering, rollback, clean, archival) and post-commit callbacks. 
When `hoodie.metrics.reporter.metricsname.prefix` is set, every name is 
prefixed with `<prefix>.<name>`.
 
-Name  |  Description
+Name | Description
 --- | ---
-commitFreshnessInMs | Milliseconds from the commit end time and the maximum 
event time of the incoming records
-commitLatencyInMs | Milliseconds from the commit end time and the minimum 
event time of incoming records
-commitTime  | Time of commit in epoch milliseconds
-duration  | Total time taken for the commit/rollback in milliseconds
-numFilesDeleted | Number of files deleted during a clean/rollback
-numFilesFinalized | Number of files finalized in a write
-totalBytesWritten | Bytes written in a HoodieCommit
-totalCompactedRecordsUpdated  | Number of records updated in a compaction 
operation
-totalCreateTime | Time taken for file creation during a Hoodie Insert operation
-totalFilesInsert  | Number of newly written files in a HoodieCommit
-totalFilesUpdate  | Number of files updated in a HoodieCommit
-totalInsertRecordsWritten | Number of records inserted or converted to 
updates(for small file handling) in a HoodieCommit
-totalLogFilesCompacted  | Number of log files under a base file in a file 
group compacted
-totalLogFilesSize | Total size in bytes of all log files under a base file in 
a file group
-totalPartitionsWritten  | Number of partitions that took writes in a 
HoodieCommit
-totalRecordsWritten | Number of records written in a HoodieCommit. For 
inserts, it is the total numbers of records inserted. And for updates, it the 
total number of records in the file.
-totalScanTime | Time taken for reading and merging logblocks in a log file
-totalUpdateRecordsWritten | Number of records that got changed in a 
HoodieCommit
-totalUpsertTime | Time taken for Hoodie Merge
-
-These metrics can be found at org.apache.hudi.metrics.HoodieMetrics and 
referenced from 
-org.apache.hudi.common.model.HoodieCommitMetadata and 
org.apache.hudi.common.model.HoodieWriteStat
+commitFreshnessInMs | Milliseconds from the commit end time and the maximum 
event time of the incoming records.
+commitLatencyInMs | Milliseconds from the commit end time and the minimum 
event time of incoming records.
+commitTime | Time of commit in epoch milliseconds.
+duration | Total time taken for the commit/rollback in milliseconds.
+numFilesDeleted | Number of files deleted during a clean/rollback.
+numFilesFinalized | Number of files finalized in a write.
+totalBytesWritten | Bytes written in a HoodieCommit.
+totalCompactedRecordsUpdated | Number of records updated in a compaction 
operation.
+totalCreateTime | Time taken for file creation during a Hoodie Insert 
operation.
+totalFilesInsert | Number of newly written files in a HoodieCommit.
+totalFilesUpdate | Number of files updated in a HoodieCommit.
+totalInsertRecordsWritten | Number of records inserted or converted to updates 
(for small file handling) in a HoodieCommit.
+totalLogFilesCompacted | Number of log files under a base file in a file group 
compacted.
+totalLogFilesSize | Total size in bytes of all log files under a base file in 
a file group.
+totalPartitionsWritten | Number of partitions that took writes in a 
HoodieCommit.
+totalRecordsWritten | Number of records written in a HoodieCommit. For 
inserts, the total records inserted; for updates, the total records in the file.
+totalScanTime | Time taken for reading and merging log blocks in a log file.
+totalUpdateRecordsWritten | Number of records that got changed in a 
HoodieCommit.
+totalUpsertTime | Time taken for Hoodie Merge.
+clean.duration | Wall-clock time in milliseconds for a clean operation.
+archive.duration | Wall-clock time in milliseconds for an archive operation.
+rollback.failure.counter | Incremented each time a rollback operation fails.
+postCommit.success.counter | Incremented each time all post-commit callbacks 
succeed.
+postCommit.failure.counter | Incremented each time a post-commit callback 
fails (post-commit failures are non-fatal).
+postCommit.duration | Wall-clock time in milliseconds for post-commit callback 
execution.
+archival.archivalNumAllCommits | Total number of instants archived in this 
archival run.
+archival.archivalNumWriteCommits | Number of write instants (commit, 
deltacommit, replacecommit) archived.
+archival.archivalNumCleanCommits | Number of clean instants archived.
+archival.archivalNumRollbackCommits | Number of rollback instants archived.
+archival.archivalStatus | `1` if archival succeeded, `-1` if it failed.
+archival.archivalFailure.\<ExceptionClassName\> | Incremented on archival 
failure; the suffix is the simple class name of the exception thrown.
+archival.archivalOutOfMemory | Incremented when archival fails due to an 
`OutOfMemoryError`.
+\<action\>.totalCorruptedLogBlocks | Number of corrupted log blocks 
encountered during compaction. Reported only when 
`hoodie.metricscompaction.log.blocks.on=true`. `<action>` is the commit action 
type (e.g., `commit`).
+\<action\>.totalRollbackLogBlocks | Number of rollback log blocks encountered 
during compaction. Reported only when 
`hoodie.metricscompaction.log.blocks.on=true`.
+\<action\>.totalLogBlocksCompacted | Total number of log blocks compacted. 
Reported only when `hoodie.metricscompaction.log.blocks.on=true`.
+
+These metrics live in `org.apache.hudi.metrics.HoodieMetrics` (with 
archival-specific names sourced from 
`org.apache.hudi.client.utils.ArchivalMetrics`) and are referenced from 
`org.apache.hudi.common.model.HoodieCommitMetadata` and 
`org.apache.hudi.common.model.HoodieWriteStat`.
+
+In multi-tenant deployments where a single Spark job writes to multiple Hudi 
tables, each table gets its own isolated `MetricRegistry`, scoped as 
`<tableName>.<registryName>` so metrics from different tables do not collide. 
No configuration is required.
diff --git a/website/docs/overview.mdx b/website/docs/overview.mdx
index 46c2d5b4be1c..ed439cef38a1 100644
--- a/website/docs/overview.mdx
+++ b/website/docs/overview.mdx
@@ -59,10 +59,10 @@ If you want to experience Apache Hudi integrated into an 
end to end demo with Ka
 
 Hudi brings first-class support for AI and unstructured data workloads to the 
data lakehouse:
 
-- **[VECTOR type & Similarity Search](vector_search.md)** — Store embeddings 
and run approximate nearest neighbor search directly in Spark SQL
+- **[VECTOR type & Similarity Search](vector_search.md)** — Store embeddings 
and run vector similarity search directly in Spark SQL
 - **[BLOB type for Unstructured Data](blob_unstructured_data.md)** — Store 
images, PDFs, audio, and other binary data with inline or out-of-line storage
 - **[VARIANT type for Semi-Structured Data](variant_type.md)** — Store 
flexible JSON-like data (LLM outputs, model metadata, feature maps) without 
rigid schemas
-- **[Lance File Format](lance_file_format.md)** — Vector-optimized columnar 
format with built-in ANN indexing
+- **[Lance File Format](lance_file_format.md)** — Vector-friendly columnar 
format for AI/ML workloads
 
 See the full [AI-Native Lakehouse Overview](ai_overview.md) for use cases and 
architecture.
 
diff --git a/website/docs/precommit_validator.md 
b/website/docs/precommit_validator.md
index fe0a0dd77605..d67d9ec3a054 100644
--- a/website/docs/precommit_validator.md
+++ b/website/docs/precommit_validator.md
@@ -109,6 +109,73 @@ Hudi offers a [commit notification 
service](platform_services_post_commit_callba
 
 The commit notification service can be combined with pre-commit validators to 
send a notification when a commit fails a validation. This is possible by 
passing details about the validation as a custom value to the HTTP endpoint.
 
+## Notes on Validator Behavior
+
+Hudi 1.2.0 introduced the following behavioral refinements:
+
+**Metadata fields in SQL queries**: Validator SQL can now reference Hudi 
metadata fields (`_hoodie_record_key`, `_hoodie_partition_path`, 
`_hoodie_file_name`, `_hoodie_commit_time`, `_hoodie_commit_seqno`) directly in 
query expressions.
+
+**Empty writes**: Empty write commits no longer cause pre-commit validators to 
error. Validators are skipped gracefully when no records are present in the 
write.
+
+## Failure Policy
+
+Hudi 1.2.0 introduces a configurable failure policy for pre-commit validators:
+
+| Config Key | Default | Description |
+|---|---|---|
+| `hoodie.precommit.validators.failure.policy` | `FAIL` | How to handle 
validator failures. `FAIL`: block the commit with an exception. `WARN_LOG`: 
emit a warning log but allow the commit to proceed (useful for soft 
monitoring). |
+
+## Flink and Streaming-Offset Validators
+
+Available since Hudi 1.2.0. Flink writers now honor 
`hoodie.precommit.validators` using the same configuration key as Spark. 
Validators intended for use with Flink must extend the engine-agnostic 
`org.apache.hudi.client.validator.BasePreCommitValidator` (in `hudi-common`), 
which provides access to commit metadata and timeline information independently 
of Spark.
+
+Two built-in streaming-offset validators are now available for Kafka-sourced 
pipelines:
+
+| Validator Class | Engine | Description |
+|---|---|---|
+| `org.apache.hudi.sink.validator.FlinkKafkaOffsetValidator` | Flink | 
Validates that the number of records written matches the Kafka offset 
difference for the batch |
+| `org.apache.hudi.utilities.streamer.validator.SparkKafkaOffsetValidator` | 
Spark / HoodieStreamer | Same semantics for Spark-based Kafka ingestion 
pipelines |
+
+Both validators use the following configuration:
+
+| Config Key | Default | Description |
+|---|---|---|
+| `hoodie.precommit.validators.streaming.offset.tolerance.percentage` | `0.0` 
| Tolerance percentage for offset-based record-count validation. A value of 
`0.0` requires an exact match between expected records (from Kafka offset 
delta) and actual records written. For upsert workloads with deduplication, set 
a higher tolerance (e.g., `10.0` for 10%). |
+| `hoodie.precommit.validators.failure.policy` | `FAIL` | See [Failure 
Policy](#failure-policy) above. |
+
+Example (Flink):
+```properties
+hoodie.precommit.validators=org.apache.hudi.sink.validator.FlinkKafkaOffsetValidator
+hoodie.precommit.validators.streaming.offset.tolerance.percentage=5.0
+hoodie.precommit.validators.failure.policy=WARN_LOG
+```
+
+## Pre-Write Validators
+
+Introduced in Hudi 1.2.0, pre-write validators run **before** data is written 
to storage, in contrast to pre-commit validators which run **after** data is 
written but before the commit is published to the timeline. This enables 
earlier rejection of invalid operations, avoiding unnecessary I/O.
+
+Configuration:
+
+| Config Key | Default | Description |
+|---|---|---|
+| `hoodie.prewrite.validators` | `""` | Comma-separated list of 
fully-qualified class names implementing 
`org.apache.hudi.client.validator.PreWriteValidator`. |
+
+To implement a custom pre-write validator, implement the 
`org.apache.hudi.client.validator.PreWriteValidator` interface:
+
+```java
+public interface PreWriteValidator {
+  <T> void validate(
+      String instantTime,
+      WriteOperationType writeOperationType,
+      HoodieTableMetaClient metaClient,
+      HoodieWriteConfig writeConfig,
+      HoodieEngineContext engineContext,
+      Option<HoodieData<HoodieRecord<T>>> recordsOpt) throws 
HoodieValidationException;
+}
+```
+
+No built-in pre-write validator implementations are provided yet; this 
framework is designed for custom user extensions. Unlike pre-commit validators, 
pre-write validators have access to the incoming records before any write I/O 
occurs.
+
 ## Related Resources
 
 <h3>Blogs</h3>
diff --git a/website/docs/procedures.md b/website/docs/procedures.md
index 71edf434ed10..94dd41991598 100644
--- a/website/docs/procedures.md
+++ b/website/docs/procedures.md
@@ -2,7 +2,7 @@
 title: SQL Procedures
 summary: "In this page, we introduce how to use SQL procedures with Hudi."
 toc: true
-last_modified_at: 2025-11-24T00:00:00
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
@@ -298,6 +298,64 @@ call show_archived_commits(table => 'test_hudi_table');
 | 20220216171027021 | 435346              | 1                 | 0              
     | 1                        | 1                     | 0                     
       | 0            |
 | 20220216171019361 | 435349              | 1                 | 0              
     | 1                        | 1                     | 0                     
       | 0            |
 
+### show_timeline
+
+Show timeline entries for a Hudi table. Returns instant-level information for 
all timeline operations (commits, compactions, clustering, clean, rollback, 
etc.) from the active and optionally archived timeline. Results are sorted by 
timestamp descending.
+
+**Input**
+
+| Parameter Name | Type    | Required | Default Value | Description            
                                                          |
+|----------------|---------|----------|---------------|----------------------------------------------------------------------------------|
+| table          | String  | N*       | None          | Hudi table name 
(mutually exclusive with `path`)                                 |
+| path           | String  | N*       | None          | Base path of the Hudi 
table (mutually exclusive with `table`)                    |
+| limit          | Int     | N        | 20            | Max number of timeline 
entries to return (ignored when both `startTime` and `endTime` are set) |
+| showArchived   | Boolean | N        | false         | Whether to include 
archived timeline entries                                     |
+| filter         | String  | N        | ""            | SQL expression to 
filter results on any output column                            |
+| startTime      | String  | N        | ""            | Start timestamp for 
filtering (format: `yyyyMMddHHmmss`, inclusive)              |
+| endTime        | String  | N        | ""            | End timestamp for 
filtering (format: `yyyyMMddHHmmss`, inclusive)                |
+
+\* Either `table` or `path` must be provided.
+
+**Output**
+
+| Output Name    | Type   | Description                                        
                                   |
+|----------------|--------|---------------------------------------------------------------------------------------|
+| instant_time   | String | Requested timestamp of the instant                 
                                   |
+| action         | String | Action type: `commit`, `deltacommit`, 
`compaction`, `clustering`, `clean`, `rollback`, etc. |
+| state          | String | State of the instant: `REQUESTED`, `INFLIGHT`, or 
`COMPLETED`                        |
+| requested_time | String | Wall-clock time when the instant was requested 
(format: `MM-dd HH:mm:ss`)            |
+| inflight_time  | String | Wall-clock time when the instant became inflight 
(format: `MM-dd HH:mm:ss`)          |
+| completed_time | String | Wall-clock time when the instant completed 
(format: `MM-dd HH:mm:ss`), or `null`     |
+| timeline_type  | String | `ACTIVE` or `ARCHIVED`                             
                                   |
+| rollback_info  | String | For rollback instants: what was rolled back; for 
rolled-back instants: which rollback instant rolled them back; otherwise `null` 
|
+
+**Example**
+
+```sql
+-- Show the 20 most recent timeline entries
+call show_timeline(table => 'test_hudi_table');
+
+-- Show up to 50 entries including archived timeline
+call show_timeline(table => 'test_hudi_table', limit => 50, showArchived => 
true);
+
+-- Filter to completed commits in a time range
+call show_timeline(
+  table => 'test_hudi_table',
+  startTime => '20251201000000',
+  endTime => '20251231235959',
+  filter => "action = 'commit' AND state = 'COMPLETED'"
+);
+
+-- Look up by base path instead of table name
+call show_timeline(path => 'hdfs:///user/hive/warehouse/test_hudi_table');
+```
+
+| instant_time      | action | state     | requested_time      | inflight_time 
      | completed_time      | timeline_type | rollback_info |
+|-------------------|--------|-----------|---------------------|---------------------|---------------------|---------------|---------------|
+| 20251205143022001 | commit | COMPLETED | 12-05 14:30:20      | 12-05 
14:30:21      | 12-05 14:30:22      | ACTIVE        | null          |
+| 20251205141510003 | clean  | COMPLETED | 12-05 14:15:09      | 12-05 
14:15:10      | 12-05 14:15:10      | ACTIVE        | null          |
+| 20251205140030002 | commit | COMPLETED | 12-05 14:00:28      | 12-05 
14:00:29      | 12-05 14:00:30      | ACTIVE        | null          |
+
 ### show_commit_files
 
 Show files of a commit.
diff --git a/website/docs/reading_tables_batch_reads.md 
b/website/docs/reading_tables_batch_reads.md
index 0a62f34d55e3..3b1e845e2145 100644
--- a/website/docs/reading_tables_batch_reads.md
+++ b/website/docs/reading_tables_batch_reads.md
@@ -19,6 +19,32 @@ val tripsDF = spark.read.
 tripsDF.where(tripsDF.fare > 20.0).show()
 ```
 
+## Flink Batch (Snapshot) Read
+
+Flink can read a Hudi table as a snapshot (batch) query by leaving 
`read.streaming.enabled` at its default value of `false`.
+
+```sql
+CREATE TABLE hudi_table (
+  uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
+  name VARCHAR(10),
+  age INT,
+  ts TIMESTAMP(3),
+  `partition` VARCHAR(20)
+)
+PARTITIONED BY (`partition`)
+WITH (
+  'connector' = 'hudi',
+  'path' = '${path}',
+  'table.type' = 'MERGE_ON_READ'
+  -- read.streaming.enabled defaults to false → batch/snapshot read
+);
+
+-- Snapshot query
+SELECT * FROM hudi_table WHERE age > 25;
+```
+
+For more Flink read options, see [Using Flink](ingestion_flink.md).
+
 ## Daft
 
 [Daft](https://www.daft.ai/) supports reading Hudi tables using 
`daft.read_hudi()` function.
diff --git a/website/docs/reading_tables_streaming_reads.md 
b/website/docs/reading_tables_streaming_reads.md
index 5055f42c0449..7191cedf0115 100644
--- a/website/docs/reading_tables_streaming_reads.md
+++ b/website/docs/reading_tables_streaming_reads.md
@@ -97,3 +97,47 @@ spark.readStream \
 Spark SQL can be used within ForeachBatch sink to do INSERT, UPDATE, DELETE 
and MERGE INTO.
 Target table must exist before write.
 :::
+
+## Flink Streaming Read
+
+Flink can continuously consume new commits from a Hudi table as a streaming 
source. Enable this by setting `read.streaming.enabled=true` and optionally a 
`read.start-commit`.
+
+```sql
+CREATE TABLE hudi_table (
+  uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
+  name VARCHAR(10),
+  age INT,
+  ts TIMESTAMP(3),
+  `partition` VARCHAR(20)
+)
+PARTITIONED BY (`partition`)
+WITH (
+  'connector' = 'hudi',
+  'path' = '${path}',
+  'table.type' = 'MERGE_ON_READ',
+  'read.streaming.enabled' = 'true',          -- enable streaming read
+  'read.start-commit' = '20210316134557',      -- start from this instant 
(omit for latest)
+  'read.streaming.check-interval' = '60'       -- poll interval in seconds
+);
+
+SELECT * FROM hudi_table;
+```
+
+### Source V2 for Streaming
+
+As of Hudi 1.2.0, the [FLIP-27-based Source 
V2](ingestion_flink.md#flink-source-v2) is available as an opt-in for streaming 
reads. Source V2 participates in Flink's checkpoint protocol for finer-grained 
recovery and supports partition pruning:
+
+```sql
+WITH (
+  'connector' = 'hudi',
+  'path' = '${path}',
+  'read.streaming.enabled' = 'true',
+  'read.source-v2.enabled' = 'true'   -- enable FLIP-27 source (Hudi 1.2.0+)
+)
+```
+
+:::warning
+Savepoints taken with the legacy source are not compatible with Source V2. 
Start a fresh job when switching. See [Flink Source 
V2](ingestion_flink.md#flink-source-v2) for migration details.
+:::
+
+For a full list of Flink streaming read options (rate limiting, commits limit, 
CDC mode, etc.), see [Using Flink](ingestion_flink.md).
diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md
index d1c5ba865bdb..2fc0d421dab2 100644
--- a/website/docs/sql_ddl.md
+++ b/website/docs/sql_ddl.md
@@ -2,7 +2,7 @@
 title: SQL DDL
 summary: "In this page, we discuss using SQL DDL commands with Hudi"
 toc: true
-last_modified_at: 
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
@@ -696,6 +696,14 @@ SHOW PARTITIONS hudi_table;
 ALTER TABLE hudi_table DROP PARTITION (dt='2021-12-09', hh='10');
 ```
 
+:::note Slash-separated date partitioning and SHOW PARTITIONS
+When a table is written with 
`hoodie.datasource.write.slash.separated.date.partitioning=true`, the
+physical directory layout uses `yyyy/MM/dd` paths. `SHOW PARTITIONS` correctly 
handles this: it
+returns partition values in the standard `col=yyyy-MM-dd` display format, 
normalizing the `/`
+separators back to `-` for readability. See [Key 
Generation](key_generation.md#slash-separated-date-partitioning)
+for details on configuring slash-separated partitioning.
+:::
+
 ### Show and drop index
 
 **Syntax**
@@ -837,6 +845,39 @@ WITH (
 );
 ```
 
+### Create Append-Only Table Without Primary Key
+
+Hudi 1.2.0 supports creating a Flink table **without a `PRIMARY KEY`** for 
pure append workloads.
+In this mode, set `write.operation` to `insert`; Hudi will not enforce 
record-level uniqueness and
+the record-key and ordering fields are optional.
+
+```sql
+-- Append-only table: no PRIMARY KEY required
+CREATE TABLE hudi_append_table (
+  id      BIGINT,
+  name    STRING,
+  ts      BIGINT,
+  city    STRING
+)
+PARTITIONED BY (`city`)
+WITH (
+  'connector'       = 'hudi',
+  'path'            = 'file:///tmp/hudi_append_table',
+  'table.type'      = 'COPY_ON_WRITE',
+  'write.operation' = 'insert'
+);
+
+INSERT INTO hudi_append_table VALUES (1, 'Alice', 1695159649, 'sf'), (2, 
'Bob', 1695091554, 'ny');
+```
+
+:::note
+Without a primary key, Hudi uses auto-generated record keys and does **not** 
perform deduplication
+or upsert merging. This is equivalent to `bulk_insert` semantics and is well 
suited for log/event
+ingestion pipelines where every incoming row should be appended as-is.
+If `write.operation` is any value other than `insert` and no `PRIMARY KEY` is 
defined, Hudi will
+throw `"Primary key definition is missing"` at table creation time.
+:::
+
 ### Create Table in Non-Blocking Concurrency Control Mode
 
 The following is an example of creating a Flink table in [Non-Blocking 
Concurrency Control 
mode](concurrency_control.md#non-blocking-concurrency-control).
@@ -967,3 +1008,15 @@ WITH (
 | numeric       |              | not supported |
 | null          |              | not supported |
 | object        |              | not supported |
+
+### AI and Unstructured Data Types
+
+Hudi 1.2.0 introduces two additional column types for AI and unstructured data 
workloads:
+
+- **`VECTOR(dim[, elementType])`** — stores fixed-dimension embedding vectors 
(e.g. `VECTOR(768)`,
+  `VECTOR(768, FLOAT)`, `VECTOR(768, DOUBLE)`). Enables approximate 
nearest-neighbor search via
+  the `hudi_vector_search` TVF. See [Vector Search](vector_search.md) for full 
details.
+
+- **`BLOB`** — stores arbitrary binary objects (images, audio, documents) 
either inline within the
+  base file or as external references. See [BLOB / Unstructured 
Data](blob_unstructured_data.md)
+  for the storage modes, DDL syntax, and read APIs.
diff --git a/website/docs/sql_queries.md b/website/docs/sql_queries.md
index 9310c72fbc62..4b9d6c2dd695 100644
--- a/website/docs/sql_queries.md
+++ b/website/docs/sql_queries.md
@@ -2,7 +2,7 @@
 title: SQL Queries
 summary: "In this page, we go over querying Hudi tables using SQL"
 toc: true
-last_modified_at: 
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
@@ -14,6 +14,24 @@ This page will show how to issue different queries and 
discuss any specific inst
 ## Spark SQL
 The Spark [quickstart](quick-start-guide.md) provides a good overview of how 
to use Spark SQL to query Hudi tables. This section will go into more advanced 
configurations and functionalities.
 
+:::tip Setting Hudi read options at the session level
+Hudi 1.2.0 supports setting read options at the **Spark session level** using 
the `spark.hoodie.*` prefix.
+Any `spark.hoodie.X` config set via `spark.conf.set` or `--conf` is treated 
equivalently to `hoodie.X`.
+
+Config precedence (low → high):
+1. Global DFS properties
+2. `spark.hoodie.*` session-level configs (normalized to `hoodie.*`)
+3. Explicit `hoodie.*` data source options or per-table `SET` commands
+
+```sql
+-- Apply a Hudi read option for the entire session
+SET spark.hoodie.metadata.column.stats.enable = true;
+SELECT * FROM hudi_table WHERE price BETWEEN 10.0 AND 50.0;
+```
+
+If both `spark.hoodie.X` and `hoodie.X` are set, the explicit `hoodie.X` value 
takes precedence.
+:::
+
 ### Snapshot Query
 Snapshot queries are the most common query type for Hudi tables. Spark SQL 
supports snapshot queries on both COPY_ON_WRITE and MERGE_ON_READ tables.
 Using session properties, you can specify options around indexing to optimize 
query performance, as shown below.
@@ -332,6 +350,19 @@ also changed to use completion time. To support 
compatiblity, Hudi does a checkp
 time to completion time depending on the source table version.
 :::
 
+### Vector Similarity Search
+
+Hudi 1.2.0 introduces a `hudi_vector_search` table-valued function (TVF) for 
approximate
+nearest-neighbor (ANN) search over `VECTOR` columns. This is an extension of 
the
+`hudi_table_changes` TVF pattern.
+
+```sql
+-- Find the 10 nearest neighbors to a query vector in the 'embedding' column
+SELECT * FROM hudi_vector_search('db.embeddings_table', 'embedding', 
ARRAY(0.1, 0.2, ...), 10);
+```
+
+See [Vector Search](vector_search.md) for the full API, supported metrics, and 
setup instructions.
+
 ### Query Indexes and Timeline
 
 Hudi also allows users to directly query the metadata partitions and check the 
metadata corresponding to the table
diff --git a/website/docs/syncing_aws_glue_data_catalog.md 
b/website/docs/syncing_aws_glue_data_catalog.md
index 35f43a8af472..4b1e295dc831 100644
--- a/website/docs/syncing_aws_glue_data_catalog.md
+++ b/website/docs/syncing_aws_glue_data_catalog.md
@@ -54,6 +54,10 @@ 
hoodie.datasource.meta.sync.glue.partition_index_fields.enable
 hoodie.datasource.meta.sync.glue.partition_index_fields
 ```
 
+## Writer Version Table Property
+
+Hudi 1.2.0 Glue sync writes the table property `hudi_writer_version` (set to 
the Hudi version that last synced the table) to the Glue Data Catalog entry on 
every sync, consistent with HMS sync behavior.
+
 ## Other references
 
 ### Running AWS Glue Catalog Sync for Spark DataSource
diff --git a/website/docs/syncing_metastore.md 
b/website/docs/syncing_metastore.md
index f260a585c1b6..059f172de003 100644
--- a/website/docs/syncing_metastore.md
+++ b/website/docs/syncing_metastore.md
@@ -297,3 +297,44 @@ While using hive beeline query, you need to enter settings:
 ```bash
 set hive.input.format = 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;
 ```
+
+## Spark Catalog Metastore Client
+
+When running Hudi inside a Spark environment that already has Hive support 
enabled (e.g., SparkSQL with `spark.sql.catalogImplementation=hive`), the 
standard `IMetaStoreClient` initialization can conflict with Spark's own Hive 
classloader. Setting
+
+```properties
+hoodie.datasource.hive_sync.use_spark_catalog=true
+```
+
+(default: `false`) makes Hudi use `SparkCatalogMetaStoreClient` — a 
Spark-native `IMetaStoreClient` implementation — instead of creating its own. 
This avoids classloader conflicts in Hive-on-Spark setups. Requires a 
`SparkSession` with Hive support active.
+
+## HMS 4.x Support via JDBC Fallback
+
+HMS 4.x changed several Thrift API method signatures (e.g., `get_table` → 
`get_table_req`), which makes the standard Thrift-based HMS client 
incompatible. Hudi 1.2.0 adds automatic fallback: when a Thrift metadata call 
surfaces a `TApplicationException` anywhere in its cause chain, Hudi flips an 
internal `thriftIncompatible` flag and reroutes the rest of that sync run 
through the JDBC path.
+
+**Requirement:** sync mode must be `jdbc` with a valid JDBC URL so the 
fallback client is available:
+
+```properties
+hoodie.datasource.hive_sync.mode=jdbc
+hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hiveserver:10000
+hoodie.datasource.hive_sync.username=<username>
+hoodie.datasource.hive_sync.password=<password>
+```
+
+If the table is synced with `mode=hms` or `mode=hiveql` against HMS 4.x, Hudi 
logs `"Thrift API incompatible with HMS but no JDBC fallback available. 
Consider using mode=jdbc with a valid jdbcUrl."` and surfaces the original 
exception — no automatic recovery happens.
+
+**Detection scope.** The flag is per `HoodieHiveSyncClient` instance, not 
global, and only transitions from `false` to `true` (it never resets). In 
practice this means the first Thrift call of each sync run probes once, and the 
rest of that run uses the JDBC fallback. The next sync run starts with a fresh 
probe.
+
+**JDBC connection failures surface separately.** With `mode=jdbc`, Hudi opens 
the JDBC connection eagerly when the sync client is constructed — before any 
Thrift call is attempted. A bad JDBC URL, missing driver, or wrong credentials 
therefore fails at startup with `HoodieHiveSyncException: Failed to create 
HiveMetaStoreClient` and the underlying JDBC exception as the cause in the 
stack trace. This is a configuration-error path, not an HMS API mismatch, and 
is the same behavior as `mode= [...]
+
+## Writer Version Table Property
+
+Hudi 1.2.0 sync writes the table property `hudi_writer_version` (set to the 
Hudi version that last synced the table) to the Hive metastore entry on every 
sync. This allows tooling and metastore administrators to identify which Hudi 
version wrote a given table.
+
+To emit `TOUCH` events to the metastore for partition-level change tracking 
(e.g., for downstream catalog notifications), set:
+
+```properties
+hoodie.meta.sync.touch.partitions.enabled=true
+```
+
+Default is `false`. When enabled, a TOUCH event is issued for each partition 
that was modified in the sync operation.
diff --git a/website/docs/variant_type.md b/website/docs/variant_type.md
index bf42390afe2f..fdcf6f54b54b 100644
--- a/website/docs/variant_type.md
+++ b/website/docs/variant_type.md
@@ -3,7 +3,7 @@ title: "Semi-Structured Data (VARIANT)"
 keywords: [ hudi, variant, semi-structured, json, schemaless, shredding, 
parse_json, flexible schema]
 summary: "Store and query semi-structured JSON-like data in Hudi tables using 
the VARIANT type, with optional shredding for query performance"
 toc: true
-last_modified_at: 2026-04-25T00:00:00-00:00
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 
 import Tabs from '@theme/Tabs';
@@ -253,12 +253,25 @@ binary `value` field.
 
 | Engine | VARIANT Support |
 |:-------|:---------------|
-| **Spark 4.0+** | Native `VariantType` — full read/write/query |
+| **Spark 4.0** | Native `VariantType` — full read/write/query for COW and 
MOR; native `df.write` with `VariantType` on the V1 datasource |
+| **Spark 4.1** | Native `VariantType` — full read/write/query for COW and MOR 
|
 | **Spark 3.x** | Reads as `STRUCT<value: BINARY, metadata: BINARY>` — 
backward compatible |
-| **Flink** | Reads as `ROW<metadata BYTES, value BYTES>` — cross-engine 
compatible |
+| **Flink** | Native `VARIANT` operations are not supported. Tables written by 
Spark with VARIANT columns can be read in Flink only as the underlying 
`ROW<metadata BYTES, value BYTES>` struct. |
 
-A VARIANT table written by Spark 4.0 can be read by Spark 3.x or Flink, and 
vice versa. The
-binary encoding is engine-independent.
+A VARIANT table written by Spark 4.0/4.1 can be read by Spark 3.x using the 
underlying binary struct, or by Flink as `ROW<metadata BYTES, value BYTES>`. 
The binary encoding is engine-independent.
+
+## Metastore Sync
+
+When syncing VARIANT column schemas to external catalogs, Hudi maps the binary 
encoding to the
+target catalog's native struct type:
+
+| Catalog | VARIANT representation |
+|:--------|:----------------------|
+| Hive | `STRUCT<metadata:BINARY, value:BINARY>` |
+| BigQuery | `STRUCT` with `metadata` and `value` fields (`BYTES` type) |
+
+Query engines that support VARIANT (Spark 4.0+, Flink 2.1+) read the table 
directly using the
+Parquet VARIANT annotation and do not go through the Hive/BigQuery metastore 
representation.
 
 ## Use Cases for AI Workloads
 
@@ -344,3 +357,6 @@ CREATE TABLE api_responses (
 - Native `VARIANT` keyword in DDL requires Spark 4.0+. On Spark 3.x, use the 
struct representation.
 - VARIANT shredding configuration is determined at write time based on the 
schema definition.
 - Complex path expressions within VARIANT may require casting to STRING and 
then using JSON functions.
+- Native VARIANT operations are not supported on Flink. VARIANT columns 
surface as `ROW<metadata BYTES, value BYTES>` and can be read but not natively 
decoded or queried as a variant.
+- VARIANT columns are **not supported** on Lance-backed tables. Use Parquet as 
the base file format
+  for tables containing VARIANT columns.
diff --git a/website/docs/vector_search.md b/website/docs/vector_search.md
index 9c53443161a7..4e6687057a6c 100644
--- a/website/docs/vector_search.md
+++ b/website/docs/vector_search.md
@@ -1,9 +1,9 @@
 ---
 title: "Vector Search"
-keywords: [ hudi, vector, search, embeddings, similarity, cosine, ANN, nearest 
neighbor, VECTOR type]
-summary: "Store embedding vectors in Hudi tables and run approximate nearest 
neighbor search using the VECTOR type and hudi_vector_search TVF"
+keywords: [ hudi, vector, search, embeddings, similarity, cosine, nearest 
neighbor, VECTOR type]
+summary: "Store embedding vectors in Hudi tables and run vector similarity 
search using the VECTOR type and hudi_vector_search TVF"
 toc: true
-last_modified_at: 2026-04-25T00:00:00-00:00
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 
 import Tabs from '@theme/Tabs';
@@ -13,6 +13,20 @@ Hudi's `VECTOR` type and `hudi_vector_search` table-valued 
function (TVF) bring
 to the data lakehouse. Store embeddings alongside your structured data and 
query them with familiar Spark SQL —
 no external vector database required.
 
+## Storage Format
+
+VECTOR columns are stored in Parquet as `FIXED_LEN_BYTE_ARRAY` — a 
fixed-length binary encoding of the
+float array. Hudi stamps `hudi_type` metadata on the column so the Spark 
reader knows to decode the
+bytes back into a typed array.
+
+On **Lance** tables, VECTOR columns are stored natively as Lance 
`FixedSizeList<Float32/Float64, dim>`,
+so embeddings are written without conversion overhead at the file-format 
layer. See
+[Lance File Format](lance_file_format.md) for details.
+
+The `VECTOR(dim[, elementType])` DDL syntax works across Spark 3.4, 3.5, 4.0, 
and 4.1. Hudi's SQL
+parser normalizes `VECTOR(128, FLOAT)` to `VECTOR(128)` (FLOAT is the default 
element type).
+Nesting VECTOR inside STRUCT, ARRAY, or MAP is not supported.
+
 ## VECTOR Type
 
 The `VECTOR(dim[, elementType])` type declares a column that stores 
fixed-dimensional embedding vectors.
@@ -98,8 +112,8 @@ INSERT INTO products VALUES (
 
 ## hudi_vector_search TVF
 
-The `hudi_vector_search` table-valued function performs approximate nearest 
neighbor (ANN) search
-over a VECTOR column.
+The `hudi_vector_search` table-valued function returns the `top_k` rows from a 
Hudi table whose
+VECTOR column is closest to a given query vector under a chosen distance 
metric.
 
 ### Syntax
 
@@ -240,3 +254,21 @@ FROM hudi_vector_search(
 - VECTOR columns must be **top-level fields** — nesting inside STRUCT, ARRAY, 
or MAP is not supported.
 - The query vector's element type must **exactly match** the corpus 
embedding's element type (no implicit casting).
 - VECTOR dimension and element type **cannot be changed** after table creation 
via schema evolution.
+- **Flink cannot read VECTOR columns.** VECTOR data is stored as Parquet 
`FIXED_LEN_BYTE_ARRAY`, which
+  Flink's Parquet reader does not decode back into a typed array. Flink can 
still read all **other**
+  columns in a table that contains a VECTOR column — only the VECTOR column 
itself is inaccessible.
+  Use Spark to query VECTOR columns.
+
+## Metastore Sync
+
+When syncing VECTOR column schemas to external catalogs, Hudi maps the binary 
encoding to the
+target catalog's native binary type, preserving the original VECTOR metadata 
in table properties:
+
+| Catalog | VECTOR representation |
+|:--------|:---------------------|
+| Hive | `BINARY` |
+| BigQuery | `BYTES` |
+
+The `VECTOR(dim, elementType)` dimension and element-type metadata is 
preserved in
+`TBLPROPERTIES`/table descriptions so the table can be correctly reconstructed 
by Spark even after
+a metastore round-trip.
diff --git a/website/docs/writing_data.md b/website/docs/writing_data.md
index 5e4570182692..dcdc5a33bd81 100644
--- a/website/docs/writing_data.md
+++ b/website/docs/writing_data.md
@@ -1,7 +1,7 @@
 ---
 title: Batch Writes
 keywords: [hudi, incremental, batch, processing]
-last_modified_at: 2024-03-13T15:59:57-04:00
+last_modified_at: 2026-05-27T00:00:00-00:00
 ---
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
@@ -441,5 +441,32 @@ inputDF.write.format("hudi")
        .save(basePath)
 ```
 
+### Rolling Extra Metadata
+
+Rolling extra metadata allows you to automatically carry forward selected 
commit metadata keys to every subsequent commit and clean instant without 
having to walk the full timeline. This is particularly useful for persisting 
checkpoint information such as Kafka offsets or Flink checkpoints across 
commits.
+
+| Config | Default | Description |
+|---|---|---|
+| `hoodie.write.rolling.metadata.keys` | `""` (disabled) | Comma-separated 
list of extra metadata keys to carry forward to each new commit and clean 
instant. Values are read from recent completed instants and written into the 
new commit metadata, so they remain accessible without walking the timeline. 
New values override old ones. Only applies to data table commits and clean 
instants. |
+| `hoodie.write.rolling.metadata.timeline.lookback.commits` | `10` | Maximum 
number of completed instants to walk back when searching for the configured 
rolling metadata keys. Higher values improve resilience at a small performance 
cost. |
+
+**Example:**
+
+```java
+inputDF.write.format("hudi")
+       .option("hoodie.write.rolling.metadata.keys", 
"kafka.offset.partition.0,kafka.offset.partition.1")
+       .option("hoodie.write.rolling.metadata.timeline.lookback.commits", "10")
+       // ... other options
+       .save(basePath)
+```
+
+### Advanced Storage Options
+
+The following advanced storage configuration options were added in Hudi 1.2.0:
+
+| Config | Default | Description |
+|---|---|---|
+| `hoodie.parquet.write.config.injector.class` | (none) | Fully-qualified 
class name of a custom `HoodieParquetConfigInjector` implementation. Use this 
to inject custom Parquet writer properties (e.g., disable dictionary encoding, 
set bloom filter sizes) without modifying the Hudi source. The implementing 
class must implement `org.apache.hudi.io.HoodieParquetConfigInjector`. |
+
 ## Java Client
 We can use plain java to write to hudi tables. To use Java client we can 
refere 
[here](https://github.com/apache/hudi/blob/master/hudi-examples/hudi-examples-java/src/main/java/org/apache/hudi/examples/java/HoodieJavaWriteClientExample.java)

(hudi) branch asf-site updated: docs: Update documentation for new features in Hudi 1.2.0 (#18867)

Reply via email to