(paimon) branch master updated: [doc] Update data evolution doc

lzljs3620320 Thu, 21 Aug 2025 00:54:14 -0700

This is an automated email from the ASF dual-hosted git repository.

lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/paimon.git



The following commit(s) were added to refs/heads/master by this push:
     new e3ccae467d [doc] Update data evolution doc
e3ccae467d is described below

commit e3ccae467d0cd95873d8df2bc22dafebc22334fa
Author: JingsongLi <[email protected]>
AuthorDate: Thu Aug 21 15:53:36 2025 +0800

    [doc] Update data evolution doc
---
 docs/content/append-table/data-evolution.md | 72 +++++++++++++++++++++++++++++
 docs/content/append-table/row-tracking.md   | 60 +++---------------------
 2 files changed, 78 insertions(+), 54 deletions(-)

diff --git a/docs/content/append-table/data-evolution.md 
b/docs/content/append-table/data-evolution.md
new file mode 100644
index 0000000000..66c8d90aea
--- /dev/null
+++ b/docs/content/append-table/data-evolution.md
@@ -0,0 +1,72 @@
+---
+title: "Data Evolution"
+weight: 6
+type: docs
+aliases:
+- /append-table/data-evolution.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Data Evolution
+
+Paimon supports complete Schema Evolution, allowing you to freely add, modify, 
or delete column schema. But how to
+backfill newly added columns or update column data.
+
+Data Evolution Mode is a new feature for Append tables that revolutionizes how 
you handle data evolution, 
+particularly when adding new columns. This mode allows you to update partial 
columns without rewriting entire data
+files. Instead, it writes new column data to separate files and intelligently 
merges them with the original data
+during read operations.
+
+The data evolution mode offers significant advantages for your data lake 
architecture:
+
+* Efficient Partial Column Updates: With this mode, you can use Spark's MERGE 
INTO statement to update a subset of columns. This avoids the high I/O cost of 
rewriting the whole file, as only the updated columns are written.
+
+* Reduced File Rewrites: In scenarios with frequent schema changes, such as 
adding new columns, the traditional method requires constant file rewriting. 
Data evolution mode eliminates this overhead by appending new column data to 
dedicated files. This approach is much more efficient and reduces the burden on 
your storage system.
+
+* Optimized Read Performance: The new mode is designed for seamless data 
retrieval. During query execution, Paimon's engine efficiently combines the 
original data with the new column data, ensuring that read performance remains 
uncompromised. The merge process is highly optimized, so your queries run just 
as fast as they would on a single, consolidated file.
+
+To enable data evolution, you must enable row-tracking and set the 
`row-tracking.enabled` and `data-evolution.enabled` property to `true` when 
creating an append table. This ensures that the table is ready for efficient 
schema evolution operations.
+
+Use Spark Sql as an example:
+
+```sql
+CREATE TABLE target (a INT, b INT, c STRING) TBLPROPERTIES (
+    'row-tracking.enabled' = 'true',
+    'data-evolution.enabled' = 'true'
+)
+```
+
+Now we could only support spark 'MERGE INTO' statement to update partial 
columns.
+
+```sql
+MERGE INTO t
+USING s
+ON t.id = s.id
+WHEN MATCHED THEN UPDATE SET t.b = s.b
+WHEN NOT MATCHED THEN INSERT (id, b, c) VALUES (id, b, 11)
+```
+
+This statement updates only the `b` column in the target table `t` based on 
the matching records from the source table
+`s`. The `id` column and `c` column remain unchanged, and new records are 
inserted with the specified values.
+
+Note that: 
+* Data Evolution Table does not support 'Delete' statement yet.
+* Merge Into for Data Evolution Table does not support 'WHEN NOT MATCHED BY 
SOURCE' clause.
+* Only Spark version greater than 3.5.0 is supported for Data Evolution Table.
diff --git a/docs/content/append-table/row-tracking.md 
b/docs/content/append-table/row-tracking.md
index d8d4022155..04aec8bc58 100644
--- a/docs/content/append-table/row-tracking.md
+++ b/docs/content/append-table/row-tracking.md
@@ -24,15 +24,17 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# Use row tracking for Paimon Tables
-
-## What is row tracking
+# Row tracking
 
 Row tracking allows Paimon to track row-level lineage in a Paimon append 
table. Once enabled on a Paimon table, two more hidden columns will be added to 
the table schema:
 - `_ROW_ID`: BIGINT, this is a unique identifier for each row in the table. It 
is used to track the lineage of the row and can be used to identify the row in 
case of update, merge into or delete.
 - `_SEQUENCE_NUMBER`: BIGINT, this is field indicates which `version` of this 
record is. It actually is the snapshot-id of the snapshot that this row belongs 
to. It is used to track the lineage of the row version.
 
-## Enable row tracking
+Hidden columns follows the following rules:
+- Whenever we read from one table with row tracking enabled, the `_ROW_ID` and 
`_SEQUENCE_NUMBER` will be `NOT NULL`.
+- If we append records to row-tracking table in the first time, we don't 
actually write them to the data file, they are lazy assigned by committer.
+- If one row moved from one file to another file for **any reason**, the 
`_ROW_ID` column should be copied to the target file. The `_SEQUENCE_NUMBER` 
field should be set to `NULL` if the record is changed, otherwise, copy it too.
+- Whenever we read from a row-tracking table, we firstly read `_ROW_ID` and 
`_SEQUENCE_NUMBER` from the data file, then we read the value columns from the 
data file. If they found `NULL`, we read from `DataFileMeta` to fall back to 
the lazy assigned values. Anyway, it has no way to be `NULL`.
 
 To enable row-tracking, you must config `row-tracking.enabled` to `true` in 
the table options when creating an append table.
 Consider an example via Flink SQL:
@@ -49,8 +51,6 @@ Notice that:
 - Only spark support update, merge into and delete operations on row-tracking 
tables, Flink SQL does not support these operations yet.
 - This function is experimental, this line will be removed after being stable.
 
-## How to use row tracking
-
 After creating a row-tracking table, you can insert data into it as usual. The 
`_ROW_ID` and `_SEQUENCE_NUMBER` columns will be automatically managed by 
Paimon.
 ```sql
 CREATE TABLE t (id INT, data STRING) TBLPROPERTIES ('row-tracking.enabled' = 
'true');
@@ -121,51 +121,3 @@ You will get:
 | 33|              c|      2|               3|
 +---+---------------+-------+----------------+
 ```
-
-## Spec
-
-`_ROW_ID` and `_SEQUENCE_NUMBER` fields follows the following rules:
-- Whenever we read from one table with row tracking enabled, the `_ROW_ID` and 
`_SEQUENCE_NUMBER` will be `NOT NULL`.
-- If we append records to row-tracking table in the first time, we don't 
actually write them to the data file, they are lazy assigned by committer.
-- If one row moved from one file to another file for **any reason**, the 
`_ROW_ID` column should be copied to the target file. The `_SEQUENCE_NUMBER` 
field should be set to `NULL` if the record is changed, otherwise, copy it too.
-- Whenever we read from a row-tracking table, we firstly read `_ROW_ID` and 
`_SEQUENCE_NUMBER` from the data file, then we read the value columns from the 
data file. If they found `NULL`, we read from `DataFileMeta` to fall back to 
the lazy assigned values. Anyway, it has no way to be `NULL`.
-
-# Data Evolution Mode
-
-## What is data evolution mode
-Data Evolution Mode is a new feature for Paimon's append tables that 
revolutionizes how you handle schema evolution, particularly when adding new 
columns. 
-This mode allows you to update partial columns without rewriting entire data 
files. 
-Instead, it writes new column data to separate files and intelligently merges 
them with the original data during read operations.
-
-
-## Key Features and Benefits
-The data evolution mode offers significant advantages for your data lake 
architecture:
-
-* Efficient Partial Column Updates: With this mode, you can use Spark's MERGE 
INTO statement to update a subset of columns. This avoids the high I/O cost of 
rewriting the whole file, as only the updated columns are written.
-
-* Reduced File Rewrites: In scenarios with frequent schema changes, such as 
adding new columns, the traditional method requires constant file rewriting. 
Data evolution mode eliminates this overhead by appending new column data to 
dedicated files. This approach is much more efficient and reduces the burden on 
your storage system.
-
-* Optimized Read Performance: The new mode is designed for seamless data 
retrieval. During query execution, Paimon's engine efficiently combines the 
original data with the new column data, ensuring that read performance remains 
uncompromised. The merge process is highly optimized, so your queries run just 
as fast as they would on a single, consolidated file.
-
-## Enabling Data Evolution Mode
-To enable data evolution, you must enable row-tracking and set the 
`data-evolution.enabled` property to `true` when creating an append table. This 
ensures that the table is ready for efficient schema evolution operations.
-Use Spark Sql as an example:
-```sql
-CREATE TABLE target (a INT, b INT, c STRING) TBLPROPERTIES 
('row-tracking.enabled' = 'true', 'data-evolution.enabled' = 'true')
-```
-
-## Partially update columns
-Now we could only support spark 'MERGE INTO' statement to update partial 
columns.
-```sql
-MERGE INTO t
-USING s
-ON t.id = s.id
-WHEN MATCHED THEN UPDATE SET t.b = s.b
-WHEN NOT MATCHED THEN INSERT (id, b, c) VALUES (id, b, 11)
-```
-This statement updates only the `b` column in the target table `t` based on 
the matching records from the source table `s`. The `id` column and `c` column 
remain unchanged, and new records are inserted with the specified values.
-
-Note that: 
-* Data Evolution Table does not support 'Delete' statement yet
-* Merge Into for Data Evolution Table does not support 'WHEN NOT MATCHED BY 
SOURCE' clause
-* Only Spark version greater than 3.5.0 is supported for Data Evolution Table
\ No newline at end of file

(paimon) branch master updated: [doc] Update data evolution doc

Reply via email to