(paimon) branch master updated: [doc][spark] Split out the Structured Streaming page (#6230)

lzljs3620320 Sat, 20 Sep 2025 18:07:40 -0700

This is an automated email from the ASF dual-hosted git repository.

lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/paimon.git



The following commit(s) were added to refs/heads/master by this push:
     new 42a6d33a04 [doc][spark] Split out the Structured Streaming page (#6230)
42a6d33a04 is described below

commit 42a6d33a0496208b2f5fed0211989315a466b057
Author: Zouxxyy <[email protected]>
AuthorDate: Wed Sep 10 11:44:11 2025 +0800

    [doc][spark] Split out the Structured Streaming page (#6230)
---
 docs/README.md                                     |  26 +--
 docs/content/spark/sql-query.md                    | 174 ---------------------
 docs/content/spark/sql-write.md                    |  34 +---
 .../{sql-query.md => structured-streaming.md}      | 164 +++----------------
 docs/setup_hugo.sh                                 |  42 +++++
 5 files changed, 86 insertions(+), 354 deletions(-)

diff --git a/docs/README.md b/docs/README.md
index fb96248516..bb1d19b6e4 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -6,24 +6,28 @@ that you always have docs corresponding to your checked-out 
version.
 
 # Requirements
 
-### Build the site locally
+### Install Hugo
 
 Make sure you have installed Hugo on your system.
-Note: An extended version of Hugo <= 0.124.1 is required. you can Find this at 
[Hugo](https://github.com/gohugoio/hugo/releases/tag/v0.124.1)
+Note: An extended version of Hugo <= 
[0.124.1](https://github.com/gohugoio/hugo/releases/tag/v0.124.1) is required.
+
+From this directory:
+
 ```sh
-go install -tags extended,withdeploy github.com/gohugoio/[email protected]
+# Install Hugo
+./setup_hugo.sh
+
+# Fetch the theme submodule
+git submodule update --init --recursive
 ```
 
+### Start Hugo local server
+
 From this directory:
 
-  * Fetch the theme submodule
-       ```sh
-       git submodule update --init --recursive
-       ```
-  * Start local server
-       ```sh
-       hugo -b "" serve
-       ```
+```sh
+hugo -b "" serve
+```
 
 The site can be viewed at http://localhost:1313/
 
diff --git a/docs/content/spark/sql-query.md b/docs/content/spark/sql-query.md
index 62a048bd9a..4217ccc656 100644
--- a/docs/content/spark/sql-query.md
+++ b/docs/content/spark/sql-query.md
@@ -117,180 +117,6 @@ SELECT * FROM paimon_incremental_to_auto_tag('tableName', 
'2024-12-04');
 In Batch SQL, the `DELETE` records are not allowed to be returned, so records 
of `-D` will be dropped.
 If you want see `DELETE` records, you can query audit_log table.
 
-## Streaming Query
-
-{{< hint info >}}
-
-Paimon currently supports Spark 3.3+ for streaming read.
-
-{{< /hint >}}
-
-Paimon supports rich scan mode for streaming read. There is a list:
-<table class="configuration table table-bordered">
-    <thead>
-        <tr>
-            <th class="text-left" style="width: 20%">Scan Mode</th>
-            <th class="text-left" style="width: 60%">Description</th>
-        </tr>
-    </thead>
-    <tbody>
-        <tr>
-            <td><h5>latest</h5></td>
-            <td>For streaming sources, continuously reads latest changes 
without producing a snapshot at the beginning. </td>
-        </tr>
-        <tr>
-            <td><h5>latest-full</h5></td>
-            <td>For streaming sources, produces the latest snapshot on the 
table upon first startup, and continue to read the latest changes.</td>
-        </tr>
-        <tr>
-            <td><h5>from-timestamp</h5></td>
-            <td>For streaming sources, continuously reads changes starting 
from timestamp specified by "scan.timestamp-millis", without producing a 
snapshot at the beginning. </td>
-        </tr>
-        <tr>
-            <td><h5>from-snapshot</h5></td>
-            <td>For streaming sources, continuously reads changes starting 
from snapshot specified by "scan.snapshot-id", without producing a snapshot at 
the beginning. </td>
-        </tr>
-        <tr>
-            <td><h5>from-snapshot-full</h5></td>
-            <td>For streaming sources, produces from snapshot specified by 
"scan.snapshot-id" on the table upon first startup, and continuously reads 
changes.</td>
-        </tr>
-        <tr>
-            <td><h5>default</h5></td>
-            <td>It is equivalent to from-snapshot if "scan.snapshot-id" is 
specified. It is equivalent to from-timestamp if "timestamp-millis" is 
specified. Or, It is equivalent to latest-full.</td>
-        </tr>
-    </tbody>
-</table>
-
-A simple example with default scan mode:
-
-```scala
-// no any scan-related configs are provided, that will use latest-full scan 
mode.
-val query = spark.readStream
-  .format("paimon")
-  // by table name
-  .table("table_name") 
-  // or by location
-  // .load("/path/to/paimon/source/table")
-  .writeStream
-  .format("console")
-  .start()
-```
-
-Paimon Structured Streaming also supports a variety of streaming read modes, 
it can support many triggers and many read limits.
-
-These read limits are supported:
-
-<table class="configuration table table-bordered">
-    <thead>
-        <tr>
-            <th class="text-left" style="width: 20%">Key</th>
-            <th class="text-left" style="width: 15%">Default</th>
-            <th class="text-left" style="width: 10%">Type</th>
-            <th class="text-left" style="width: 55%">Description</th>
-        </tr>
-    </thead>
-    <tbody>
-        <tr>
-            <td><h5>read.stream.maxFilesPerTrigger</h5></td>
-            <td style="word-wrap: break-word;">(none)</td>
-            <td>Integer</td>
-            <td>The maximum number of files returned in a single batch.</td>
-        </tr>
-        <tr>
-            <td><h5>read.stream.maxBytesPerTrigger</h5></td>
-            <td style="word-wrap: break-word;">(none)</td>
-            <td>Long</td>
-            <td>The maximum number of bytes returned in a single batch.</td>
-        </tr>
-        <tr>
-            <td><h5>read.stream.maxRowsPerTrigger</h5></td>
-            <td style="word-wrap: break-word;">(none)</td>
-            <td>Long</td>
-            <td>The maximum number of rows returned in a single batch.</td>
-        </tr>
-        <tr>
-            <td><h5>read.stream.minRowsPerTrigger</h5></td>
-            <td style="word-wrap: break-word;">(none)</td>
-            <td>Long</td>
-            <td>The minimum number of rows returned in a single batch, which 
used to create MinRowsReadLimit with read.stream.maxTriggerDelayMs 
together.</td>
-        </tr>
-        <tr>
-            <td><h5>read.stream.maxTriggerDelayMs</h5></td>
-            <td style="word-wrap: break-word;">(none)</td>
-            <td>Long</td>
-            <td>The maximum delay between two adjacent batches, which used to 
create MinRowsReadLimit with read.stream.minRowsPerTrigger together.</td>
-        </tr>
-    </tbody>
-</table>
-
-**Example: One**
-
-Use `org.apache.spark.sql.streaming.Trigger.AvailableNow()` and 
`maxBytesPerTrigger` defined by paimon.
-
-```scala
-// Trigger.AvailableNow()) processes all available data at the start
-// of the query in one or multiple batches, then terminates the query.
-// That set read.stream.maxBytesPerTrigger to 128M means that each
-// batch processes a maximum of 128 MB of data.
-val query = spark.readStream
-  .format("paimon")
-  .option("read.stream.maxBytesPerTrigger", "134217728")
-  .table("table_name")
-  .writeStream
-  .format("console")
-  .trigger(Trigger.AvailableNow())
-  .start()
-```
-
-**Example: Two**
-
-Use `org.apache.spark.sql.connector.read.streaming.ReadMinRows`.
-
-```scala
-// It will not trigger a batch until there are more than 5,000 pieces of data,
-// unless the interval between the two batches is more than 300 seconds.
-val query = spark.readStream
-  .format("paimon")
-  .option("read.stream.minRowsPerTrigger", "5000")
-  .option("read.stream.maxTriggerDelayMs", "300000")
-  .table("table_name")
-  .writeStream
-  .format("console")
-  .start()
-```
-
-Paimon Structured Streaming supports read row in the form of changelog (add 
rowkind column in row to represent its
-change type) in two ways:
-
-- Direct streaming read with the system audit_log table
-- Set `read.changelog` to true (default is false), then streaming read with 
table location
-
-**Example:**
-
-```scala
-// Option 1
-val query1 = spark.readStream
-  .format("paimon")
-  .table("`table_name$audit_log`")
-  .writeStream
-  .format("console")
-  .start()
-
-// Option 2
-val query2 = spark.readStream
-  .format("paimon")
-  .option("read.changelog", "true")
-  .table("table_name")
-  .writeStream
-  .format("console")
-  .start()
-
-/*
-+I   1  Hi
-+I   2  Hello
-*/
-```
-
 ## Query Optimization
 
 It is highly recommended to specify partition and primary key filters
diff --git a/docs/content/spark/sql-write.md b/docs/content/spark/sql-write.md
index fb01b22058..fa88446300 100644
--- a/docs/content/spark/sql-write.md
+++ b/docs/content/spark/sql-write.md
@@ -223,35 +223,7 @@ WHEN NOT MATCHED THEN
 INSERT *      -- when not matched, insert this row without any transformation;
 ```
 
-## Streaming Write
-
-{{< hint info >}}
-
-Paimon Structured Streaming only supports the two `append` and `complete` 
modes.
-
-{{< /hint >}}
-
-```scala
-// Create a paimon table if not exists.
-spark.sql(s"""
-           |CREATE TABLE T (k INT, v STRING)
-           |TBLPROPERTIES ('primary-key'='k', 'bucket'='3')
-           |""".stripMargin)
-
-// Here we use MemoryStream to fake a streaming source.
-val inputData = MemoryStream[(Int, String)]
-val df = inputData.toDS().toDF("k", "v")
-
-// Streaming Write to paimon table.
-val stream = df
-  .writeStream
-  .outputMode("append")
-  .option("checkpointLocation", "/path/to/checkpoint")
-  .format("paimon")
-  .start("/path/to/paimon/sink/table")
-```
-
-## Schema Evolution
+## Write Merge Schema
 
 {{< hint info >}}
 
@@ -259,7 +231,7 @@ Since the table schema may be updated during writing, 
catalog caching needs to b
 
 {{< /hint >}}
 
-Schema evolution is a feature that allows users to easily modify the current 
schema of a table to adapt to existing data, or new data that changes over 
time, while maintaining data integrity and consistency.
+Write merge schema is a feature that allows users to easily modify the current 
schema of a table to adapt to existing data, or new data that changes over 
time, while maintaining data integrity and consistency.
 
 Paimon supports automatic schema merging of source data and current table data 
while data is being written, and uses the merged schema as the latest schema of 
the table, and it only requires configuring `write.merge-schema`.
 
@@ -277,7 +249,7 @@ When enable `write.merge-schema`, Paimon can allow users to 
perform the followin
 
 Paimon also supports explicit type conversions between certain types (e.g. 
String -> Date, Long -> Int), it requires an explicit configuration 
`write.merge-schema.explicit-cast`.
 
-Schema evolution can be used in streaming mode at the same time.
+Write merge schema can be used in streaming mode at the same time.
 
 ```scala
 val inputData = MemoryStream[(Int, String)]
diff --git a/docs/content/spark/sql-query.md 
b/docs/content/spark/structured-streaming.md
similarity index 57%
copy from docs/content/spark/sql-query.md
copy to docs/content/spark/structured-streaming.md
index 62a048bd9a..8b7d0495b9 100644
--- a/docs/content/spark/sql-query.md
+++ b/docs/content/spark/structured-streaming.md
@@ -1,9 +1,9 @@
 ---
-title: "SQL Query"
-weight: 4
+title: "Structured Streaming"
+weight: 11
 type: docs
 aliases:
-- /spark/sql-query.html
+- /spark/structured-streaming.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
@@ -24,98 +24,39 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# SQL Query
+# Structured Streaming
 
-Just like all other tables, Paimon tables can be queried with `SELECT` 
statement.
+Paimon supports streaming data processing with [Spark Structured 
Streaming](https://spark.apache.org/docs/latest/streaming/index.html), enabling 
both streaming write and streaming query.
 
-## Batch Query
-
-Paimon's batch read returns all the data in a snapshot of the table. By 
default, batch reads return the latest snapshot.
-
-```sql
--- read all columns
-SELECT * FROM t;
-```
-
-Paimon also supports reading some hidden metadata columns, currently 
supporting the following columns:
-
-- `__paimon_file_path`: the file path of the record.
-- `__paimon_partition`: the partition of the record.
-- `__paimon_bucket`: the bucket of the record.
-- `__paimon_row_index`: the row index of the record.
-- `_ROW_ID`: the unique row id of the record (valid only when 
`row-tracking.enabled` is set to true).
-- `_SEQUENCE_NUMBER`: the sequence number of the record (valid only when 
`row-tracking.enabled` is set to true).
-
-```sql
--- read all columns and the corresponding file path, partition, bucket, 
rowIndex of the record
-SELECT *, __paimon_file_path, __paimon_partition, __paimon_bucket, 
__paimon_row_index FROM t;
-```
+## Streaming Write
 
 {{< hint info >}}
-Note: only append table or deletion vector table support querying metadata 
columns.
-{{< /hint >}}
-
-### Batch Time Travel
-
-Paimon batch reads with time travel can specify a snapshot or a tag and read 
the corresponding data.
 
-Requires Spark 3.3+.
+Paimon Structured Streaming only supports the two `append` and `complete` 
modes.
 
-you can use `VERSION AS OF` and `TIMESTAMP AS OF` in query to do time travel:
-
-```sql
--- read the snapshot with id 1L (use snapshot id as version)
-SELECT * FROM t VERSION AS OF 1;
-
--- read the snapshot from specified timestamp 
-SELECT * FROM t TIMESTAMP AS OF '2023-06-01 00:00:00.123';
-
--- read the snapshot from specified timestamp in unix seconds
-SELECT * FROM t TIMESTAMP AS OF 1678883047;
-
--- read tag 'my-tag'
-SELECT * FROM t VERSION AS OF 'my-tag';
-
--- read the snapshot from specified watermark. will match the first snapshot 
after the watermark
-SELECT * FROM t VERSION AS OF 'watermark-1678883047356';
-
-```
-
-{{< hint warning >}}
-If tag's name is a number and equals to a snapshot id, the VERSION AS OF 
syntax will consider tag first. For example, if
-you have a tag named '1' based on snapshot 2, the statement `SELECT * FROM t 
VERSION AS OF '1'` actually queries snapshot 2
-instead of snapshot 1.
 {{< /hint >}}
 
-### Batch Incremental
-
-Read incremental changes between start snapshot (exclusive) and end snapshot.
-
-For example:
-- '5,10' means changes between snapshot 5 and snapshot 10.
-- 'TAG1,TAG3' means changes between TAG1 and TAG3.
-
-By default, will scan changelog files for the table which produces changelog 
files. Otherwise, scan newly changed files.
-You can also force specifying `'incremental-between-scan-mode'`.
-
-Paimon supports that use Spark SQL to do the incremental query that 
implemented by Spark Table Valued Function.
-
-```sql
--- read the incremental data between snapshot id 12 and snapshot id 20.
-SELECT * FROM paimon_incremental_query('tableName', 12, 20);
-
--- read the incremental data between ts 1692169900000 and ts 1692169900000.
-SELECT * FROM paimon_incremental_between_timestamp('tableName', 
'1692169000000', '1692169900000');
-SELECT * FROM paimon_incremental_between_timestamp('tableName', '2025-03-12 
00:00:00', '2025-03-12 00:08:00');
-
--- read the incremental data to tag '2024-12-04'.
--- Paimon will find an earlier tag and return changes between them.
--- If the tag doesn't exist or the earlier tag doesn't exist, return empty.
-SELECT * FROM paimon_incremental_to_auto_tag('tableName', '2024-12-04');
+```scala
+// Create a paimon table if not exists.
+spark.sql(s"""
+           |CREATE TABLE T (k INT, v STRING)
+           |TBLPROPERTIES ('primary-key'='k', 'bucket'='3')
+           |""".stripMargin)
+
+// Here we use MemoryStream to fake a streaming source.
+val inputData = MemoryStream[(Int, String)]
+val df = inputData.toDS().toDF("k", "v")
+
+// Streaming Write to paimon table.
+val stream = df
+  .writeStream
+  .outputMode("append")
+  .option("checkpointLocation", "/path/to/checkpoint")
+  .format("paimon")
+  .start("/path/to/paimon/sink/table")
 ```
 
-In Batch SQL, the `DELETE` records are not allowed to be returned, so records 
of `-D` will be dropped.
-If you want see `DELETE` records, you can query audit_log table.
+Streaming write also supports [Write merge schema]({{< ref 
"spark/sql-write#write-merge-schema" >}}).
 
 ## Streaming Query
 
@@ -290,56 +231,3 @@ val query2 = spark.readStream
 +I   2  Hello
 */
 ```
-
-## Query Optimization
-
-It is highly recommended to specify partition and primary key filters
-along with the query, which will speed up the data skipping of the query.
-
-The filter functions that can accelerate data skipping are:
-- `=`
-- `<`
-- `<=`
-- `>`
-- `>=`
-- `IN (...)`
-- `LIKE 'abc%'`
-- `IS NULL`
-
-Paimon will sort the data by primary key, which speeds up the point queries
-and range queries. When using a composite primary key, it is best for the query
-filters to form a [leftmost 
prefix](https://dev.mysql.com/doc/refman/5.7/en/multiple-column-indexes.html)
-of the primary key for good acceleration.
-
-Suppose that a table has the following specification:
-
-```sql
-CREATE TABLE orders (
-    catalog_id BIGINT,
-    order_id BIGINT,
-    .....,
-) TBLPROPERTIES (
-    'primary-key' = 'catalog_id,order_id'
-);
-```
-
-The query obtains a good acceleration by specifying a range filter for
-the leftmost prefix of the primary key.
-
-```sql
-SELECT * FROM orders WHERE catalog_id=1025;
-
-SELECT * FROM orders WHERE catalog_id=1025 AND order_id=29495;
-
-SELECT * FROM orders
-  WHERE catalog_id=1025
-  AND order_id>2035 AND order_id<6000;
-```
-
-However, the following filter cannot accelerate the query well.
-
-```sql
-SELECT * FROM orders WHERE order_id=29495;
-
-SELECT * FROM orders WHERE catalog_id=1025 OR order_id=29495;
-```
diff --git a/docs/setup_hugo.sh b/docs/setup_hugo.sh
new file mode 100755
index 0000000000..b9eb37b836
--- /dev/null
+++ b/docs/setup_hugo.sh
@@ -0,0 +1,42 @@
+#!/usr/bin/env bash
+################################################################################
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.
+################################################################################
+
+# setup hugo
+
+# Detect Operating System
+OS="Linux"
+[[ "$OSTYPE" == "darwin"* ]] && OS="Mac"
+
+# Setup Hugo based on OS
+if [ "$OS" = "Mac" ]; then
+    HUGO_ARTIFACT="hugo_extended_0.124.1_darwin-universal.tar.gz"
+else
+    HUGO_ARTIFACT="hugo_extended_0.124.1_Linux-64bit.tar.gz"
+fi
+
+HUGO_REPO="https://github.com/gohugoio/hugo/releases/download/v0.124.1/${HUGO_ARTIFACT}";
+if ! curl --fail -OL $HUGO_REPO ; then
+       echo "Failed to download Hugo binary"
+       exit 1
+fi
+if [ "$OS" = "Mac" ]; then
+    tar -zxvf $HUGO_ARTIFACT -C /usr/local/bin --include='hugo'
+else
+    tar -zxvf $HUGO_ARTIFACT -C /usr/local/bin --wildcards --no-anchored 'hugo'
+fi

(paimon) branch master updated: [doc][spark] Split out the Structured Streaming page (#6230)

Reply via email to