This is an automated email from the ASF dual-hosted git repository.
lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/paimon.git
The following commit(s) were added to refs/heads/master by this push:
new 619a206c1c [doc] Docment more optimization for Query Performance
619a206c1c is described below
commit 619a206c1c8269cd2181b92c4e05cf3d3c1840b0
Author: JingsongLi <[email protected]>
AuthorDate: Thu Aug 14 12:03:59 2025 +0800
[doc] Docment more optimization for Query Performance
---
docs/content/append-table/bucketed.md | 21 +--------
docs/content/append-table/query-performance.md | 39 ++++++++++++++++
.../content/primary-key-table/query-performance.md | 54 ++++++++++++----------
3 files changed, 71 insertions(+), 43 deletions(-)
diff --git a/docs/content/append-table/bucketed.md
b/docs/content/append-table/bucketed.md
index c485fe7c7a..5643da0c05 100644
--- a/docs/content/append-table/bucketed.md
+++ b/docs/content/append-table/bucketed.md
@@ -171,23 +171,6 @@ INSERT INTO paimon_table SELECT * FROM kakfa_table;
SELECT * FROM paimon_table /*+ OPTIONS('scan.bounded.watermark'='...') */;
```
-## Batch
+## Bucketed Join
-Bucketed table can be used to avoid shuffle if necessary in batch query, for
example, you can use the following Spark
-SQL to read a Paimon table:
-
-```sql
-SET spark.sql.sources.v2.bucketing.enabled = true;
-
-CREATE TABLE FACT_TABLE (order_id INT, f1 STRING) TBLPROPERTIES
('bucket'='10', 'bucket-key' = 'order_id');
-
-CREATE TABLE DIM_TABLE (order_id INT, f2 STRING) TBLPROPERTIES ('bucket'='10',
'primary-key' = 'order_id');
-
-SELECT * FROM FACT_TABLE JOIN DIM_TABLE on t1.order_id = t4.order_id;
-```
-
-The `spark.sql.sources.v2.bucketing.enabled` config is used to enable
bucketing for V2 data sources. When turned on,
-Spark will recognize the specific distribution reported by a V2 data source
through SupportsReportPartitioning, and
-will try to avoid shuffle if necessary.
-
-The costly join shuffle will be avoided if two tables have the same bucketing
strategy and same number of buckets.
+Bucketed table can be used to avoid shuffle if necessary in batch query, see
[Bucketed Join]({{< ref "append-table/query-performance#bucketed-join" >}}).
diff --git a/docs/content/append-table/query-performance.md
b/docs/content/append-table/query-performance.md
index e2128bbd89..1c3d48468d 100644
--- a/docs/content/append-table/query-performance.md
+++ b/docs/content/append-table/query-performance.md
@@ -26,6 +26,24 @@ under the License.
# Query Performance
+## Aggregate push down
+
+Append Table supports aggregate push down:
+
+```sql
+SELECT COUNT(*) FROM TABLE WHERE DT = '20230101';
+```
+
+This query can be accelerated during compilation and returns very quickly.
+
+For Spark SQL, table with default `metadata.stats-mode` can be accelerated:
+
+```sql
+SELECT MIN(a), MAX(b) FROM TABLE WHERE DT = '20230101';
+```
+
+Min max query can be also accelerated during compilation and returns very
quickly.
+
## Data Skipping By Order
Paimon by default records the maximum and minimum values of each field in the
manifest file.
@@ -75,3 +93,24 @@ we use the procedure, you should config appropriate
configurations in target tab
`file-index.<filter-type>.columns` to the table.
How to invoke: see [flink procedures]({{< ref "flink/procedures#procedures"
>}})
+
+## Bucketed Join
+
+Bucketed table can be used to avoid shuffle if necessary in batch query, for
example, you can use the following Spark
+SQL to read a Paimon table:
+
+```sql
+SET spark.sql.sources.v2.bucketing.enabled = true;
+
+CREATE TABLE FACT_TABLE (order_id INT, f1 STRING) TBLPROPERTIES
('bucket'='10', 'bucket-key' = 'order_id');
+
+CREATE TABLE DIM_TABLE (order_id INT, f2 STRING) TBLPROPERTIES ('bucket'='10',
'primary-key' = 'order_id');
+
+SELECT * FROM FACT_TABLE JOIN DIM_TABLE on t1.order_id = t4.order_id;
+```
+
+The `spark.sql.sources.v2.bucketing.enabled` config is used to enable
bucketing for V2 data sources. When turned on,
+Spark will recognize the specific distribution reported by a V2 data source
through SupportsReportPartitioning, and
+will try to avoid shuffle if necessary.
+
+The costly join shuffle will be avoided if two tables have the same bucketing
strategy and same number of buckets.
diff --git a/docs/content/primary-key-table/query-performance.md
b/docs/content/primary-key-table/query-performance.md
index d27c4868fe..2ba19b0d3d 100644
--- a/docs/content/primary-key-table/query-performance.md
+++ b/docs/content/primary-key-table/query-performance.md
@@ -36,40 +36,46 @@ the concurrency of reading data.
For MOW (Deletion Vectors) or COW table or [Read Optimized]({{< ref
"concepts/system-tables#read-optimized-table" >}}) table,
there is no limit to the concurrency of reading data, and they can also
utilize some filtering conditions for non-primary-key columns.
-## Data Skipping By Primary Key Filter
+## Aggregate push down
-For a regular bucketed table (For example, bucket = 5), the filtering
conditions of the primary key will greatly
-accelerate queries and reduce the reading of a large number of files.
+Table with Deletion Vectors Enabled supports aggregate push down:
+
+```sql
+SELECT COUNT(*) FROM TABLE WHERE DT = '20230101';
+```
-## Data Skipping By File Index
+This query can be accelerated during compilation and returns very quickly.
-You can use file index to table with Deletion Vectors enabled, it filters
files by index on the read side.
+For Spark SQL, table with default `metadata.stats-mode` can be accelerated:
```sql
-CREATE TABLE <PAIMON_TABLE> WITH (
- 'deletion-vectors.enabled' = 'true',
- 'file-index.bloom-filter.columns' = 'c1,c2',
- 'file-index.bloom-filter.c1.items' = '200'
-);
+SELECT MIN(a), MAX(b) FROM TABLE WHERE DT = '20230101';
```
-Supported filter types:
+Min max query can be also accelerated during compilation and returns very
quickly.
-`Bloom Filter`:
-* `file-index.bloom-filter.columns`: specify the columns that need bloom
filter index.
-* `file-index.bloom-filter.<column_name>.fpp` to config false positive
probability.
-* `file-index.bloom-filter.<column_name>.items` to config the expected
distinct items in one data file.
+## Data Skipping By Primary Key Filter
-`Bitmap`:
-* `file-index.bitmap.columns`: specify the columns that need bitmap index. See
[Index Bitmap]({{< ref "concepts/spec/fileindex#index-bitmap" >}}).
+For a regular bucketed table (For example, bucket = 5), the filtering
conditions of the primary key will greatly
+accelerate queries and reduce the reading of a large number of files.
+
+## Bucketed Join
-`Bit-Slice Index Bitmap`
-* `file-index.bsi.columns`: specify the columns that need bsi index.
+Fixed Bucketed table (e.g. bucket = 10) can be used to avoid shuffle if
necessary in batch query, for example, you can
+use the following Spark SQL to read a Paimon table:
+
+```sql
+SET spark.sql.sources.v2.bucketing.enabled = true;
-More filter types will be supported...
+CREATE TABLE FACT_TABLE (order_id INT, f1 STRING) TBLPROPERTIES
('bucket'='10', 'primary-key' = 'order_id');
+
+CREATE TABLE DIM_TABLE (order_id INT, f2 STRING) TBLPROPERTIES ('bucket'='10',
'primary-key' = 'order_id');
+
+SELECT * FROM FACT_TABLE JOIN DIM_TABLE on t1.order_id = t4.order_id;
+```
-If you want to add file index to existing table, without any rewrite, you can
use `rewrite_file_index` procedure. Before
-we use the procedure, you should config appropriate configurations in target
table. You can use ALTER clause to config
-`file-index.<filter-type>.columns` to the table.
+The `spark.sql.sources.v2.bucketing.enabled` config is used to enable
bucketing for V2 data sources. When turned on,
+Spark will recognize the specific distribution reported by a V2 data source
through SupportsReportPartitioning, and
+will try to avoid shuffle if necessary.
-How to invoke: see [flink procedures]({{< ref "flink/procedures#procedures"
>}})
+The costly join shuffle will be avoided if two tables have the same bucketing
strategy and same number of buckets.