(paimon) branch master updated: [doc] Docment more optimization for Query Performance

lzljs3620320 Wed, 13 Aug 2025 21:04:15 -0700

This is an automated email from the ASF dual-hosted git repository.

lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/paimon.git



The following commit(s) were added to refs/heads/master by this push:
     new 619a206c1c [doc] Docment more optimization for Query Performance
619a206c1c is described below

commit 619a206c1c8269cd2181b92c4e05cf3d3c1840b0
Author: JingsongLi <[email protected]>
AuthorDate: Thu Aug 14 12:03:59 2025 +0800

    [doc] Docment more optimization for Query Performance
---
 docs/content/append-table/bucketed.md              | 21 +--------
 docs/content/append-table/query-performance.md     | 39 ++++++++++++++++
 .../content/primary-key-table/query-performance.md | 54 ++++++++++++----------
 3 files changed, 71 insertions(+), 43 deletions(-)

diff --git a/docs/content/append-table/bucketed.md 
b/docs/content/append-table/bucketed.md
index c485fe7c7a..5643da0c05 100644
--- a/docs/content/append-table/bucketed.md
+++ b/docs/content/append-table/bucketed.md
@@ -171,23 +171,6 @@ INSERT INTO paimon_table SELECT * FROM kakfa_table;
 SELECT * FROM paimon_table /*+ OPTIONS('scan.bounded.watermark'='...') */;
 ```
 
-## Batch
+## Bucketed Join
 
-Bucketed table can be used to avoid shuffle if necessary in batch query, for 
example, you can use the following Spark
-SQL to read a Paimon table:
-
-```sql
-SET spark.sql.sources.v2.bucketing.enabled = true;
-
-CREATE TABLE FACT_TABLE (order_id INT, f1 STRING) TBLPROPERTIES 
('bucket'='10', 'bucket-key' = 'order_id');
-
-CREATE TABLE DIM_TABLE (order_id INT, f2 STRING) TBLPROPERTIES ('bucket'='10', 
'primary-key' = 'order_id');
-
-SELECT * FROM FACT_TABLE JOIN DIM_TABLE on t1.order_id = t4.order_id;
-```
-
-The `spark.sql.sources.v2.bucketing.enabled` config is used to enable 
bucketing for V2 data sources. When turned on,
-Spark will recognize the specific distribution reported by a V2 data source 
through SupportsReportPartitioning, and
-will try to avoid shuffle if necessary.
-
-The costly join shuffle will be avoided if two tables have the same bucketing 
strategy and same number of buckets.
+Bucketed table can be used to avoid shuffle if necessary in batch query, see 
[Bucketed Join]({{< ref "append-table/query-performance#bucketed-join" >}}).
diff --git a/docs/content/append-table/query-performance.md 
b/docs/content/append-table/query-performance.md
index e2128bbd89..1c3d48468d 100644
--- a/docs/content/append-table/query-performance.md
+++ b/docs/content/append-table/query-performance.md
@@ -26,6 +26,24 @@ under the License.
 
 # Query Performance
 
+## Aggregate push down
+
+Append Table supports aggregate push down:
+
+```sql
+SELECT COUNT(*) FROM TABLE WHERE DT = '20230101';
+```
+
+This query can be accelerated during compilation and returns very quickly.
+
+For Spark SQL, table with default `metadata.stats-mode` can be accelerated:
+
+```sql
+SELECT MIN(a), MAX(b) FROM TABLE WHERE DT = '20230101';
+```
+
+Min max query can be also accelerated during compilation and returns very 
quickly.
+
 ## Data Skipping By Order
 
 Paimon by default records the maximum and minimum values of each field in the 
manifest file.
@@ -75,3 +93,24 @@ we use the procedure, you should config appropriate 
configurations in target tab
 `file-index.<filter-type>.columns` to the table.
 
 How to invoke: see [flink procedures]({{< ref "flink/procedures#procedures" 
>}}) 
+
+## Bucketed Join
+
+Bucketed table can be used to avoid shuffle if necessary in batch query, for 
example, you can use the following Spark
+SQL to read a Paimon table:
+
+```sql
+SET spark.sql.sources.v2.bucketing.enabled = true;
+
+CREATE TABLE FACT_TABLE (order_id INT, f1 STRING) TBLPROPERTIES 
('bucket'='10', 'bucket-key' = 'order_id');
+
+CREATE TABLE DIM_TABLE (order_id INT, f2 STRING) TBLPROPERTIES ('bucket'='10', 
'primary-key' = 'order_id');
+
+SELECT * FROM FACT_TABLE JOIN DIM_TABLE on t1.order_id = t4.order_id;
+```
+
+The `spark.sql.sources.v2.bucketing.enabled` config is used to enable 
bucketing for V2 data sources. When turned on,
+Spark will recognize the specific distribution reported by a V2 data source 
through SupportsReportPartitioning, and
+will try to avoid shuffle if necessary.
+
+The costly join shuffle will be avoided if two tables have the same bucketing 
strategy and same number of buckets.
diff --git a/docs/content/primary-key-table/query-performance.md 
b/docs/content/primary-key-table/query-performance.md
index d27c4868fe..2ba19b0d3d 100644
--- a/docs/content/primary-key-table/query-performance.md
+++ b/docs/content/primary-key-table/query-performance.md
@@ -36,40 +36,46 @@ the concurrency of reading data.
 For MOW (Deletion Vectors) or COW table or [Read Optimized]({{< ref 
"concepts/system-tables#read-optimized-table" >}}) table,
 there is no limit to the concurrency of reading data, and they can also 
utilize some filtering conditions for non-primary-key columns.
 
-## Data Skipping By Primary Key Filter
+## Aggregate push down
 
-For a regular bucketed table (For example, bucket = 5), the filtering 
conditions of the primary key will greatly
-accelerate queries and reduce the reading of a large number of files.
+Table with Deletion Vectors Enabled supports aggregate push down:
+
+```sql
+SELECT COUNT(*) FROM TABLE WHERE DT = '20230101';
+```
 
-## Data Skipping By File Index
+This query can be accelerated during compilation and returns very quickly.
 
-You can use file index to table with Deletion Vectors enabled, it filters 
files by index on the read side.
+For Spark SQL, table with default `metadata.stats-mode` can be accelerated:
 
 ```sql
-CREATE TABLE <PAIMON_TABLE> WITH (
-    'deletion-vectors.enabled' = 'true',
-    'file-index.bloom-filter.columns' = 'c1,c2',
-    'file-index.bloom-filter.c1.items' = '200'
-);
+SELECT MIN(a), MAX(b) FROM TABLE WHERE DT = '20230101';
 ```
 
-Supported filter types:
+Min max query can be also accelerated during compilation and returns very 
quickly.
 
-`Bloom Filter`:
-* `file-index.bloom-filter.columns`: specify the columns that need bloom 
filter index.
-* `file-index.bloom-filter.<column_name>.fpp` to config false positive 
probability.
-* `file-index.bloom-filter.<column_name>.items` to config the expected 
distinct items in one data file.
+## Data Skipping By Primary Key Filter
 
-`Bitmap`:
-* `file-index.bitmap.columns`: specify the columns that need bitmap index. See 
[Index Bitmap]({{< ref "concepts/spec/fileindex#index-bitmap" >}}).
+For a regular bucketed table (For example, bucket = 5), the filtering 
conditions of the primary key will greatly
+accelerate queries and reduce the reading of a large number of files.
+
+## Bucketed Join
 
-`Bit-Slice Index Bitmap`
-* `file-index.bsi.columns`: specify the columns that need bsi index.
+Fixed Bucketed table (e.g. bucket = 10) can be used to avoid shuffle if 
necessary in batch query, for example, you can
+use the following Spark SQL to read a Paimon table:
+
+```sql
+SET spark.sql.sources.v2.bucketing.enabled = true;
 
-More filter types will be supported...
+CREATE TABLE FACT_TABLE (order_id INT, f1 STRING) TBLPROPERTIES 
('bucket'='10', 'primary-key' = 'order_id');
+
+CREATE TABLE DIM_TABLE (order_id INT, f2 STRING) TBLPROPERTIES ('bucket'='10', 
'primary-key' = 'order_id');
+
+SELECT * FROM FACT_TABLE JOIN DIM_TABLE on t1.order_id = t4.order_id;
+```
 
-If you want to add file index to existing table, without any rewrite, you can 
use `rewrite_file_index` procedure. Before
-we use the procedure, you should config appropriate configurations in target 
table. You can use ALTER clause to config
-`file-index.<filter-type>.columns` to the table.
+The `spark.sql.sources.v2.bucketing.enabled` config is used to enable 
bucketing for V2 data sources. When turned on,
+Spark will recognize the specific distribution reported by a V2 data source 
through SupportsReportPartitioning, and
+will try to avoid shuffle if necessary.
 
-How to invoke: see [flink procedures]({{< ref "flink/procedures#procedures" 
>}}) 
+The costly join shuffle will be avoided if two tables have the same bucketing 
strategy and same number of buckets.

(paimon) branch master updated: [doc] Docment more optimization for Query Performance

Reply via email to