(doris-website) branch master updated: [catalog](benchmark) add trino connector doc (#656)

morningman Thu, 16 May 2024 01:42:41 -0700

This is an automated email from the ASF dual-hosted git repository.

morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 1b13fe5263c [catalog](benchmark) add trino connector doc (#656)
1b13fe5263c is described below

commit 1b13fe5263c13a7bb460023915ffb806e0b82539
Author: Mingyu Chen <[email protected]>
AuthorDate: Thu May 16 16:42:31 2024 +0800

    [catalog](benchmark) add trino connector doc (#656)
    
    add tpch and tpcds catalog doc
---
 docs/lakehouse/datalake-analytics/tpcds.md         | 179 ++++++++++++++++++++
 docs/lakehouse/datalake-analytics/tpch.md          | 147 ++++++++++++++++
 .../current/lakehouse/datalake-analytics/tpcds.md  | 187 +++++++++++++++++++++
 .../current/lakehouse/datalake-analytics/tpch.md   | 156 +++++++++++++++++
 sidebars.json                                      |   6 +-
 5 files changed, 673 insertions(+), 2 deletions(-)

diff --git a/docs/lakehouse/datalake-analytics/tpcds.md 
b/docs/lakehouse/datalake-analytics/tpcds.md
new file mode 100644
index 00000000000..4b3144d9bbf
--- /dev/null
+++ b/docs/lakehouse/datalake-analytics/tpcds.md
@@ -0,0 +1,179 @@
+---
+{
+"title": "TPCDS",
+"language": "en"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Usage Notes
+
+TPCDS Catalog uses the [Trino 
Connector](https://doris.apache.org/community/how-to-contribute/trino-connector-developer-guide)
 compatibility framework and the [TPCDS 
Connector](https://trino.io/docs/current/connector/tpcds.html) to quickly build 
TPCDS test sets.
+
+:::tip
+This feature is supported starting from Doris version 3.0.0.
+:::
+
+## Compiling the TPCDS Connector
+
+> JDK 17 is required.
+
+```shell
+git clone https://github.com/trinodb/trino.git
+git checkout 435
+cd trino/plugin/trino-tpcds
+mvn clean install -DskipTest
+```
+
+After compiling, you will find the `trino-tpcds-435/` directory under 
`trino/plugin/trino-tpcds/target/`.
+
+You can also directly download the precompiled 
[trino-tpcds-435.tar.gz](https://github.com/morningman/trino-connectors/releases/download/trino-connectors/trino-tpcds-435.tar.gz)
 and extract it.
+
+## Deploying the TPCDS Connector
+
+Place the `trino-tpcds-435/` directory under the `connectors/` directory in 
the deployment paths of all FE and BE nodes. (If it does not exist, you can 
create it manually).
+
+```
+├── bin
+├── conf
+├── connectors
+│   ├── trino-tpcds-435
+...
+```
+
+After deployment, it is recommended to restart the FE and BE nodes to ensure 
the Connector is loaded correctly.
+
+## Creating the TPCDS Catalog
+
+```sql
+CREATE CATALOG `tpcds` PROPERTIES (
+    "type" = "trino-connector",
+    "connector.name" = "tpcds",
+    "tpcds.split-count" = "32"
+);
+```
+
+The `tpcds.split-count` property sets the level of concurrency. It is 
recommended to set it to twice the number of cores per BE node to achieve 
optimal concurrency and improve data generation efficiency.
+
+## Using the TPCDS Catalog
+
+The TPCDS Catalog includes pre-configured TPCDS datasets of different scale 
factors, which can be viewed using the `SHOW DATABASES` and `SHOW TABLES` 
commands.
+
+```
+mysql> SWITCH tpcds;
+Query OK, 0 rows affected (0.00 sec)
+
+mysql> SHOW DATABASES;
++--------------------+
+| Database           |
++--------------------+
+| information_schema |
+| mysql              |
+| sf1                |
+| sf100              |
+| sf1000             |
+| sf10000            |
+| sf100000           |
+| sf300              |
+| sf3000             |
+| sf30000            |
+| tiny               |
++--------------------+
+11 rows in set (0.00 sec)
+
+mysql> USE sf1;
+mysql> SHOW TABLES;
++------------------------+
+| Tables_in_sf1          |
++------------------------+
+| call_center            |
+| catalog_page           |
+| catalog_returns        |
+| catalog_sales          |
+| customer               |
+| customer_address       |
+| customer_demographics  |
+| date_dim               |
+| dbgen_version          |
+| household_demographics |
+| income_band            |
+| inventory              |
+| item                   |
+| promotion              |
+| reason                 |
+| ship_mode              |
+| store                  |
+| store_returns          |
+| store_sales            |
+| time_dim               |
+| warehouse              |
+| web_page               |
+| web_returns            |
+| web_sales              |
+| web_site               |
++------------------------+
+25 rows in set (0.00 sec)
+```
+
+You can directly query these tables using the SELECT statement.
+
+:::tip
+The data in these pre-configured datasets is not actually stored but generated 
in real-time during queries. Therefore, these datasets are not suitable for 
direct benchmarking. They are more appropriate for writing to other target 
tables (such as Doris internal tables, Hive, Iceberg, and other data sources 
supported by Doris) via `INSERT INTO SELECT`, after which performance tests can 
be conducted on the target tables.
+:::
+
+### Best Practices
+
+#### Quickly Build TPCDS Test Dataset
+
+You can quickly build a TPCDS test dataset using the CTAS (Create Table As 
Select) statement:
+
+```
+CREATE TABLE hive.tpcds100.call_center            PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.call_center           ;
+CREATE TABLE hive.tpcds100.catalog_page           PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.catalog_page          ;
+CREATE TABLE hive.tpcds100.catalog_returns        PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.catalog_returns       ;
+CREATE TABLE hive.tpcds100.catalog_sales          PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.catalog_sales         ;
+CREATE TABLE hive.tpcds100.customer               PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.customer              ;
+CREATE TABLE hive.tpcds100.customer_address       PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.customer_address      ;
+CREATE TABLE hive.tpcds100.customer_demographics  PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.customer_demographics ;
+CREATE TABLE hive.tpcds100.date_dim               PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.date_dim              ;
+CREATE TABLE hive.tpcds100.dbgen_version          PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.dbgen_version         ;
+CREATE TABLE hive.tpcds100.household_demographics PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.household_demographics;
+CREATE TABLE hive.tpcds100.income_band            PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.income_band           ;
+CREATE TABLE hive.tpcds100.inventory              PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.inventory             ;
+CREATE TABLE hive.tpcds100.item                   PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.item                  ;
+CREATE TABLE hive.tpcds100.promotion              PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.promotion             ;
+CREATE TABLE hive.tpcds100.reason                 PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.reason                ;
+CREATE TABLE hive.tpcds100.ship_mode              PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.ship_mode             ;
+CREATE TABLE hive.tpcds100.store                  PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.store                 ;
+CREATE TABLE hive.tpcds100.store_returns          PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.store_returns         ;
+CREATE TABLE hive.tpcds100.store_sales            PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.store_sales           ;
+CREATE TABLE hive.tpcds100.time_dim               PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.time_dim              ;
+CREATE TABLE hive.tpcds100.warehouse              PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.warehouse             ;
+CREATE TABLE hive.tpcds100.web_page               PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.web_page              ;
+CREATE TABLE hive.tpcds100.web_returns            PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.web_returns           ;
+CREATE TABLE hive.tpcds100.web_sales              PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.web_sales             ;
+CREATE TABLE hive.tpcds100.web_site               PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.web_site              ;
+```
+
+:::tip
+On a Doris cluster with 3 BE nodes, each with 16 cores, creating a TPCDS 1000 
dataset in Hive takes approximately 3 to 4 hours.
+:::
+
diff --git a/docs/lakehouse/datalake-analytics/tpch.md 
b/docs/lakehouse/datalake-analytics/tpch.md
new file mode 100644
index 00000000000..252193779ff
--- /dev/null
+++ b/docs/lakehouse/datalake-analytics/tpch.md
@@ -0,0 +1,147 @@
+---
+{
+"title": "TPCH",
+"language": "en"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Usage Notes
+
+TPCH Catalog uses the [Trino 
Connector](https://doris.apache.org/community/how-to-contribute/trino-connector-developer-guide)
 compatibility framework and the [TPCH 
Connector](https://trino.io/docs/current/connector/tpch.html) to quickly build 
TPCH test sets.
+
+:::tip
+This feature is supported starting from Doris version 3.0.0.
+:::
+
+## Compiling the TPCH Connector
+
+> JDK 17 is required.
+
+```shell
+git clone https://github.com/trinodb/trino.git
+git checkout 435
+cd trino/plugin/trino-tpch
+mvn clean install -DskipTest
+```
+
+After compiling, you will find the `trino-tpch-435/` directory under 
`trino/plugin/trino-tpch/target/`.
+
+You can also directly download the precompiled 
[trino-tpch-435.tar.gz](https://github.com/morningman/trino-connectors/releases/download/trino-connectors/trino-tpch-435.tar.gz)
 and extract it.
+
+## Deploying the TPCH Connector
+
+Place the `trino-tpch-435/` directory under the `connectors/` directory in the 
deployment paths of all FE and BE nodes. (If it does not exist, you can create 
it manually).
+
+```
+├── bin
+├── conf
+├── connectors
+│   ├── trino-tpch-435
+...
+```
+
+After deployment, it is recommended to restart the FE and BE nodes to ensure 
the Connector is loaded correctly.
+
+## Creating the TPCH Catalog
+
+```sql
+CREATE CATALOG `tpch` PROPERTIES (
+    "type" = "trino-connector",
+    "connector.name" = "tpch",
+    "tpch.column-naming" = "STANDARD",
+    "tpch.splits-per-node" = "32"
+);
+```
+
+The `tpch.splits-per-node` property sets the level of concurrency. It is 
recommended to set it to twice the number of cores per BE node to achieve 
optimal concurrency and improve data generation efficiency.
+
+When `"tpch.column-naming" = "STANDARD"`, the column names in the TPCH table 
will start with the abbreviation of the table name, such as `l_orderkey`, 
otherwise, it is `orderkey`.
+
+## Using the TPCH Catalog
+
+The TPCH Catalog includes pre-configured TPCH datasets of different scale 
factors, which can be viewed using the `SHOW DATABASES` and `SHOW TABLES` 
commands.
+
+```
+mysql> SWITCH tpch;
+Query OK, 0 rows affected (0.00 sec)
+
+mysql> SHOW DATABASES;
++--------------------+
+| Database           |
++--------------------+
+| information_schema |
+| mysql              |
+| sf1                |
+| sf100              |
+| sf1000             |
+| sf10000            |
+| sf100000           |
+| sf300              |
+| sf3000             |
+| sf30000            |
+| tiny               |
++--------------------+
+11 rows in set (0.00 sec)
+
+mysql> USE sf1;
+mysql> SHOW TABLES;
++---------------+
+| Tables_in_sf1 |
++---------------+
+| customer      |
+| lineitem      |
+| nation        |
+| orders        |
+| part          |
+| partsupp      |
+| region        |
+| supplier      |
++---------------+
+8 rows in set (0.00 sec)
+```
+
+You can directly query these tables using the SELECT statement.
+
+:::tip
+The data in these pre-configured datasets is not actually stored but generated 
in real-time during queries. Therefore, these datasets are not suitable for 
direct benchmarking. They are more appropriate for writing to other target 
tables (such as Doris internal tables, Hive, Iceberg, and other data sources 
supported by Doris) via `INSERT INTO SELECT`, after which performance tests can 
be conducted on the target tables.
+:::
+
+### Best Practices
+
+#### Quickly Build TPCH Test Dataset
+
+You can quickly build a TPCH test dataset using the CTAS (Create Table As 
Select) statement:
+
+```
+CREATE TABLE hive.tpch100.customer PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.customer  ;
+CREATE TABLE hive.tpch100.lineitem PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.lineitem  ;
+CREATE TABLE hive.tpch100.nation   PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.nation    ;
+CREATE TABLE hive.tpch100.orders   PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.orders    ;
+CREATE TABLE hive.tpch100.part     PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.part      ;
+CREATE TABLE hive.tpch100.partsupp PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.partsupp  ;
+CREATE TABLE hive.tpch100.region   PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.region    ;
+CREATE TABLE hive.tpch100.supplier PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.supplier  ;
+```
+
+:::tip
+On a Doris cluster with 3 BE nodes, each with 16 cores, creating a TPCH 1000 
dataset in Hive takes approximately 25 minutes, and TPCH 10000 takes about 4 to 
5 hours.
+:::
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/tpcds.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/tpcds.md
new file mode 100644
index 00000000000..69a0853ef5a
--- /dev/null
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/tpcds.md
@@ -0,0 +1,187 @@
+---
+{
+"title": "TPCDS",
+"language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## 使用须知
+
+TPCDS Catalog 通过 [Trino 
Connector](https://doris.apache.org/community/how-to-contribute/trino-connector-developer-guide)
 兼容框架，使用 [TPCDS Connector](https://trino.io/docs/current/connector/tpcds.html) 
来快速构建 TPCDS 测试集。
+
+:::tip
+该功能自 Doris 3.0.0 版本开始支持。
+:::
+
+## 编译 TPCDS Connector
+
+> 需要 JDK 17 版本。
+
+```shell
+git clone https://github.com/trinodb/trino.git
+git checkout 435
+cd trino/plugin/trino-tpcds
+mvn clean install -DskipTest
+```
+
+完成编译后，会在 `trino/plugin/trino-tpcds/target/` 下得到 `trino-tpcds-435/` 目录。
+
+也可以直接下载预编译的 
[trino-tpcds-435.tar.gz](https://github.com/morningman/trino-connectors/releases/download/trino-connectors/trino-tpcds-435.tar.gz)
 并解压。
+
+## 部署 TPCDS Connector
+
+将 `trino-tpcds-435/` 目录放到所有 FE 和 BE 部署路径的 `connectors/` 目录下。（如果没有，可以手动创建）。
+
+```
+├── bin
+├── conf
+├── connectors
+│   ├── trino-tpcds-435
+...
+```
+
+部署完成后，建议重启 FE、BE 节点以确保 Connector 可以被正确加载。
+
+## 创建 TPCDS Catalog
+
+```sql
+CREATE CATALOG `tpcds` PROPERTIES (
+    "type" = "trino-connector",
+    "connector.name" = "tpcds",
+    "tpcds.split-count" = "32"
+);
+```
+
+其中 `tpcds.split-count` 为并发数，建议设置为 BE 单机核数的 2 倍，可以获得最优的并发度。提升数据生成效率。
+
+## 使用 TPCDS Catalog
+
+TPCDS Catalog 中预制了不同 Scale Factor 的 TPCDS 数据集，可以通过 `SHOW DATABASES` 和 `SHOW 
TABLES` 命令查看。
+
+```
+mysql> SWITCH tpcds;
+Query OK, 0 rows affected (0.00 sec)
+
+mysql> SHOW DATABASES;
++--------------------+
+| Database           |
++--------------------+
+| information_schema |
+| mysql              |
+| sf1                |
+| sf100              |
+| sf1000             |
+| sf10000            |
+| sf100000           |
+| sf300              |
+| sf3000             |
+| sf30000            |
+| tiny               |
++--------------------+
+11 rows in set (0.00 sec)
+
+mysql> USE sf1;
+mysql> SHOW TABLES;
++------------------------+
+| Tables_in_sf1          |
++------------------------+
+| call_center            |
+| catalog_page           |
+| catalog_returns        |
+| catalog_sales          |
+| customer               |
+| customer_address       |
+| customer_demographics  |
+| date_dim               |
+| dbgen_version          |
+| household_demographics |
+| income_band            |
+| inventory              |
+| item                   |
+| promotion              |
+| reason                 |
+| ship_mode              |
+| store                  |
+| store_returns          |
+| store_sales            |
+| time_dim               |
+| warehouse              |
+| web_page               |
+| web_returns            |
+| web_sales              |
+| web_site               |
++------------------------+
+25 rows in set (0.00 sec)
+```
+
+通过 SELECT 语句可以直接查询这些表。
+
+:::tips
+这些预制数据集的数据，并没有实际存储，而是在查询时实时生成的。所以这些预制数据集不适合用来直接进行 Benchmark 测试。适用于通过 `INSERT 
INTO SELECT` 将数据集写入到其他目的表（如 Doris 内表、Hive、Iceberg 等所有 Doris 
支持写入的数据源）后，对目的表进行性能测试。
+:::
+
+### 最佳实践
+
+#### 快速构建 TPCDS 测试数据集
+
+可以通过 CTAS 语句快速构建一个 TPCDS 测试数据集：
+
+```
+CREATE TABLE hive.tpcds100.call_center            PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.call_center           ;
+CREATE TABLE hive.tpcds100.catalog_page           PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.catalog_page          ;
+CREATE TABLE hive.tpcds100.catalog_returns        PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.catalog_returns       ;
+CREATE TABLE hive.tpcds100.catalog_sales          PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.catalog_sales         ;
+CREATE TABLE hive.tpcds100.customer               PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.customer              ;
+CREATE TABLE hive.tpcds100.customer_address       PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.customer_address      ;
+CREATE TABLE hive.tpcds100.customer_demographics  PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.customer_demographics ;
+CREATE TABLE hive.tpcds100.date_dim               PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.date_dim              ;
+CREATE TABLE hive.tpcds100.dbgen_version          PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.dbgen_version         ;
+CREATE TABLE hive.tpcds100.household_demographics PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.household_demographics;
+CREATE TABLE hive.tpcds100.income_band            PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.income_band           ;
+CREATE TABLE hive.tpcds100.inventory              PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.inventory             ;
+CREATE TABLE hive.tpcds100.item                   PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.item                  ;
+CREATE TABLE hive.tpcds100.promotion              PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.promotion             ;
+CREATE TABLE hive.tpcds100.reason                 PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.reason                ;
+CREATE TABLE hive.tpcds100.ship_mode              PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.ship_mode             ;
+CREATE TABLE hive.tpcds100.store                  PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.store                 ;
+CREATE TABLE hive.tpcds100.store_returns          PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.store_returns         ;
+CREATE TABLE hive.tpcds100.store_sales            PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.store_sales           ;
+CREATE TABLE hive.tpcds100.time_dim               PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.time_dim              ;
+CREATE TABLE hive.tpcds100.warehouse              PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.warehouse             ;
+CREATE TABLE hive.tpcds100.web_page               PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.web_page              ;
+CREATE TABLE hive.tpcds100.web_returns            PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.web_returns           ;
+CREATE TABLE hive.tpcds100.web_sales              PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.web_sales             ;
+CREATE TABLE hive.tpcds100.web_site               PROPERTIES("file_format" = 
"parquet") AS SELECT * FROM tpcds.sf100.web_site              ;
+```
+
+:::tips
+在包含 3 个 16C BE 节点的 Doris 集群上，创建一个 TPCDS 1000 的 Hive 数据集，大约需要 3 到 4 个小时。
+:::
+
+
+
+
+
+
+
+
+
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/tpch.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/tpch.md
new file mode 100644
index 00000000000..8dc31766fc0
--- /dev/null
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/tpch.md
@@ -0,0 +1,156 @@
+---
+{
+"title": "TPCH",
+"language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## 使用须知
+
+TPCH Catalog 通过 [Trino 
Connector](https://doris.apache.org/community/how-to-contribute/trino-connector-developer-guide)
 兼容框架，使用 [TPCH Connector](https://trino.io/docs/current/connector/tpch.html) 
来快速构建 TPCH 测试集。
+
+:::tip
+该功能自 Doris 3.0.0 版本开始支持。
+:::
+
+## 编译 TPCH Connector
+
+> 需要 JDK 17 版本。
+
+```shell
+git clone https://github.com/trinodb/trino.git
+git checkout 435
+cd trino/plugin/trino-tpch
+mvn clean install -DskipTest
+```
+
+完成编译后，会在 `trino/plugin/trino-tpch/target/` 下得到 `trino-tpch-435/` 目录。
+
+也可以直接下载预编译的 
[trino-tpch-435.tar.gz](https://github.com/morningman/trino-connectors/releases/download/trino-connectors/trino-tpch-435.tar.gz)
 并解压。
+
+## 部署 TPCH Connector
+
+将 `trino-tpch-435/` 目录放到所有 FE 和 BE 部署路径的 `connectors/` 目录下。（如果没有，可以手动创建）。
+
+```
+├── bin
+├── conf
+├── connectors
+│   ├── trino-tpch-435
+...
+```
+
+部署完成后，建议重启 FE、BE 节点以确保 Connector 可以被正确加载。
+
+## 创建 TPCH Catalog
+
+```sql
+CREATE CATALOG `tpch` PROPERTIES (
+    "type" = "trino-connector",
+    "connector.name" = "tpch",
+    "tpch.column-naming" = "STANDARD",
+    "tpch.splits-per-node" = "32"
+);
+```
+
+其中 `tpch.splits-per-node` 为并发数，建议设置为 BE 单机核数的 2 倍，可以获得最优的并发度。提升数据生成效率。
+
+`"tpch.column-naming" = "STANDARD"` 时，TPCH 表中的列名，都会以表名缩写开头，比如 
`l_orderkey`，否则，是 `orderkey`。
+
+## 使用 TPCH Catalog
+
+TPCH Catalog 中预制了不同 Scale Factor 的 TPCH 数据集，可以通过 `SHOW DATABASES` 和 `SHOW 
TABLES` 命令查看。
+
+```
+mysql> SWITCH tpch;
+Query OK, 0 rows affected (0.00 sec)
+
+mysql> SHOW DATABASES;
++--------------------+
+| Database           |
++--------------------+
+| information_schema |
+| mysql              |
+| sf1                |
+| sf100              |
+| sf1000             |
+| sf10000            |
+| sf100000           |
+| sf300              |
+| sf3000             |
+| sf30000            |
+| tiny               |
++--------------------+
+11 rows in set (0.00 sec)
+
+mysql> USE sf1;
+mysql> SHOW TABLES;
++---------------+
+| Tables_in_sf1 |
++---------------+
+| customer      |
+| lineitem      |
+| nation        |
+| orders        |
+| part          |
+| partsupp      |
+| region        |
+| supplier      |
++---------------+
+8 rows in set (0.00 sec)
+```
+
+通过 SELECT 语句可以直接查询这些表。
+
+:::tips
+这些预制数据集的数据，并没有实际存储，而是在查询时实时生成的。所以这些预制数据集不适合用来直接进行 Benchmark 测试。适用于通过 `INSERT 
INTO SELECT` 将数据集写入到其他目的表（如 Doris 内表、Hive、Iceberg 等所有 Doris 
支持写入的数据源）后，对目的表进行性能测试。
+:::
+
+### 最佳实践
+
+#### 快速构建 TPCH 测试数据集
+
+可以通过 CTAS 语句快速构建一个 TPCH 测试数据集：
+
+```
+CREATE TABLE hive.tpch100.customer PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.customer  ;
+CREATE TABLE hive.tpch100.lineitem PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.lineitem  ;
+CREATE TABLE hive.tpch100.nation   PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.nation    ;
+CREATE TABLE hive.tpch100.orders   PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.orders    ;
+CREATE TABLE hive.tpch100.part     PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.part      ;
+CREATE TABLE hive.tpch100.partsupp PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.partsupp  ;
+CREATE TABLE hive.tpch100.region   PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.region    ;
+CREATE TABLE hive.tpch100.supplier PROPERTIES("file_format" = "parquet") AS 
SELECT * FROM tpch.sf100.supplier  ;
+```
+
+:::tips
+在包含 3 个 16C BE 节点的 Doris 集群上，创建一个 TPCH 1000 的 Hive 数据集，大约需要 25 分钟，TPCH 10000 
大约需要 4 到 5 个小时。
+:::
+
+
+
+
+
+
+
+
+
diff --git a/sidebars.json b/sidebars.json
index 48ca0ebd687..18e5621608c 100644
--- a/sidebars.json
+++ b/sidebars.json
@@ -265,7 +265,9 @@
                         "lakehouse/datalake-analytics/hudi",
                         "lakehouse/datalake-analytics/iceberg",
                         "lakehouse/datalake-analytics/paimon",
-                        "lakehouse/datalake-analytics/dlf"
+                        "lakehouse/datalake-analytics/dlf",
+                        "lakehouse/datalake-analytics/tpch",
+                        "lakehouse/datalake-analytics/tpcds"
                     ]
                 },
                 {
@@ -1522,4 +1524,4 @@
             ]
         }
     ]
-}
\ No newline at end of file
+}


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) branch master updated: [catalog](benchmark) add trino connector doc (#656)

Reply via email to