This is an automated email from the ASF dual-hosted git repository.
morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 52df93edd07 [opt](mc) update maxcompute doc (#3474)
52df93edd07 is described below
commit 52df93edd07604dceb78d6bc927a7bdb7eb74b99
Author: Mingyu Chen (Rayner) <[email protected]>
AuthorDate: Tue Mar 17 17:15:01 2026 -0700
[opt](mc) update maxcompute doc (#3474)
---
docs/lakehouse/catalogs/maxcompute-catalog.md | 52 +++++++++++++++++++++-
.../lakehouse/catalogs/maxcompute-catalog.md | 52 +++++++++++++++++++++-
.../lakehouse/catalogs/maxcompute-catalog.md | 52 +++++++++++++++++++++-
.../lakehouse/catalogs/maxcompute-catalog.md | 52 +++++++++++++++++++++-
.../lakehouse/catalogs/maxcompute-catalog.md | 52 +++++++++++++++++++++-
.../lakehouse/catalogs/maxcompute-catalog.md | 52 +++++++++++++++++++++-
6 files changed, 306 insertions(+), 6 deletions(-)
diff --git a/docs/lakehouse/catalogs/maxcompute-catalog.md
b/docs/lakehouse/catalogs/maxcompute-catalog.md
index 1bea22a7628..ae73869d6fd 100644
--- a/docs/lakehouse/catalogs/maxcompute-catalog.md
+++ b/docs/lakehouse/catalogs/maxcompute-catalog.md
@@ -63,6 +63,19 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
| `mc.datetime_predicate_push_down` | `true` | Whether to allow predicate
push-down for `timestamp/timestamp_ntz` types. Doris loses precision (9 -> 6)
when syncing these two types. Therefore, if the original data precision is
higher than 6 digits, predicate push-down may lead to inaccurate results. |
2.1.9/3.0.5 (inclusive) and later |
| `mc.account_format` | `name` | The account systems of Alibaba Cloud
International and China sites are inconsistent. For international site users,
if you encounter errors like `user 'RAM$xxxxxx:xxxxx' is not a valid aliyun
account`, you can set this parameter to `id`. | 3.0.9/3.1.1 (inclusive) and
later |
| `mc.enable.namespace.schema` | `false` | Whether to support MaxCompute
schema hierarchy. See:
https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations |
3.1.3 (inclusive) and later |
+ | `mc.max_field_size_bytes` | `8388608` (8 MB) | Maximum bytes allowed for a
single field in a write session. When writing data that contains large string
or binary fields, the write may fail if the field size exceeds this value. You
can increase this value based on your actual data. | 4.1.0 (inclusive) and
later |
+
+ - `mc.max_field_size_bytes`
+
+ MaxCompute allows a maximum of 8 MB per field by default. If the data
being written contains large string or binary fields, the write may fail.
+
+ To adjust this limit, first execute the following command in the
MaxCompute console SQL editor:
+
+ `setproject odps.sql.cfile2.field.maxsize=262144;`
+
+ This adjusts the maximum bytes for a single field. The unit is KB and the
maximum value is 262144.
+
+ Then set `mc.max_field_size_bytes` to 262144 in the Doris catalog
properties (this value must not exceed the MaxCompute setting).
* `{CommonProperties}`
@@ -180,6 +193,31 @@ SELECT * FROM mc_tbl LIMIT 10;
SELECT * FROM mc_ctl.mc_db.mc_tbl LIMIT 10;
```
+### Query Optimization
+
+- LIMIT Query Optimization (Since 4.1.0)
+
+ This parameter is only applicable to scenarios where `LIMIT 1` is
frequently used to check whether data exists.
+
+ When querying MaxCompute tables, if the query contains only partition
column equality predicates (`=` or `IN`) with a `LIMIT` clause, you can enable
the Session Variable `enable_mc_limit_split_optimization` to optimize the split
generation strategy.
+
+ When enabled, the system uses a `row_offset` strategy to read only the
required number of rows, instead of generating splits for all data. This can
reduce the split count from many to exactly one, significantly reducing query
latency.
+
+ This optimization applies to queries like:
+
+ ```sql
+ SELECT * FROM mc_tbl WHERE pt = 'value' LIMIT 100;
+ SELECT * FROM mc_tbl WHERE pt IN ('v1', 'v2') LIMIT 100;
+ ```
+
+ To enable:
+
+ ```sql
+ SET enable_mc_limit_split_optimization = true;
+ ```
+
+ > This parameter is disabled by default. The optimization will not take
effect when the query contains non-partition column filters, non-equality
predicates (such as `>`, `<`, `!=`), or does not have a `LIMIT` clause.
+
## Write Operations
Starting from version 4.1.0, Doris supports write operations to MaxCompute
tables. You can use standard INSERT statements to write data from other data
sources directly to MaxCompute tables through Doris.
@@ -310,7 +348,7 @@ For MaxCompute Database, after deletion, all tables under
it will also be delete
DROP TABLE [IF EXISTS] mc_tbl;
```
-## Appendix
+## FAQ
### How to Obtain Endpoint and Quota (Applicable for Doris 2.1.7 and Later)
@@ -356,3 +394,15 @@ Note:
1. This method can only control the concurrent request count for a single
table within a single Query, and cannot control resource usage across multiple
SQL statements.
2. Reducing concurrency means increasing the Query execution time.
+
+### Write Best Practices
+
+- It is recommended to write to specified partitions whenever possible, e.g.
`INSERT INTO mc_tbl PARTITION(ds='20250201')`. When no partition is specified,
due to limitations of the MaxCompute Storage API, data for each partition must
be written sequentially. As a result, the execution plan will sort data based
on the partition columns, which can consume significant memory resources when
the data volume is large and may cause the write to fail.
+
+- When writing without specifying a partition, do not set `set
enable_strict_consistency_dml=false`. This forcefully removes the sort node,
causing partition data to be written out of order, which will ultimately result
in an error from MaxCompute.
+
+- Do not add a `LIMIT` clause. When a `LIMIT` clause is added, Doris will use
only a single thread for writing to guarantee the write count. This can be used
for small-scale testing, but if the `LIMIT` value is large, write performance
will be poor.
+
+### Write Error: `Data invalid: ODPS-0020041:StringOutOfMaxLength`
+
+Refer to the description of `mc.max_field_size_bytes`.
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/maxcompute-catalog.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/maxcompute-catalog.md
index 856633f9944..0ba3b8d1161 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/maxcompute-catalog.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/maxcompute-catalog.md
@@ -63,6 +63,19 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
| `mc.datetime_predicate_push_down` | `true` | 是否允许下推
`timestamp/timestamp_ntz` 类型的谓词条件。Doris 对这两个类型的同步会丢失精度(9 -> 6)。因此如果原数据精度高于 6
位,则条件下推可能导致结果不准确。 | 2.1.9/3.0.5(含)之后 |
| `mc.account_format` | `name` | 阿里云国际站和中国站的账号系统不一致,对于国际站用户,如出现如
`user 'RAM$xxxxxx:xxxxx' is not a valid aliyun account` 的错误,可指定该参数为 `id`。 |
3.0.9/3.1.1(含)之后 |
| `mc.enable.namespace.schema` | `false` | 是否支持 MaxCompute 的
schema
层级。详见:https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations
| 3.1.3(含)之后 |
+ | `mc.max_field_size_bytes` | `8388608`(8 MB) |
写入会话中单个字段允许的最大字节数。当写入包含大型字符串或二进制字段的数据时,如果字段大小超过该值,可能会导致写入失败。可根据实际数据情况适当调大该值。 |
4.1.0(含)之后 |
+
+ - `mc.max_field_size_bytes`
+
+ MaxCompute 默认允许单个字段最大为 8MB,如果写入的数据中包含大型字符串或二进制字段,可能会导致写入失败。
+
+ 如需调整,需要先在 MaxCompute 控制台的 SQL 编辑器中执行:
+
+ `setproject odps.sql.cfile2.field.maxsize=262144;`
+
+ 以调整单个字段的最大字节数。单位为 KB,最大值为 262144。
+
+ 然后在 Doris 的 catalog 属性中设置 `mc.max_field_size_bytes` 为 262144(该值不能大于
MaxCompute 的设置值)。
* `{CommonProperties}`
@@ -180,6 +193,31 @@ SELECT * FROM mc_tbl LIMIT 10;
SELECT * FROM mc_ctl.mc_db.mc_tbl LIMIT 10;
```
+### 查询优化
+
+- LIMIT 查询优化 (自 4.1.0 起)
+
+ 该参数仅适用于需要频繁使用 `LIMIT 1` 来检测数据是否存在的场景。
+
+ 当查询 MaxCompute 表时,如果查询仅包含分区列的等值条件(`=` 或 `IN`)并且带有 `LIMIT` 子句,可以通过开启
Session Variable `enable_mc_limit_split_optimization` 来优化 Split 的生成策略。
+
+ 开启后,系统会使用 `row_offset` 策略,仅读取所需数量的行数据,而不是为所有数据生成 Split。这可以将 Split
数量从多个减少为一个,从而显著降低查询延迟。
+
+ 该优化适用于如下形式的查询:
+
+ ```sql
+ SELECT * FROM mc_tbl WHERE pt = 'value' LIMIT 100;
+ SELECT * FROM mc_tbl WHERE pt IN ('v1', 'v2') LIMIT 100;
+ ```
+
+ 开启方式:
+
+ ```sql
+ SET enable_mc_limit_split_optimization = true;
+ ```
+
+ > 该参数默认关闭。当查询包含非分区列的过滤条件、非等值条件(如 `>`、`<`、`!=`)、或不带 `LIMIT` 子句时,该优化不会生效。
+
## 写入操作
自 4.1.0 版本开始,Doris 支持对 MaxCompute 表的写入操作。您可以通过标准的 INSERT 语句,将其他数据源的数据通过 Doris
直接写入 MaxCompute 表。
@@ -310,7 +348,7 @@ DROP DATABASE [IF EXISTS] mc.mc_schema;
DROP TABLE [IF EXISTS] mc_tbl;
```
-## 附录
+## 常见问题
### 如何获取 Endpoint 和 Quota(适用于 Doris 2.1.7 之后)
@@ -356,3 +394,15 @@ MaxCompute Endpoint 和 Tunnel Endpoint 的配置请参见[各地域及不同网
1. 该方法只能控制单个 Query 中单张表的并发请求数量,无法控制多个 SQL 的资源使用量。
2. 降低并发数量意味着会提高 Query 的查询时间。
+
+### 写入最佳实践
+
+- 建议优先选择指定分区写入,如 `INSERT INTO mc_tbl PARTITION(ds='20250201')`。当不指定分区时,由于
MaxCompute Storage API 的限制,各个分区的数据需要顺序写入,所以在执行计划中会基于 Partition
字段进行排序,当数据量较大时,对内存资源消耗较大,可能导致写入失败。
+
+- 当不指定分区写入时,不要设置 `set
enable_strict_consistency_dml=false`。该方法会强制取消排序节点,导致分区数据乱序写入,最终 MaxCompute 会报错。
+
+- 不要添加 `LIMIT` 子句。当添加 `LIMIT` 子句时,Doris 仅会使用单线程写出,以保证写入的数量。可以用于小数据量测试,如果
`LIMIT` 数量较大,写入性能不佳。
+
+### 写入报错:`Data invalid: ODPS-0020041:StringOutOfMaxLength`
+
+参考 `mc.max_field_size_bytes` 的说明。
\ No newline at end of file
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md
index 856633f9944..0ba3b8d1161 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md
@@ -63,6 +63,19 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
| `mc.datetime_predicate_push_down` | `true` | 是否允许下推
`timestamp/timestamp_ntz` 类型的谓词条件。Doris 对这两个类型的同步会丢失精度(9 -> 6)。因此如果原数据精度高于 6
位,则条件下推可能导致结果不准确。 | 2.1.9/3.0.5(含)之后 |
| `mc.account_format` | `name` | 阿里云国际站和中国站的账号系统不一致,对于国际站用户,如出现如
`user 'RAM$xxxxxx:xxxxx' is not a valid aliyun account` 的错误,可指定该参数为 `id`。 |
3.0.9/3.1.1(含)之后 |
| `mc.enable.namespace.schema` | `false` | 是否支持 MaxCompute 的
schema
层级。详见:https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations
| 3.1.3(含)之后 |
+ | `mc.max_field_size_bytes` | `8388608`(8 MB) |
写入会话中单个字段允许的最大字节数。当写入包含大型字符串或二进制字段的数据时,如果字段大小超过该值,可能会导致写入失败。可根据实际数据情况适当调大该值。 |
4.1.0(含)之后 |
+
+ - `mc.max_field_size_bytes`
+
+ MaxCompute 默认允许单个字段最大为 8MB,如果写入的数据中包含大型字符串或二进制字段,可能会导致写入失败。
+
+ 如需调整,需要先在 MaxCompute 控制台的 SQL 编辑器中执行:
+
+ `setproject odps.sql.cfile2.field.maxsize=262144;`
+
+ 以调整单个字段的最大字节数。单位为 KB,最大值为 262144。
+
+ 然后在 Doris 的 catalog 属性中设置 `mc.max_field_size_bytes` 为 262144(该值不能大于
MaxCompute 的设置值)。
* `{CommonProperties}`
@@ -180,6 +193,31 @@ SELECT * FROM mc_tbl LIMIT 10;
SELECT * FROM mc_ctl.mc_db.mc_tbl LIMIT 10;
```
+### 查询优化
+
+- LIMIT 查询优化 (自 4.1.0 起)
+
+ 该参数仅适用于需要频繁使用 `LIMIT 1` 来检测数据是否存在的场景。
+
+ 当查询 MaxCompute 表时,如果查询仅包含分区列的等值条件(`=` 或 `IN`)并且带有 `LIMIT` 子句,可以通过开启
Session Variable `enable_mc_limit_split_optimization` 来优化 Split 的生成策略。
+
+ 开启后,系统会使用 `row_offset` 策略,仅读取所需数量的行数据,而不是为所有数据生成 Split。这可以将 Split
数量从多个减少为一个,从而显著降低查询延迟。
+
+ 该优化适用于如下形式的查询:
+
+ ```sql
+ SELECT * FROM mc_tbl WHERE pt = 'value' LIMIT 100;
+ SELECT * FROM mc_tbl WHERE pt IN ('v1', 'v2') LIMIT 100;
+ ```
+
+ 开启方式:
+
+ ```sql
+ SET enable_mc_limit_split_optimization = true;
+ ```
+
+ > 该参数默认关闭。当查询包含非分区列的过滤条件、非等值条件(如 `>`、`<`、`!=`)、或不带 `LIMIT` 子句时,该优化不会生效。
+
## 写入操作
自 4.1.0 版本开始,Doris 支持对 MaxCompute 表的写入操作。您可以通过标准的 INSERT 语句,将其他数据源的数据通过 Doris
直接写入 MaxCompute 表。
@@ -310,7 +348,7 @@ DROP DATABASE [IF EXISTS] mc.mc_schema;
DROP TABLE [IF EXISTS] mc_tbl;
```
-## 附录
+## 常见问题
### 如何获取 Endpoint 和 Quota(适用于 Doris 2.1.7 之后)
@@ -356,3 +394,15 @@ MaxCompute Endpoint 和 Tunnel Endpoint 的配置请参见[各地域及不同网
1. 该方法只能控制单个 Query 中单张表的并发请求数量,无法控制多个 SQL 的资源使用量。
2. 降低并发数量意味着会提高 Query 的查询时间。
+
+### 写入最佳实践
+
+- 建议优先选择指定分区写入,如 `INSERT INTO mc_tbl PARTITION(ds='20250201')`。当不指定分区时,由于
MaxCompute Storage API 的限制,各个分区的数据需要顺序写入,所以在执行计划中会基于 Partition
字段进行排序,当数据量较大时,对内存资源消耗较大,可能导致写入失败。
+
+- 当不指定分区写入时,不要设置 `set
enable_strict_consistency_dml=false`。该方法会强制取消排序节点,导致分区数据乱序写入,最终 MaxCompute 会报错。
+
+- 不要添加 `LIMIT` 子句。当添加 `LIMIT` 子句时,Doris 仅会使用单线程写出,以保证写入的数量。可以用于小数据量测试,如果
`LIMIT` 数量较大,写入性能不佳。
+
+### 写入报错:`Data invalid: ODPS-0020041:StringOutOfMaxLength`
+
+参考 `mc.max_field_size_bytes` 的说明。
\ No newline at end of file
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md
index 856633f9944..0ba3b8d1161 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md
@@ -63,6 +63,19 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
| `mc.datetime_predicate_push_down` | `true` | 是否允许下推
`timestamp/timestamp_ntz` 类型的谓词条件。Doris 对这两个类型的同步会丢失精度(9 -> 6)。因此如果原数据精度高于 6
位,则条件下推可能导致结果不准确。 | 2.1.9/3.0.5(含)之后 |
| `mc.account_format` | `name` | 阿里云国际站和中国站的账号系统不一致,对于国际站用户,如出现如
`user 'RAM$xxxxxx:xxxxx' is not a valid aliyun account` 的错误,可指定该参数为 `id`。 |
3.0.9/3.1.1(含)之后 |
| `mc.enable.namespace.schema` | `false` | 是否支持 MaxCompute 的
schema
层级。详见:https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations
| 3.1.3(含)之后 |
+ | `mc.max_field_size_bytes` | `8388608`(8 MB) |
写入会话中单个字段允许的最大字节数。当写入包含大型字符串或二进制字段的数据时,如果字段大小超过该值,可能会导致写入失败。可根据实际数据情况适当调大该值。 |
4.1.0(含)之后 |
+
+ - `mc.max_field_size_bytes`
+
+ MaxCompute 默认允许单个字段最大为 8MB,如果写入的数据中包含大型字符串或二进制字段,可能会导致写入失败。
+
+ 如需调整,需要先在 MaxCompute 控制台的 SQL 编辑器中执行:
+
+ `setproject odps.sql.cfile2.field.maxsize=262144;`
+
+ 以调整单个字段的最大字节数。单位为 KB,最大值为 262144。
+
+ 然后在 Doris 的 catalog 属性中设置 `mc.max_field_size_bytes` 为 262144(该值不能大于
MaxCompute 的设置值)。
* `{CommonProperties}`
@@ -180,6 +193,31 @@ SELECT * FROM mc_tbl LIMIT 10;
SELECT * FROM mc_ctl.mc_db.mc_tbl LIMIT 10;
```
+### 查询优化
+
+- LIMIT 查询优化 (自 4.1.0 起)
+
+ 该参数仅适用于需要频繁使用 `LIMIT 1` 来检测数据是否存在的场景。
+
+ 当查询 MaxCompute 表时,如果查询仅包含分区列的等值条件(`=` 或 `IN`)并且带有 `LIMIT` 子句,可以通过开启
Session Variable `enable_mc_limit_split_optimization` 来优化 Split 的生成策略。
+
+ 开启后,系统会使用 `row_offset` 策略,仅读取所需数量的行数据,而不是为所有数据生成 Split。这可以将 Split
数量从多个减少为一个,从而显著降低查询延迟。
+
+ 该优化适用于如下形式的查询:
+
+ ```sql
+ SELECT * FROM mc_tbl WHERE pt = 'value' LIMIT 100;
+ SELECT * FROM mc_tbl WHERE pt IN ('v1', 'v2') LIMIT 100;
+ ```
+
+ 开启方式:
+
+ ```sql
+ SET enable_mc_limit_split_optimization = true;
+ ```
+
+ > 该参数默认关闭。当查询包含非分区列的过滤条件、非等值条件(如 `>`、`<`、`!=`)、或不带 `LIMIT` 子句时,该优化不会生效。
+
## 写入操作
自 4.1.0 版本开始,Doris 支持对 MaxCompute 表的写入操作。您可以通过标准的 INSERT 语句,将其他数据源的数据通过 Doris
直接写入 MaxCompute 表。
@@ -310,7 +348,7 @@ DROP DATABASE [IF EXISTS] mc.mc_schema;
DROP TABLE [IF EXISTS] mc_tbl;
```
-## 附录
+## 常见问题
### 如何获取 Endpoint 和 Quota(适用于 Doris 2.1.7 之后)
@@ -356,3 +394,15 @@ MaxCompute Endpoint 和 Tunnel Endpoint 的配置请参见[各地域及不同网
1. 该方法只能控制单个 Query 中单张表的并发请求数量,无法控制多个 SQL 的资源使用量。
2. 降低并发数量意味着会提高 Query 的查询时间。
+
+### 写入最佳实践
+
+- 建议优先选择指定分区写入,如 `INSERT INTO mc_tbl PARTITION(ds='20250201')`。当不指定分区时,由于
MaxCompute Storage API 的限制,各个分区的数据需要顺序写入,所以在执行计划中会基于 Partition
字段进行排序,当数据量较大时,对内存资源消耗较大,可能导致写入失败。
+
+- 当不指定分区写入时,不要设置 `set
enable_strict_consistency_dml=false`。该方法会强制取消排序节点,导致分区数据乱序写入,最终 MaxCompute 会报错。
+
+- 不要添加 `LIMIT` 子句。当添加 `LIMIT` 子句时,Doris 仅会使用单线程写出,以保证写入的数量。可以用于小数据量测试,如果
`LIMIT` 数量较大,写入性能不佳。
+
+### 写入报错:`Data invalid: ODPS-0020041:StringOutOfMaxLength`
+
+参考 `mc.max_field_size_bytes` 的说明。
\ No newline at end of file
diff --git
a/versioned_docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md
b/versioned_docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md
index 1bea22a7628..ae73869d6fd 100644
--- a/versioned_docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md
+++ b/versioned_docs/version-3.x/lakehouse/catalogs/maxcompute-catalog.md
@@ -63,6 +63,19 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
| `mc.datetime_predicate_push_down` | `true` | Whether to allow predicate
push-down for `timestamp/timestamp_ntz` types. Doris loses precision (9 -> 6)
when syncing these two types. Therefore, if the original data precision is
higher than 6 digits, predicate push-down may lead to inaccurate results. |
2.1.9/3.0.5 (inclusive) and later |
| `mc.account_format` | `name` | The account systems of Alibaba Cloud
International and China sites are inconsistent. For international site users,
if you encounter errors like `user 'RAM$xxxxxx:xxxxx' is not a valid aliyun
account`, you can set this parameter to `id`. | 3.0.9/3.1.1 (inclusive) and
later |
| `mc.enable.namespace.schema` | `false` | Whether to support MaxCompute
schema hierarchy. See:
https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations |
3.1.3 (inclusive) and later |
+ | `mc.max_field_size_bytes` | `8388608` (8 MB) | Maximum bytes allowed for a
single field in a write session. When writing data that contains large string
or binary fields, the write may fail if the field size exceeds this value. You
can increase this value based on your actual data. | 4.1.0 (inclusive) and
later |
+
+ - `mc.max_field_size_bytes`
+
+ MaxCompute allows a maximum of 8 MB per field by default. If the data
being written contains large string or binary fields, the write may fail.
+
+ To adjust this limit, first execute the following command in the
MaxCompute console SQL editor:
+
+ `setproject odps.sql.cfile2.field.maxsize=262144;`
+
+ This adjusts the maximum bytes for a single field. The unit is KB and the
maximum value is 262144.
+
+ Then set `mc.max_field_size_bytes` to 262144 in the Doris catalog
properties (this value must not exceed the MaxCompute setting).
* `{CommonProperties}`
@@ -180,6 +193,31 @@ SELECT * FROM mc_tbl LIMIT 10;
SELECT * FROM mc_ctl.mc_db.mc_tbl LIMIT 10;
```
+### Query Optimization
+
+- LIMIT Query Optimization (Since 4.1.0)
+
+ This parameter is only applicable to scenarios where `LIMIT 1` is
frequently used to check whether data exists.
+
+ When querying MaxCompute tables, if the query contains only partition
column equality predicates (`=` or `IN`) with a `LIMIT` clause, you can enable
the Session Variable `enable_mc_limit_split_optimization` to optimize the split
generation strategy.
+
+ When enabled, the system uses a `row_offset` strategy to read only the
required number of rows, instead of generating splits for all data. This can
reduce the split count from many to exactly one, significantly reducing query
latency.
+
+ This optimization applies to queries like:
+
+ ```sql
+ SELECT * FROM mc_tbl WHERE pt = 'value' LIMIT 100;
+ SELECT * FROM mc_tbl WHERE pt IN ('v1', 'v2') LIMIT 100;
+ ```
+
+ To enable:
+
+ ```sql
+ SET enable_mc_limit_split_optimization = true;
+ ```
+
+ > This parameter is disabled by default. The optimization will not take
effect when the query contains non-partition column filters, non-equality
predicates (such as `>`, `<`, `!=`), or does not have a `LIMIT` clause.
+
## Write Operations
Starting from version 4.1.0, Doris supports write operations to MaxCompute
tables. You can use standard INSERT statements to write data from other data
sources directly to MaxCompute tables through Doris.
@@ -310,7 +348,7 @@ For MaxCompute Database, after deletion, all tables under
it will also be delete
DROP TABLE [IF EXISTS] mc_tbl;
```
-## Appendix
+## FAQ
### How to Obtain Endpoint and Quota (Applicable for Doris 2.1.7 and Later)
@@ -356,3 +394,15 @@ Note:
1. This method can only control the concurrent request count for a single
table within a single Query, and cannot control resource usage across multiple
SQL statements.
2. Reducing concurrency means increasing the Query execution time.
+
+### Write Best Practices
+
+- It is recommended to write to specified partitions whenever possible, e.g.
`INSERT INTO mc_tbl PARTITION(ds='20250201')`. When no partition is specified,
due to limitations of the MaxCompute Storage API, data for each partition must
be written sequentially. As a result, the execution plan will sort data based
on the partition columns, which can consume significant memory resources when
the data volume is large and may cause the write to fail.
+
+- When writing without specifying a partition, do not set `set
enable_strict_consistency_dml=false`. This forcefully removes the sort node,
causing partition data to be written out of order, which will ultimately result
in an error from MaxCompute.
+
+- Do not add a `LIMIT` clause. When a `LIMIT` clause is added, Doris will use
only a single thread for writing to guarantee the write count. This can be used
for small-scale testing, but if the `LIMIT` value is large, write performance
will be poor.
+
+### Write Error: `Data invalid: ODPS-0020041:StringOutOfMaxLength`
+
+Refer to the description of `mc.max_field_size_bytes`.
diff --git
a/versioned_docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md
b/versioned_docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md
index 1bea22a7628..ae73869d6fd 100644
--- a/versioned_docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md
+++ b/versioned_docs/version-4.x/lakehouse/catalogs/maxcompute-catalog.md
@@ -63,6 +63,19 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
| `mc.datetime_predicate_push_down` | `true` | Whether to allow predicate
push-down for `timestamp/timestamp_ntz` types. Doris loses precision (9 -> 6)
when syncing these two types. Therefore, if the original data precision is
higher than 6 digits, predicate push-down may lead to inaccurate results. |
2.1.9/3.0.5 (inclusive) and later |
| `mc.account_format` | `name` | The account systems of Alibaba Cloud
International and China sites are inconsistent. For international site users,
if you encounter errors like `user 'RAM$xxxxxx:xxxxx' is not a valid aliyun
account`, you can set this parameter to `id`. | 3.0.9/3.1.1 (inclusive) and
later |
| `mc.enable.namespace.schema` | `false` | Whether to support MaxCompute
schema hierarchy. See:
https://help.aliyun.com/zh/maxcompute/user-guide/schema-related-operations |
3.1.3 (inclusive) and later |
+ | `mc.max_field_size_bytes` | `8388608` (8 MB) | Maximum bytes allowed for a
single field in a write session. When writing data that contains large string
or binary fields, the write may fail if the field size exceeds this value. You
can increase this value based on your actual data. | 4.1.0 (inclusive) and
later |
+
+ - `mc.max_field_size_bytes`
+
+ MaxCompute allows a maximum of 8 MB per field by default. If the data
being written contains large string or binary fields, the write may fail.
+
+ To adjust this limit, first execute the following command in the
MaxCompute console SQL editor:
+
+ `setproject odps.sql.cfile2.field.maxsize=262144;`
+
+ This adjusts the maximum bytes for a single field. The unit is KB and the
maximum value is 262144.
+
+ Then set `mc.max_field_size_bytes` to 262144 in the Doris catalog
properties (this value must not exceed the MaxCompute setting).
* `{CommonProperties}`
@@ -180,6 +193,31 @@ SELECT * FROM mc_tbl LIMIT 10;
SELECT * FROM mc_ctl.mc_db.mc_tbl LIMIT 10;
```
+### Query Optimization
+
+- LIMIT Query Optimization (Since 4.1.0)
+
+ This parameter is only applicable to scenarios where `LIMIT 1` is
frequently used to check whether data exists.
+
+ When querying MaxCompute tables, if the query contains only partition
column equality predicates (`=` or `IN`) with a `LIMIT` clause, you can enable
the Session Variable `enable_mc_limit_split_optimization` to optimize the split
generation strategy.
+
+ When enabled, the system uses a `row_offset` strategy to read only the
required number of rows, instead of generating splits for all data. This can
reduce the split count from many to exactly one, significantly reducing query
latency.
+
+ This optimization applies to queries like:
+
+ ```sql
+ SELECT * FROM mc_tbl WHERE pt = 'value' LIMIT 100;
+ SELECT * FROM mc_tbl WHERE pt IN ('v1', 'v2') LIMIT 100;
+ ```
+
+ To enable:
+
+ ```sql
+ SET enable_mc_limit_split_optimization = true;
+ ```
+
+ > This parameter is disabled by default. The optimization will not take
effect when the query contains non-partition column filters, non-equality
predicates (such as `>`, `<`, `!=`), or does not have a `LIMIT` clause.
+
## Write Operations
Starting from version 4.1.0, Doris supports write operations to MaxCompute
tables. You can use standard INSERT statements to write data from other data
sources directly to MaxCompute tables through Doris.
@@ -310,7 +348,7 @@ For MaxCompute Database, after deletion, all tables under
it will also be delete
DROP TABLE [IF EXISTS] mc_tbl;
```
-## Appendix
+## FAQ
### How to Obtain Endpoint and Quota (Applicable for Doris 2.1.7 and Later)
@@ -356,3 +394,15 @@ Note:
1. This method can only control the concurrent request count for a single
table within a single Query, and cannot control resource usage across multiple
SQL statements.
2. Reducing concurrency means increasing the Query execution time.
+
+### Write Best Practices
+
+- It is recommended to write to specified partitions whenever possible, e.g.
`INSERT INTO mc_tbl PARTITION(ds='20250201')`. When no partition is specified,
due to limitations of the MaxCompute Storage API, data for each partition must
be written sequentially. As a result, the execution plan will sort data based
on the partition columns, which can consume significant memory resources when
the data volume is large and may cause the write to fail.
+
+- When writing without specifying a partition, do not set `set
enable_strict_consistency_dml=false`. This forcefully removes the sort node,
causing partition data to be written out of order, which will ultimately result
in an error from MaxCompute.
+
+- Do not add a `LIMIT` clause. When a `LIMIT` clause is added, Doris will use
only a single thread for writing to guarantee the write count. This can be used
for small-scale testing, but if the `LIMIT` value is large, write performance
will be poor.
+
+### Write Error: `Data invalid: ODPS-0020041:StringOutOfMaxLength`
+
+Refer to the description of `mc.max_field_size_bytes`.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]