Copilot commented on code in PR #3442:
URL: https://github.com/apache/doris-website/pull/3442#discussion_r2903919384
##########
docs/gettingStarted/must-read-before-poc.md:
##########
@@ -0,0 +1,214 @@
+---
+{
+ "title": "Must Read Before the POC",
+ "language": "en",
+ "description": "Understand the four table design decisions in Apache Doris
— data model, key columns, partitioning, and bucketing — and why each one
matters for your POC.",
+ "sidebar_label": "Must Read Before the POC"
+}
+---
+
+# Must Read Before the POC
+
+Creating a table in Doris involves four decisions that affect load and query
performance. Some of them — like the data model — cannot be changed later.
Understanding **why** each decision exists helps you get it right the first
time.
+
+:::tip The simplest valid CREATE TABLE
+
+```sql
+CREATE TABLE my_table
+(
+ id INT,
+ name VARCHAR(100),
+ created_at DATETIME,
+ amount DECIMAL(10,2)
+);
+```
+
+This is the simplest syntax — Doris defaults to Duplicate Key, a single
partition, and Random bucketing. It works, but **it won't give you good
performance for most POC scenarios.** Read the four decisions below to
understand what to tune and why.
+
+:::
+
+## 1. Data Model
+
+**Why it matters:** The data model controls whether Doris keeps every row,
keeps only the latest row per key, or pre-aggregates rows at write time.
+
+**How to choose:** Ask yourself one question — *do I need to update rows?*
+
+| If your data is... | Use | Why |
+|---|---|---|
+| Append-only (logs, events, facts) | **[Duplicate
Key](../table-design/data-model/duplicate)** (default — just omit it) | Keeps
all rows. Best query performance. Safest default. |
+| Updated by primary key (CDC sync, user profiles) | **[Unique
Key](../table-design/data-model/unique)** | New rows replace old rows with the
same key. |
+| Pre-aggregated metrics (PV, UV, revenue sums) | **[Aggregate
Key](../table-design/data-model/aggregate)** | Rows are merged with SUM/MAX/MIN
during ingestion. |
+
+For a POC, **Duplicate Key works for most scenarios**. Switch only if you have
a clear need for upsert or pre-aggregation. For a detailed comparison, see
[Data Model Overview](../table-design/data-model/overview).
+
+## 2. Key Columns
+
+**Why it matters:** Key columns determine the **physical sort order** on disk.
Doris builds a [prefix index](../table-design/index/prefix-index) on the first
36 bytes of key columns, so queries that filter on these columns run
significantly faster. However, when a `VARCHAR` column is encountered, the
prefix index stops immediately — no subsequent columns are included. So place
fixed-size columns (INT, BIGINT, DATE) before VARCHAR to maximize index
coverage.
+
+**How to choose:** Put the column you filter on most frequently first, with
fixed-size types before VARCHAR types. You can add [inverted
indexes](../table-design/index/inverted-index) later for any column that needs
fast filtering.
+
+## 3. Partitioning
+
+**Why it matters:** Partitioning splits data into independent units. When a
query includes a partition column in its WHERE clause, Doris only scans the
relevant partitions — this is called **partition pruning** and it can skip the
vast majority of data.
+
+**How to choose:**
+
+- **Have a time column?** → Use `AUTO PARTITION BY RANGE(date_trunc(time_col,
'day'))`. Partitions are created automatically during import, no manual
management needed.
+- **No time column, data < 50 GB?** → Skip partitioning entirely. Doris
creates a single partition by default.
+- **No time column, data > 50 GB?** → Consider `AUTO PARTITION BY
LIST(category_col)` on a categorical dimension.
+
+For full syntax and advanced options, see [Auto
Partition](../table-design/data-partitioning/auto-partitioning).
+
+## 4. Bucketing
+
+**Why it matters:** Each bucket is stored as one or more **tablets** (one per
replica). A tablet lives on a single BE node, so scanning a tablet can only use
that one BE. For a single query, parallelism is determined by `partitions ×
buckets` — replicas are not used simultaneously. For concurrent queries,
different replicas can serve different queries, so the total tablet count
`partitions × buckets × replicas` determines cluster-wide throughput.
+
+**Partitions first, then buckets.** Both partitioning and bucketing increase
tablet count, but partitions also enable pruning and are easier to manage
(add/drop). When you need more parallelism, prefer adding partitions before
increasing bucket count.
+
+**Default is Random bucketing** — you can omit the `DISTRIBUTED BY` clause
entirely. For Duplicate Key tables, Random bucketing is recommended because it
enables `load_to_single_tablet` for lower memory usage and higher load
throughput.
+
+**When to specify Hash bucketing:** If you frequently filter or join on a
specific column, `DISTRIBUTED BY HASH(that_column)` enables **bucket pruning**
— Doris skips irrelevant buckets, which is faster than scanning all of them.
+
+```sql
+-- Default: random bucketing (omit the clause, or write explicitly)
+DISTRIBUTED BY RANDOM BUCKETS 10
+
+-- Better for queries that filter on a specific column
+DISTRIBUTED BY HASH(user_id) BUCKETS 10
+```
+
+For details on choosing between Hash and Random bucketing, see [Data
Bucketing](../table-design/data-partitioning/data-bucketing).
+
+## Critical Points
+
+Things that surprise new users. Read these before you create your first table.
+
+:::caution
+
+**Data model is permanent.** You cannot change from Duplicate to Unique or
Aggregate after table creation. If you choose wrong, the only fix is to create
a new table and re-import data.
+
+:::
+
+**STRING type cannot be a key or partition column.** Use `VARCHAR` instead.
`STRING` is only for value columns storing large text content. For key columns,
`VARCHAR(65533)` has no performance penalty compared to `VARCHAR(255)` — they
perform the same when storing identical data, so use a generous length. See
[Data Types](../table-design/data-type) for the full type reference.
+
+**Aggregate Key tables don't support `count(*)` well.** Because values are
pre-aggregated, `count(*)` cannot simply count rows. The workaround is to add a
column like `row_count BIGINT SUM DEFAULT '1'` and query `SELECT
SUM(row_count)` instead.
+
+**Bucket count on existing partitions cannot be changed.** You can only adjust
bucket count for **new** partitions. Keep each tablet between **1 GB and 20
GB** compressed data (excluding index), or under **10 GB** for Unique Key
tables — check with `SHOW TABLETS FROM your_table`. If data per partition is
under 1 GB, a single bucket is fine. Otherwise, use multiple buckets so each
tablet stays within the range. Too few buckets limits query parallelism; too
many creates excessive small files.
+
+## Typical Use Cases
+
+Ready-to-use templates for the most common POC scenarios.
+
+### Log / Event Analytics
+
+Append-only data, queried by time range and keyword.
+
+```sql
+CREATE TABLE app_logs
+(
+ log_time DATETIME NOT NULL,
+ log_level VARCHAR(10),
+ service_name VARCHAR(50),
+ trace_id VARCHAR(64),
+ message STRING,
+ INDEX idx_message (message) USING INVERTED PROPERTIES("parser" = "unicode")
+)
+AUTO PARTITION BY RANGE(date_trunc(`log_time`, 'day'))
+()
+DISTRIBUTED BY RANDOM BUCKETS 10;
+```
+
+- Default **Duplicate Key** — logs are never updated, Random bucketing for
best load throughput
+- **AUTO PARTITION by day** — time-range queries skip irrelevant days
+- **Inverted index on message** — enables full-text search
([details](../table-design/index/inverted-index))
+
+### Real-Time Dashboard with Upsert (CDC)
+
+Sync from MySQL/PostgreSQL, keep latest state per primary key.
+
+```sql
+CREATE TABLE user_profiles
+(
+ user_id BIGINT NOT NULL,
+ username VARCHAR(50),
+ email VARCHAR(100),
+ status TINYINT,
+ updated_at DATETIME
+)
+UNIQUE KEY(user_id)
+DISTRIBUTED BY HASH(user_id) BUCKETS 10;
+```
+
+- **Unique Key** — new rows replace old by `user_id`, enabling [CDC
sync](../data-operate/import/data-source/doris-other-oltp)
Review Comment:
This internal link appears to be broken:
`../data-operate/import/data-source/doris-other-oltp` does not exist in
`docs/data-operate/import/data-source/` (closest match is
`migrate-data-from-other-oltp.md`). Please update the link target to an
existing doc so it resolves on the site.
```suggestion
- **Unique Key** — new rows replace old by `user_id`, enabling [CDC
sync](../data-operate/import/data-source/migrate-data-from-other-oltp)
```
##########
sidebars.ts:
##########
@@ -1180,6 +1181,7 @@ const sidebars: SidebarsConfig = {
'sql-manual/basic-element/object-identifiers',
'sql-manual/basic-element/reserved-keywords',
'sql-manual/basic-element/variables',
+ 'sql-manual/basic-element/session-variables',
Review Comment:
The sidebar entry `sql-manual/basic-element/session-variables` points to a
doc that does not exist under `docs/sql-manual/basic-element/` (and also not
under the versioned docs). This will break the docs build/navigation. Either
add the corresponding doc file (and versioned/i18n copies as needed), or
remove/replace this sidebar item with the correct existing doc ID.
```suggestion
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]