Re: [PR] [docs] Add POC guide and update tablet sizing recommendation [doris-website]

via GitHub Mon, 09 Mar 2026 01:23:35 -0700


Copilot commented on code in PR #3442:
URL: https://github.com/apache/doris-website/pull/3442#discussion_r2903919384



##########
docs/gettingStarted/must-read-before-poc.md:
##########
@@ -0,0 +1,214 @@
+---
+{
+    "title": "Must Read Before the POC",
+    "language": "en",
+    "description": "Understand the four table design decisions in Apache Doris 
— data model, key columns, partitioning, and bucketing — and why each one 
matters for your POC.",
+    "sidebar_label": "Must Read Before the POC"
+}
+---
+
+# Must Read Before the POC
+
+Creating a table in Doris involves four decisions that affect load and query 
performance. Some of them — like the data model — cannot be changed later. 
Understanding **why** each decision exists helps you get it right the first 
time.
+
+:::tip The simplest valid CREATE TABLE
+
+```sql
+CREATE TABLE my_table
+(
+    id          INT,
+    name        VARCHAR(100),
+    created_at  DATETIME,
+    amount      DECIMAL(10,2)
+);
+```
+
+This is the simplest syntax — Doris defaults to Duplicate Key, a single 
partition, and Random bucketing. It works, but **it won't give you good 
performance for most POC scenarios.** Read the four decisions below to 
understand what to tune and why.
+
+:::
+
+## 1. Data Model
+
+**Why it matters:** The data model controls whether Doris keeps every row, 
keeps only the latest row per key, or pre-aggregates rows at write time.
+
+**How to choose:** Ask yourself one question — *do I need to update rows?*
+
+| If your data is... | Use | Why |
+|---|---|---|
+| Append-only (logs, events, facts) | **[Duplicate 
Key](../table-design/data-model/duplicate)** (default — just omit it) | Keeps 
all rows. Best query performance. Safest default. |
+| Updated by primary key (CDC sync, user profiles) | **[Unique 
Key](../table-design/data-model/unique)** | New rows replace old rows with the 
same key. |
+| Pre-aggregated metrics (PV, UV, revenue sums) | **[Aggregate 
Key](../table-design/data-model/aggregate)** | Rows are merged with SUM/MAX/MIN 
during ingestion. |
+
+For a POC, **Duplicate Key works for most scenarios**. Switch only if you have 
a clear need for upsert or pre-aggregation. For a detailed comparison, see 
[Data Model Overview](../table-design/data-model/overview).
+
+## 2. Key Columns
+
+**Why it matters:** Key columns determine the **physical sort order** on disk. 
Doris builds a [prefix index](../table-design/index/prefix-index) on the first 
36 bytes of key columns, so queries that filter on these columns run 
significantly faster. However, when a `VARCHAR` column is encountered, the 
prefix index stops immediately — no subsequent columns are included. So place 
fixed-size columns (INT, BIGINT, DATE) before VARCHAR to maximize index 
coverage.
+
+**How to choose:** Put the column you filter on most frequently first, with 
fixed-size types before VARCHAR types. You can add [inverted 
indexes](../table-design/index/inverted-index) later for any column that needs 
fast filtering.
+
+## 3. Partitioning
+
+**Why it matters:** Partitioning splits data into independent units. When a 
query includes a partition column in its WHERE clause, Doris only scans the 
relevant partitions — this is called **partition pruning** and it can skip the 
vast majority of data.
+
+**How to choose:**
+
+- **Have a time column?** → Use `AUTO PARTITION BY RANGE(date_trunc(time_col, 
'day'))`. Partitions are created automatically during import, no manual 
management needed.
+- **No time column, data < 50 GB?** → Skip partitioning entirely. Doris 
creates a single partition by default.
+- **No time column, data > 50 GB?** → Consider `AUTO PARTITION BY 
LIST(category_col)` on a categorical dimension.
+
+For full syntax and advanced options, see [Auto 
Partition](../table-design/data-partitioning/auto-partitioning).
+
+## 4. Bucketing
+
+**Why it matters:** Each bucket is stored as one or more **tablets** (one per 
replica). A tablet lives on a single BE node, so scanning a tablet can only use 
that one BE. For a single query, parallelism is determined by `partitions × 
buckets` — replicas are not used simultaneously. For concurrent queries, 
different replicas can serve different queries, so the total tablet count 
`partitions × buckets × replicas` determines cluster-wide throughput.
+
+**Partitions first, then buckets.** Both partitioning and bucketing increase 
tablet count, but partitions also enable pruning and are easier to manage 
(add/drop). When you need more parallelism, prefer adding partitions before 
increasing bucket count.
+
+**Default is Random bucketing** — you can omit the `DISTRIBUTED BY` clause 
entirely. For Duplicate Key tables, Random bucketing is recommended because it 
enables `load_to_single_tablet` for lower memory usage and higher load 
throughput.
+
+**When to specify Hash bucketing:** If you frequently filter or join on a 
specific column, `DISTRIBUTED BY HASH(that_column)` enables **bucket pruning** 
— Doris skips irrelevant buckets, which is faster than scanning all of them.
+
+```sql
+-- Default: random bucketing (omit the clause, or write explicitly)
+DISTRIBUTED BY RANDOM BUCKETS 10
+
+-- Better for queries that filter on a specific column
+DISTRIBUTED BY HASH(user_id) BUCKETS 10
+```
+
+For details on choosing between Hash and Random bucketing, see [Data 
Bucketing](../table-design/data-partitioning/data-bucketing).
+
+## Critical Points
+
+Things that surprise new users. Read these before you create your first table.
+
+:::caution
+
+**Data model is permanent.** You cannot change from Duplicate to Unique or 
Aggregate after table creation. If you choose wrong, the only fix is to create 
a new table and re-import data.
+
+:::
+
+**STRING type cannot be a key or partition column.** Use `VARCHAR` instead. 
`STRING` is only for value columns storing large text content. For key columns, 
`VARCHAR(65533)` has no performance penalty compared to `VARCHAR(255)` — they 
perform the same when storing identical data, so use a generous length. See 
[Data Types](../table-design/data-type) for the full type reference.
+
+**Aggregate Key tables don't support `count(*)` well.** Because values are 
pre-aggregated, `count(*)` cannot simply count rows. The workaround is to add a 
column like `row_count BIGINT SUM DEFAULT '1'` and query `SELECT 
SUM(row_count)` instead.
+
+**Bucket count on existing partitions cannot be changed.** You can only adjust 
bucket count for **new** partitions. Keep each tablet between **1 GB and 20 
GB** compressed data (excluding index), or under **10 GB** for Unique Key 
tables — check with `SHOW TABLETS FROM your_table`. If data per partition is 
under 1 GB, a single bucket is fine. Otherwise, use multiple buckets so each 
tablet stays within the range. Too few buckets limits query parallelism; too 
many creates excessive small files.
+
+## Typical Use Cases
+
+Ready-to-use templates for the most common POC scenarios.
+
+### Log / Event Analytics
+
+Append-only data, queried by time range and keyword.
+
+```sql
+CREATE TABLE app_logs
+(
+    log_time      DATETIME    NOT NULL,
+    log_level     VARCHAR(10),
+    service_name  VARCHAR(50),
+    trace_id      VARCHAR(64),
+    message       STRING,
+    INDEX idx_message (message) USING INVERTED PROPERTIES("parser" = "unicode")
+)
+AUTO PARTITION BY RANGE(date_trunc(`log_time`, 'day'))
+()
+DISTRIBUTED BY RANDOM BUCKETS 10;
+```
+
+- Default **Duplicate Key** — logs are never updated, Random bucketing for 
best load throughput
+- **AUTO PARTITION by day** — time-range queries skip irrelevant days
+- **Inverted index on message** — enables full-text search 
([details](../table-design/index/inverted-index))
+
+### Real-Time Dashboard with Upsert (CDC)
+
+Sync from MySQL/PostgreSQL, keep latest state per primary key.
+
+```sql
+CREATE TABLE user_profiles
+(
+    user_id       BIGINT      NOT NULL,
+    username      VARCHAR(50),
+    email         VARCHAR(100),
+    status        TINYINT,
+    updated_at    DATETIME
+)
+UNIQUE KEY(user_id)
+DISTRIBUTED BY HASH(user_id) BUCKETS 10;
+```
+
+- **Unique Key** — new rows replace old by `user_id`, enabling [CDC 
sync](../data-operate/import/data-source/doris-other-oltp)

Review Comment:
   This internal link appears to be broken: 
`../data-operate/import/data-source/doris-other-oltp` does not exist in 
`docs/data-operate/import/data-source/` (closest match is 
`migrate-data-from-other-oltp.md`). Please update the link target to an 
existing doc so it resolves on the site.
   ```suggestion
   - **Unique Key** — new rows replace old by `user_id`, enabling [CDC 
sync](../data-operate/import/data-source/migrate-data-from-other-oltp)
   ```



##########
sidebars.ts:
##########
@@ -1180,6 +1181,7 @@ const sidebars: SidebarsConfig = {
                         'sql-manual/basic-element/object-identifiers',
                         'sql-manual/basic-element/reserved-keywords',
                         'sql-manual/basic-element/variables',
+                        'sql-manual/basic-element/session-variables',

Review Comment:
   The sidebar entry `sql-manual/basic-element/session-variables` points to a 
doc that does not exist under `docs/sql-manual/basic-element/` (and also not 
under the versioned docs). This will break the docs build/navigation. Either 
add the corresponding doc file (and versioned/i18n copies as needed), or 
remove/replace this sidebar item with the correct existing doc ID.
   ```suggestion
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [docs] Add POC guide and update tablet sizing recommendation [doris-website]

Reply via email to