(doris-website) branch master updated: Add product quantization compression information (#3037)

yiguolei Sun, 07 Dec 2025 18:33:37 -0800

This is an automated email from the ASF dual-hosted git repository.

yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new d63bd2e3435 Add product quantization compression information (#3037)
d63bd2e3435 is described below

commit d63bd2e343508cbd0f9313bc7d4b079e7aa79076
Author: ivin <[email protected]>
AuthorDate: Mon Dec 8 10:33:26 2025 +0800

    Add product quantization compression information (#3037)
    
    ## Versions
    
    - [x] dev
    - [x] 4.x
    - [ ] 3.x
    - [ ] 2.1
    
    ## Languages
    
    - [x] Chinese
    - [x] English
    
    ## Docs Checklist
    
    - [ ] Checked by AI
    - [ ] Test Cases Built
---
 docs/ai/vector-search/overview.md                                    | 5 ++++-
 .../current/ai/vector-search/overview.md                             | 5 ++++-
 .../version-4.x/ai/vector-search/overview.md                         | 5 ++++-
 versioned_docs/version-4.x/ai/vector-search/overview.md              | 5 ++++-
 4 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/docs/ai/vector-search/overview.md 
b/docs/ai/vector-search/overview.md
index ec75452d22c..5ecfbcf5fb5 100644
--- a/docs/ai/vector-search/overview.md
+++ b/docs/ai/vector-search/overview.md
@@ -311,6 +311,7 @@ On 768-D Cohere-MEDIUM-1M and Cohere-LARGE-10M datasets, 
SQ8 reduces index size
 
|---------|-----|----------------------|------------|-----------|------------|-------|
 | Cohere-MEDIUM-1M | 768D | Doris (FLAT) | 5.647 GB (2.533 + 3.114) | 2.533 GB 
| 3.114 GB | 1M vectors |
 | Cohere-MEDIUM-1M | 768D | Doris SQ INT8 | 3.501 GB (2.533 + 0.992) | 2.533 
GB | 0.992 GB | INT8 symmetric quantization |
+| Cohere-MEDIUM-1M | 768D | Doris PQ(pq_m=384,pq_nbits=8)   | 3.149 GB (2.535 
+ 0.614) | 2.535 GB | 0.614 GB | product quantization |
 | Cohere-LARGE-10M | 768D | Doris (FLAT) | 56.472 GB (25.328 + 31.145) | 
25.328 GB | 31.145 GB | 10M vectors |
 | Cohere-LARGE-10M | 768D | Doris SQ INT8 | 35.016 GB (25.329 + 9.687) | 
25.329 GB | 9.687 GB | INT8 quantization |
 
@@ -319,7 +320,9 @@ Quantization introduces extra build-time overhead because 
each distance computat
 Similarly, Doris also supports product quantization, but note that when using 
PQ, additional parameters need to be provided:
 
 - `pq_m`: Indicates how many sub-vectors to split the original 
high-dimensional vector into (vector dimension dim must be divisible by pq_m).
-- `pq_nbits`: Indicates the number of bits for each sub-vector quantization, 
which determines the size of each subspace codebook (k = 2 ^ pq_nbits), in 
faiss pq_nbits is generally required to be no greater than 24.
+- `pq_nbits`: Indicates the number of bits for each sub-vector quantization, 
which determines the size of each subspace codebook, in faiss pq_nbits is 
generally required to be no greater than 24.
+
+Note that PQ quantization requires sufficient data during the training, the 
number of training points needing to be at least as large as the number of 
clusters (n >= 2 ^ pq_nbits).
 
 ```sql
 CREATE TABLE sift_1M (
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
index f4cd532936e..b06d595a502 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
@@ -288,6 +288,7 @@ PROPERTIES (
 |--------|----------|---------------|------------|----------|----------|------|
 | Cohere-MEDIUM-1M | 768D | Doris (FLAT)    | 5.647 GB (2.533 + 3.114) | 2.533 
GB | 3.114 GB | 1M 向量，原始 + HNSW FLAT 索引 |
 | Cohere-MEDIUM-1M | 768D | Doris SQ INT8   | 3.501 GB (2.533 + 0.992) | 2.533 
GB | 0.992 GB | INT8 对称量化 |
+| Cohere-MEDIUM-1M | 768D | Doris PQ(pq_m=384,pq_nbits=8)   | 3.149 GB (2.535 
+ 0.614) | 2.535 GB | 0.614 GB | 乘积量化 |
 | Cohere-LARGE-10M | 768D | Doris (FLAT)    | 56.472 GB (25.328 + 31.145) | 
25.328 GB | 31.145 GB | 10M 向量 |
 | Cohere-LARGE-10M | 768D | Doris SQ INT8   | 35.016 GB (25.329 + 9.687) | 
25.329 GB | 9.687 GB | INT8 量化，索引显著减小 |
 
@@ -296,7 +297,9 @@ PROPERTIES (
 类似的, Doris也支持乘积量化, 不过需要注意的是在使用PQ时需要提供额外的参数:
 
 - `pq_m`: 表示将原始的高维向量分割成多少个子向量(向量维度 dim 必须能被 pq_m 整除)。
-- `pq_nbits`: 表示每个子向量量化的比特数, 它决定了每个子空间码本的大小(k = 2 ^ pq_nbits), 
在faiss中pq_nbits值一般要求不大于24。
+- `pq_nbits`: 表示每个子向量量化的比特数, 它决定了每个子空间码本的大小, 在faiss中pq_nbits值一般要求不大于24。
+
+特别需要注意的是, pq量化在训练阶段对训练的数据量有要求, 至少需要与每一个聚类中心数量一样多(即 训练点个数 n >= 2 ^ pq_nbits)。
 
 ```sql
 CREATE TABLE sift_1M (
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
index b68a25fd303..39b1da157d3 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
@@ -287,6 +287,7 @@ PROPERTIES (
 |--------|----------|---------------|------------|----------|----------|------|
 | Cohere-MEDIUM-1M | 768D | Doris (FLAT)    | 5.647 GB (2.533 + 3.114) | 2.533 
GB | 3.114 GB | 1M 向量，原始 + HNSW FLAT 索引 |
 | Cohere-MEDIUM-1M | 768D | Doris SQ INT8   | 3.501 GB (2.533 + 0.992) | 2.533 
GB | 0.992 GB | INT8 对称量化 |
+| Cohere-MEDIUM-1M | 768D | Doris PQ(pq_m=384,pq_nbits=8)   | 3.149 GB (2.535 
+ 0.614) | 2.535 GB | 0.614 GB | 乘积量化 |
 | Cohere-LARGE-10M | 768D | Doris (FLAT)    | 56.472 GB (25.328 + 31.145) | 
25.328 GB | 31.145 GB | 10M 向量 |
 | Cohere-LARGE-10M | 768D | Doris SQ INT8   | 35.016 GB (25.329 + 9.687) | 
25.329 GB | 9.687 GB | INT8 量化，索引显著减小 |
 
@@ -295,7 +296,9 @@ PROPERTIES (
 类似的, Doris也支持乘积量化, 不过需要注意的是在使用PQ时需要提供额外的参数:
 
 - `pq_m`: 表示将原始的高维向量分割成多少个子向量(向量维度 dim 必须能被 pq_m 整除)。
-- `pq_nbits`: 表示每个子向量量化的比特数, 它决定了每个子空间码本的大小(k = 2 ^ pq_nbits), 
在faiss中pq_nbits值一般要求不大于24。
+- `pq_nbits`: 表示每个子向量量化的比特数, 它决定了每个子空间码本的大小, 在faiss中pq_nbits值一般要求不大于24。
+
+特别需要注意的是, pq量化在训练阶段对训练的数据量有要求, 至少需要与每一个聚类中心数量一样多(即 训练点个数 n >= 2 ^ pq_nbits)。
 
 ```sql
 CREATE TABLE sift_1M (
diff --git a/versioned_docs/version-4.x/ai/vector-search/overview.md 
b/versioned_docs/version-4.x/ai/vector-search/overview.md
index ec75452d22c..5ecfbcf5fb5 100644
--- a/versioned_docs/version-4.x/ai/vector-search/overview.md
+++ b/versioned_docs/version-4.x/ai/vector-search/overview.md
@@ -311,6 +311,7 @@ On 768-D Cohere-MEDIUM-1M and Cohere-LARGE-10M datasets, 
SQ8 reduces index size
 
|---------|-----|----------------------|------------|-----------|------------|-------|
 | Cohere-MEDIUM-1M | 768D | Doris (FLAT) | 5.647 GB (2.533 + 3.114) | 2.533 GB 
| 3.114 GB | 1M vectors |
 | Cohere-MEDIUM-1M | 768D | Doris SQ INT8 | 3.501 GB (2.533 + 0.992) | 2.533 
GB | 0.992 GB | INT8 symmetric quantization |
+| Cohere-MEDIUM-1M | 768D | Doris PQ(pq_m=384,pq_nbits=8)   | 3.149 GB (2.535 
+ 0.614) | 2.535 GB | 0.614 GB | product quantization |
 | Cohere-LARGE-10M | 768D | Doris (FLAT) | 56.472 GB (25.328 + 31.145) | 
25.328 GB | 31.145 GB | 10M vectors |
 | Cohere-LARGE-10M | 768D | Doris SQ INT8 | 35.016 GB (25.329 + 9.687) | 
25.329 GB | 9.687 GB | INT8 quantization |
 
@@ -319,7 +320,9 @@ Quantization introduces extra build-time overhead because 
each distance computat
 Similarly, Doris also supports product quantization, but note that when using 
PQ, additional parameters need to be provided:
 
 - `pq_m`: Indicates how many sub-vectors to split the original 
high-dimensional vector into (vector dimension dim must be divisible by pq_m).
-- `pq_nbits`: Indicates the number of bits for each sub-vector quantization, 
which determines the size of each subspace codebook (k = 2 ^ pq_nbits), in 
faiss pq_nbits is generally required to be no greater than 24.
+- `pq_nbits`: Indicates the number of bits for each sub-vector quantization, 
which determines the size of each subspace codebook, in faiss pq_nbits is 
generally required to be no greater than 24.
+
+Note that PQ quantization requires sufficient data during the training, the 
number of training points needing to be at least as large as the number of 
clusters (n >= 2 ^ pq_nbits).
 
 ```sql
 CREATE TABLE sift_1M (


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) branch master updated: Add product quantization compression information (#3037)

Reply via email to