This is an automated email from the ASF dual-hosted git repository.
yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new d63bd2e3435 Add product quantization compression information (#3037)
d63bd2e3435 is described below
commit d63bd2e343508cbd0f9313bc7d4b079e7aa79076
Author: ivin <[email protected]>
AuthorDate: Mon Dec 8 10:33:26 2025 +0800
Add product quantization compression information (#3037)
## Versions
- [x] dev
- [x] 4.x
- [ ] 3.x
- [ ] 2.1
## Languages
- [x] Chinese
- [x] English
## Docs Checklist
- [ ] Checked by AI
- [ ] Test Cases Built
---
docs/ai/vector-search/overview.md | 5 ++++-
.../current/ai/vector-search/overview.md | 5 ++++-
.../version-4.x/ai/vector-search/overview.md | 5 ++++-
versioned_docs/version-4.x/ai/vector-search/overview.md | 5 ++++-
4 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/docs/ai/vector-search/overview.md
b/docs/ai/vector-search/overview.md
index ec75452d22c..5ecfbcf5fb5 100644
--- a/docs/ai/vector-search/overview.md
+++ b/docs/ai/vector-search/overview.md
@@ -311,6 +311,7 @@ On 768-D Cohere-MEDIUM-1M and Cohere-LARGE-10M datasets,
SQ8 reduces index size
|---------|-----|----------------------|------------|-----------|------------|-------|
| Cohere-MEDIUM-1M | 768D | Doris (FLAT) | 5.647 GB (2.533 + 3.114) | 2.533 GB
| 3.114 GB | 1M vectors |
| Cohere-MEDIUM-1M | 768D | Doris SQ INT8 | 3.501 GB (2.533 + 0.992) | 2.533
GB | 0.992 GB | INT8 symmetric quantization |
+| Cohere-MEDIUM-1M | 768D | Doris PQ(pq_m=384,pq_nbits=8) | 3.149 GB (2.535
+ 0.614) | 2.535 GB | 0.614 GB | product quantization |
| Cohere-LARGE-10M | 768D | Doris (FLAT) | 56.472 GB (25.328 + 31.145) |
25.328 GB | 31.145 GB | 10M vectors |
| Cohere-LARGE-10M | 768D | Doris SQ INT8 | 35.016 GB (25.329 + 9.687) |
25.329 GB | 9.687 GB | INT8 quantization |
@@ -319,7 +320,9 @@ Quantization introduces extra build-time overhead because
each distance computat
Similarly, Doris also supports product quantization, but note that when using
PQ, additional parameters need to be provided:
- `pq_m`: Indicates how many sub-vectors to split the original
high-dimensional vector into (vector dimension dim must be divisible by pq_m).
-- `pq_nbits`: Indicates the number of bits for each sub-vector quantization,
which determines the size of each subspace codebook (k = 2 ^ pq_nbits), in
faiss pq_nbits is generally required to be no greater than 24.
+- `pq_nbits`: Indicates the number of bits for each sub-vector quantization,
which determines the size of each subspace codebook, in faiss pq_nbits is
generally required to be no greater than 24.
+
+Note that PQ quantization requires sufficient data during the training, the
number of training points needing to be at least as large as the number of
clusters (n >= 2 ^ pq_nbits).
```sql
CREATE TABLE sift_1M (
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
index f4cd532936e..b06d595a502 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
@@ -288,6 +288,7 @@ PROPERTIES (
|--------|----------|---------------|------------|----------|----------|------|
| Cohere-MEDIUM-1M | 768D | Doris (FLAT) | 5.647 GB (2.533 + 3.114) | 2.533
GB | 3.114 GB | 1M 向量,原始 + HNSW FLAT 索引 |
| Cohere-MEDIUM-1M | 768D | Doris SQ INT8 | 3.501 GB (2.533 + 0.992) | 2.533
GB | 0.992 GB | INT8 对称量化 |
+| Cohere-MEDIUM-1M | 768D | Doris PQ(pq_m=384,pq_nbits=8) | 3.149 GB (2.535
+ 0.614) | 2.535 GB | 0.614 GB | 乘积量化 |
| Cohere-LARGE-10M | 768D | Doris (FLAT) | 56.472 GB (25.328 + 31.145) |
25.328 GB | 31.145 GB | 10M 向量 |
| Cohere-LARGE-10M | 768D | Doris SQ INT8 | 35.016 GB (25.329 + 9.687) |
25.329 GB | 9.687 GB | INT8 量化,索引显著减小 |
@@ -296,7 +297,9 @@ PROPERTIES (
类似的, Doris也支持乘积量化, 不过需要注意的是在使用PQ时需要提供额外的参数:
- `pq_m`: 表示将原始的高维向量分割成多少个子向量(向量维度 dim 必须能被 pq_m 整除)。
-- `pq_nbits`: 表示每个子向量量化的比特数, 它决定了每个子空间码本的大小(k = 2 ^ pq_nbits),
在faiss中pq_nbits值一般要求不大于24。
+- `pq_nbits`: 表示每个子向量量化的比特数, 它决定了每个子空间码本的大小, 在faiss中pq_nbits值一般要求不大于24。
+
+特别需要注意的是, pq量化在训练阶段对训练的数据量有要求, 至少需要与每一个聚类中心数量一样多(即 训练点个数 n >= 2 ^ pq_nbits)。
```sql
CREATE TABLE sift_1M (
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
index b68a25fd303..39b1da157d3 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
@@ -287,6 +287,7 @@ PROPERTIES (
|--------|----------|---------------|------------|----------|----------|------|
| Cohere-MEDIUM-1M | 768D | Doris (FLAT) | 5.647 GB (2.533 + 3.114) | 2.533
GB | 3.114 GB | 1M 向量,原始 + HNSW FLAT 索引 |
| Cohere-MEDIUM-1M | 768D | Doris SQ INT8 | 3.501 GB (2.533 + 0.992) | 2.533
GB | 0.992 GB | INT8 对称量化 |
+| Cohere-MEDIUM-1M | 768D | Doris PQ(pq_m=384,pq_nbits=8) | 3.149 GB (2.535
+ 0.614) | 2.535 GB | 0.614 GB | 乘积量化 |
| Cohere-LARGE-10M | 768D | Doris (FLAT) | 56.472 GB (25.328 + 31.145) |
25.328 GB | 31.145 GB | 10M 向量 |
| Cohere-LARGE-10M | 768D | Doris SQ INT8 | 35.016 GB (25.329 + 9.687) |
25.329 GB | 9.687 GB | INT8 量化,索引显著减小 |
@@ -295,7 +296,9 @@ PROPERTIES (
类似的, Doris也支持乘积量化, 不过需要注意的是在使用PQ时需要提供额外的参数:
- `pq_m`: 表示将原始的高维向量分割成多少个子向量(向量维度 dim 必须能被 pq_m 整除)。
-- `pq_nbits`: 表示每个子向量量化的比特数, 它决定了每个子空间码本的大小(k = 2 ^ pq_nbits),
在faiss中pq_nbits值一般要求不大于24。
+- `pq_nbits`: 表示每个子向量量化的比特数, 它决定了每个子空间码本的大小, 在faiss中pq_nbits值一般要求不大于24。
+
+特别需要注意的是, pq量化在训练阶段对训练的数据量有要求, 至少需要与每一个聚类中心数量一样多(即 训练点个数 n >= 2 ^ pq_nbits)。
```sql
CREATE TABLE sift_1M (
diff --git a/versioned_docs/version-4.x/ai/vector-search/overview.md
b/versioned_docs/version-4.x/ai/vector-search/overview.md
index ec75452d22c..5ecfbcf5fb5 100644
--- a/versioned_docs/version-4.x/ai/vector-search/overview.md
+++ b/versioned_docs/version-4.x/ai/vector-search/overview.md
@@ -311,6 +311,7 @@ On 768-D Cohere-MEDIUM-1M and Cohere-LARGE-10M datasets,
SQ8 reduces index size
|---------|-----|----------------------|------------|-----------|------------|-------|
| Cohere-MEDIUM-1M | 768D | Doris (FLAT) | 5.647 GB (2.533 + 3.114) | 2.533 GB
| 3.114 GB | 1M vectors |
| Cohere-MEDIUM-1M | 768D | Doris SQ INT8 | 3.501 GB (2.533 + 0.992) | 2.533
GB | 0.992 GB | INT8 symmetric quantization |
+| Cohere-MEDIUM-1M | 768D | Doris PQ(pq_m=384,pq_nbits=8) | 3.149 GB (2.535
+ 0.614) | 2.535 GB | 0.614 GB | product quantization |
| Cohere-LARGE-10M | 768D | Doris (FLAT) | 56.472 GB (25.328 + 31.145) |
25.328 GB | 31.145 GB | 10M vectors |
| Cohere-LARGE-10M | 768D | Doris SQ INT8 | 35.016 GB (25.329 + 9.687) |
25.329 GB | 9.687 GB | INT8 quantization |
@@ -319,7 +320,9 @@ Quantization introduces extra build-time overhead because
each distance computat
Similarly, Doris also supports product quantization, but note that when using
PQ, additional parameters need to be provided:
- `pq_m`: Indicates how many sub-vectors to split the original
high-dimensional vector into (vector dimension dim must be divisible by pq_m).
-- `pq_nbits`: Indicates the number of bits for each sub-vector quantization,
which determines the size of each subspace codebook (k = 2 ^ pq_nbits), in
faiss pq_nbits is generally required to be no greater than 24.
+- `pq_nbits`: Indicates the number of bits for each sub-vector quantization,
which determines the size of each subspace codebook, in faiss pq_nbits is
generally required to be no greater than 24.
+
+Note that PQ quantization requires sufficient data during the training, the
number of training points needing to be at least as large as the number of
clusters (n >= 2 ^ pq_nbits).
```sql
CREATE TABLE sift_1M (
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]