This is an automated email from the ASF dual-hosted git repository.
jiafengzheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new d5d0a9a5bd5 fix_en_doc (#49)
d5d0a9a5bd5 is described below
commit d5d0a9a5bd53a6f18e336e5675d0e5dd4af8457b
Author: Liqf <[email protected]>
AuthorDate: Wed Aug 17 08:59:23 2022 +0800
fix_en_doc (#49)
---
docs/data-table/data-model.md | 36 +-
docs/data-table/data-partition.md | 419 +++++++++++----------
docs/get-starting/get-starting.md | 2 +-
docs/install/install-deploy.md | 136 +------
.../compilation-with-ldb-toolchain.md | 2 +-
docs/install/source-install/compilation.md | 17 +-
.../current/data-table/data-model.md | 8 +-
.../current/get-starting/get-starting.md | 1 +
8 files changed, 258 insertions(+), 363 deletions(-)
diff --git a/docs/data-table/data-model.md b/docs/data-table/data-model.md
index 3df284c933b..662e7f405b3 100644
--- a/docs/data-table/data-model.md
+++ b/docs/data-table/data-model.md
@@ -26,7 +26,7 @@ under the License.
# Data Model
-This document describes Doris's data model, ROLLUP and prefix index concepts
at the logical level to help users better use Doris to cope with different
business scenarios.
+This document describes Doris's data model,at the logical level to help users
better use Doris to cope with different business scenarios.
## Basic concepts
@@ -38,7 +38,7 @@ Columns can be divided into two categories: Key and Value.
From a business persp
Doris's data model is divided into three main categories:
* Aggregate
-* Uniq
+* Unique
* Duplicate
Let's introduce them separately.
@@ -234,9 +234,9 @@ Data aggregation occurs in Doris in the following three
stages:
Data may be aggregated to varying degrees at different times. For example,
when a batch of data is just imported, it may not be aggregated with the
existing data. But for users, user**can only query aggregated data**. That is,
different degrees of aggregation are transparent to user queries. Users should
always assume that data exists in terms of the degree of aggregation that
**ultimately completes**, and **should not assume that some aggregation has not
yet occurred**. (See the section [...]
-## Uniq Model
+## Unique Model
-In some multi-dimensional analysis scenarios, users are more concerned with
how to ensure the uniqueness of Key, that is, how to obtain the Primary Key
uniqueness constraint. Therefore, we introduce Uniq's data model. This model is
essentially a special case of aggregation model and a simplified representation
of table structure. Let's give an example.
+In some multi-dimensional analysis scenarios, users are more concerned with
how to ensure the uniqueness of Key, that is, how to obtain the Primary Key
uniqueness constraint. Therefore, we introduce Unique data model. This model is
essentially a special case of aggregation model and a simplified representation
of table structure. Let's give an example.
|ColumnName|Type|IsKey|Comment|
|---|---|---|---|
@@ -303,7 +303,7 @@ PROPERTIES (
);
```
-That is to say, Uniq model can be completely replaced by REPLACE in
aggregation model. Its internal implementation and data storage are exactly the
same. No further examples will be given here.
+That is to say, Unique model can be completely replaced by REPLACE in
aggregation model. Its internal implementation and data storage are exactly the
same. No further examples will be given here.
## Duplicate Model
@@ -336,14 +336,14 @@ PROPERTIES (
);
```
-This data model is different from Aggregate and Uniq models. Data is stored
entirely in accordance with the data in the imported file, without any
aggregation. Even if the two rows of data are identical, they will be retained.
-The DUPLICATE KEY specified in the table building statement is only used to
specify which columns the underlying data is sorted according to. (The more
appropriate name should be "Sorted Column", where the name "DUPLICATE KEY" is
used to specify the data model used. For more explanations of "Sorted Column",
see the section **Prefix Index**.) On the choice of DUPLICATE KEY, we recommend
that the first 2-4 columns be selected appropriately.
+This data model is different from Aggregate and Unique models. Data is stored
entirely in accordance with the data in the imported file, without any
aggregation. Even if the two rows of data are identical, they will be retained.
+The DUPLICATE KEY specified in the table building statement is only used to
specify which columns the underlying data is sorted according to. (The more
appropriate name should be "Sorted Column", where the name "DUPLICATE KEY" is
used to specify the data model used. For more explanations of "Sorted Column",
see the section [Prefix Index](./index/prefix-index.md)) On the choice of
DUPLICATE KEY, we recommend that the first 2-4 columns be selected
appropriately.
-This data model is suitable for storing raw data without aggregation
requirements and primary key uniqueness constraints. For more usage scenarios,
see the **Limitations of the Aggregation Model** section.
+This data model is suitable for storing raw data without aggregation
requirements and primary key uniqueness constraints. For more usage scenarios,
see the [Limitations of the Aggregation
Model](./data-model.md#limitations-of-aggregation-model) section.
## Limitations of aggregation model
-Here we introduce the limitations of Aggregate model (including Uniq model).
+Here we introduce the limitations of Aggregate model (including Unique model).
In the aggregation model, what the model presents is the aggregated data. That
is to say, any data that has not yet been aggregated (for example, two
different imported batches) must be presented in some way to ensure
consistency. Let's give an example.
@@ -385,7 +385,9 @@ We add aggregation operator to query engine to ensure data
consistency.
In addition, on the aggregate column (Value), when executing aggregate class
queries that are inconsistent with aggregate types, attention should be paid to
semantics. For example, in the example above, we execute the following queries:
-`SELECT MIN(cost) FROM table;`
+```
+SELECT MIN(cost) FROM table;
+```
The result is 5, not 1.
@@ -393,7 +395,9 @@ At the same time, this consistency guarantee will greatly
reduce the query effic
Let's take the most basic count (*) query as an example:
-`SELECT COUNT(*) FROM table;`
+```
+SELECT COUNT(*) FROM table;
+```
In other databases, such queries return results quickly. Because in the
implementation, we can get the query result by counting rows at the time of
import and saving count statistics information, or by scanning only a column of
data to get count value at the time of query, with very little overhead. But in
Doris's aggregation model, the overhead of this query is **very large**.
@@ -425,9 +429,9 @@ Because the final aggregation result is:
So `select count (*) from table;` The correct result should be **4**. But if
we only scan the `user_id`column and add query aggregation, the final result is
**3** (10001, 10002, 10003). If aggregated without queries, the result is **5**
(a total of five rows in two batches). It can be seen that both results are
wrong.
-In order to get the correct result, we must read the data of `user_id` and
`date`, and **together with aggregate** when querying, to return the correct
result of **4**. That is to say, in the count (*) query, Doris must scan all
AGGREGATE KEY columns (here are `user_id` and `date`) and aggregate them to get
the semantically correct results. When aggregated columns are large, count (*)
queries need to scan a large amount of data.
+In order to get the correct result, we must read the data of `user_id` and
`date`, and **together with aggregate** when querying, to return the correct
result of **4**. That is to say, in the `count (*)` query, Doris must scan all
AGGREGATE KEY columns (here are `user_id` and `date`) and aggregate them to get
the semantically correct results. When aggregated columns are large, `count
(*)` queries need to scan a large amount of data.
-Therefore, when there are frequent count (*) queries in the business, we
recommend that users simulate count (*) by adding a column with a value of 1
and aggregation type of SUM. As the table structure in the previous example, we
modify it as follows:
+Therefore, when there are frequent `count (*)` queries in the business, we
recommend that users simulate `count (*)` by adding a column with a value of 1
and aggregation type of SUM. As the table structure in the previous example, we
modify it as follows:
|ColumnName|Type|AggregationType|Comment|
|---|---|---|---|
@@ -442,15 +446,15 @@ Another way is to **change the aggregation type of the
count column above to REP
### Duplicate Model
-Duplicate model has no limitation of aggregation model. Because the model does
not involve aggregate semantics, when doing count (*) query, we can get the
correct semantics by choosing a column of queries arbitrarily.
+Duplicate model has no limitation of aggregation model. Because the model does
not involve aggregate semantics, when doing `count (*)` query, we can get the
correct semantics by choosing a column of queries arbitrarily.
## Suggestions for Choosing Data Model
Because the data model was established when the table was built, and **could
not be modified. Therefore, it is very important to select an appropriate data
model**.
-1. Aggregate model can greatly reduce the amount of data scanned and the
amount of query computation by pre-aggregation. It is very suitable for report
query scenarios with fixed patterns. But this model is not very friendly for
count (*) queries. At the same time, because the aggregation method on the
Value column is fixed, semantic correctness should be considered in other types
of aggregation queries.
+1. Aggregate model can greatly reduce the amount of data scanned and the
amount of query computation by pre-aggregation. It is very suitable for report
query scenarios with fixed patterns. But this model is not very friendly for
`count (*)` queries. At the same time, because the aggregation method on the
Value column is fixed, semantic correctness should be considered in other types
of aggregation queries.
-2. Uniq model guarantees the uniqueness of primary key for scenarios requiring
unique primary key constraints. However, the query advantage brought by
pre-aggregation such as ROLLUP cannot be exploited (because the essence is
REPLACE, there is no such aggregation as SUM).
+2. Unique model guarantees the uniqueness of primary key for scenarios
requiring unique primary key constraints. However, the query advantage brought
by pre-aggregation such as ROLLUP cannot be exploited (because the essence is
REPLACE, there is no such aggregation as SUM).
> **[Note]** The Unique model only supports the entire row update. If the
user needs unique key with partial update (such as loading multiple source
tables into one doris table), you can consider using the Aggregate model,
setting the aggregate type of the non-primary key columns to
REPLACE_IF_NOT_NULL. For detail, please refer to [CREATE TABLE
Manual](../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md)
diff --git a/docs/data-table/data-partition.md
b/docs/data-table/data-partition.md
index 441e43cee8e..8d4372340b5 100644
--- a/docs/data-table/data-partition.md
+++ b/docs/data-table/data-partition.md
@@ -140,200 +140,200 @@ When defining columns, you can refer to the following
suggestions:
Doris supports two levels of data partitioning. The first layer is Partition,
which supports Range and List partitioning. The second layer is the Bucket
(Tablet), which only supports Hash partitioning.
-It is also possible to use only one layer of partitioning. When using a layer
partition, only Bucket partitioning is supported.
-
-1. Partition
-
- * The Partition column can specify one or more columns. The partition
class must be a KEY column. The use of multi-column partitions is described
later in the **Multi-column partitioning** summary.
- * Regardless of the type of partition column, double quotes are required
when writing partition values.
- * There is no theoretical limit on the number of partitions.
- * When you do not use Partition to build a table, the system will
automatically generate a Partition with the same name as the table name. This
Partition is not visible to the user and cannot be modified.
- * **Do not add partitions with overlapping ranges** when creating
partitions.
-
- #### Range Partition
-
- * Partition columns are usually time columns for easy management of old
and new data.
-
- * Partition supports only the upper bound by `VALUES LESS THAN (...)`, the
system will use the upper bound of the previous partition as the lower bound of
the partition, and generate a left closed right open interval. Passing, also
supports specifying the upper and lower bounds by `VALUES [...)`, and
generating a left closed right open interval.
-
- * It is easier to understand by specifying `VALUES [...)`. Here is an
example of the change in partition range when adding or deleting partitions
using the `VALUES LESS THAN (...)` statement:
- * As in the `example_range_tbl` example above, when the table is
built, the following 3 partitions are automatically generated:
- ```
- P201701: [MIN_VALUE, 2017-02-01)
- P201702: [2017-02-01, 2017-03-01)
- P201703: [2017-03-01, 2017-04-01)
- ```
-
- * When we add a partition p201705 VALUES LESS THAN ("2017-06-01"), the
partition results are as follows:
-
- ```
- P201701: [MIN_VALUE, 2017-02-01)
- P201702: [2017-02-01, 2017-03-01)
- P201703: [2017-03-01, 2017-04-01)
- P201705: [2017-04-01, 2017-06-01)
- ```
-
- * At this point we delete the partition p201703, the partition results
are as follows:
-
- ```
- p201701: [MIN_VALUE, 2017-02-01)
- p201702: [2017-02-01, 2017-03-01)
- p201705: [2017-04-01, 2017-06-01)
- ```
-
- > Note that the partition range of p201702 and p201705 has not
changed, and there is a hole between the two partitions: [2017-03-01,
2017-04-01). That is, if the imported data range is within this hole, it cannot
be imported.
-
- * Continue to delete partition p201702, the partition results are as
follows:
-
- ```
- p201701: [MIN_VALUE, 2017-02-01)
- p201705: [2017-04-01, 2017-06-01)
- ```
-
- > The void range becomes: [2017-02-01, 2017-04-01)
-
- * Now add a partition p201702new VALUES LESS THAN ("2017-03-01"), the
partition results are as follows:
-
- ```
- p201701: [MIN_VALUE, 2017-02-01)
- p201702new: [2017-02-01, 2017-03-01)
- p201705: [2017-04-01, 2017-06-01)
- ```
-
- > You can see that the hole size is reduced to: [2017-03-01,
2017-04-01)
-
- * Now delete partition p201701 and add partition p201612 VALUES LESS
THAN ("2017-01-01"), the partition result is as follows:
-
- ```
- p201612: [MIN_VALUE, 2017-01-01)
- p201702new: [2017-02-01, 2017-03-01)
- p201705: [2017-04-01, 2017-06-01)
- ```
-
- > A new void appeared: [2017-01-01, 2017-02-01)
-
-
- In summary, the deletion of a partition does not change the scope of an
existing partition. There may be holes in deleting partitions. When a partition
is added by the `VALUES LESS THAN` statement, the lower bound of the partition
immediately follows the upper bound of the previous partition.
-
- In addition to the single-column partitioning we have seen above, Range
partition also supports **multi-column partitioning**, examples are as follows:
+It is also possible to use only one layer of partitioning. When using a layer
partition, only Bucket partitioning is supported.Let's introduce the partitions
and buckets separately:
+
+1. **Partition**
+
+ - The Partition column can specify one or more columns. The partition class
must be a KEY column. The use of multi-column partitions is described later in
the **Multi-column partitioning** summary.
+ - Regardless of the type of partition column, double quotes are required
when writing partition values.
+ - There is no theoretical limit on the number of partitions.
+ - When you do not use Partition to build a table, the system will
automatically generate a Partition with the same name as the table name. This
Partition is not visible to the user and cannot be modified.
+ - **Do not add partitions with overlapping ranges** when creating
partitions.
+
+ **Range Partition**
+
+ - The partition column is usually a time column to facilitate the
management of old and new data.
+
+ - Partition supports only the upper bound by `VALUES LESS THAN (...)`, the
system will use the upper bound of the previous partition as the lower bound of
the partition, and generate a left closed right open interval. Passing, also
supports specifying the upper and lower bounds by `VALUES [...)`, and
generating a left closed right open interval.
+
+ - It is easier to understand by specifying `VALUES [...)`. Here is an
example of the change in partition range when adding or deleting partitions
using the `VALUES LESS THAN (...)` statement:
+
+ - As in the `example_range_tbl` example above, when the table is built,
the following 3 partitions are automatically generated:
+
+ ```text
+ p201701: [MIN_VALUE, 2017-02-01)
+ p201702: [2017-02-01, 2017-03-01)
+ p201703: [2017-03-01, 2017-04-01)
+ ```
+
+ - When we add a partition p201705 VALUES LESS THAN ("2017-06-01"), the
partition results are as follows:
+
+ ```text
+ p201701: [MIN_VALUE, 2017-02-01)
+ p201702: [2017-02-01, 2017-03-01)
+ p201703: [2017-03-01, 2017-04-01)
+ p201705: [2017-04-01, 2017-06-01)
+ ```
+
+ - At this point we delete the partition p201703, the partition results
are as follows:
+
+ ```text
+ p201701: [MIN_VALUE, 2017-02-01)
+ p201702: [2017-02-01, 2017-03-01)
+ p201705: [2017-04-01, 2017-06-01)
+ ```
+
+ > Note that the partition range of p201702 and p201705 has not
changed, and there is a hole between the two partitions: [2017-03-01,
2017-04-01). That is, if the imported data range is within this hole, it cannot
be imported.
+
+ - Continue to delete partition p201702, the partition results are as
follows:
+
+ ```text
+ p201701: [MIN_VALUE, 2017-02-01)
+ p201705: [2017-04-01, 2017-06-01)
+ ```
+
+ > The void range becomes: [2017-02-01, 2017-04-01)
+
+ - Now add a partition p201702new VALUES LESS THAN ("2017-03-01"), the
partition results are as follows:
+
+ ```text
+ p201701: [MIN_VALUE, 2017-02-01)
+ p201702new: [2017-02-01, 2017-03-01)
+ p201705: [2017-04-01, 2017-06-01)
+ ```
+
+ > You can see that the hole size is reduced to: [2017-03-01, 2017-04-01)
+
+ - Now delete partition p201701 and add partition p201612 VALUES LESS THAN
("2017-01-01"), the partition result is as follows:
+
+ ```text
+ p201612: [MIN_VALUE, 2017-01-01)
+ p201702new: [2017-02-01, 2017-03-01)
+ p201705: [2017-04-01, 2017-06-01)
+ ```
+
+ > A new void appeared: [2017-01-01, 2017-02-01)
+
+ In summary, the deletion of a partition does not change the scope of an
existing partition. There may be holes in deleting partitions. When a partition
is added by the `VALUES LESS THAN` statement, the lower bound of the partition
immediately follows the upper bound of the previous partition.
+
+ In addition to the single-column partitioning we have seen above, Range
partition also supports **multi-column partitioning**, examples are as follows:
```text
- PARTITION BY RANGE(`date`, `id`)
- (
- PARTITION `p201701_1000` VALUES LESS THAN ("2017-02-01", "1000"),
- PARTITION `p201702_2000` VALUES LESS THAN ("2017-03-01", "2000"),
- PARTITION `p201703_all` VALUES LESS THAN ("2017-04-01")
- )
- ```
-
- In the above example, we specify `date` (DATE type) and `id` (INT type) as
partition columns. The resulting partitions in the above example are as follows:
-
- ```
- *p201701_1000: [(MIN_VALUE, MIN_VALUE), ("2017-02-01", "1000") )
- *p201702_2000: [("2017-02-01", "1000"), ("2017-03-01", "2000") )
- *p201703_all: [("2017-03-01", "2000"), ("2017-04-01", MIN_VALUE))
- ```
-
- Note that the last partition user defaults only the partition value of the
`date` column, so the partition value of the `id` column will be filled with
`MIN_VALUE` by default. When the user inserts data, the partition column values
are compared in order, and the corresponding partition is finally obtained.
Examples are as follows:
-
- ```
- * Data --> Partition
- * 2017-01-01, 200 --> p201701_1000
- * 2017-01-01, 2000 --> p201701_1000
- * 2017-02-01, 100 --> p201701_1000
- * 2017-02-01, 2000 --> p201702_2000
- * 2017-02-15, 5000 --> p201702_2000
- * 2017-03-01, 2000 --> p201703_all
- * 2017-03-10, 1 --> p201703_all
- * 2017-04-01, 1000 --> Unable to import
- * 2017-05-01, 1000 --> Unable to import
- ```
-
- #### List Partition
-
- * The partition column supports the `BOOLEAN, TINYINT, SMALLINT, INT,
BIGINT, LARGEINT, DATE, DATETIME, CHAR, VARCHAR` data type, and the partition
value is an enumeration value. Partitions can be hit only if the data is one of
the target partition enumeration values.
- * Partition supports specifying the number of partitions contained in each
partition via `VALUES IN (...) ` to specify the enumeration values contained in
each partition.
- * The following example illustrates how partitions change when adding or
deleting partitions.
-
- * As in the `example_list_tbl` example above, when the table is built,
the following three partitions are automatically created.
-
- ```
- p_cn: ("Beijing", "Shanghai", "Hong Kong")
- p_usa: ("New York", "San Francisco")
- p_jp: ("Tokyo")
- ```
-
- * When we add a partition p_uk VALUES IN ("London"), the result of the
partition is as follows
-
- ```
- p_cn: ("Beijing", "Shanghai", "Hong Kong")
- p_usa: ("New York", "San Francisco")
- p_jp: ("Tokyo")
- p_uk: ("London")
- ```
-
- * When we delete the partition p_jp, the result of the partition is as
follows.
-
- ```
- p_cn: ("Beijing", "Shanghai", "Hong Kong")
- p_usa: ("New York", "San Francisco")
- p_uk: ("London")
- ```
-
- List partition also supports **multi-column partition**, examples are as
follows:
-
- ```text
- PARTITION BY LIST(`id`, `city`)
- (
- PARTITION `p1_city` VALUES IN (("1", "Beijing"), ("1", "Shanghai")),
- PARTITION `p2_city` VALUES IN (("2", "Beijing"), ("2", "Shanghai")),
- PARTITION `p3_city` VALUES IN (("3", "Beijing"), ("3", "Shanghai"))
- )
- ```
-
- In the above example, we specify `id`(INT type) and `city`(VARCHAR type)
as partition columns. The above example ends up with the following partitions.
-
- ```
- * p1_city: [("1", "Beijing"), ("1", "Shanghai")]
- * p2_city: [("2", "Beijing"), ("2", "Shanghai")]
- * p3_city: [("3", "Beijing"), ("3", "Shanghai")]
- ```
-
- When the user inserts data, the partition column values will be compared
sequentially in order to finally get the corresponding partition. An example is
as follows.
-
- ```
- * Data ---> Partition
- * 1, Beijing ---> p1_city
- * 1, Shanghai ---> p1_city
- * 2, Shanghai ---> p2_city
- * 3, Beijing ---> p3_city
- * 1, Tianjin ---> Unable to import
- * 4, Beijing ---> Unable to import
- ```
-
-2. Bucket
-
- * If a Partition is used, the `DISTRIBUTED ...` statement describes the
division rules for the data in each partition. If you do not use Partition, it
describes the rules for dividing the data of the entire table.
- * The bucket column can be multiple columns, but it must be a Key column.
The bucket column can be the same or different from the Partition column.
- * The choice of bucket column is a trade-off between **query throughput**
and **query concurrency**:
-
- 1. If you select multiple bucket columns, the data is more evenly
distributed. However, if the query condition does not include the equivalent
condition for all bucket columns, a query will scan all buckets. The throughput
of such queries will increase, and the latency of a single query will decrease.
This method is suitable for large throughput and low concurrent query scenarios.
- 2. If you select only one or a few bucket columns, the point query can
query only one bucket. This approach is suitable for high-concurrency point
query scenarios.
-
- * There is no theoretical limit on the number of buckets.
-
-3. Recommendations on the number and amount of data for Partitions and Buckets.
-
- * The total number of tablets in a table is equal to (Partition num *
Bucket num).
- * The number of tablets in a table, which is slightly more than the number
of disks in the entire cluster, regardless of capacity expansion.
- * The data volume of a single tablet does not theoretically have an upper
and lower bound, but is recommended to be in the range of 1G - 10G. If the
amount of data for a single tablet is too small, the aggregation of the data is
not good and the metadata management pressure is high. If the amount of data is
too large, it is not conducive to the migration, completion, and increase the
cost of Schema Change or Rollup operation failure retry (the granularity of
these operations failure [...]
- * When the tablet's data volume principle and quantity principle conflict,
it is recommended to prioritize the data volume principle.
- * When building a table, the number of Buckets for each partition is
uniformly specified. However, when dynamically increasing partitions (`ADD
PARTITION`), you can specify the number of Buckets for the new partition
separately. This feature can be used to easily reduce or expand data.
- * Once the number of Buckets for a Partition is specified, it cannot be
changed. Therefore, when determining the number of Buckets, you need to
consider the expansion of the cluster in advance. For example, there are
currently only 3 hosts, and each host has 1 disk. If the number of Buckets is
only set to 3 or less, then even if you add more machines later, you can't
increase the concurrency.
- * Give some examples: Suppose there are 10 BEs, one for each BE disk. If
the total size of a table is 500MB, you can consider 4-8 shards. 5GB: 8-16.
50GB: 32. 500GB: Recommended partitions, each partition is about 50GB in size,
with 16-32 shards per partition. 5TB: Recommended partitions, each with a size
of around 50GB and 16-32 shards per partition.
-
- > Note: The amount of data in the table can be viewed by the [show
data](../sql-manual/sql-reference/Show-Statements/SHOW-DATA.md) command. The
result is divided by the number of copies, which is the amount of data in the
table.
-
+ PARTITION BY RANGE(`date`, `id`)
+ (
+ PARTITION `p201701_1000` VALUES LESS THAN ("2017-02-01", "1000"),
+ PARTITION `p201702_2000` VALUES LESS THAN ("2017-03-01", "2000"),
+ PARTITION `p201703_all` VALUES LESS THAN ("2017-04-01")
+ )
+ ```
+
+ In the above example, we specify `date` (DATE type) and `id` (INT type) as
partition columns. The resulting partitions in the above example are as follows:
+
+ ```
+ * p201701_1000: [(MIN_VALUE, MIN_VALUE), ("2017-02-01", "1000") )
+ * p201702_2000: [("2017-02-01", "1000"), ("2017-03-01", "2000") )
+ * p201703_all: [("2017-03-01", "2000"), ("2017-04-01", MIN_VALUE))
+ ```
+
+ Note that the last partition user defaults only the partition value of the
`date` column, so the partition value of the `id` column will be filled with
`MIN_VALUE` by default. When the user inserts data, the partition column values
are compared in order, and the corresponding partition is finally obtained.
Examples are as follows:
+
+ ```
+ * Data --> Partition
+ * 2017-01-01, 200 --> p201701_1000
+ * 2017-01-01, 2000 --> p201701_1000
+ * 2017-02-01, 100 --> p201701_1000
+ * 2017-02-01, 2000 --> p201702_2000
+ * 2017-02-15, 5000 --> p201702_2000
+ * 2017-03-01, 2000 --> p201703_all
+ * 2017-03-10, 1 --> p201703_all
+ * 2017-04-01, 1000 --> Unable to import
+ * 2017-05-01, 1000 --> Unable to import
+ ```
+
+ **List Partition**
+
+ - The partition column supports the `BOOLEAN, TINYINT, SMALLINT, INT,
BIGINT, LARGEINT, DATE, DATETIME, CHAR, VARCHAR` data type, and the partition
value is an enumeration value. Partitions can be hit only if the data is one of
the target partition enumeration values.
+
+ - Partition supports specifying the number of partitions contained in each
partition via `VALUES IN (...) ` to specify the enumeration values contained in
each partition.
+
+ - The following example illustrates how partitions change when adding or
deleting partitions.
+
+ - As in the `example_list_tbl` example above, when the table is built,
the following three partitions are automatically created.
+
+ ```text
+ p_cn: ("Beijing", "Shanghai", "Hong Kong")
+ p_usa: ("New York", "San Francisco")
+ p_jp: ("Tokyo")
+ ```
+
+ - When we add a partition p_uk VALUES IN ("London"), the result of the
partition is as follows:
+
+ ```text
+ p_cn: ("Beijing", "Shanghai", "Hong Kong")
+ p_usa: ("New York", "San Francisco")
+ p_jp: ("Tokyo")
+ p_uk: ("London")
+ ```
+
+ - When we delete the partition p_jp, the result of the partition is as
follows:
+
+ ```text
+ p_cn: ("Beijing", "Shanghai", "Hong Kong")
+ p_usa: ("New York", "San Francisco")
+ p_uk: ("London")
+ ```
+
+ List partition also supports **multi-column partition**, examples are as
follows:
+
+ ```text
+ PARTITION BY LIST(`id`, `city`)
+ (
+ PARTITION `p1_city` VALUES IN (("1", "Beijing"), ("1", "Shanghai")),
+ PARTITION `p2_city` VALUES IN (("2", "Beijing"), ("2", "Shanghai")),
+ PARTITION `p3_city` VALUES IN (("3", "Beijing"), ("3", "Shanghai"))
+ )
+ ```
+
+ In the above example, we specify `id`(INT type) and `city`(VARCHAR type) as
partition columns. The above example ends up with the following partitions:
+
+ ```
+ * p1_city: [("1", "Beijing"), ("1", "Shanghai")]
+ * p2_city: [("2", "Beijing"), ("2", "Shanghai")]
+ * p3_city: [("3", "Beijing"), ("3", "Shanghai")]
+ ```
+
+ When the user inserts data, the partition column values will be compared
sequentially in order to finally get the corresponding partition. An example is
as follows.
+
+ ```
+ * Data ---> Partition
+ * 1, Beijing ---> p1_city
+ * 1, Shanghai ---> p1_city
+ * 2, Shanghai ---> p2_city
+ * 3, Beijing ---> p3_city
+ * 1, Tianjin ---> 无法导入
+ * 4, Beijing ---> 无法导入
+ ```
+
+2. **Bucket**
+
+ - If a Partition is used, the `DISTRIBUTED ...` statement describes the
division rules for the data in each partition. If you do not use Partition, it
describes the rules for dividing the data of the entire table.
+ - The bucket column can be multiple columns, but it must be a Key column.
The bucket column can be the same or different from the Partition column.
+ - The choice of bucket column is a trade-off between **query throughput**
and **query concurrency**:
+ 1. If you select multiple bucket columns, the data is more evenly
distributed. However, if the query condition does not include the equivalent
condition for all bucket columns, a query will scan all buckets. The throughput
of such queries will increase, and the latency of a single query will decrease.
This method is suitable for large throughput and low concurrent query scenarios.
+ 2. If you select only one or a few bucket columns, the point query can
query only one bucket. This approach is suitable for high-concurrency point
query scenarios.
+ - There is no theoretical limit on the number of buckets.
+
+3. **Recommendations on the number and amount of data for Partitions and
Buckets.**
+
+ - The total number of tablets in a table is equal to (Partition num *
Bucket num).
+ - The number of tablets in a table, which is slightly more than the number
of disks in the entire cluster, regardless of capacity expansion.
+ - The data volume of a single tablet does not theoretically have an upper
and lower bound, but is recommended to be in the range of 1G - 10G. If the
amount of data for a single tablet is too small, the aggregation of the data is
not good and the metadata management pressure is high. If the amount of data is
too large, it is not conducive to the migration, completion, and increase the
cost of Schema Change or Rollup operation failure retry (the granularity of
these operations failure r [...]
+ - When the tablet's data volume principle and quantity principle conflict,
it is recommended to prioritize the data volume principle.
+ - When building a table, the number of Buckets for each partition is
uniformly specified. However, when dynamically increasing partitions (`ADD
PARTITION`), you can specify the number of Buckets for the new partition
separately. This feature can be used to easily reduce or expand data.
+ - Once the number of Buckets for a Partition is specified, it cannot be
changed. Therefore, when determining the number of Buckets, you need to
consider the expansion of the cluster in advance. For example, there are
currently only 3 hosts, and each host has 1 disk. If the number of Buckets is
only set to 3 or less, then even if you add more machines later, you can't
increase the concurrency.
+ - Give some examples: Suppose there are 10 BEs, one for each BE disk. If
the total size of a table is 500MB, you can consider 4-8 shards. 5GB: 8-16.
50GB: 32. 500GB: Recommended partitions, each partition is about 50GB in size,
with 16-32 shards per partition. 5TB: Recommended partitions, each with a size
of around 50GB and 16-32 shards per partition.
+
+ > Note: The amount of data in the table can be viewed by the [show
data](../sql-manual/sql-reference/Show-Statements/SHOW-DATA.md) command. The
result is divided by the number of copies, which is the amount of data in the
table.
#### Compound Partitions vs Single Partitions
@@ -355,43 +355,44 @@ The user can also use a single partition without using
composite partitions. The
In the last PROPERTIES of the table building statement, for the relevant
parameters that can be set in PROPERTIES, we can check [CREATE
TABLE](../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md)
for a detailed introduction.
### ENGINE
-
In this example, the type of ENGINE is olap, the default ENGINE type. In
Doris, only this ENGINE type is managed and stored by Doris. Other ENGINE
types, such as mysql, broker, es, etc., are essentially mappings to tables in
other external databases or systems to ensure that Doris can read the data. And
Doris itself does not create, manage, and store any tables and data of a
non-olap ENGINE type.
-### Other
+### 其他
+```text
`IF NOT EXISTS` indicates that if the table has not been created, it is
created. Note that only the table name is judged here, and it is not determined
whether the new table structure is the same as the existing table structure. So
if there is a table with the same name but different structure, the command
will also return success, but it does not mean that a new table and a new
structure have been created.
+```
## common problem
-### Build Table Operations FAQ
+###Build Table Operations FAQ
1. If a syntax error occurs in a long build statement, a syntax error may be
incomplete. Here is a list of possible syntax errors for manual error
correction:
- * The syntax is incorrect. Please read `HELP CREATE TABLE;` carefully to
check the relevant syntax structure.
- * Reserved words. When the user-defined name encounters a reserved word,
it needs to be enclosed in the backquote ``. It is recommended that all custom
names be generated using this symbol.
- * Chinese characters or full-width characters. Non-utf8 encoded Chinese
characters, or hidden full-width characters (spaces, punctuation, etc.) can
cause syntax errors. It is recommended to check with a text editor with
invisible characters.
+ - The syntax is incorrect. Please read `HELP CREATE TABLE;` carefully to
check the relevant syntax structure.
+ - Reserved words. When the user-defined name encounters a reserved word, it
needs to be enclosed in the backquote ``. It is recommended that all custom
names be generated using this symbol.
+ - Chinese characters or full-width characters. Non-utf8 encoded Chinese
characters, or hidden full-width characters (spaces, punctuation, etc.) can
cause syntax errors. It is recommended to check with a text editor with
invisible characters.
2. `Failed to create partition [xxx] . Timeout`
- Doris builds are created in order of Partition granularity. This error may
be reported when a Partition creation fails. Even if you don't use Partition,
you will report `Failed to create partition` when there is a problem with the
built table, because as mentioned earlier, Doris will create an unchangeable
default Partition for tables that do not have a Partition specified.
+ Doris builds are created in order of Partition granularity. This error may
be reported when a Partition creation fails. Even if you don't use Partition,
you will report `Failed to create partition` when there is a problem with the
built table, because as mentioned earlier, Doris will create an unchangeable
default Partition for tables that do not have a Partition specified.
- When this error is encountered, it is usually the BE that has encountered
problems creating data fragments. You can follow the steps below to
troubleshoot:
+ When this error is encountered, it is usually the BE that has encountered
problems creating data fragments. You can follow the steps below to
troubleshoot:
- 1. In fe.log, find the `Failed to create partition` log for the
corresponding point in time. In this log, a series of numbers like
`{10001-10010}` will appear. The first number of the pair is the Backend ID and
the second number is the Tablet ID. As for the pair of numbers above, on the
Backend with ID 10001, creating a tablet with ID 10010 failed.
- 2. Go to the be.INFO log corresponding to Backend and find the log related
to the tablet id in the corresponding time period. You can find the error
message.
- 3. Listed below are some common tablet creation failure errors, including
but not limited to:
- * BE did not receive the relevant task, and the tablet id related log
could not be found in be.INFO. Or the BE is created successfully, but the
report fails. For the above questions, see [Deployment and Upgrade
Documentation] to check the connectivity of FE and BE.
- * Pre-allocated memory failed. It may be that the length of a line in
a row in the table exceeds 100KB.
- * `Too many open files`. The number of open file handles exceeds the
Linux system limit. The handle limit of the Linux system needs to be modified.
+ - In fe.log, find the `Failed to create partition` log for the
corresponding point in time. In this log, a series of numbers like
`{10001-10010}` will appear. The first number of the pair is the Backend ID and
the second number is the Tablet ID. As for the pair of numbers above, on the
Backend with ID 10001, creating a tablet with ID 10010 failed.
+ - Go to the be.INFO log corresponding to Backend and find the log related
to the tablet id in the corresponding time period. You can find the error
message.
+ - Listed below are some common tablet creation failure errors, including
but not limited to:
+ - BE did not receive the relevant task, and the tablet id related log
could not be found in be.INFO. Or the BE is created successfully, but the
report fails. For the above questions, see [Deployment and Upgrade
Documentation](../install/install-deploy.md) to check the connectivity of FE
and BE.
+ - Pre-allocated memory failed. It may be that the length of a line in a
row in the table exceeds 100KB.
+ - `Too many open files`. The number of open file handles exceeds the
Linux system limit. The handle limit of the Linux system needs to be modified.
- You can also extend the timeout by setting
`tablet_create_timeout_second=xxx` in fe.conf. The default is 2 seconds.
+ If there is a timeout when creating a data shard, you can also extend the
timeout by setting `tablet_create_timeout_second=xxx` and
`max_create_table_timeout_second=xxx` in fe.conf. Among them,
`tablet_create_timeout_second` defaults to 1 second,
`max_create_table_timeout_second` defaults to 60 seconds, and the overall
timeout time is min(tablet_create_timeout_second * replication_num,
max_create_table_timeout_second). For specific parameter settings, please refer
to [fe-config](../ad [...]
3. The build table command does not return results for a long time.
- Doris's table creation command is a synchronous command. The timeout of
this command is currently set to be relatively simple, ie (tablet num *
replication num) seconds. If you create more data fragments and have fragment
creation failed, it may cause an error to be returned after waiting for a long
timeout.
+ Doris's table creation command is a synchronous command. The timeout of
this command is currently set to be relatively simple, ie (tablet num *
replication num) seconds. If you create more data fragments and have fragment
creation failed, it may cause an error to be returned after waiting for a long
timeout.
- Under normal circumstances, the statement will return in a few seconds or
ten seconds. If it is more than one minute, it is recommended to cancel this
operation directly and go to the FE or BE log to view the related errors.
+ Under normal circumstances, the statement will return in a few seconds or
ten seconds. If it is more than one minute, it is recommended to cancel this
operation directly and go to the FE or BE log to view the related errors.
## More help
diff --git a/docs/get-starting/get-starting.md
b/docs/get-starting/get-starting.md
index 0fc753de39c..731c66497f4 100644
--- a/docs/get-starting/get-starting.md
+++ b/docs/get-starting/get-starting.md
@@ -236,7 +236,7 @@ For a complete parameter comparison table, please go to
[Profile parameter analy
- create database
```mysql
- CREATE DATABASE database name;
+ CREATE DATABASE database_name;
````
> For more detailed syntax and best practices used by Create-DataBase, see
[Create-Database](../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-DATABASE)
command manual.
diff --git a/docs/install/install-deploy.md b/docs/install/install-deploy.md
index 73b4b553f8d..2d294b5104b 100644
--- a/docs/install/install-deploy.md
+++ b/docs/install/install-deploy.md
@@ -142,7 +142,7 @@ This is a representation of
[CIDR](https://en.wikipedia.org/wiki/Classless_Inter
**Note**: When priority networks is configured and FE or BE is started, only
the correct IP binding of FE or BE is ensured. In ADD BACKEND or ADD FRONTEND
statements, you also need to specify IP matching priority networks
configuration, otherwise the cluster cannot be established. Give an example:
-BE is configured as `priority_networks = 10.1.3.0/24'.`.
+BE is configured as `priority_networks = 10.1.3.0/24`.
When you want to ADD BACKEND use: `ALTER SYSTEM ADD BACKEND
"192.168.0.1:9050";`
@@ -152,7 +152,7 @@ At this point, DROP must remove the BE that added errors
and re-use the correct
FE is the same.
-BROKER does not currently have, nor does it need, priority\_networks. Broker's
services are bound to 0.0.0 by default. Simply execute the correct accessible
BROKER IP when ADD BROKER is used.
+BROKER does not currently have, nor does it need, priority\_networks. Broker's
services are bound to 0.0.0.0 by default. Simply execute the correct accessible
BROKER IP when ADD BROKER is used.
#### Table Name Case Sensitivity Setting
@@ -201,9 +201,9 @@ See the section on `lower_case_table_names` variables in
[Variables](../advanced
**instructions**
- * 1./home/disk1/doris.HDD : The storage medium is HDD;
- * 2./home/disk2/doris.SSD : The storage medium is SSD;
- * 3./home/disk2/doris : The storage medium is HDD HDD(default)
+ - 1./home/disk1/doris.HDD : The storage medium is HDD;
+ - 2./home/disk2/doris.SSD : The storage medium is SSD;
+ - 3./home/disk2/doris : The storage medium is HDD(default)
eg.2:
@@ -213,8 +213,8 @@ See the section on `lower_case_table_names` variables in
[Variables](../advanced
**instructions**
- * 1./home/disk1/doris,medium:hdd The storage medium is HDD;
- * 2./home/disk2/doris,medium:ssd The storage medium is SSD;
+ - 1./home/disk1/doris,medium:hdd The storage medium is HDD;
+ - 2./home/disk2/doris,medium:ssd The storage medium is SSD;
* BE webserver_port configuration
@@ -274,128 +274,6 @@ Broker is deployed as a plug-in, independent of Doris. If
you need to import dat
**Note: In production environments, daemons should be used to start all
instances to ensure that processes are automatically pulled up after they exit,
such as [Supervisor](http://supervisord.org/). For daemon startup, in 0.9.0 and
previous versions, you need to modify the start_xx.sh scripts to remove the
last & symbol**. Starting with version 0.10.0, call `sh start_xx.sh` directly
to start. Also refer to [here](https://www.cnblogs.com/lenmom/p/9973401.html)
-## Expansion and contraction
-
-Doris can easily expand and shrink FE, BE, Broker instances.
-
-### FE Expansion and Compression
-
-High availability of FE can be achieved by expanding FE to three top-one nodes.
-
-Users can login to Master FE through MySQL client. By:
-
-`SHOW PROC '/frontends';`
-
-To view the current FE node situation.
-
-You can also view the FE node through the front-end page connection:
``http://fe_hostname:fe_http_port/frontend`` or
```http://fe_hostname:fe_http_port/system?Path=//frontends```.
-
-All of the above methods require Doris's root user rights.
-
-The process of FE node expansion and contraction does not affect the current
system operation.
-
-#### Adding FE nodes
-
-FE is divided into three roles: Leader, Follower and Observer. By default, a
cluster can have only one Leader and multiple Followers and Observers. Leader
and Follower form a Paxos selection group. If the Leader goes down, the
remaining Followers will automatically select a new Leader to ensure high write
availability. Observer synchronizes Leader data, but does not participate in
the election. If only one FE is deployed, FE defaults to Leader.
-
-The first FE to start automatically becomes Leader. On this basis, several
Followers and Observers can be added.
-
-Add Follower or Observer. Connect to the started FE using mysql-client and
execute:
-
-`ALTER SYSTEM ADD FOLLOWER "follower_host:edit_log_port";`
-
-or
-
-`ALTER SYSTEM ADD OBSERVER "observer_host:edit_log_port";`
-
-The follower\_host and observer\_host is the node IP of Follower or Observer,
and the edit\_log\_port in its configuration file fe.conf.
-
-Configure and start Follower or Observer. Follower and Observer are configured
with Leader. The following commands need to be executed at the first startup:
-
-`bin/start_fe.sh --helper host:edit_log_port --daemon`
-
-The host is the node IP of Leader, and the edit\_log\_port in Lead's
configuration file fe.conf. The --helper is only required when
follower/observer is first startup.
-
-View the status of Follower or Observer. Connect to any booted FE using
mysql-client and execute:
-
-```SHOW PROC '/frontends';```
-
-You can view the FE currently joined the cluster and its corresponding roles.
-
-> Notes for FE expansion:
->
-> 1. The number of Follower FEs (including Leaders) must be odd. It is
recommended that a maximum of three constituent high availability (HA) modes be
deployed.
-> 2. When FE is in a highly available deployment (1 Leader, 2 Follower), we
recommend that the reading service capability of FE be extended by adding
Observer FE. Of course, you can continue to add Follower FE, but it's almost
unnecessary.
-> 3. Usually a FE node can handle 10-20 BE nodes. It is suggested that the
total number of FE nodes should be less than 10. Usually three can meet most of
the needs.
-> 4. The helper cannot point to the FE itself, it must point to one or more
existing running Master/Follower FEs.
-
-#### Delete FE nodes
-
-Delete the corresponding FE node using the following command:
-
-```ALTER SYSTEM DROP FOLLOWER[OBSERVER] "fe_host:edit_log_port";```
-
-> Notes for FE contraction:
->
-> 1. When deleting Follower FE, make sure that the remaining Follower
(including Leader) nodes are odd.
-
-### BE Expansion and Compression
-
-Users can login to Leader FE through mysql-client. By:
-
-```SHOW PROC '/backends';```
-
-To see the current BE node situation.
-
-You can also view the BE node through the front-end page connection:
``http://fe_hostname:fe_http_port/backend`` or
``http://fe_hostname:fe_http_port/system?Path=//backends``.
-
-All of the above methods require Doris's root user rights.
-
-The expansion and scaling process of BE nodes does not affect the current
system operation and the tasks being performed, and does not affect the
performance of the current system. Data balancing is done automatically.
Depending on the amount of data available in the cluster, the cluster will be
restored to load balancing in a few hours to a day. For cluster load, see the
[Tablet Load Balancing
Document](../admin-manual/maint-monitor/tablet-meta-tool.md).
-
-#### Add BE nodes
-
-The BE node is added in the same way as in the **BE deployment** section. The
BE node is added by the `ALTER SYSTEM ADD BACKEND` command.
-
-> Notes for BE expansion:
->
-> 1. After BE expansion, Doris will automatically balance the data according
to the load, without affecting the use during the period.
-
-#### Delete BE nodes
-
-There are two ways to delete BE nodes: DROP and DECOMMISSION
-
-The DROP statement is as follows:
-
-```ALTER SYSTEM DROP BACKEND "be_host:be_heartbeat_service_port";```
-
-**Note: DROP BACKEND will delete the BE directly and the data on it will not
be recovered!!! So we strongly do not recommend DROP BACKEND to delete BE
nodes. When you use this statement, there will be corresponding error-proof
operation hints.**
-
-DECOMMISSION clause:
-
-```ALTER SYSTEM DECOMMISSION BACKEND "be_host:be_heartbeat_service_port";```
-
-> DECOMMISSION notes:
->
-> 1. This command is used to safely delete BE nodes. After the command is
issued, Doris attempts to migrate the data on the BE to other BE nodes, and
when all data is migrated, Doris automatically deletes the node.
-> 2. The command is an asynchronous operation. After execution, you can see
that the BE node's `SystemDecommissioned` status is true through ``SHOW PROC
'/backends';` Indicates that the node is offline.
-> 3. The order **does not necessarily carry out successfully**. For example,
when the remaining BE storage space is insufficient to accommodate the data on
the offline BE, or when the number of remaining machines does not meet the
minimum number of replicas, the command cannot be completed, and the BE will
always be in the state of `SystemDecommissioned` as true.
-> 4. The progress of DECOMMISSION can be viewed through `SHOW PROC
'/backends';` Tablet Num, and if it is in progress, Tablet Num will continue to
decrease.
-> 5. The operation can be carried out by:
-> ```CANCEL ALTER SYSTEM DECOMMISSION BACKEND
"be_host:be_heartbeat_service_port";```
-> The order was cancelled. When cancelled, the data on the BE will
maintain the current amount of data remaining. Follow-up Doris re-load balancing
-
-**For expansion and scaling of BE nodes in multi-tenant deployment
environments, please refer to the [Multi-tenant Design
Document](../admin-manual/maint-monitor/multi-tenant.md).**
-
-### Broker Expansion and Shrinkage
-
-There is no rigid requirement for the number of Broker instances. Usually one
physical machine is deployed. Broker addition and deletion can be accomplished
by following commands:
-
-```ALTER SYSTEM ADD BROKER broker_name "broker_host:broker_ipc_port";```
-```ALTER SYSTEM DROP BROKER broker_name "broker_host:broker_ipc_port";```
-```ALTER SYSTEM DROP ALL BROKER broker_name;```
-
-Broker is a stateless process that can be started or stopped at will. Of
course, when it stops, the job running on it will fail. Just try again.
## Common Questions
diff --git a/docs/install/source-install/compilation-with-ldb-toolchain.md
b/docs/install/source-install/compilation-with-ldb-toolchain.md
index d164a42e053..751ae1f588f 100644
--- a/docs/install/source-install/compilation-with-ldb-toolchain.md
+++ b/docs/install/source-install/compilation-with-ldb-toolchain.md
@@ -67,7 +67,7 @@ This works for most Linux distributions (CentOS, Ubuntu,
etc.).
2. [Apache Maven
3.6.3](https://doris-thirdparty-repo.bj.bcebos.com/thirdparty/apache-maven-3.6.3-bin.tar.gz)
3. [Node
v12.13.0](https://doris-thirdparty-repo.bj.bcebos.com/thirdparty/node-v12.13.0-linux-x64.tar.gz)
- If your environment is somehow minimal, additional packages should be
installed before compiling Doris. The following instructions describe how to
setup a minimal CentOS 6 box to compile Doris. Other linux distros should be
similar.
+ Different Linux distributions may have different components included by
default.Therefore some additional components may need to be installed.The
following takes centos6 as an example, other distributions are similar:
```
# install required system packages
diff --git a/docs/install/source-install/compilation.md
b/docs/install/source-install/compilation.md
index 8c6c9d57b3d..e3184ee9389 100644
--- a/docs/install/source-install/compilation.md
+++ b/docs/install/source-install/compilation.md
@@ -59,7 +59,7 @@ This document focuses on how to code Doris through source
code.
| apache/incubator-doris:build-env-latest | before
[0efef1b](https://github.com/apache/doris/commit/0efef1b332300887ee0473f9df9bdd9d7297d824)
| |
| apache/doris:build-env-for-1.0.0| | 1.0.0 |
| apache/doris:build-env-for-1.1.0| | 1.1.0 |
-| apache/doris:build-env-ldb-toolchain-latest | trunk | trunk |
+| apache/doris:build-env-ldb-toolchain-latest | trunk | |
**note**:
@@ -133,11 +133,15 @@ This document focuses on how to code Doris through source
code.
>
> If you are using `build-env-for-0.15.0` or `build-env-for-1.0.0` for
the first time, use the following command when compiling:
>
- > `sh build.sh --clean --be --fe --ui`
+ >```
+ >sh build.sh --clean --be --fe --ui
+ >```
>
> `build-env-ldb-toolchain-latest` and later, use the following command:
>
- > ` sh build.sh --clean --be --fe`
+ >```
+ >sh build.sh --clean --be --fe
+ >```
>
> This is because from build-env-for-0.15.0, we upgraded thrift (0.9 ->
0.13), you need to use the --clean command to force the use of the new version
of thrift to generate code files, otherwise incompatible code will appear.
@@ -230,6 +234,13 @@ You can try to compile Doris directly in your own Linux
environment.
REPOSITORY_URL contains all third-party library source code packages
and their historical versions.
+3. `fatal error: Killed signal terminated program ...`
+
+ If you encounter the above error when compiling with a Docker image, it may
be that the memory allocated to the image is insufficient(The default memory
size allocated by Docker is 2GB, and the peak memory usage during compilation
is greater than 2GB).
+
+ Try to increase the allocated memory of the image appropriately, 4GB ~ 8GB
is recommended.
+
+
## Special statement
Starting from version 0.13, the dependency on the two third-party libraries
[1] and [2] will be removed in the default compiled output. These two
third-party libraries are under [GNU General Public License
V3](https://www.gnu.org/licenses/gpl-3.0.en.html). This license is incompatible
with [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0), so it
should not appear in the Apache release by default.
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-table/data-model.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-table/data-model.md
index 9ac5d545737..3a73dbda8c4 100644
--- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-table/data-model.md
+++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-table/data-model.md
@@ -344,7 +344,7 @@ PROPERTIES (
这种数据模型区别于 Aggregate 和 Unique 模型。数据完全按照导入文件中的数据进行存储,不会有任何聚合。即使两行数据完全相同,也都会保留。
而在建表语句中指定的 DUPLICATE KEY,只是用来指明底层数据按照那些列进行排序。(更贴切的名称应该为 “Sorted Column”,这里取名
“DUPLICATE KEY” 只是用以明确表示所用的数据模型。关于 “Sorted
Column”的更多解释,可以参阅[前缀索引](./index/prefix-index.md))。在 DUPLICATE KEY
的选择上,我们建议适当的选择前 2-4 列就可以。
-这种数据模型适用于既没有聚合需求,又没有主键唯一性约束的原始数据的存储。更多使用场景,可参阅**聚合模型的局限性**小节。
+这种数据模型适用于既没有聚合需求,又没有主键唯一性约束的原始数据的存储。更多使用场景,可参阅[聚合模型的局限性](./data-model.md#聚合模型的局限性)小节。
## 聚合模型的局限性
@@ -434,7 +434,7 @@ SELECT COUNT(*) FROM table;
所以,`select count(*) from table;` 的正确结果应该为 **4**。但如果我们只扫描 `user_id`
这一列,如果加上查询时聚合,最终得到的结果是 **3**(10001, 10002, 10003)。而如果不加查询时聚合,则得到的结果是
**5**(两批次一共5行数据)。可见这两个结果都是不对的。
-为了得到正确的结果,我们必须同时读取 `user_id` 和 `date` 这两列的数据,**再加上查询时聚合**,才能返回 **4**
这个正确的结果。也就是说,在 count(*) 查询中,Doris 必须扫描所有的 AGGREGATE KEY 列(这里就是 `user_id` 和
`date`),并且聚合后,才能得到语意正确的结果。当聚合列非常多时,count(*) 查询需要扫描大量的数据。
+为了得到正确的结果,我们必须同时读取 `user_id` 和 `date` 这两列的数据,**再加上查询时聚合**,才能返回 **4**
这个正确的结果。也就是说,在 `count(*)` 查询中,Doris 必须扫描所有的 AGGREGATE KEY 列(这里就是 `user_id` 和
`date`),并且聚合后,才能得到语意正确的结果。当聚合列非常多时,`count(*)` 查询需要扫描大量的数据。
因此,当业务上有频繁的 count(*) 查询时,我们建议用户通过增加一个**值恒为 1 的,聚合类型为 SUM 的列来模拟
count(\*)**。如刚才的例子中的表结构,我们修改如下:
@@ -451,13 +451,13 @@ SELECT COUNT(*) FROM table;
### Duplicate 模型
-Duplicate 模型没有聚合模型的这个局限性。因为该模型不涉及聚合语意,在做 count(*) 查询时,任意选择一列查询,即可得到语意正确的结果。
+Duplicate 模型没有聚合模型的这个局限性。因为该模型不涉及聚合语意,在做 `count(*)` 查询时,任意选择一列查询,即可得到语意正确的结果。
## 数据模型的选择建议
因为数据模型在建表时就已经确定,且**无法修改**。所以,选择一个合适的数据模型**非常重要**。
-1. Aggregate 模型可以通过预聚合,极大地降低聚合查询时所需扫描的数据量和查询的计算量,非常适合有固定模式的报表类查询场景。但是该模型对
count(*) 查询很不友好。同时因为固定了 Value 列上的聚合方式,在进行其他类型的聚合查询时,需要考虑语意正确性。
+1. Aggregate 模型可以通过预聚合,极大地降低聚合查询时所需扫描的数据量和查询的计算量,非常适合有固定模式的报表类查询场景。但是该模型对
`count(*)` 查询很不友好。同时因为固定了 Value 列上的聚合方式,在进行其他类型的聚合查询时,需要考虑语意正确性。
2. Unique 模型针对需要唯一主键约束的场景,可以保证主键唯一性约束。但是无法利用 ROLLUP 等预聚合带来的查询优势(因为本质是
REPLACE,没有 SUM 这种聚合方式)。
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/get-starting.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/get-starting.md
index cd619a15fb0..f77b8715100 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/get-starting.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/get-starting/get-starting.md
@@ -468,6 +468,7 @@ FE 将查询计划拆分成为 Fragment 下发到 BE 进行任务执行。BE 在
> 1. FE_HOST 是任一 FE 所在节点 IP,8030 为 fe.conf 中的 http_port。
> 2. 可以使用任一 BE 的 IP,以及 be.conf 中的 webserver_port 进行导入。如:`BE_HOST:8040`
> 3. example_db 是你导入表所在的数据库
+
本地文件 `table1_data` 以 `,` 作为数据之间的分隔,具体内容如下:
```text
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]