This is an automated email from the ASF dual-hosted git repository.
zykkk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push:
new 4512569a3a [docs](releasenote)Update en release note 2.0.0 (#23041)
4512569a3a is described below
commit 4512569a3af522520b08738ecd8e1a72a224c6c8
Author: Hu Yanjun <[email protected]>
AuthorDate: Wed Aug 16 15:13:09 2023 +0800
[docs](releasenote)Update en release note 2.0.0 (#23041)
---
docs/en/docs/releasenotes/release-2.0.0.md | 302 ++++++++++++-----------------
docs/images/release-note-2.0.0-1.png | Bin 0 -> 201175 bytes
docs/images/release-note-2.0.0-2.png | Bin 0 -> 54381 bytes
docs/images/release-note-2.0.0-3.png | Bin 0 -> 159900 bytes
docs/images/release-note-2.0.0-4.png | Bin 0 -> 81409 bytes
docs/images/release-note-2.0.0-5.png | Bin 0 -> 511051 bytes
docs/images/release-note-2.0.0-6.png | Bin 0 -> 235717 bytes
docs/images/release-note-2.0.0-7.png | Bin 0 -> 394611 bytes
docs/images/release-note-2.0.0-8.png | Bin 0 -> 183457 bytes
9 files changed, 122 insertions(+), 180 deletions(-)
diff --git a/docs/en/docs/releasenotes/release-2.0.0.md
b/docs/en/docs/releasenotes/release-2.0.0.md
index 2dde250006..a24f1c2c71 100644
--- a/docs/en/docs/releasenotes/release-2.0.0.md
+++ b/docs/en/docs/releasenotes/release-2.0.0.md
@@ -25,269 +25,211 @@ under the License.
-->
-We are excited to announce the release of Apache Doris 2.0.0. We would like to
extend our heartfelt thanks to the 275 Apache Doris Contributors who have
committed over 4100 bug fixes and optimizations altogether. You are the driving
force behind all the new features and performance leap!
+We are more than excited to announce that, after six months of coding,
testing, and fine-tuning, Apache Doris 2.0.0 is now production-ready. Special
thanks to the 275 committers who altogether contributed over 4100 optimizations
and fixes to the project.
-> Download:
[https://doris.apache.org/download](https://doris.apache.org/download)
->
-> GitHub source code:
[https://github.com/apache/doris/tree/branch-2.0](https://github.com/apache/doris/tree/branch-2.0)
-
-In the middle of 2023, we are half way on our roadmap and many steps closer to
our visions that we put forward on Doris Summit 2022:
-
-> We want to build Apache Doris into an all-in-one platform that can serve
most of our users' needs so as to maximize their productivity while inducing
the least development and maintainence costs. Specifically, it should be
capable of data analytics in multiple scenarios, support both online and
offline workloads, and deliver lightning performance in both high-throughput
interactive analysis and high-concurrency point queries. Also, in response to
industry trends, it should be able to p [...]
+This new version highlights:
-Taking on these great visions means a path full of hurdles. We need to figure
out answers to all these difficult questions:
+- 10 times faster data queries
+- Enhanced log analytic and federated query capabilities
+- More efficient data writing and updates
+- Improved multi-tenant and resource isolation mechanisms
+- Progresses in elastic scaling of resources and storage-compute separation
+- Enterprise-facing features for higher usability
-- How to ensure real-time data writing without compromising query service
stability?
-- How to ensure online service continuity during data updates and table schema
changes?
-- How to store and analyze structured and semi-structured data efficiently in
one place?
-- How to handle multiple workloads (point queries, reporting, ad-hoc queries,
ETL/ELT, etc.) at the same time and guarantee isolation of them?
-- How to enable efficient execution of complicated SQLs, stability of big
queries, and observability of execution?
-- How to integrate and access data lakes and many heterogenous data sources
more easily?
-- How to improve query performance while largely reducing storage and
computation costs?
+> Download: https://doris.apache.org/download
+>
+> GitHub source code: https://github.com/apache/doris/releases/tag/2.0.0-rc04
-We believe that life is miserable for either the developers or the users, so
we decided to tackle more challenges so that our users would suffer less. Now
we are happy to announce our progress with Apache Doris 2.0.0. These are what
you can expect from this new version:
+## **A 10 Times Performance Increase**
-# A 10 times Performance Increase
+In SSB-Flat and TPC-H benchmarking, Apache Doris 2.0.0 delivered **over
10-time faster query performance** compared to an early version of Apache Doris.
-High performance is what our users identify us with. It has been repeatedly
proven by the test results of Apache Doris on ClickBench and TPC-H benchmarks
during the past year. But there remain some differences between benchmarking
and real-life application:
+
-- Benchmarking simplifies and abstracts real-world scenarios so it might not
cover the complex query statements that are frequently seen in data analytics.
-- Query statements in benchmarking are often fine-tuned, but in real life,
fine-tuning is just too demanding, exhausting, and time-consuming.
+This is realized by the introduction of a smarter query optimizer, inverted
index, a parallel execution model, and a series of new functionalities to
support high-concurrency point queries.
-That's why we introduced a brand new query optimizer: Nereids. With a richer
statistical base and the advanced Cascades framework, Nereids is capable of
self-tuning in most query scenarios, so users can expect high performance
without any fine-tuning or SQL rewriting. What's more, it supports all 99 SQLs
in TPC-DS.
+### A smarter query optimizer
-Testing on 22 TPC-H SQLs showed that Nereids, with no human intervention,
**brought an over 10-time performance increase compared to the old query
optimizer**. Similar results were reported by dozens of users who have tried
Apache Doris 2.0 Alpha and Beta in their business scenarios. It has really
freed engineers from the burden of fine-tuning.
+The brand new query optimizer, Nereids, has a richer statistical base and
adopts the Cascades framework. It is capable of self-tuning in most query
scenarios and supports all 99 SQLs in TPC-DS, so users can expect high
performance without any fine-tuning or SQL rewriting.
-**Documentation**:
[https://doris.apache.org/docs/dev/query-acceleration/nereids/](https://doris.apache.org/docs/dev/query-acceleration/nereids/)
+TPC-H tests showed that Nereids, with no human intervention, outperformed the
old query optimizer by a wide margin. Over 100 users have tried Apache Doris
2.0.0 in their production environment and the vast majority of them reported
huge speedups in query execution.
-Nerieds is enabled by default in Apache Doris 2.0.0: `SET
enable_nereids_planner=true`.
+
-# Support for a Wider Range of Analytic Scenarios
+**Doc**: https://doris.apache.org/docs/dev/query-acceleration/nereids/
-## A 10 times more cost-effective log analysis solution
+Nereids is enabled by default in Apache Doris 2.0.0: `SET
enable_nereids_planner=true`. Nereids collects statistical data by calling the
Analyze command.
-From a simple OLAP engine for real-time reporting to a data warehouse that is
capable of ETL/ELT and more, Apache Doris has been pushing its boundaries. With
version 2.0, we are making breakthroughs in log analysis.
+### Inverted Index
-The common log analytic solutions within the industry are basically different
tradeoffs between high writing throughput, low storage cost, and fast text
retrieval. But Apache Doris 2.0 allows you to have them all. It has inverted
index that allows full-text searches on strings and equivalence/range queries
on numerics/datetime. Comparison tests with the same datasets in the same
hardware environment showed that Apache Doris was 4 times faster than
Elasticsearch in log data writing, 2 tim [...]
+In Apache Doris 2.0.0, we introduced inverted index to better support fuzzy
keyword search, equivalence queries, and range queries.
-We are also making advancements in multi-model data analysis. Apache Doris 2.0
supports two new data types: Map and Struct, as well as the quick writing,
functional analysis, and nesting of them.
+A smartphone manufacturer tested Apache Doris 2.0.0 in their user behavior
analysis scenarios. With inverted index enabled, v2.0.0 was able to finish the
queries within milliseconds and maintain stable performance as the query
concurrency level went up. In this case, it is 5 to 90 times faster than its
old version.
-Read more:
[https://doris.apache.org/blog/Inverted%20Index](https://doris.apache.org/blog/Inverted%20Index)
+
-## High-concurrency data serving
+### 20 times higher concurrency capability
-In scenarios such as e-commerce order queries and express tracking, there will
be a huge number of end data users inputing queries for a small piece of data
simultaneously. These are high-concurrency point queries, which can bring huge
pressure on the system. A traditional solution is to introduce Key-Value stores
like Apache HBase for such queries, and Redis as a cache layer to ease the
burden, but that means redundant storage and higher maintainence costs.
+In scenarios like e-commerce order queries and express tracking, a huge number
of end data users search for a certain data record simultaneously. These are
what we call high-concurrency point queries, which can bring huge pressure on
the system. A traditional solution is to introduce Key-Value stores like Apache
HBase for such queries, and Redis as a cache layer to ease the burden, but that
means redundant storage and higher maintenance costs.
-For a column-oriented DBMS like Apache Doris, the I/O usage of point queries
will be multiplied. We need neater execution. Thus, we introduced row storage
format and row cache to increase row reading efficiency, short-circuit plans to
speed up data retrieval, and prepared statements to reduce frontend overheads.
+For a column-oriented DBMS like Apache Doris, the I/O usage of point queries
will be multiplied. We need neater execution. Thus, on the basis of columnar
storage, we added row storage format and row cache to increase row reading
efficiency, short-circuit plans to speed up data retrieval, and prepared
statements to reduce frontend overheads.
-After these optimizations, Apache Doris 2.0 reached a concurrency level of
**30,000 QPS per node** on YCSB on a 16 Core 64G cloud server with 4×1T hard
drives, representing an improvement of **20 times** compared to older versions.
This makes Apache Doris a good alternative to HBase in high-concurrency
scenarios.
+After these optimizations, Apache Doris 2.0 reached a concurrency level of
**30,000 QPS per node** on YCSB on a 16 Core 64G cloud server with 4×1T hard
drives, representing an improvement of **20 times** compared to its older
version. This makes Apache Doris a good alternative to HBase in
high-concurrency scenarios, so that users don't need to endure extra
maintenance costs and redundant storage brought by complicated tech stacks.
-Doc:
[https://doris.apache.org/blog/High_concurrency](https://doris.apache.org/blog/High_concurrency)
+Read more: https://doris.apache.org/blog/High_concurrency
-## Enhanced data lakehouse capabilities
+### A self-adaptive parallel execution model
-In Apache Doris 1.2, we introduced Multi-Catalog to support the auto-mapping
and auto-synchronization of data from heterogeneous sources. After
optimizations in data reading, execution engine, and query optimizer, Doris 1.2
delivered a 3~5 times faster query performance than Presto/Trino in standard
tests.
+Apache 2.0 brought in a Pipeline execution model for higher efficiency and
stability in hybrid analytic workloads. In this model, the execution of queries
is driven by data. The blocking operators in all query execution processes are
split into pipelines. Whether a pipeline gets an execution thread depends on
whether its relevant data is ready. This enables asynchronous blocking
operations and more flexible system resource management. Also, this improves
CPU efficiency as the system does [...]
-In Apache Doris 2.0, we extended the list of data sources supported and
conducted optimizations according to the actual production environment of users.
+Doc:
https://doris.apache.org/docs/dev/query-acceleration/pipeline-execution-engine/
-- More data sources supported
- - Supported snapshot queries on Hudi Copy-on-Write tables and read optimized
queries on Hudi Merge-on-Read tables; The support for incremental queries and
snapshot queries on Merge-on-Read tables is under plan.
- - Doc:
[https://doris.apache.org/docs/dev/lakehouse/multi-catalog/hudi/](https://doris.apache.org/docs/dev/lakehouse/multi-catalog/hudi/)
- - Supported connection to Oceanbase via JDBC Catalog, prolonging the list of
supported relational databases which include MySQL, PostgreSQL, Oracle,
SQLServer, Doris, ClickHouse, SAP HANA, and Trino/Presto.
- - Doc:
[https://doris.apache.org/docs/dev/lakehouse/multi-catalog/jdbc/](https://doris.apache.org/docs/dev/lakehouse/multi-catalog/jdbc/)
-- Data Privilege Control
- - Supported authorization for Hive Catalog via Apache Range; Supported
extensible privilege authorization plug-ins to allow user-defined authorization
method on any catalog.
- - Doc:
[https://doris.apache.org/docs/dev/lakehouse/multi-catalog/hive/](https://doris.apache.org/docs/dev/lakehouse/multi-catalog/hive/)
-- Performance improvement
- - Accelerated reading of flat tables and large numbers of small files;
improved query speed by dozens of times; reduced reading overhead by techniques
such as full loading of small files, I/O coalescing, and data pre-fetching.
- - Increased query speed on ORC/Parquet files by 100% compared to version 1.2.
+**How to enable the Pipeline execution model**
- - Supported local caching of lakehouse data. Users can cache data from HDFS
or object storage in local disks to speed up queries involving the same data.
In the case of cache hits, querying lakehouse data will be as quick as querying
internal data in Apache Doris.
- - Doc:
[https://doris.apache.org/docs/dev/lakehouse/filecache/](https://doris.apache.org/docs/dev/lakehouse/filecache/)
- - Supported collection of external table statistics. Users can obtain
statistics about their specified external tables via the Analyze statement so
that Nereids can fine-tune the query plan for complicated SQLs more
efficiently.
- - Doc:
[https://doris.apache.org/docs/dev/lakehouse/multi-catalog/](https://doris.apache.org/docs/dev/lakehouse/multi-catalog/)
- - Improved data writeback speed of JDBC Catalog. By way of PrepareStmt and
other batch methods, users can write data back to relational databases such as
MySQL and Oracle via the INSERT INTO command much faster.
+- The Pipeline execution engine is enabled by default in Apache Doris 2.0:
`Set enable_pipeline_engine = true`.
+- `parallel_pipeline_task_num` represents the number of pipeline tasks that
are parallelly executed in SQL queries. The default value of it is `0`, which
means Apache Doris will automatically set the concurrency level to half the
number of CPUs in each backend node. Users can change this value as they need
it.
+- For those who are upgrading to Apache Doris 2.0 from an older version, it is
recommended to set the value of `parallel_pipeline_task_num` to that of
`parallel_fragment_exec_instance_num` in the old version.
-# A Unified Platform for Multiple Analytic Workloads
+## A Unified Platform for Multiple Analytic Workloads
-## A self-adaptive parallel execution model
+Apache Doris has been pushing its boundaries. Starting as an OLAP engine for
reporting, it is now a data warehouse capable of ETL/ELT and more. Version 2.0
is making advancements in its log analysis and data lakehousing capabilities.
-With the expansion of user base, Apache Doris is undertaking more and more
analytic workloads while handling larger and larger data sizes. A big challenge
is to ensure high execution efficiency for all these workloads and avoid
resource preemption.
+### A 10 times more cost-effective log analysis solution
-Older versions of Apache Doris had a volcano-based execution engine. To give
full play to all the machines and cores, users had to set the execution
concurrency themselves (for example, change
`parallel_fragment_exec_instance_num` from the default value 1 to 8 or 16). But
problems existed when Doris had to deal with multiple queries at the same time:
+Apache Doris 2.0.0 provides native support for semi-structured data. In
addition to JSON and Array, it now supports a complex data type: Map. Based on
Light Schema Change, it also supports Schema Evolution, which means you can
adjust the schema as your business changes. You can add or delete fields and
indexes, and change the data types for fields. As we introduced inverted index
and a high-performance text analysis algorithm into it, it can execute
full-text search and dimensional analy [...]
-- Instance operators took up the threads and the query tasks didn't get
executed. Logical deadlocks occurred.
-- Instance threads were scheduled by a system scheduling mechanism and the
switching between threads brought extra overheads.
-- When processing various analytic workloads, instance threads might fight for
CPU resources so queries and tenants might interrupt each other.
+
-Apache 2.0 brought in a Pipeline execution engine to solve these problems. In
this engine, the execution of queries are driven by data. The blocking
operators in all the query execution processes are split into pipelines.
Whether a pipeline gets an execution thread depends on whether its data is
ready. As a result:
+### Enhanced data lakehousing capabilities
-- The Pipeline execution model splits the query execution plan into pipeline
tasks based on the blocking logic and asynchronizes the blocking operations, so
no instance is going to take up a single thread for a long time.
-- It allows users to manage system resources more flexibly. They can take
different scheduling strategies to assign CPU resources to various queries and
tenants.
-- It also pools data from all buckets, so the number of instances will not be
limited by the number of buckets, and the system doesn't have to frequently
create and destroy threads.
+In Apache Doris 1.2, we introduced Multi-Catalog to allow for auto-mapping and
auto-synchronization of data from heterogeneous sources. In version 2.0.0, we
extended the list of data sources supported and optimized Doris for based on
users' needs in production environment.
-With the Pipeline execution engine, Apache Doris is going to offer **faster
queries and higher stability in hybrid workload scenarios**.
+
-Doc:[
https://doris.apache.org/docs/dev/query-acceleration/pipeline-execution-engine/](
https://doris.apache.org/docs/dev/query-acceleration/pipeline-execution-engine/)
+Apache Doris 2.0.0 supports dozens of data sources including Hive, Hudi,
Iceberg, Paimon, MaxCompute, Elasticsearch, Trino, ClickHouse, and almost all
open lakehouse formats. It also supports snapshot queries on Hudi Copy-on-Write
tables and read optimized queries on Hudi Merge-on-Read tables. It allows for
authorization of Hive Catalog using Apache Ranger, so users can reuse their
existing privilege control system. Besides, it supports extensible
authorization plug-ins to enable user-de [...]
-The Pipeline execution engine is enabled by default in Apache Doris 2.0: `Set
enable_pipeline_engine = true`. `parallel_pipeline_task_num` represents the
number of pipeline tasks that are parallelly executed in SQL queries. The
default value of it is `0`, and users can change the value as they need. For
those who are upgrading to Apache Doris 2.0 from an older version, it is
recommended to set the value of `parallel_pipeline_task_num` to that of
`parallel_fragment_exec_instance_num` in t [...]
+TPC-H benchmark tests showed that Apache Doris 2.0.0 is 3~5 times faster than
Presto/Trino in queries on Hive tables. This is realized by all-around
optimizations (in small file reading, flat table reading, local file cache,
ORC/Parquet file reading, Compute Nodes, and information collection of external
tables) finished in this development cycle and the distributed execution
framework, vectorized execution engine, and query optimizer of Apache Doris.
-## Workload management
+
-Based on the Pipeline execution engine, Apache Doris 2.0 divides the workloads
into Workload Groups for fine-grained management of memory and CPU resources.
+All this gives Apache Doris 2.0.0 an edge in data lakehousing scenarios. With
Doris, you can do incremental or overall synchronization of multiple upstream
data sources in one place, and expect much higher data query performance than
other query engines. The processed data can be written back to the sources or
provided for downstream systems. In this way, you can make Apache Doris your
unified data analytic gateway.
-By relating a query to a Workload Group, users can limit the percentage of
memory and CPU resources used by one query on the backend nodes and configure
memory soft limits for resource groups. When there is a cluster resource
shortage, the system will kill the largest memory-consuming tasks; when there
is sufficient cluster resources, once the Workload Groups use more resources
than expected, the idle cluster resources will be shared among multiple
Workload Groups and the system memory w [...]
+## Efficient Data Update
-```SQL
-create workload group if not exists etl_group
-properties (
- "cpu_share"="10",
- "memory_limit"="30%",
- "max_concurrency" = "10",
- "max_queue_size" = "20",
- "queue_timeout" = "3000"
-);
-```
-You can check the existing Workload Group via the `Show` command:
+Data update is important in real-time analysis, since users want to always be
accessible to the latest data, and be able to update data flexibly, such as
updating a row or just a few columns, batching updating or deleting their
specified data, or even overwriting a whole data partition.
-## Query queue
+Efficient data updating has been another hill to climb in data analysis.
Apache Hive only supports updates on the partition level, while Hudi and
Iceberg do better in low-frequency batch updates instead of real-time updates
due to their Merge-on-Read and Copy-on-Write implementations.
-When creating a Workload Group, users can set a maximum query number for it.
Queries beyond that limit will wait for execution in the query queue.
+As for data updating, Apache Doris 2.0.0 is capable of:
-- `max_concurrency`: the maximum number of queries allowed by the current
Workload Group
-- `max_queue_size`: the length of query queue. After all spots are filled, any
new queries will be rejected.
-- `queue_timeout`: the waiting time of a query in a queue, measured in
miliseconds. If a queue has been waiting for longer than this limit, it will be
rejected.
+- **Faster data writing**: In the pressure tests with an online payment
platform, under 20 concurrent data writing tasks, Doris reached a writing
throughput of 300,000 records per second and maintained stability throughout
the over 10-hour continuous writing process.
+- **Partial column update**: Older versions of Doris implements partial column
update by `replace_if_not_null` in the Aggregate Key model. In 2.0.0, we enable
partial column updates in the Unique Key model. That means you can directly
write data from multiple source tables into a flat table, without having to
concatenate them into one output stream using Flink before writing. This method
avoids a complicated processing pipeline and the extra resource consumption.
You can simply specify t [...]
+- **Conditional update and deletion**: In addition to the simple Update and
Delete operations, we realize complicated conditional updates and deletes
operations on the basis of Merge-on-Write.
-Doc:
[https://doris.apache.org/docs/dev/admin-manual/workload-group/](https://doris.apache.org/docs/dev/admin-manual/workload-group/)
+## Faster, Stabler, and Smarter Data Writing
-# Elastic Scaling and Storage-Compute Separation
+### Higher speed in data writing
-When it comes to computation and storage resources, what do users want?
-
-- **Elastic scaling of computation resources**: Scale up resources quickly in
peak times to increase efficiency and scale down in valley times to reduce
costs.
-- **Lower storage costs**: Use low-cost storage media and separate storage
from computation.
-- **Separation of workloads**: Isolate the computation resources of different
workloads to avoid preemption.
-- **Unified management of data**: Simply manage catalogs and data in one place.
-
-To separate storage and computation is a way to realize elastic scaling of
resources, but it demands more efforts in maintaining storage stability, which
determines the stability and continuity of OLAP services. To ensure storage
stability, we introduced mechanisms including cache management, computation
resource management, and garbage collection.
-
- In this respect, we divide our users into three groups after investigation:
-
-1. Users with no need for resource scaling
-2. Users requiring resource scaling, low storage costs, and workload
separation from Apache Doris
-3. Users who already have a stable large-scale storage system and thus require
an advanced compute-storage-separated architecture for efficient resource
scaling
+As part of our continuing effort to strengthen the real-time analytic
capability of Apache Doris, we have improved the end-to-end real-time data
writing capability of version 2.0.0. Benchmark tests reported higher throughput
in various writing methods:
-Apache Doris 2.0 provides two solutions to address the needs of the first two
types of users.
+- Stream Load, TPC-H 144G lineitem table, 48-bucket Duplicate table,
triple-replica writing: throughput increased by 100%
+- Stream Load, TPC-H 144G lineitem table, 48-bucket Unique Key table,
triple-replica writing: throughput increased by 200%
+- Insert Into Select, TPC-H 144G lineitem table, 48-bucket Duplicate table:
throughput increased by 50%
+- Insert Into Select, TPC-H 144G lineitem table, 48-bucket Unique Key table:
throughput increased by 150%
-1. **Compute nodes**. We introduced stateless compute nodes in version 2.0.
Unlike the mix nodes, the compute nodes do not save any data and are not
involved in workload balancing of data tablets during cluster scaling. Thus,
they are able to quickly join the cluster and share the computing pressure
during peak times. In addition, in data lakehouse analysis, these nodes will be
the first ones to execute queries on remote storage (HDFS/S3) so there will be
no resource competition between [...]
- 1. Doc:
[https://doris.apache.org/docs/dev/advanced/compute_node/](https://doris.apache.org/docs/dev/advanced/compute_node/)
-2. **Hot-cold data separation**. Hot/cold data refers to data that is
frequently/seldom accessed, respectively. Generally, it makes more sense to
store cold data in low-cost storage. Older versions of Apache Doris support
lifecycle management of table partitions: As hot data cooled down, it would be
moved from SSD to HDD. However, data was stored with multiple replicas on HDD,
which was still a waste. Now, in Apache Doris 2.0, cold data can be stored in
object storage, which is even chea [...]
- 1. Doc:
[https://doris.apache.org/docs/dev/advanced/cold_hot_separation/](https://doris.apache.org/docs/dev/advanced/cold_hot_separation/)
+### Greater stability in high-concurrency data writing
- 2. Read more:
[https://doris.apache.org/blog/HCDS/](https://doris.apache.org/blog/HCDS/)
+The sources of system instability often includes small file merging, write
amplification, and the consequential disk I/O and CPU overheads. Hence, we
introduced Vertical Compaction and Segment Compaction in version 2.0.0 to
eliminate OOM errors in compaction and avoid the generation of too many segment
files during data writing. After such improvements, Apache Doris can write data
50% faster while **using only 10% of the memory that it previously used**.
-For the third type of users, the SelectDB team is going to contribute the
SelectDB Cloud Compute-Storage-Separation solution to the Apache Doris project.
The performance and stability of this solution has stood the test of hundreds
of companies in their production environment. The merging of the solution to
Apache Doris is underway.
+Read more: https://doris.apache.org/blog/Compaction
-# Faster, Stabler, and Smarter Data Writing
+### Auto-synchronization of table schema
-## Higher speed in data writing
+The latest Flink-Doris-Connector allows users to synchronize an entire
database (such as MySQL and Oracle) to Apache Doris by one simple step.
According to our test results, one single synchronization task can support the
real-time concurrent writing of thousands of tables. Users no longer need to go
through a complicated synchronization procedure because Apache Doris has
automated the process. Changes in the upstream data schema will be
automatically captured and dynamically updated to [...]
-As part of our continuing effort to strengthen the real-time analytic
capability of Apache Doris, we have improved the end-to-end real-time data
writing of version 2.0:
+Read more: https://doris.apache.org/blog/FDC
-- When tested with the TPC-H 100G standard dataset, Apache Doris 2.0 reached a
data loading speed of over 550MB/s for a single node with its `insert into
select` method, which was a **200% increase**. In triple-replica import of 144G
data, it delivered a single-node data loading speed of 121MB/s via the Stream
Load method, **up 400%** in system throughput.
-- We have introduced single-replica data loading into version 2.0. Apache
Doris guarantees high data reliability and system availability by its
multi-replica mechanism, but multi-replica writing also multiplies the CPU and
memory usage. Now Apache Doris only writes one data copy to the memory, and
then it synchronizes the storage file to other copies, so it can save a lot of
the computation resources. In large data ingestion, the single-replica loading
method can accelerate data ingestio [...]
+## A New Multi-Tenant Resource Isolation Solution
-## Greater stability in high-concurrency data writing
+The purpose of multi-tenant resource isolation is to avoid resource preemption
in the case of heavy loads. For that sake, older versions of Apache Doris
adopted a hard isolation plan featured by Resource Group: Backend nodes of the
same Doris cluster would be tagged, and those of the same tag formed a Resource
Group. As data was ingested into the database, different data replicas would be
written into different Resource Groups, which will be responsible for different
workloads. For examp [...]
-The merging of small files, write amplification, and the consequential disk
I/O and CPU overheads are often the sources of system instability. Hence, we
introduced Vertical Compaction and Segment Compaction in version 2.0 to
eliminate OOM errors in compaction and avoid the generation of too many segment
files during data writing. After such improvements, Apache Doris can write data
50% faster while using only 10% of the memory that it previously used.
+
-Read more:
[https://doris.apache.org/blog/Compaction](https://doris.apache.org/blog/Compaction)
+This is an effective solution, but in practice, it happens that some Resource
Groups are heavily occupied while others are idle. We want a more flexible way
to reduce vacancy rate of resources. Thus, in 2.0.0, we introduce Workload
Group resource soft limit.
-## Auto-synchronization of table schema
+
-The latest Flink-Doris-Connector allows users to synchronize the whole
database (such as MySQL) to Apache Doris by one simple step. According to our
test results, one single synchronization task can undertake the real-time
concurrent writing of thousands of tables. Apache Doris has automated the
updating of upstream table schema and data so users no longer need to go
through a complicated synchronization procedure. Also, changes in the upstream
data schema will be automatically captured [...]
+The idea is to divide workloads into groups to allow for flexible management
of CPU and memory resources. Apache Doris associates a query with a Workload
Group, and limits the percentage of CPU and memory that a single query can use
on a backend node. The memory soft limit can be configured and enabled by the
user.
-# Support for Partial Column Update in the Unique Key Model
+When there is a cluster resource shortage, the system will kill the largest
memory-consuming query tasks; when there are sufficient cluster resources, once
a Workload Group uses more resources than expected, the idle cluster resources
will be shared among all the Workload Groups to give full play to the system
memory and ensure stable execution of queries. You can also prioritize the
Workload Groups in terms of resource allocation. In other words, you can decide
which tasks can be assign [...]
-Apache Doris 1.2 realized real-time writing and quick query execution at the
same time with the Merge-on-Write mechanism in the Unique Key Model. Now in
version 2.0, we have further improved the Unique Key Model. It supports partial
column update so when ingesting multiple source tables, users don't have to
merge them into one flat table in advance.
+Meanwhile, we introduced Query Queue in 2.0.0. Upon Workload Group creation,
you can set a maximum query number for a query queue. Queries beyond that limit
will wait for execution in the queue. This is to reduce system burden under
heavy workloads.
-On this basis, we have also enhanced the capability of Merge-on-Write. Apache
Doris 2.0 is 50% faster than Apache Doris 1.2 in large data writing and 10
times faster in high-concurrency data writing. A parallel processing mechanism
is available to avoid "publish timeout" (E-3115), and a more efficient
compaction mechanism is in place to prevent "too many versions" (E-235). All
this allows users to replace Merge-on-Read with Merge-on-Write in more
scenarios. Plus, partial column update ma [...]
+## Elastic Scaling and Storage-Compute Separation
-The execution of partial column update is simple.
+When it comes to computation and storage resources, what do users want?
-**Example (Stream Load):**
+- **Elastic scaling of computation resources**: Scale up resources quickly in
peak times to increase efficiency and scale down in valley times to reduce
costs.
+- **Lower storage costs**: Use low-cost storage media and separate storage
from computation.
+- **Separation of workloads**: Isolate the computation resources of different
workloads to avoid preemption.
+- **Unified management of data**: Simply manage catalogs and data in one place.
-Suppose that you have the following table schema:
+To separate storage and computation is a way to realize elastic scaling of
resources, but it demands more efforts in maintaining storage stability, which
determines the stability and continuity of OLAP services. To ensure storage
stability, we introduced mechanisms including cache management, computation
resource management, and garbage collection.
-```Python
-mysql> desc user_profile;
-+------------------+-----------------+------+-------+---------+-------+
-| Field | Type | Null | Key | Default | Extra |
-+------------------+-----------------+------+-------+---------+-------+
-| id | INT | Yes | true | NULL | |
-| name | VARCHAR(10) | Yes | false | NULL | NONE |
-| age | INT | Yes | false | NULL | NONE |
-| city | VARCHAR(10) | Yes | false | NULL | NONE |
-| balance | DECIMALV3(9, 0) | Yes | false | NULL | NONE |
-| last_access_time | DATETIME | Yes | false | NULL | NONE |
-+------------------+-----------------+------+-------+---------+-------+
-```
+ In this respect, we divide our users into three groups after investigation:
-If you need to batch update the "balance" and "last access time" fields for
the last 10 seconds, you can put the updates in a CSV file as follows:
+1. Users with no need for resource scaling
+2. Users requiring resource scaling, low storage costs, and workload
separation from Apache Doris
+3. Users who already have a stable large-scale storage system and thus require
an advanced compute-storage-separated architecture for efficient resource
scaling
-```Python
-1,500,2023-07-03 12:00:01
-3,23,2023-07-03 12:00:02
-18,9999999,2023-07-03 12:00:03
-```
+Apache Doris 2.0 provides two solutions to address the needs of the first two
types of users.
-Then, add a header `partial_columns:true` and specify the relevant column
names in the the Stream Load command:
+1. **Compute nodes**. We introduced stateless compute nodes in version 2.0.
Unlike the mix nodes, the compute nodes do not save any data and are not
involved in workload balancing of data tablets during cluster scaling. Thus,
they are able to quickly join the cluster and share the computing pressure
during peak times. In addition, in data lakehouse analysis, these nodes will be
the first ones to execute queries on remote storage (HDFS/S3) so there will be
no resource competition between [...]
+ 1. Doc: https://doris.apache.org/docs/dev/advanced/compute_node/
+2. **Hot-cold data separation**. Hot/cold data refers to data that is
frequently/seldom accessed, respectively. Generally, it makes more sense to
store cold data in low-cost storage. Older versions of Apache Doris support
lifecycle management of table partitions: As hot data cooled down, it would be
moved from SSD to HDD. However, data was stored with multiple replicas on HDD,
which was still a waste. Now, in Apache Doris 2.0, cold data can be stored in
object storage, which is even chea [...]
+ 1. Read more: https://doris.apache.org/blog/HCDS/
-```Python
-curl --location-trusted -u root: -H "partial_columns:true" -H
"column_separator:," -H "columns:id,balance,last_access_time" -T /tmp/test.csv
http://127.0.0.1:48037/api/db1/user_profile/_stream_load
-```
+For neater separate of computation and storage, the VeloDB team is going to
contribute the Cloud Compute-Storage-Separation solution to the Apache Doris
project. The performance and stability of it has stood the test of hundreds of
companies in their production environment. The merging of code will be finished
by October this year, and all Apache Doris users will be able to get an early
taste of it in September.
-# Farewell to OOM
+## Enhanced Usability
-Memory management might not be on the priority list of users until there is a
memory problem. However, real-life analytics is full of extreme cases that are
challenging memory stability. In large computation tasks, OOM errors often
cause queries to fail or even result in a backend downtime.
+Apache Doris 2.0.0 also highlights some enterprise-facing functionalities.
-To solve this, we have improved the memory data structures, reconstructed the
MemTrackers, and introduced soft memory limits for queries and a GC mechansim
to cope with process memory overflow. The new memory management mechanism
allocates, caculates, and monitors memory more efficiently. According to
benchmark tests, pressure tests, and user feedback, it eliminates most memory
hotspots and backend downtime. Even if there is an OOM error, users can locate
the question spot based on the l [...]
+### Support for Kubernetes Deployment
-In a word, Apache Doris 2.0 is able to handle complicated computation and
large ETL/ELT operations with greater system stability.
+Older versions of Apache Doris communicate based on IP, so any host failure in
Kubernetes deployment that causes a POD IP drift will lead to cluster
unavailability. Now, version 2.0 supports FQDN. That means the failed Doris
nodes can recover automatically without human intervention, which lays the
foundation for Kubernetes deployment and elastic scaling.
-Read more:
[https://doris.apache.org/blog/Memory_Management/](https://doris.apache.org/blog/Memory_Management/)
+### Support for Cross-Cluster Replication (CCR)
-# Support for Kubernetes Deployment
+Apache Doris 2.0.0 supports cross-cluster replication (CCR). Data changes at
the database/table level in the source cluster will be synchronized to the
target cluster. You can choose to replicate the incremental data or the overall
data.
-Older versions of Apache Doris communicate based on IP, so any host failure in
Kubernetes deployment that causes a POD IP drift will lead to cluster
unavailability. Now, version 2.0 supports FQDN. That means the failed Doris
nodes can recover automatically without human intervention, which lays the
foundation for Kubernetes deployment and elastic scaling.
+It also supports synchronization of DDL, which means DDL statements executed
by the source cluster can also by automatically replicated to the target
cluster.
-Doc:
[https://doris.apache.org/docs/dev/install/k8s-deploy/](https://doris.apache.org/docs/dev/install/k8s-deploy/)
+It is simple to configure and use CCR in Doris. Leveraging this functionality,
you can implement read-write separation and multi-datacenter replication
-# Support for Cross-Cluster Replication
+This feature allows for higher availability of data, read/write workload
separation, and cross-data-center replication more efficiently.
-For data synchronization across multiple clusters, Apache Doris used to
require regular data backup and restoration via the Backup/Restore command. The
process required intermediate storage and came with high latency. Apache Doris
2.0.0 supports cross-cluster replication (CCR), which automates this process.
Data changes at the database/table level in the source cluster will be
synchronized to the target cluster. This feature allows for higher availability
of data, read/write workload sep [...]
+## Behavior Change
-# Behavior Change
+- Use rolling upgrade from 1.2-ITS to 2.0.0, and restart upgrade from preview
versions of 2.0 to 2.0.0;
+- The new query optimizer (Nereids) is enabled by default:
`enable_nereids_planner=true`;
+- All non-vectorized code has been removed from the system, so the
`enable_vectorized_engine` parameter no long works;
+- A new parameter `enable_single_replica_compaction` has been added;
+- datev2, datetimev2, and decimalv3 are the default data types in table
creation; datav1, datetimev1, and decimalv2 are not supported in table creation;
+- decimalv3 is the default data type for JDBC and Iceberg Catalog;
+- A new data type `AGG_STATE` has been added;
+- The cluster column has been removed from backend tables;
+- For better compatibility with BI tools, datev2 and datetimev2 are displayed
as date and datetime when `show create table`;
+- max_openfiles and swaps checks are added to the backend startup script so
inappropriate system configuration might lead to backend failure;
+- Password-free login is not allowed when accessing frontend on localhost;
+- If there is a Multi-Catalog in the system, by default, only data of the
internal catalog will be displayed when querying information schema;
+- A limit has been imposed on the depth of the expression tree. The default
value is 200;
+- The single quote in the return value of array string has been changed to
double quote;
+- The Doris processes are renamed to DorisFE and DorisBE.
-- 1.2-lts requires downtime to upgrade to 2.0.0, 2.0-alpha requires downtime
to upgrade to 2.0.0
-- Query optimizer switch default on ` enable_ Nereids_ Planner=true `;
-- Non vectorized code has been removed from the system, so 'enable'_
Vectorized_ The 'engine' parameter will no longer be effective;
-- Add Parameter ` enable_ Single_ Replica_ Compaction `;
-- By default, use datev2, datetimev2, and decimalv3 to create tables, but do
not support creating tables with datev1, datetimev1, and decimalv2;
-- Decimalv3 is used by default in JDBC and Iceberg Catalog;
-- Add AGG for date type_ State;
-- Remove the cluster column from the backend table;
-- For better compatibility with BI tools, when displaying create tables,
display date v2 and date timev2 as date and date time.
-- Added max in BE startup script_ Check for openfiles and swap, so if the
system configuration is not reasonable, be may fail to start;
-- Prohibit logging in without a password when accessing FE from localhost;
-- When there is a Multi Catalog in the system, querying the information schema
only displays data from the internal catalog by default;
-- Limited the depth of the expression tree, defaulting to 200;
-- The array string returns a single quotation mark to a double quotation mark;
-- Rename Doris' process names to DorisFE and DorisBE;
+## Embarking on the 2.0.0 Journey
+To make Apache Doris 2.0.0 production-ready, we invited hundreds of enterprise
users to engage in the testing and optimized it for better performance,
stability, and usability. In the next phase, we will continue responding to
user needs with agile release planning. We plan to launch 2.0.1 in late August
and 2.0.2 in September, as we keep fixing bugs and adding new features. We also
plan to release an early version of 2.1 in September to bring a few
long-requested capabilities to you. Fo [...]
-# Known Issues
-- There may be performance degradation using pooling of pipeline. It can be
worked around by `ADMIN SET FRONTEND CONFIG ("disable_shared_scan" = "true")`;
-- Can not perform deletion on table by non-key column:
https://github.com/apache/doris/pull/22673. Will be fixed in 2.0.1;
-- Wrong result for function time_to_sec:
https://github.com/apache/doris/pull/22656. Will be fixed in 2.0.1
+If you have any questions or ideas when investigating, testing, and deploying
Apache Doris, please find us on [Slack](https://t.co/ZxJuNJHXb2). Our
developers will be happy to hear them and provide targeted support.
diff --git a/docs/images/release-note-2.0.0-1.png
b/docs/images/release-note-2.0.0-1.png
new file mode 100644
index 0000000000..b858a76b0d
Binary files /dev/null and b/docs/images/release-note-2.0.0-1.png differ
diff --git a/docs/images/release-note-2.0.0-2.png
b/docs/images/release-note-2.0.0-2.png
new file mode 100644
index 0000000000..0c2e4bee61
Binary files /dev/null and b/docs/images/release-note-2.0.0-2.png differ
diff --git a/docs/images/release-note-2.0.0-3.png
b/docs/images/release-note-2.0.0-3.png
new file mode 100644
index 0000000000..1dc48b9b1d
Binary files /dev/null and b/docs/images/release-note-2.0.0-3.png differ
diff --git a/docs/images/release-note-2.0.0-4.png
b/docs/images/release-note-2.0.0-4.png
new file mode 100644
index 0000000000..91d1241eed
Binary files /dev/null and b/docs/images/release-note-2.0.0-4.png differ
diff --git a/docs/images/release-note-2.0.0-5.png
b/docs/images/release-note-2.0.0-5.png
new file mode 100644
index 0000000000..be43f32a11
Binary files /dev/null and b/docs/images/release-note-2.0.0-5.png differ
diff --git a/docs/images/release-note-2.0.0-6.png
b/docs/images/release-note-2.0.0-6.png
new file mode 100644
index 0000000000..8c968101e0
Binary files /dev/null and b/docs/images/release-note-2.0.0-6.png differ
diff --git a/docs/images/release-note-2.0.0-7.png
b/docs/images/release-note-2.0.0-7.png
new file mode 100644
index 0000000000..46c5c00ebb
Binary files /dev/null and b/docs/images/release-note-2.0.0-7.png differ
diff --git a/docs/images/release-note-2.0.0-8.png
b/docs/images/release-note-2.0.0-8.png
new file mode 100644
index 0000000000..8d209adf85
Binary files /dev/null and b/docs/images/release-note-2.0.0-8.png differ
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]