This is an automated email from the ASF dual-hosted git repository.
luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 087aec0dfe0 add blogs & update dead links (#319)
087aec0dfe0 is described below
commit 087aec0dfe00019e782ed3633fe1521f8256adba
Author: Hu Yanjun <[email protected]>
AuthorDate: Thu Oct 12 10:48:30 2023 +0800
add blogs & update dead links (#319)
---
...-Why-We-Went-from-ClickHouse-to-Apache-Doris.md | 2 +-
blog/Tencent-LLM.md | 2 +-
...ambda-Architecture-for-40-Faster-Performance.md | 97 ------
...s-a-next-generation-real-time-data-warehouse.md | 172 ++++++++++
blog/log-analysis-elasticsearch-vs-apache-doris.md | 352 +++++++++++++++++++++
...rom-clickhouse-to-apache-doris-what-happened.md | 161 ++++++++++
blog/release-note-2.0.0.md | 4 +-
static/images/Introduction_3.png | Bin 0 -> 149120 bytes
static/images/Introduction_4.png | Bin 0 -> 143564 bytes
static/images/LAS_1.png | Bin 0 -> 96696 bytes
static/images/LAS_2.png | Bin 0 -> 104517 bytes
static/images/LAS_3.png | Bin 0 -> 73001 bytes
static/images/LAS_4.png | Bin 0 -> 67765 bytes
static/images/LAS_5.png | Bin 0 -> 49078 bytes
static/images/introduction_1.png | Bin 0 -> 321089 bytes
static/images/introduction_10.png | Bin 0 -> 141948 bytes
static/images/introduction_11.png | Bin 0 -> 77773 bytes
static/images/introduction_2.png | Bin 0 -> 326940 bytes
static/images/introduction_5.png | Bin 0 -> 138772 bytes
static/images/introduction_6.png | Bin 0 -> 58618 bytes
static/images/introduction_7.png | Bin 0 -> 194930 bytes
static/images/introduction_8.png | Bin 0 -> 277049 bytes
static/images/introduction_9.png | Bin 0 -> 173567 bytes
static/images/youzan-1.png | Bin 0 -> 96929 bytes
static/images/youzan-2.png | Bin 0 -> 86410 bytes
static/images/youzan-3.png | Bin 0 -> 385862 bytes
static/images/youzan-4.png | Bin 0 -> 382306 bytes
static/images/youzan-5.png | Bin 0 -> 143511 bytes
static/images/youzan-6.png | Bin 0 -> 402306 bytes
29 files changed, 689 insertions(+), 101 deletions(-)
diff --git
a/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris.md
b/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris.md
index 8f5aae89c27..e93ab50fd45 100644
--- a/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris.md
+++ b/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris.md
@@ -1,6 +1,6 @@
---
{
- 'title': 'Tencent Data Engineer: Why We G from ClickHouse to Apache
Doris?',
+ 'title': 'Tencent Data Engineer: Why We Went from ClickHouse to Apache
Doris?',
'summary': "Evolution of the data processing architecture of Tencent Music
Entertainment towards better performance and simpler maintenance.",
'date': '2023-03-07',
'author': 'Jun Zhang & Kai Dai',
diff --git a/blog/Tencent-LLM.md b/blog/Tencent-LLM.md
index 3775c5a82d0..d8c39cb847c 100644
--- a/blog/Tencent-LLM.md
+++ b/blog/Tencent-LLM.md
@@ -28,7 +28,7 @@ specific language governing permissions and limitations
under the License.
-->
-Six months ago, I wrote about [why we replaced ClickHouse with Apache Doris as
an OLAP engine](https://doris.apache.org/blog/Tencent%20Music/) for our data
management system. Back then, we were struggling with the auto-generation of
SQL statements. As days pass, we have made progresses big enough to be
references for you (I think), so here I am again.
+Six months ago, I wrote about [why we replaced ClickHouse with Apache Doris as
an OLAP
engine](https://doris.apache.org/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris)
for our data management system. Back then, we were struggling with the
auto-generation of SQL statements. As days pass, we have made progresses big
enough to be references for you (I think), so here I am again.
We have adopted Large Language Models (LLM) to empower our Doris-based OLAP
services.
diff --git
a/blog/Zipping-up-the-Lambda-Architecture-for-40-Faster-Performance.md
b/blog/Zipping-up-the-Lambda-Architecture-for-40-Faster-Performance.md
deleted file mode 100644
index d38a2f7902b..00000000000
--- a/blog/Zipping-up-the-Lambda-Architecture-for-40-Faster-Performance.md
+++ /dev/null
@@ -1,97 +0,0 @@
----
-{
- 'title': 'Zipping up the Lambda Architecture for 40% Faster Performance',
- 'summary': "Instead of pooling real-time and offline data after they are
fully ready for queries, Douyu engineers use Apache Doris to share part of the
pre-query computation burden.",
- 'date': '2023-05-05',
- 'author': 'Tongyang Han',
- 'tags': ['Best Practice'],
-}
-
----
-
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements. See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership. The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied. See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-Author: Tongyang Han, Senior Data Engineer at Douyu
-
-The Lambda architecture has been common practice in big data processing. The
concept is to separate stream (real time data) and batch (offline data)
processing, and that's exactly what we did. These two types of data of ours
were processed in two isolated tubes before they were pooled together and ready
for searches and queries.
-
-
-
-Then we run into a few problems:
-
-1. **Isolation of real-time and offline data warehouses**
- 1. I know this is kind of the essence of Lambda architecture, but that
means we could not reuse real-time data since it was not layered as offline
data, so further customized development was required.
-2. **Complex Pipeline from Data Sources to Data Application**
- 1. Data had to go through multi-step processing before it reached our data
users. As our architecture involved too many components, navigating and
maintaining these tech stacks was a lot of work.
-3. **Lack of management of real-time data sources**
- 1. In extreme cases, this worked like a data silo and we had no way to
find out whether the ingested data was duplicated or reusable.
-
-So we decided to "zip up" the Lambda architecture a little bit. By "zipping
up", I mean to introduce an OLAP engine that is capable of processing, storing,
and analyzing data, so real-time data and offline data converge a little
earlier than they used to. It is not a revolution of Lambda, but a minor change
in the choice of components, which made our real-time data processing 40%
faster.
-
-## **Zipping up Lambda Architecture**
-
-I am going to elaborate on how this is done using our data tagging process as
an example.
-
-Previously, our offline tags were produced by the data warehouse, put into a
flat table, and then written in **HBase**, while real-time tags were produced
by **Flink**, and put into **HBase** directly. Then **Spark** would work as the
computing engine.
-
-
-
-The problem with this stemmed from the low computation efficiency of **Flink**
and **Spark**.
-
-- **Real-time tag production**: When computing real-time tags that involve
data within a long time range, Flink did not deliver stable performance and
consumed more resources than expected. And when a task failed, it would take a
really long time for checkpoint recovery.
-- **Tag query**: As a tag query engine, Spark could be slow.
-
-As a solution, we replaced **HBase** and **Spark** with **Apache Doris**, a
real-time analytic database, and moved part of the computational logic of the
foregoing wide-time-range real-time tags from **Flink** to **Apache Doris**.
-
-
-
-Instead of putting our flat tables in HBase, we place them in Apache Doris.
These tables are divided into partitions based on time sensitivity. Offline
tags will be updated daily while real-time tags will be updated in real time.
We organize these tables in the Aggregate Model of Apache Doris, which allows
partial update of data.
-
-Instead of using Spark for queries, we parse the query rules into SQL for
execution in Apache Doris. For pattern matching, we use Redis to cache the hot
data from Apache Doris, so the system can respond to such queries much faster.
-
-
-
-## **Computational Pipeline of Wide-Time-Range Real-Time Tags**
-
-In some cases, the computation of wide-time-range real-time tags entails the
aggregation of historical (offline) data with real-time data. The following
figure shows our old computational pipeline for these tags.
-
-
-
-As you can see, it required multiple tasks to finish computing one real-time
tag. Also, in complicated aggregations that involve a collection of aggregation
operations, any improper resource allocation could lead to back pressure or
waste of resources. This adds to the difficulty of task scheduling. The
maintenance and stability guarantee of such a long pipeline could be an issue,
too.
-
-To improve on that, we decided to move such aggregation workload to Apache
Doris.
-
-
-
-We have around 400 million customer tags in our system, and each customer is
attached with over 300 tags. We divide customers into more than 10,000 groups,
and we have to update 5000 of them on a daily basis. The above improvement has
sped up the computation of our wide-time-range real-time queries by **40%**.
-
-## Overwrite
-
-To atomically replace data tables and partitions in Apache Doris, we
customized the
[Doris-Spark-Connector](https://github.com/apache/doris-spark-connector), and
added an "Overwrite" mode to the Connector.
-
-When a Spark job is submitted, Apache Doris will call an interface to fetch
information of the data tables and partitions.
-
-- If it is a non-partitioned table, we create a temporary table for the target
table, ingest data into it, and then perform atomic replacement. If the data
ingestion fails, we clear the temporary table;
-- If it is a dynamic partitioned table, we create a temporary partition for
the target partition, ingest data into it, and then perform atomic replacement.
If the data ingestion fails, we clear the temporary partition;
-- If it is a non-dynamic partitioned table, we need to extend the
Doris-Spark-Connector parameter configuration first. Then we create a temporary
partition and take steps as above.
-
-## Conclusion
-
-One prominent advantage of Lambda architecture is the stability it provides.
However, in our practice, the processing of real-time data and offline data
sometimes intertwines. For example, the computation of certain real-time tags
requires historical (offline) data. Such interaction becomes a root cause of
instability. Thus, instead of pooling real-time and offline data after they are
fully ready for queries, we use an OLAP engine to share part of the pre-query
computation burden and mak [...]
diff --git
a/blog/introduction-to-apache-doris-a-next-generation-real-time-data-warehouse.md
b/blog/introduction-to-apache-doris-a-next-generation-real-time-data-warehouse.md
new file mode 100644
index 00000000000..7018172cdee
--- /dev/null
+++
b/blog/introduction-to-apache-doris-a-next-generation-real-time-data-warehouse.md
@@ -0,0 +1,172 @@
+---
+{
+ 'title': 'Introduction to Apache Doris: A Next-Generation Real-Time Data
Warehouse',
+ 'summary': "This is a technical overview of Apache Doris, introducing how
it enables fast query performance with it architectural design, features, and
mechanisms.",
+ 'date': '2023-10-03',
+ 'author': 'Apache Doris',
+ 'tags': ['Tech Sharing'],
+}
+
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## What is Apache Doris?
+
+[Apache Doris](https://doris.apache.org/) is an open-source real-time data
warehouse. It can collect data from various data sources, including relational
databases (MySQL, PostgreSQL, SQL Server, Oracle, etc.), logs, and time series
data from IoT devices. It is capable of reporting, ad-hoc analysis, federated
queries, and log analysis, so it can be used to support dashboarding,
self-service BI, A/B testing, user behavior analysis and the like.
+
+Apache Doris supports both batch import and stream writing. It can be well
integrated with Apache Spark, Apache Hive, Apache Flink, Airbyte, DBT, and
Fivetran. It can also connect to data lakes such as Apache Hive, Apache Hudi,
Apache Iceberg, Delta Lake, and Apache Paimon.
+
+
+
+## Performance
+
+As a real-time OLAP engine, Apache Doris hasn a competitive edge in query
speed. According to the TPC-H and SSB-Flat benchmarking results, Doris can
deliver much faster performance than Presto, Greenplum, and ClickHouse.
+
+As for its self-volution, it has increased its query speed by over 10 times in
the past two years, both in complex queries and flat table analysis.
+
+
+
+## Architectural Design
+
+Behind the fast speed of Apache Doris is the architectural design, features,
and mechanisms that contribute to the performance of Doris.
+
+First of all, Apache Doris has a cost-based optimizer (CBO) that can figure
out the most efficient execution plan for complicated big queries. It has a
fully vectorized execution engine so it can reduce virtual function calls and
cache misses. It is MPP-based (Massively Parallel Processing) so it can give
full play to the user's machines and cores. In Doris, query execution is
data-driven, which means whether a query gets executed is determined by whether
its relevant data is ready, and [...]
+
+## Fast Point Queries for A Column-Oriented Database
+
+Apache Doris is a column-oriented database so it can make data compression and
data sharding easier and faster. But this might not be suitable for cases such
as customer-facing services. In these cases, a data platform will have to
handle requests from a large number of users concurrently (these are called
"high-concurrency point queries"), and having a columnar storage engine will
amplify I/O operations per second, especially when data is arranged in flat
tables.
+
+To fix that, Apache Doris enables hybrid storage, which means to have row
storage and columnar storage at the same time.
+
+
+
+In addition, since point queries are all simple queries, it will be
unnecessary and wasteful to call out the query planner, so Doris executes a
short circuit plan for them to reduce overhead.
+
+Another big source of overheads in high-concurrency point queries is SQL
parsing. For that, Doris has prepared statements. The idea is to pre-compute
the SQL statement and cache them, so they can be reused for similar queries.
+
+
+
+## Data Ingestion
+
+Apache Doris provides a range of methods for data ingestion.
+
+**Real-Time stream writing**:
+
+- **[Stream
Load](https://doris.apache.org/docs/dev/data-operate/import/import-way/stream-load-manual?_highlight=stream&_highlight=loa)**:
You can apply this method to write local files or data streams via HTTP. It is
linearly scalable and can reach a throughput of 10 million records per second
in some use cases.
+-
**[Flink-Doris-Connector](https://doris.apache.org/docs/1.2/ecosystem/flink-doris-connector/)**:
With built-in Flink CDC, this Connector ingests data from OLTP databases to
Doris. So far, we have realized auto-synchronization of data from MySQL and
Oracle to Doris.
+- **[Routine
Load](https://doris.apache.org/docs/dev/data-operate/import/import-way/routine-load-manual)**:
This is to subscribe data from Kafka message queues.
+- **[Insert
Into](https://doris.apache.org/docs/dev/data-operate/import/import-way/insert-into-manual)**:
This is especially useful when you try to do ETL in Doris internally, like
writing data from one Doris table to another.
+
+**Batch writing**:
+
+- **[Spark
Load](https://doris.apache.org/docs/dev/data-operate/import/import-way/spark-load-manual)**:
With this method, you can leverage Spark resources to pre-process data from
HDFS and object storage before writing to Doris.
+- **[Broker
Load](https://doris.apache.org/docs/dev/data-operate/import/import-way/broker-load-manual)**:
This supports HDFS and S3 protocol.
+- `insert into <internal table> select from <external table>`: This simple
statement allows you to connect Doris to various storage systems, data lakes,
and databases.
+
+## Data Update
+
+For data updates, what Apache Doris has to offer is that, it supports both
Merge on Read and Merge on Write, the former for low-frequency batch updates
and the latter for real-time writing. With Merge on Write, the latest data will
be ready by the time you execute queries, and that's why it can improve your
query speed by 5 to 10 times compared to Merge on Read.
+
+From an implementation perspective, these are a few common data update
operations, and Doris supports them all:
+
+- **Upsert**: to replace or update a whole row
+- **Partial column update**: to update just a few columns in a row
+- **Conditional updating**: to filter out some data by combining a few
conditions in order to replace or delete it
+- **Insert Overwrite**: to rewrite a table or partition
+
+In some cases, data updates happen concurrently, which means there is numerous
new data coming in and trying to modify the existing data record, so the
updating order matters a lot. That's why Doris allows you to decide the order,
either by the order of transaction commit or that of the sequence column
(something that you specify in the table in advance). Doris also supports data
deletion based on the specified predicate, and that's how conditional updating
is done.
+
+## Service Availability & Data Reliability
+
+Apart from fast performance in queries and data ingestion, Apache Doris also
provides service availability guarantee, and this is how:
+
+Architecturally, Doris has two processes: frontend and backend. Both of them
are easily scalable. The frontend nodes manage the clusters, metadata and
handle user requests; the backend nodes execute the queries and are capable of
auto data balancing and auto-restoration. It supports cluster upgrading and
scaling to avoid interruption to services.
+
+
+
+## Cross Cluster Replication
+
+Enterprise users, especially those in finance or e-commerce, will need to
backup their clusters or their entire data center, just in case of force
majeure. So Doris 2.0 provides Cross Cluster Replication (CCR). With CCR, users
can do a lot:
+
+- **Disaster recovery**: for quick restoration of data services
+- **Read-write separation**: master cluster + slave cluster; one for reading,
one for writing
+- **Isolated upgrade of clusters**: For cluster scaling, CCR allows users to
pre-create a backup cluster for a trial run so they can clear out the possible
incompatibility issues and bugs.
+
+Tests show that Doris CCR can reach a data latency of minutes. In the best
case, it can reach the upper speed limit of the hardware environment.
+
+
+
+## Multi-Tenant Management
+
+Apache Doris has sophisticated Role-Based Access Control, and it allows
fine-grained privilege control on the level of databases, tables, rows, and
columns.
+
+
+
+For resource isolation, Doris used to implement a hard isolation plan, which
is to divide the backend nodes into resource groups, and assign the Resource
Groups to different workloads. This is a hard isolation plan. It was simple and
neat. But sometimes users can make the most out of their computing resource
because some Resource Groups are idle.
+
+
+
+Thus, instead of Resource Groups, Doris 2.0 introduces Workload Group. A soft
limit is set for a Workload Group about how many resources it can use. When
that soft limit is hit, and meanwhile there are some idle resources available.
The idle resources will be shared across the workload groups. Users can also
prioritize the workload groups in terms of their access to idle resources.
+
+
+
+## Easy to Use
+
+As many capabilities as Apache Doris provides, it is also easy to use. It
supports standard SQL and is compatible with MySQL protocol and most BI tools
on the market.
+
+Another effort that we've made to improve usability is a feature called Light
Schema Change. This means if users need to add or delete some columns in a
table, they just need to update the metadata in the frontend but don't have to
modify all the data files. Light Schema Change can be done within milliseconds.
It also allows changes to indexes and data type of columns. The combination of
Light Schema Change and Flink-Doris-Connector means synchronization of upstream
tables within milliseconds.
+
+## Semi-Structured Data Analysis
+
+Common examples of semi-structure data include logs, observability data, and
time series data. These cases require schema-free support, lower cost, and
capabilities in multi-dimensional analysis and full-text search.
+
+In text analysis, mostly, people use the LIKE operator, so we put a lot of
effort into improving the performance of it, including pushing down the LIKE
operator down to the storage layer (to reduce data scanning), and introducing
the NGram Bloomfilter, the Hyperscan regex matching library, and the Volnitsky
algorithm (for sub-string matching).
+
+
+
+We have also introduced inverted index for text tokenization. It is a power
tool for fuzzy keyword search, full-text search, equivalence queries, and range
queries.
+
+## Data Lakehouse
+
+For users to build a high-performing data lakehouse and a unified query
gateway, Doris can map, cache, and auto-refresh the meta data from external
sources. It supports Hive Metastore and almost all open data lakehouse formats.
You can connect it to relational databases, Elasticsearch, and many other
sources. And it allows you to reuse your own authentication systems, like
Kerberos and Apache Ranger, on the external tables.
+
+Benchmark results show that Apache Doris is 3~5 times faster than Trino in
queries on Hive tables. It is the joint result of a few features:
+
+1. Efficient query engine
+2. Hot data caching mechanism
+3. Compute nodes
+4. Views in Doris
+
+The [Compute Nodes](https://doris.apache.org/docs/dev/advanced/compute-node)
is a newly introduced solution in version 2.0 for data lakehousing. Unlike
normal backend nodes, Compute Nodes are stateless and do not store any data.
Neither are they involved in data balancing during cluster scaling. Thus, they
can join the cluster flexibly and easily during computation peak times.
+
+Also, Doris allows you to write the computation results of external tables
into Doris to form a view. This is a similar thinking to Materialized Views: to
trade space for speed. After a query on external tables is executed, the
results can be put in Doris internally. When there are similar queries
following up, the system can directly read the results of previous queries from
Doris, and that speeds things up.
+
+## Tiered Storage
+
+The main purpose of tiered storage is to save money. [Tiered storage
](https://doris.apache.org/docs/dev/advanced/cold-hot-separation?_highlight=cold)means
to separate hot data and cold data into different storage, with hot data being
the data that is frequently accessed and cold data that isn't. It allows users
to put hot data in the quick but expensive disks (such as SSD and HDD), and
cold data in object storage.
+
+
+
+Roughly speaking, for a data asset consisting of 80% cold data, tiered storage
will reduce your storage cost by 70%.
+
+## The Apache Doris Community
+
+This is an overview of Apache Doris, an open-source real-time data warehouse.
It is actively evolving with an agile release schedule, and the
[community](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-1t3wfymur-0soNPATWQ~gbU8xutFOLog)
embraces any questions, ideas, and feedback.
diff --git a/blog/log-analysis-elasticsearch-vs-apache-doris.md
b/blog/log-analysis-elasticsearch-vs-apache-doris.md
new file mode 100644
index 00000000000..28c57378f85
--- /dev/null
+++ b/blog/log-analysis-elasticsearch-vs-apache-doris.md
@@ -0,0 +1,352 @@
+---
+{
+ 'title': 'Log Analysis: Elasticsearch VS Apache Doris',
+ 'summary': "As a major part of a company's data asset, logs brings values
to businesses in three aspects: system observability, cyber security, and data
analysis. They are your first resort for troubleshooting, your reference for
improving system security, and your data mine where you can extract information
that points to business growth.",
+ 'date': '2023-09-28',
+ 'author': 'Apache Doris',
+ 'tags': ['Tech Sharing'],
+}
+
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+As a major part of a company's data asset, logs brings values to businesses in
three aspects: system observability, cyber security, and data analysis. They
are your first resort for troubleshooting, your reference for improving system
security, and your data mine where you can extract information that points to
business growth.
+
+Logs are the sequential records of events in the computer system. If you think
about how logs are generated and used, you will know what an ideal log analysis
system should look like:
+
+- **It should have schema-free support.** Raw logs are unstructured free texts
and basically impossible for aggregation and calculation, so you needed to turn
them into structured tables (the process is called "ETL") before putting them
into a database or data warehouse for analysis. If there was a schema change,
lots of complicated adjustments needed to put into ETL and the structured
tables. Therefore, semi-structured logs, mostly in JSON format, emerged. You
can add or delete fields i [...]
+- **It should be low-cost.** Logs are huge and they are generated
continuously. A fairly big company produces 10~100 TBs of log data. For
business or compliance reasons, it should keep the logs around for half a year
or longer. That means to store a log size measured in PB, so the cost is
considerable.
+- **It should be capable of real-time processing.** Logs should be written in
real time, otherwise engineers won't be able to catch the latest events in
troubleshooting and security tracking. Plus, a good log system should provide
full-text searching capabilities and respond to interactive queries quickly.
+
+## The Elasticsearch-Based Log Analysis Solution
+
+A popular log processing solution within the data industry is the **ELK stack:
Elasticsearch, Logstash, and Kibana**. The pipeline can be split into five
modules:
+
+- **Log collection**: Filebeat collects local log files and writes them to a
Kafka message queue.
+- **Log transmission**: Kafka message queue gathers and caches logs.
+- **Log transfer**: Logstash filters and transfers log data in Kafka.
+- **Log storage**: Logstash writes logs in JSON format into Elasticsearch for
storage.
+- **Log query**: Users search for logs via Kibana visualization or send a
query request via Elasticsearch DSL API.
+
+
+
+The ELK stack has outstanding real-time processing capabilities, but frictions
exist.
+
+### Inadequate Schema-Free Support
+
+The Index Mapping in Elasticsearch defines the table scheme, which includes
the field names, data types, and whether to enable index creation.
+
+
+
+Elasticsearch also boasts a Dynamic Mapping mechanism that automatically adds
fields to the Mapping according to the input JSON data. This provides some sort
of schema-free support, but it's not enough because:
+
+- Dynamic Mapping often creates too many fields when processing dirty data,
which interrupts the whole system.
+- The data type of fields is immutable. To ensure compatibility, users often
configure "text" as the data type, but that results in much slower query
performance than binary data types such as integer.
+- The index of fields is immutable, too. Users cannot add or delete indexes
for a certain field, so they often create indexes for all fields to facilitate
data filtering in queries. But too many indexes require extra storage space and
slow down data ingestion.
+
+### Inadequate Analytic Capability
+
+Elasticsearch has its unique Domain Specific Language (DSL), which is very
different from the tech stack that most data engineers and analysts are
familiar with, so there is a steep learning curve. Moreover, Elasticsearch has
a relatively closed ecosystem so there might be strong resistance in
integration with BI tools. Most importantly, Elastisearch only supports
single-table analysis and is lagging behind the modern OLAP demands for
multi-table join, sub-query, and views.
+
+
+
+### High Cost & Low Stability
+
+Elasticsearch users have been complaining about the computation and storage
costs. The root reason lies in the way Elasticsearch works.
+
+- **Computation cost**: In data writing, Elasticsearch also executes
compute-intensive operations including inverted index creation, tokenization,
and inverted index ranking. Under these circumstances, data is written into
Elasticsearch at a speed of around 2MB/s per core. When CPU resources are
tight, data writing requirements often get rejected during peak times, which
further leads to higher latency.
+- **Storage cost**: To speed up retrieval, Elasticsearch stores the forward
indexes, inverted indexes, and docvalues of the original data, consuming a lot
more storage space. The compression ratio of a single data copy is only 1.5:1,
compared to the 5:1 in most log solutions.
+
+As data and cluster size grows, maintaining stability can be another issue:
+
+- **During data writing peaks**: Clusters are prone to overload during data
writing peaks.
+
+- **During queries**: Since all queries are processed in the memory, big
queries can easily lead to JVM OOM.
+
+- **Slow recovery**: For a cluster failure, Elasticsearch should reload
indexes, which is resource-intensive, so it will take many minutes to recover.
That challenges service availability guarantee.
+
+
+
+## A More Cost-Effective Option
+
+Reflecting on the strengths and limitations of the Elasticsearch-based
solution, the Apache Doris developers have optimized Apache Doris for log
processing.
+
+- **Increase writing throughout**: The performance of Elasticsearch is
bottlenecked by data parsing and inverted index creation, so we improved Apache
Doris in these factors: we quickened data parsing and index creation by SIMD
instructions and CPU vector instructions; then we removed those data structures
unnecessary for log analysis scenarios, such as forward indexes, to simplify
index creation.
+- **Reduce storage costs**: We removed forward indexes, which represented 30%
of index data. We adopted columnar storage and the ZSTD compression algorithm,
and thus achieved a compression ratio of 5:1 to 10:1. Given that a large part
of the historical logs are rarely accessed, we introduced tiered storage to
separate hot and cold data. Logs that are older than a specified time period
will be moved to object storage, which is much less expensive. This can reduce
storage costs by around 70%.
+
+Benchmark tests with ES Rally, the official testing tool for Elasticsearch,
showed that Apache Doris was around 5 times as fast as Elasticsearch in data
writing, 2.3 times as fast in queries, and it consumed only 1/5 of the storage
space that Elasticsearch used. On the test dataset of HTTP logs, it achieved a
writing speed of 550 MB/s and a compression ratio of 10:1.
+
+
+
+The below figure show what a typical Doris-based log processing system looks
like. It is more inclusive and allows for more flexible usage from data
ingestion, analysis, and application:
+
+- **Ingestion**: Apache Doris supports various ingestion methods for log data.
You can push logs to Doris via HTTP Output using Logstash, you can use Flink to
pre-process the logs before you write them into Doris, or you can load logs
from Flink or object storage to Doris via Routine Load and S3 Load.
+- **Analysis**: You can put log data in Doris and conduct join queries across
logs and other data in the data warehouse.
+- **Application**: Apache Doris is compatible with MySQL protocol, so you can
integrate a wide variety of data analytic tools and clients to Doris, such as
Grafana and Tableau. You can also connect applications to Doris via JDBC and
ODBC APIs. We are planning to build a Kibana-like system to visualize logs.
+
+
+
+Moreover, Apache Doris has better scheme-free support and a more user-friendly
analytic engine.
+
+### Native Support for Semi-Structured Data
+
+**Firstly, we worked on the data types.** We optimized the string search and
regular expression matching for "text" through vectorization and brought a
performance increase of 2~10 times. For JSON strings, Apache Doris will parse
and store them as a more compacted and efficient binary format, which can speed
up queries by 4 times. We also added a new data type for complicated data:
Array Map. It can structuralize concatenated strings to allow for higher
compression rate and faster queries.
+
+**Secondly, Apache Doris supports schema evolution.** This means you can
adjust the schema as your business changes. You can add or delete fields and
indexes, and change the data types for fields.
+
+Apache Doris provides Light Schema Change capabilities, so you can add or
delete fields within milliseconds:
+
+```SQL
+-- Add a column. Result will be returned in milliseconds.
+ALTER TABLE lineitem ADD COLUMN l_new_column INT;
+```
+
+
+
+You can also add index only for your target fields, so you can avoid overheads
from unnecessary index creation. After you add an index, by default, the system
will generate the index for all incremental data, and you can specify which
historical data partitions that need the index.
+
+```SQL
+-- Add inverted index. Doris will generate inverted index for all new data
afterward.
+ALTER TABLE table_name ADD INDEX index_name(column_name) USING INVERTED;
+
+-- Build index for the specified historical data partitions.
+BUILD INDEX index_name ON table_name PARTITIONS(partition_name1,
partition_name2);
+```
+
+### SQL-Based Analytic Engine
+
+The SQL-based analytic engine makes sure that data engineers and analysts can
smoothly grasp Apache Doris in a short time and bring their experience with SQL
to this OLAP engine. Building on the rich features of SQL, users can execute
data retrieval, aggregation, multi-table join, sub-query, UDF, logic views, and
materialized views to serve their own needs.
+
+With MySQL compatibility, Apache Doris can be integrated with most GUI and BI
tools in the big data ecosystem, so users can realize more complex and
diversified data analysis.
+
+### Performance in Use Case
+
+A gaming company has transitioned from the ELK stack to the Apache Doris
solution. Their Doris-based log system used 1/6 of the storage space that they
previously needed.
+
+In a cybersecurity company who built their log analysis system utilizing
inverted index in Apache Doris, they supported a data writing speed of 300,000
rows per second with 1/5 of the server resources that they formerly used.
+
+## Hands-On Guide
+
+Now let's go through the three steps of building a log analysis system with
Apache Doris.
+
+Before you start, [download](https://doris.apache.org/download/) Apache Doris
2.0 or newer versions from the website and
[deploy](https://doris.apache.org/docs/dev/install/standard-deployment/)
clusters.
+
+### Step 1: Create Tables
+
+This is an example of table creation.
+
+Explanations for the configurations:
+
+- The DATETIMEV2 time field is specified as the Key in order to speed up
queries for the latest N log records.
+- Indexes are created for the frequently accessed fields, and fields that
require full-text search are specified with Parser parameters.
+- "PARTITION BY RANGE" means to partition the data by RANGE based on time
fields, [Dynamic
Partition](https://doris.apache.org/docs/dev/advanced/partition/dynamic-partition/)
is enabled for auto-management.
+- "DISTRIBUTED BY RANDOM BUCKETS AUTO" means to distribute the data into
buckets randomly and the system will automatically decide the number of buckets
based on the cluster size and data volume.
+- "log_policy_1day" and "log_s3" means to move logs older than 1 day to S3
storage.
+
+```Go
+CREATE DATABASE log_db;
+USE log_db;
+
+CREATE RESOURCE "log_s3"
+PROPERTIES
+(
+ "type" = "s3",
+ "s3.endpoint" = "your_endpoint_url",
+ "s3.region" = "your_region",
+ "s3.bucket" = "your_bucket",
+ "s3.root.path" = "your_path",
+ "s3.access_key" = "your_ak",
+ "s3.secret_key" = "your_sk"
+);
+
+CREATE STORAGE POLICY log_policy_1day
+PROPERTIES(
+ "storage_resource" = "log_s3",
+ "cooldown_ttl" = "86400"
+);
+
+CREATE TABLE log_table
+(
+ `ts` DATETIMEV2,
+ `clientip` VARCHAR(20),
+ `request` TEXT,
+ `status` INT,
+ `size` INT,
+ INDEX idx_size (`size`) USING INVERTED,
+ INDEX idx_status (`status`) USING INVERTED,
+ INDEX idx_clientip (`clientip`) USING INVERTED,
+ INDEX idx_request (`request`) USING INVERTED PROPERTIES("parser" = "english")
+)
+ENGINE = OLAP
+DUPLICATE KEY(`ts`)
+PARTITION BY RANGE(`ts`) ()
+DISTRIBUTED BY RANDOM BUCKETS AUTO
+PROPERTIES (
+"replication_num" = "1",
+"storage_policy" = "log_policy_1day",
+"deprecated_dynamic_schema" = "true",
+"dynamic_partition.enable" = "true",
+"dynamic_partition.time_unit" = "DAY",
+"dynamic_partition.start" = "-3",
+"dynamic_partition.end" = "7",
+"dynamic_partition.prefix" = "p",
+"dynamic_partition.buckets" = "AUTO",
+"dynamic_partition.replication_num" = "1"
+);
+```
+
+### Step 2: Ingest the Logs
+
+Apache Doris supports various ingestion methods. For real-time logs, we
recommend the following three methods:
+
+- Pull logs from Kafka message queue: Routine Load
+- Logstash: write logs into Doris via HTTP API
+- Self-defined writing program: write logs into Doris via HTTP API
+
+**Ingest from Kafka**
+
+For JSON logs that are written into Kafka message queues, create [Routine
Load](https://doris.apache.org/docs/dev/data-operate/import/import-way/routine-load-manual/)
so Doris will pull data from Kafka. The following is an example. The
`property.*` configurations are optional:
+
+```SQL
+-- Prepare the Kafka cluster and topic ("log_topic")
+
+-- Create Routine Load, load data from Kafka log_topic to "log_table"
+CREATE ROUTINE LOAD load_log_kafka ON log_db.log_table
+COLUMNS(ts, clientip, request, status, size)
+PROPERTIES (
+ "max_batch_interval" = "10",
+ "max_batch_rows" = "1000000",
+ "max_batch_size" = "109715200",
+ "strict_mode" = "false",
+ "format" = "json"
+)
+FROM KAFKA (
+ "kafka_broker_list" = "host:port",
+ "kafka_topic" = "log_topic",
+ "property.group.id" = "your_group_id",
+ "property.security.protocol"="SASL_PLAINTEXT",
+ "property.sasl.mechanism"="GSSAPI",
+ "property.sasl.kerberos.service.name"="kafka",
+ "property.sasl.kerberos.keytab"="/path/to/xxx.keytab",
+ "property.sasl.kerberos.principal"="[email protected]"
+);
+```
+
+You can check how the Routine Load runs via the `SHOW ROUTINE LOAD` command.
+
+**Ingest via Logstash**
+
+Configure HTTP Output for Logstash, and then data will be sent to Doris via
HTTP Stream Load.
+
+1. Specify the batch size and batch delay in `logstash.yml` to improve data
writing performance.
+
+```Plain
+pipeline.batch.size: 100000
+pipeline.batch.delay: 10000
+```
+
+2. Add HTTP Output to the log collection configuration file `testlog.conf`,
URL => the Stream Load address in Doris.
+
+- Since Logstash does not support HTTP redirection, you should use a backend
address instead of a FE address.
+- Authorization in the headers is `http basic auth`. It is computed with `echo
-n 'username:password' | base64`.
+- The `load_to_single_tablet` in the headers can reduce the number of small
files in data ingestion.
+
+```Plain
+output {
+ http {
+ follow_redirects => true
+ keepalive => false
+ http_method => "put"
+ url => "http://172.21.0.5:8640/api/logdb/logtable/_stream_load"
+ headers => [
+ "format", "json",
+ "strip_outer_array", "true",
+ "load_to_single_tablet", "true",
+ "Authorization", "Basic cm9vdDo=",
+ "Expect", "100-continue"
+ ]
+ format => "json_batch"
+ }
+}
+```
+
+**Ingest via self-defined program**
+
+This is an example of ingesting data to Doris via HTTP Stream Load.
+
+Notes:
+
+- Use `basic auth` for HTTP authorization, use `echo -n 'username:password' |
base64` in computation
+- `http header "format:json"`: the data type is specified as JSON
+- `http header "read_json_by_line:true"`: each line is a JSON record
+- `http header "load_to_single_tablet:true"`: write to one tablet each time
+- For the data writing clients, we recommend a batch size of 100MB~1GB. Future
versions will enable Group Commit at the server end and reduce batch size from
clients.
+
+```Bash
+curl \
+--location-trusted \
+-u username:password \
+-H "format:json" \
+-H "read_json_by_line:true" \
+-H "load_to_single_tablet:true" \
+-T logfile.json \
+http://fe_host:fe_http_port/api/log_db/log_table/_stream_load
+```
+
+### Step 3: Execute Queries
+
+Apache Doris supports standard SQL, so you can connect to Doris via MySQL
client or JDBC and then execute SQL queries.
+
+```SQL
+mysql -h fe_host -P fe_mysql_port -u root -Dlog_db
+```
+
+**A few common queries in log analysis:**
+
+- Check the latest 10 records.
+
+```SQL
+SELECT * FROM log_table ORDER BY ts DESC LIMIT 10;
+```
+
+- Check the latest 10 records of Client IP "8.8.8.8".
+
+```SQL
+SELECT * FROM log_table WHERE clientip = '8.8.8.8' ORDER BY ts DESC LIMIT 10;
+```
+
+- Retrieve the latest 10 records with "error" or "404" in the "request" field.
**MATCH_ANY** is a SQL syntax keyword for full-text search in Doris. It means
to find the records that include any one of the specified keywords.
+
+```SQL
+SELECT * FROM log_table WHERE request MATCH_ANY 'error 404' ORDER BY ts DESC
LIMIT 10;
+```
+
+- Retrieve the latest 10 records with "image" and "faq" in the "request"
field. **MATCH_ALL** is also a SQL syntax keyword for full-text search in
Doris. It means to find the records that include all of the specified keywords.
+
+```SQL
+SELECT * FROM log_table WHERE request MATCH_ALL 'image faq' ORDER BY ts DESC
LIMIT 10;
+```
+
+## Conclusion
+
+If you are looking for an efficient log analytic solution, Apache Doris is
friendly to anyone equipped with SQL knowledge; if you find friction with the
ELK stack, try Apache Doris provides better schema-free support, enables faster
data writing and queries, and brings much less storage burden.
+
+But we won't stop here. We are going to provide more features to facilitate
log analysis. We plan to add more complicated data types to inverted index, and
support BKD index to make Apache Doris a fit for geo data analysis. We also
plan to expand capabilities in semi-structured data analysis, such as working
on the complex data types (Array, Map, Struct, JSON) and high-performance
string matching algorithm. And we welcome any [user feedback and development
advice](https://t.co/ZxJuNJHXb2).
diff --git a/blog/migrating-from-clickhouse-to-apache-doris-what-happened.md
b/blog/migrating-from-clickhouse-to-apache-doris-what-happened.md
new file mode 100644
index 00000000000..d5e63c95caf
--- /dev/null
+++ b/blog/migrating-from-clickhouse-to-apache-doris-what-happened.md
@@ -0,0 +1,161 @@
+---
+{
+ 'title': 'Migrating from ClickHouse to Apache Doris: What Happened?',
+ 'summary': "A user of Apache Doris has written down their migration
process from ClickHouse to Doris, including why they need the change, what
needs to be taken care of, and how they compare the performance of the two
databases in their environment. ",
+ 'date': '2023-10-11',
+ 'author': 'Chuang Li',
+ 'tags': ['Best Practice'],
+}
+
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+Migrating from one OLAP database to another is huge. Even if you're unhappy
with your current data tool and have found some promising candidate, you might
still hesitate to do the big surgery on your data architecture, because you're
uncertain about how things are going to work. So you need experience shared by
someone who has walked the path.
+
+Luckily, a user of Apache Doris has written down their migration process from
ClickHouse to Doris, including why they need the change, what needs to be taken
care of, and how they compare the performance of the two databases in their
environment.
+
+To decide whether you want to continue reading, check if you tick one of the
following boxes:
+
+- You need your join queries to be executed faster.
+- You need flexible data updates.
+- You need real-time data analysis.
+- You need to minimize your components.
+
+If you do, this post might be of some help to you.
+
+## Replacing Kylin, ClickHouse, and Druid with Apache Doris
+
+The user undergoing this change is an e-commerce SaaS provider. Its data
system serves realtime and offline reporting, customer segmentation, and log
analysis. Initially, they used different OLAP engines for these various
purposes:
+
+- **Apache Kylin for offline reporting**: The system provides offline
reporting services for over 5 million sellers. The big ones among them have
more than 10 million registered members and 100,000 SKU, and the detailed
information is put into over 400 data cubes on the platform.
+- **ClickHouse for customer segmentation and Top-N log queries**: This entails
high-frequency updates, high QPS, and complicated SQL.
+- **Apache Druid for real-time reporting**: Sellers extract data they need by
combining different dimensions, and such real-time reporting requires quick
data updates, quick query response, and strong stability of the system.
+
+
+
+The three components have their own sore spots.
+
+- **Apache Kylin** runs well with a fixed table schema, but every time you
want to add a dimension, you need to create a new data cube and refill the
historical data in it.
+- **ClickHouse** is not designed for multi-table processing, so you might need
an extra solution for federated queries and multi-table join queries. And in
this case, it was below expectation in high-concurrency scenarios.
+- **Apache Druid** implements idempotent writing so it does not support data
updating or deletion itself. That means when there is something wrong at the
upstream, you will need a full data replacement. And such data fixing is a
multi-step process if you think it all the way through, because of all the data
backups and movements. Plus, newly ingested data will not be accessible for
queries until it is put in segments in Druid. That means a longer window such
that data inconsistency betwe [...]
+
+As they work together, this architecture might be too demanding to navigate
because it requires knowledge of all these components in terms of development,
monitoring, and maintenance. Also, every time the user scales a cluster, they
must stop the current cluster and migrate all databases and tables, which is
not only a big undertaking but also a huge interruption to business.
+
+
+
+Apache Doris fills these gaps.
+
+- **Query performance**: Doris is good at high-concurrency queries and join
queries, and it is now equipped with inverted index to speed up searches in
logs.
+- **Data update**: The Unique Key model of Doris supports both large-volume
update and high-freqency real-time writing, and the Duplicate Key model and
Unique Key model supports partial column update. It also provides exactly-once
guarantee in data writing and ensures consistency between base tables,
materialized views, and replicas.
+- **Maintenance**: Doris is MySQL-compatible. It supports easy scaling and
light schema change. It comes with its own integration tools such as
Flink-Doris-Connector and Spark-Doris-Connector.
+
+So they plan on the migration.
+
+## The Replacement Surgery
+
+ClickHouse was the main performance bottleneck in the old data architecture
and why the user wanted the change in the first place, so they started with
ClickHouse.
+
+### Changes in SQL statements
+
+**Table creation statements**
+
+
+
+The user built their own SQL rewriting tool that can convert a ClickHouse
table creation statement into a Doris table creation statement. The tool can
automate the following changes:
+
+- **Mapping the field types**: It converts ClickHouse field types into the
corresponding ones in Doris. For example, it converts String as a Key into
Varchar, and String as a partitioning field into Date V2.
+- **Setting the number of historical partitions in dynamic partitioning
tables**: Some tables have historical partitions and the number of partitions
should be specified upon table creation in Doris, otherwise a "No Partition"
error will be thrown.
+- **Determining the number of buckets**: It decides the number of buckets
based on the data volume of historical partitions; for non-partitioned tables,
it decides the bucketing configurations based on the historical data volume.
+- **Determining TTL**: It decides the time to live of partitions in dynamic
partitioning tables.
+- **Setting the import sequence**: For the Unique Key model of Doris, it can
specify the data import order based on the Sequence column to ensure
orderliness in data ingestion.
+
+
+
+**Query statements**
+
+Similarly, they have their own tool to transform the ClickHouse query
statements into Doris query statements. This is to prepare for the comparison
test between ClickHouse and Doris. The key considerations in the conversions
include:
+
+- **Conversion of table names**: This is simple given the mapping rules in
table creation statements.
+- **Conversion of functions**: For example, the `COUNTIF` function in
ClickHouse is equivalent to `SUM(CASE WHEN_THEN 1 ELSE 0)`, `Array Join` is
equivalent to `Explode` and `Lateral View`, and `ORDER BY` and `GROUP BY`
should be converted to window functions.
+- **Difference** **in semantics**: ClickHouse goes by its own protocol while
Doris is MySQL-compatible, so there needs to be alias for subqueries. In this
use case, subqueries are common in customer segmentation, so they use
`sqlparse`
+
+### Changes in data ingestion methods
+
+
+
+Apache Doris provides broad options of data writing methods. For the real-time
link, the user adopts Stream Load to ingest data from NSQ and Kafka.
+
+For the sizable offline data, the user tested different methods and here are
the takeouts:
+
+1. **Insert Into**
+
+Using Multi-Catalog to read external data sources and ingesting with Insert
Into can serve most needs in this use case.
+
+2. **Stream Load**
+
+The Spark-Doris-Connector is a more general method. It can handle large data
volumes and ensure writing stability. The key is to find the right writing pace
and parallelism.
+
+The Spark-Doris-Connector also supports Bitmap. It allows you to move the
computation workload of Bitmap data in Spark clusters.
+
+Both the Spark-Doris-Connector and the Flink-Doris-Connector rely on Stream
Load. CSV is the recommended format choice. Tests on the user's billions of
rows showed that CSV was 40% faster than JSON.
+
+3. **Spark Load**
+
+The Spark Load method utilizes Spark resources for data shuffling and ranking.
The computation results are put in HDFS, and then Doris reads the files from
HDFS directly (via Broker Load). This approach is ideal for huge data
ingestion. The more data there is, the faster and more resource-efficient the
ingestion is.
+
+## Pressure Test
+
+The user compared performance of the two components on their SQL and join
query scenarios, and calculated the CPU and memory consumption of Apache Doris.
+
+### SQL query performance
+
+Apache Doris outperformed ClickHouse in 10 of the 16 SQL queries, and the
biggest performance gap was a ratio of almost 30. Overall, Apache Doris was 2~3
times faster than ClickHouse.
+
+
+
+### Join query performance
+
+For join query tests, the user used different sizes of main tables and
dimension tables.
+
+- **Primary tables**: user activity table (4 billion rows), user attribute
table (25 billion rows), and user attribute table (96 billion rows)
+- **Dimension tables**: 1 million rows, 10 million rows, 50 million rows, 100
million rows, 500 million rows, 1 billion rows, and 2.5 billion rows.
+
+The tests include **full join queries** and **filtering join queries**. Full
join queries join all rows of the primary table and dimension tables, while
filtering join queries retrieve data of a certain seller ID with a `WHERE`
filter. The results are concluded as follows:
+
+**Primary table (4 billion rows):**
+
+- Full join queries: Doris outperforms ClickHouse in full join queries with
all dimension tables. The performance gap widens as the dimension tables get
larger. The largest difference is a ratio of 5.
+- Filtering join queries: Based on the seller ID, the filter screened out 41
million rows from the primary table. With small dimension tables, Doris was 2~3
times faster than ClickHouse; with large dimension tables, Doris was over 10
times faster; with dimension tables larger than 100 million rows, ClickHouse
threw an OOM error and Doris functions normally.
+
+**Primary table (25 billion rows):**
+
+- Full join queries: Doris outperforms ClickHouse in full join queries with
all dimension tables. ClickHouse produced an OOM error with dimension tables
larger than 50 million rows.
+- Filtering join queries: The filter screened out 570 million rows from the
primary table. Doris responded within seconds and ClickHouse finished within
minutes and broke down when joining large dimension tables.
+
+**Primary table (96 billion rows):**
+
+Doris delivered relatively quick performance in all queries and ClickHouse was
unable to execute all of them.
+
+In terms of CPU and memory consumption, Apache Doris maintained stable cluster
loads in all sizes of join queries.
+
+## Future Directions
+
+As the migration goes on, the user works closely with the [Doris
community](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-1t3wfymur-0soNPATWQ~gbU8xutFOLog),
and their feedback has contributed to the making of [Apache Doris
2.0.0](https://doris.apache.org/docs/dev/releasenotes/release-2.0.0/). We will
continue assisting them in their migration from Kylin and Druid to Doris, and
we look forward to see their Doris-based unified data platform come into being.
diff --git a/blog/release-note-2.0.0.md b/blog/release-note-2.0.0.md
index 60fed524577..626f2ba7f50 100644
--- a/blog/release-note-2.0.0.md
+++ b/blog/release-note-2.0.0.md
@@ -188,9 +188,9 @@ To separate storage and computation is a way to realize
elastic scaling of resou
Apache Doris 2.0 provides two solutions to address the needs of the first two
types of users.
1. **Compute nodes**. We introduced stateless compute nodes in version 2.0.
Unlike the mix nodes, the compute nodes do not save any data and are not
involved in workload balancing of data tablets during cluster scaling. Thus,
they are able to quickly join the cluster and share the computing pressure
during peak times. In addition, in data lakehouse analysis, these nodes will be
the first ones to execute queries on remote storage (HDFS/S3) so there will be
no resource competition between [...]
- 1. Doc: https://doris.apache.org/docs/dev/advanced/compute_node/
+ 1. Doc: https://doris.apache.org/docs/dev/advanced/compute-node
2. **Hot-cold data separation**. Hot/cold data refers to data that is
frequently/seldom accessed, respectively. Generally, it makes more sense to
store cold data in low-cost storage. Older versions of Apache Doris support
lifecycle management of table partitions: As hot data cooled down, it would be
moved from SSD to HDD. However, data was stored with multiple replicas on HDD,
which was still a waste. Now, in Apache Doris 2.0, cold data can be stored in
object storage, which is even chea [...]
- 1. Read more:
https://doris.apache.org/blog/Tiered-Storage-for-Hot-and-Cold-Data:-What,-Why,-and-How?/
+ 1. Read more:
https://doris.apache.org/blog/Tiered-Storage-for-Hot-and-Cold-Data-What-Why-and-How
For neater separate of computation and storage, the VeloDB team is going to
contribute the Cloud Compute-Storage-Separation solution to the Apache Doris
project. The performance and stability of it has stood the test of hundreds of
companies in their production environment. The merging of code will be finished
by October this year, and all Apache Doris users will be able to get an early
taste of it in September.
diff --git a/static/images/Introduction_3.png b/static/images/Introduction_3.png
new file mode 100644
index 00000000000..7e3e112c235
Binary files /dev/null and b/static/images/Introduction_3.png differ
diff --git a/static/images/Introduction_4.png b/static/images/Introduction_4.png
new file mode 100644
index 00000000000..4d7c893cf8e
Binary files /dev/null and b/static/images/Introduction_4.png differ
diff --git a/static/images/LAS_1.png b/static/images/LAS_1.png
new file mode 100644
index 00000000000..c7f75990173
Binary files /dev/null and b/static/images/LAS_1.png differ
diff --git a/static/images/LAS_2.png b/static/images/LAS_2.png
new file mode 100644
index 00000000000..c86d90a3c1a
Binary files /dev/null and b/static/images/LAS_2.png differ
diff --git a/static/images/LAS_3.png b/static/images/LAS_3.png
new file mode 100644
index 00000000000..6d3194f1e11
Binary files /dev/null and b/static/images/LAS_3.png differ
diff --git a/static/images/LAS_4.png b/static/images/LAS_4.png
new file mode 100644
index 00000000000..e298b60940e
Binary files /dev/null and b/static/images/LAS_4.png differ
diff --git a/static/images/LAS_5.png b/static/images/LAS_5.png
new file mode 100644
index 00000000000..eedf9bc8ab1
Binary files /dev/null and b/static/images/LAS_5.png differ
diff --git a/static/images/introduction_1.png b/static/images/introduction_1.png
new file mode 100644
index 00000000000..0568bba99ec
Binary files /dev/null and b/static/images/introduction_1.png differ
diff --git a/static/images/introduction_10.png
b/static/images/introduction_10.png
new file mode 100644
index 00000000000..da0624ae8bb
Binary files /dev/null and b/static/images/introduction_10.png differ
diff --git a/static/images/introduction_11.png
b/static/images/introduction_11.png
new file mode 100644
index 00000000000..e98f99cd01e
Binary files /dev/null and b/static/images/introduction_11.png differ
diff --git a/static/images/introduction_2.png b/static/images/introduction_2.png
new file mode 100644
index 00000000000..ecaa8adec66
Binary files /dev/null and b/static/images/introduction_2.png differ
diff --git a/static/images/introduction_5.png b/static/images/introduction_5.png
new file mode 100644
index 00000000000..5616299f505
Binary files /dev/null and b/static/images/introduction_5.png differ
diff --git a/static/images/introduction_6.png b/static/images/introduction_6.png
new file mode 100644
index 00000000000..71a83250a1b
Binary files /dev/null and b/static/images/introduction_6.png differ
diff --git a/static/images/introduction_7.png b/static/images/introduction_7.png
new file mode 100644
index 00000000000..d56b89fb340
Binary files /dev/null and b/static/images/introduction_7.png differ
diff --git a/static/images/introduction_8.png b/static/images/introduction_8.png
new file mode 100644
index 00000000000..b7de900ceb3
Binary files /dev/null and b/static/images/introduction_8.png differ
diff --git a/static/images/introduction_9.png b/static/images/introduction_9.png
new file mode 100644
index 00000000000..ef2d4401594
Binary files /dev/null and b/static/images/introduction_9.png differ
diff --git a/static/images/youzan-1.png b/static/images/youzan-1.png
new file mode 100644
index 00000000000..a966e51cd81
Binary files /dev/null and b/static/images/youzan-1.png differ
diff --git a/static/images/youzan-2.png b/static/images/youzan-2.png
new file mode 100644
index 00000000000..cf64e1e3808
Binary files /dev/null and b/static/images/youzan-2.png differ
diff --git a/static/images/youzan-3.png b/static/images/youzan-3.png
new file mode 100644
index 00000000000..e90fb32534a
Binary files /dev/null and b/static/images/youzan-3.png differ
diff --git a/static/images/youzan-4.png b/static/images/youzan-4.png
new file mode 100644
index 00000000000..9dda8649452
Binary files /dev/null and b/static/images/youzan-4.png differ
diff --git a/static/images/youzan-5.png b/static/images/youzan-5.png
new file mode 100644
index 00000000000..48b3ef45a92
Binary files /dev/null and b/static/images/youzan-5.png differ
diff --git a/static/images/youzan-6.png b/static/images/youzan-6.png
new file mode 100644
index 00000000000..46f46a68a5a
Binary files /dev/null and b/static/images/youzan-6.png differ
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]