[doris-website] branch master updated: add blogs & update dead links (#319)

luzhijing Wed, 11 Oct 2023 19:49:15 -0700

This is an automated email from the ASF dual-hosted git repository.

luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 087aec0dfe0 add blogs & update dead links (#319)
087aec0dfe0 is described below

commit 087aec0dfe00019e782ed3633fe1521f8256adba
Author: Hu Yanjun <[email protected]>
AuthorDate: Thu Oct 12 10:48:30 2023 +0800

    add blogs & update dead links (#319)
---
 ...-Why-We-Went-from-ClickHouse-to-Apache-Doris.md |   2 +-
 blog/Tencent-LLM.md                                |   2 +-
 ...ambda-Architecture-for-40-Faster-Performance.md |  97 ------
 ...s-a-next-generation-real-time-data-warehouse.md | 172 ++++++++++
 blog/log-analysis-elasticsearch-vs-apache-doris.md | 352 +++++++++++++++++++++
 ...rom-clickhouse-to-apache-doris-what-happened.md | 161 ++++++++++
 blog/release-note-2.0.0.md                         |   4 +-
 static/images/Introduction_3.png                   | Bin 0 -> 149120 bytes
 static/images/Introduction_4.png                   | Bin 0 -> 143564 bytes
 static/images/LAS_1.png                            | Bin 0 -> 96696 bytes
 static/images/LAS_2.png                            | Bin 0 -> 104517 bytes
 static/images/LAS_3.png                            | Bin 0 -> 73001 bytes
 static/images/LAS_4.png                            | Bin 0 -> 67765 bytes
 static/images/LAS_5.png                            | Bin 0 -> 49078 bytes
 static/images/introduction_1.png                   | Bin 0 -> 321089 bytes
 static/images/introduction_10.png                  | Bin 0 -> 141948 bytes
 static/images/introduction_11.png                  | Bin 0 -> 77773 bytes
 static/images/introduction_2.png                   | Bin 0 -> 326940 bytes
 static/images/introduction_5.png                   | Bin 0 -> 138772 bytes
 static/images/introduction_6.png                   | Bin 0 -> 58618 bytes
 static/images/introduction_7.png                   | Bin 0 -> 194930 bytes
 static/images/introduction_8.png                   | Bin 0 -> 277049 bytes
 static/images/introduction_9.png                   | Bin 0 -> 173567 bytes
 static/images/youzan-1.png                         | Bin 0 -> 96929 bytes
 static/images/youzan-2.png                         | Bin 0 -> 86410 bytes
 static/images/youzan-3.png                         | Bin 0 -> 385862 bytes
 static/images/youzan-4.png                         | Bin 0 -> 382306 bytes
 static/images/youzan-5.png                         | Bin 0 -> 143511 bytes
 static/images/youzan-6.png                         | Bin 0 -> 402306 bytes
 29 files changed, 689 insertions(+), 101 deletions(-)

diff --git 
a/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris.md 
b/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris.md
index 8f5aae89c27..e93ab50fd45 100644
--- a/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris.md
+++ b/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris.md
@@ -1,6 +1,6 @@
 ---
 {
-    'title': 'Tencent Data Engineer: Why We G from ClickHouse to Apache 
Doris?',
+    'title': 'Tencent Data Engineer: Why We Went from ClickHouse to Apache 
Doris?',
     'summary': "Evolution of the data processing architecture of Tencent Music 
Entertainment towards better performance and simpler maintenance.",
     'date': '2023-03-07',
     'author': 'Jun Zhang & Kai Dai',
diff --git a/blog/Tencent-LLM.md b/blog/Tencent-LLM.md
index 3775c5a82d0..d8c39cb847c 100644
--- a/blog/Tencent-LLM.md
+++ b/blog/Tencent-LLM.md
@@ -28,7 +28,7 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-Six months ago, I wrote about [why we replaced ClickHouse with Apache Doris as 
an OLAP engine](https://doris.apache.org/blog/Tencent%20Music/) for our data 
management system. Back then, we were struggling with the auto-generation of 
SQL statements. As days pass, we have made progresses big enough to be 
references for you (I think), so here I am again. 
+Six months ago, I wrote about [why we replaced ClickHouse with Apache Doris as 
an OLAP 
engine](https://doris.apache.org/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris)
 for our data management system. Back then, we were struggling with the 
auto-generation of SQL statements. As days pass, we have made progresses big 
enough to be references for you (I think), so here I am again. 
 
 We have adopted Large Language Models (LLM) to empower our Doris-based OLAP 
services.
 
diff --git 
a/blog/Zipping-up-the-Lambda-Architecture-for-40-Faster-Performance.md 
b/blog/Zipping-up-the-Lambda-Architecture-for-40-Faster-Performance.md
deleted file mode 100644
index d38a2f7902b..00000000000
--- a/blog/Zipping-up-the-Lambda-Architecture-for-40-Faster-Performance.md
+++ /dev/null
@@ -1,97 +0,0 @@
----
-{
-    'title': 'Zipping up the Lambda Architecture for 40% Faster Performance',
-    'summary': "Instead of pooling real-time and offline data after they are 
fully ready for queries, Douyu engineers use Apache Doris to share part of the 
pre-query computation burden.",
-    'date': '2023-05-05',
-    'author': 'Tongyang Han',
-    'tags': ['Best Practice'],
-}
-
----
-
-<!-- 
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-Author: Tongyang Han, Senior Data Engineer at Douyu
-
-The Lambda architecture has been common practice in big data processing. The 
concept is to separate stream (real time data) and batch (offline data) 
processing, and that's exactly what we did. These two types of data of ours 
were processed in two isolated tubes before they were pooled together and ready 
for searches and queries.
-
-![Lambda-architecture](../static/images/Douyu_1.png)
-
-Then we run into a few problems:
-
-1. **Isolation of real-time and offline data warehouses**
-   1.  I know this is kind of the essence of Lambda architecture, but that 
means we could not reuse real-time data since it was not layered as offline 
data, so further customized development was required.
-2. **Complex Pipeline from Data Sources to Data Application**
-   1.  Data had to go through multi-step processing before it reached our data 
users. As our architecture involved too many components, navigating and 
maintaining these tech stacks was a lot of work.
-3. **Lack of management of real-time data sources**
-   1.  In extreme cases, this worked like a data silo and we had no way to 
find out whether the ingested data was duplicated or reusable.
-
-So we decided to "zip up" the Lambda architecture a little bit. By "zipping 
up", I mean to introduce an OLAP engine that is capable of processing, storing, 
and analyzing data, so real-time data and offline data converge a little 
earlier than they used to. It is not a revolution of Lambda, but a minor change 
in the choice of components, which made our real-time data processing 40% 
faster.
-
-## **Zipping up Lambda Architecture**
-
-I am going to elaborate on how this is done using our data tagging process as 
an example.
-
-Previously, our offline tags were produced by the data warehouse, put into a 
flat table, and then written in **HBase**, while real-time tags were produced 
by **Flink**, and put into **HBase** directly. Then **Spark** would work as the 
computing engine.
-
-![HBase-Redis-Spark](../static/images/Douyu_2.png)
-
-The problem with this stemmed from the low computation efficiency of **Flink** 
and **Spark**. 
-
-- **Real-time tag production**: When computing real-time tags that involve 
data within a long time range, Flink did not deliver stable performance and 
consumed more resources than expected. And when a task failed, it would take a 
really long time for checkpoint recovery.
-- **Tag query**: As a tag query engine, Spark could be slow.
-
-As a solution, we replaced **HBase** and **Spark** with **Apache Doris**, a 
real-time analytic database, and moved part of the computational logic of the 
foregoing wide-time-range real-time tags from **Flink** to **Apache Doris**.
-
-![Apache-Doris-Redis](../static/images/Douyu_3.png)
-
-Instead of putting our flat tables in HBase, we place them in Apache Doris. 
These tables are divided into partitions based on time sensitivity. Offline 
tags will be updated daily while real-time tags will be updated in real time. 
We organize these tables in the Aggregate Model of Apache Doris, which allows 
partial update of data.
-
-Instead of using Spark for queries, we parse the query rules into SQL for 
execution in Apache Doris. For pattern matching, we use Redis to cache the hot 
data from Apache Doris, so the system can respond to such queries much faster.
-
-![Real-time-and-offline-data-processing-in-Apache-Doris](../static/images/Douyu_4.png)
-
-## **Computational Pipeline of Wide-Time-Range Real-Time Tags**
-
-In some cases, the computation of wide-time-range real-time tags entails the 
aggregation of historical (offline) data with real-time data. The following 
figure shows our old computational pipeline for these tags. 
-
-![offline-data-processing-link](../static/images/Douyu_5.png)
-
-As you can see, it required multiple tasks to finish computing one real-time 
tag. Also, in complicated aggregations that involve a collection of aggregation 
operations, any improper resource allocation could lead to back pressure or 
waste of resources. This adds to the difficulty of task scheduling. The 
maintenance and stability guarantee of such a long pipeline could be an issue, 
too.
-
-To improve on that, we decided to move such aggregation workload to Apache 
Doris.
-
-![real-time-data-processing-link](../static/images/Douyu_6.png)
-
-We have around 400 million customer tags in our system, and each customer is 
attached with over 300 tags. We divide customers into more than 10,000 groups, 
and we have to update 5000 of them on a daily basis. The above improvement has 
sped up the computation of our wide-time-range real-time queries by **40%**.
-
-## Overwrite
-
-To atomically replace data tables and partitions in Apache Doris, we 
customized the 
[Doris-Spark-Connector](https://github.com/apache/doris-spark-connector), and 
added an "Overwrite" mode to the Connector.
-
-When a Spark job is submitted, Apache Doris will call an interface to fetch 
information of the data tables and partitions.
-
-- If it is a non-partitioned table, we create a temporary table for the target 
table, ingest data into it, and then perform atomic replacement. If the data 
ingestion fails, we clear the temporary table;
-- If it is a dynamic partitioned table, we create a temporary partition for 
the target partition, ingest data into it, and then perform atomic replacement. 
If the data ingestion fails, we clear the temporary partition;
-- If it is a non-dynamic partitioned table, we need to extend the 
Doris-Spark-Connector parameter configuration first. Then we create a temporary 
partition and take steps as above.
-
-##  Conclusion
-
-One prominent advantage of Lambda architecture is the stability it provides. 
However, in our practice, the processing of real-time data and offline data 
sometimes intertwines. For example, the computation of certain real-time tags 
requires historical (offline) data. Such interaction becomes a root cause of 
instability. Thus, instead of pooling real-time and offline data after they are 
fully ready for queries, we use an OLAP engine to share part of the pre-query 
computation burden and mak [...]
diff --git 
a/blog/introduction-to-apache-doris-a-next-generation-real-time-data-warehouse.md
 
b/blog/introduction-to-apache-doris-a-next-generation-real-time-data-warehouse.md
new file mode 100644
index 00000000000..7018172cdee
--- /dev/null
+++ 
b/blog/introduction-to-apache-doris-a-next-generation-real-time-data-warehouse.md
@@ -0,0 +1,172 @@
+---
+{
+    'title': 'Introduction to Apache Doris: A Next-Generation Real-Time Data 
Warehouse',
+    'summary': "This is a technical overview of Apache Doris, introducing how 
it enables fast query performance with it architectural design, features, and 
mechanisms.",
+    'date': '2023-10-03',
+    'author': 'Apache Doris',
+    'tags': ['Tech Sharing'],
+}
+
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## What is Apache Doris?
+
+[Apache Doris](https://doris.apache.org/) is an open-source real-time data 
warehouse. It can collect data from various data sources, including relational 
databases (MySQL, PostgreSQL, SQL Server, Oracle, etc.), logs, and time series 
data from IoT devices. It is capable of reporting, ad-hoc analysis, federated 
queries, and log analysis, so it can be used to support dashboarding, 
self-service BI, A/B testing, user behavior analysis and the like.
+
+Apache Doris supports both batch import and stream writing. It can be well 
integrated with Apache Spark, Apache Hive, Apache Flink, Airbyte, DBT, and 
Fivetran. It can also connect to data lakes such as Apache Hive, Apache Hudi, 
Apache Iceberg, Delta Lake, and Apache Paimon.
+
+![What-Is-Apache-Doris](../static/images/introduction_1.png)
+
+## Performance
+
+As a real-time OLAP engine, Apache Doris hasn a competitive edge in query 
speed. According to the TPC-H and SSB-Flat benchmarking results, Doris can 
deliver much faster performance than Presto, Greenplum, and ClickHouse.
+
+As for its self-volution, it has increased its query speed by over 10 times in 
the past two years, both in complex queries and flat table analysis.
+
+![Apache-Doris-VS-Presto-Greenplum-ClickHouse](../static/images/introduction_2.png)
+
+## Architectural Design
+
+Behind the fast speed of Apache Doris is the architectural design, features, 
and mechanisms that contribute to the performance of Doris. 
+
+First of all, Apache Doris has a cost-based optimizer (CBO) that can figure 
out the most efficient execution plan for complicated big queries. It has a 
fully vectorized execution engine so it can reduce virtual function calls and 
cache misses. It is MPP-based (Massively Parallel Processing) so it can give 
full play to the user's machines and cores. In Doris, query execution is 
data-driven, which means whether a query gets executed is determined by whether 
its relevant data is ready, and  [...]
+
+## Fast Point Queries for A Column-Oriented Database
+
+Apache Doris is a column-oriented database so it can make data compression and 
data sharding easier and faster. But this might not be suitable for cases such 
as customer-facing services. In these cases, a data platform will have to 
handle requests from a large number of users concurrently (these are called 
"high-concurrency point queries"), and having a columnar storage engine will 
amplify I/O operations per second, especially when data is arranged in flat 
tables. 
+
+To fix that, Apache Doris enables hybrid storage, which means to have row 
storage and columnar storage at the same time. 
+
+![Hybrid-Columnar-Row-Storage](../static/images/Introduction_3.png)
+
+In addition, since point queries are all simple queries, it will be 
unnecessary and wasteful to call out the query planner, so Doris executes a 
short circuit plan for them to reduce overhead. 
+
+Another big source of overheads in high-concurrency point queries is SQL 
parsing. For that, Doris has prepared statements. The idea is to pre-compute 
the SQL statement and cache them, so they can be reused for similar queries.
+
+![prepared-statement-and-short-circuit-plan](../static/images/Introduction_4.png)
+
+## Data Ingestion
+
+Apache Doris provides a range of methods for data ingestion.
+
+**Real-Time stream writing**:
+
+- **[Stream 
Load](https://doris.apache.org/docs/dev/data-operate/import/import-way/stream-load-manual?_highlight=stream&_highlight=loa)**:
 You can apply this method to write local files or data streams via HTTP. It is 
linearly scalable and can reach a throughput of 10 million records per second 
in some use cases.
+- 
**[Flink-Doris-Connector](https://doris.apache.org/docs/1.2/ecosystem/flink-doris-connector/)**:
 With built-in Flink CDC, this Connector ingests data from OLTP databases to 
Doris. So far, we have realized auto-synchronization of data from MySQL and 
Oracle to Doris.
+- **[Routine 
Load](https://doris.apache.org/docs/dev/data-operate/import/import-way/routine-load-manual)**:
 This is to subscribe data from Kafka message queues. 
+- **[Insert 
Into](https://doris.apache.org/docs/dev/data-operate/import/import-way/insert-into-manual)**:
 This is especially useful when you try to do ETL in Doris internally, like 
writing data from one Doris table to another.
+
+**Batch writing**:
+
+- **[Spark 
Load](https://doris.apache.org/docs/dev/data-operate/import/import-way/spark-load-manual)**:
 With this method, you can leverage Spark resources to pre-process data from 
HDFS and object storage before writing to Doris.
+- **[Broker 
Load](https://doris.apache.org/docs/dev/data-operate/import/import-way/broker-load-manual)**:
 This supports HDFS and S3 protocol.
+- `insert into <internal table> select from <external table>`: This simple 
statement allows you to connect Doris to various storage systems, data lakes, 
and databases.
+
+## Data Update
+
+For data updates, what Apache Doris has to offer is that, it supports both 
Merge on Read and Merge on Write, the former for low-frequency batch updates 
and the latter for real-time writing. With Merge on Write, the latest data will 
be ready by the time you execute queries, and that's why it can improve your 
query speed by 5 to 10 times compared to Merge on Read. 
+
+From an implementation perspective, these are a few common data update 
operations, and Doris supports them all: 
+
+- **Upsert**: to replace or update a whole row
+- **Partial column update**: to update just a few columns in a row
+- **Conditional updating**: to filter out some data by combining a few 
conditions in order to replace or delete it
+- **Insert Overwrite**: to rewrite a table or partition
+
+In some cases, data updates happen concurrently, which means there is numerous 
new data coming in and trying to modify the existing data record, so the 
updating order matters a lot. That's why Doris allows you to decide the order, 
either by the order of transaction commit or that of the sequence column 
(something that you specify in the table in advance). Doris also supports data 
deletion based on the specified predicate, and that's how conditional updating 
is done.
+
+## Service Availability & Data Reliability
+
+Apart from fast performance in queries and data ingestion, Apache Doris also 
provides service availability guarantee, and this is how: 
+
+Architecturally, Doris has two processes: frontend and backend. Both of them 
are easily scalable. The frontend nodes manage the clusters, metadata and 
handle user requests; the backend nodes execute the queries and are capable of 
auto data balancing and auto-restoration. It supports cluster upgrading and 
scaling to avoid interruption to services.
+
+![architecture-design-of-Apache-Doris](../static/images/introduction_5.png)
+
+## Cross Cluster Replication
+
+Enterprise users, especially those in finance or e-commerce, will need to 
backup their clusters or their entire data center, just in case of force 
majeure. So Doris 2.0 provides Cross Cluster Replication (CCR). With CCR, users 
can do a lot:
+
+- **Disaster recovery**: for quick restoration of data services
+- **Read-write separation**: master cluster + slave cluster; one for reading, 
one for writing
+- **Isolated upgrade of clusters**: For cluster scaling, CCR allows users to 
pre-create a backup cluster for a trial run so they can clear out the possible 
incompatibility issues and bugs.
+
+Tests show that Doris CCR can reach a data latency of minutes. In the best 
case, it can reach the upper speed limit of the hardware environment.
+
+![Cross-Cluster-Replication-in-Apache-Doris](../static/images/introduction_6.png)
+
+## Multi-Tenant Management
+
+Apache Doris has sophisticated Role-Based Access Control, and it allows 
fine-grained privilege control on the level of databases, tables, rows, and 
columns. 
+
+![multi-tenant-management-in-Apache-Doris](../static/images/introduction_7.png)
+
+For resource isolation, Doris used to implement a hard isolation plan, which 
is to divide the backend nodes into resource groups, and assign the Resource 
Groups to different workloads. This is a hard isolation plan. It was simple and 
neat. But sometimes users can make the most out of their computing resource 
because some Resource Groups are idle.
+
+![resource-group-in-Apache-Doris](../static/images/introduction_8.png)
+
+Thus, instead of Resource Groups, Doris 2.0 introduces Workload Group. A soft 
limit is set for a Workload Group about how many resources it can use. When 
that soft limit is hit, and meanwhile there are some idle resources available. 
The idle resources will be shared across the workload groups. Users can also 
prioritize the workload groups in terms of their access to idle resources.
+
+![workload-group-in-Apache-Doris](../static/images/introduction_9.png)
+
+## Easy to Use
+
+As many capabilities as Apache Doris provides, it is also easy to use. It 
supports standard SQL and is compatible with MySQL protocol and most BI tools 
on the market.
+
+Another effort that we've made to improve usability is a feature called Light 
Schema Change. This means if users need to add or delete some columns in a 
table, they just need to update the metadata in the frontend but don't have to 
modify all the data files. Light Schema Change can be done within milliseconds. 
It also allows changes to indexes and data type of columns. The combination of 
Light Schema Change and Flink-Doris-Connector means synchronization of upstream 
tables within milliseconds.
+
+## Semi-Structured Data Analysis
+
+Common examples of semi-structure data include logs, observability data, and 
time series data. These cases require schema-free support, lower cost, and 
capabilities in multi-dimensional analysis and full-text search.
+
+In text analysis, mostly, people use the LIKE operator, so we put a lot of 
effort into improving the performance of it, including pushing down the LIKE 
operator down to the storage layer (to reduce data scanning), and introducing 
the NGram Bloomfilter, the Hyperscan regex matching library, and the Volnitsky 
algorithm (for sub-string matching).
+
+![LIKE-operator](../static/images/introduction_10.png)
+
+We have also introduced inverted index for text tokenization. It is a power 
tool for fuzzy keyword search, full-text search, equivalence queries, and range 
queries.
+
+## Data Lakehouse
+
+For users to build a high-performing data lakehouse and a unified query 
gateway, Doris can map, cache, and auto-refresh the meta data from external 
sources. It supports Hive Metastore and almost all open data lakehouse formats. 
You can connect it to relational databases, Elasticsearch, and many other 
sources. And it allows you to reuse your own authentication systems, like 
Kerberos and Apache Ranger, on the external tables.
+
+Benchmark results show that Apache Doris is 3~5 times faster than Trino in 
queries on Hive tables. It is the joint result of a few features: 
+
+1. Efficient query engine
+2. Hot data caching mechanism
+3. Compute nodes
+4. Views in Doris
+
+The [Compute Nodes](https://doris.apache.org/docs/dev/advanced/compute-node) 
is a newly introduced solution in version 2.0 for data lakehousing. Unlike 
normal backend nodes, Compute Nodes are stateless and do not store any data. 
Neither are they involved in data balancing during cluster scaling. Thus, they 
can join the cluster flexibly and easily during computation peak times. 
+
+Also, Doris allows you to write the computation results of external tables 
into Doris to form a view. This is a similar thinking to Materialized Views: to 
trade space for speed. After a query on external tables is executed, the 
results can be put in Doris internally. When there are similar queries 
following up, the system can directly read the results of previous queries from 
Doris, and that speeds things up.
+
+## Tiered Storage
+
+The main purpose of tiered storage is to save money. [Tiered storage 
](https://doris.apache.org/docs/dev/advanced/cold-hot-separation?_highlight=cold)means
 to separate hot data and cold data into different storage, with hot data being 
the data that is frequently accessed and cold data that isn't. It allows users 
to put hot data in the quick but expensive disks (such as SSD and HDD), and 
cold data in object storage.
+
+![tiered-storage-in-Apache-Doris](../static/images/introduction_11.png)
+
+Roughly speaking, for a data asset consisting of 80% cold data, tiered storage 
will reduce your storage cost by 70%.
+
+## The Apache Doris Community
+
+This is an overview of Apache Doris, an open-source real-time data warehouse. 
It is actively evolving with an agile release schedule, and the 
[community](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-1t3wfymur-0soNPATWQ~gbU8xutFOLog)
 embraces any questions, ideas, and feedback.
diff --git a/blog/log-analysis-elasticsearch-vs-apache-doris.md 
b/blog/log-analysis-elasticsearch-vs-apache-doris.md
new file mode 100644
index 00000000000..28c57378f85
--- /dev/null
+++ b/blog/log-analysis-elasticsearch-vs-apache-doris.md
@@ -0,0 +1,352 @@
+---
+{
+    'title': 'Log Analysis: Elasticsearch VS Apache Doris',
+    'summary': "As a major part of a company's data asset, logs brings values 
to businesses in three aspects: system observability, cyber security, and data 
analysis. They are your first resort for troubleshooting, your reference for 
improving system security, and your data mine where you can extract information 
that points to business growth.",
+    'date': '2023-09-28',
+    'author': 'Apache Doris',
+    'tags': ['Tech Sharing'],
+}
+
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+As a major part of a company's data asset, logs brings values to businesses in 
three aspects: system observability, cyber security, and data analysis. They 
are your first resort for troubleshooting, your reference for improving system 
security, and your data mine where you can extract information that points to 
business growth.
+
+Logs are the sequential records of events in the computer system. If you think 
about how logs are generated and used, you will know what an ideal log analysis 
system should look like:
+
+- **It should have schema-free support.** Raw logs are unstructured free texts 
and basically impossible for aggregation and calculation, so you needed to turn 
them into structured tables (the process is called "ETL") before putting them 
into a database or data warehouse for analysis. If there was a schema change, 
lots of complicated adjustments needed to put into ETL and the structured 
tables. Therefore, semi-structured logs, mostly in JSON format, emerged. You 
can add or delete fields i [...]
+- **It should be low-cost.** Logs are huge and they are generated 
continuously. A fairly big company produces 10~100 TBs of log data. For 
business or compliance reasons, it should keep the logs around for half a year 
or longer. That means to store a log size measured in PB, so the cost is 
considerable.
+- **It should be capable of real-time processing.** Logs should be written in 
real time, otherwise engineers won't be able to catch the latest events in 
troubleshooting and security tracking. Plus, a good log system should provide 
full-text searching capabilities and respond to interactive queries quickly.
+
+## The Elasticsearch-Based Log Analysis Solution
+
+A popular log processing solution within the data industry is the **ELK stack: 
Elasticsearch, Logstash, and Kibana**. The pipeline can be split into five 
modules:
+
+- **Log collection**: Filebeat collects local log files and writes them to a 
Kafka message queue.
+- **Log transmission**: Kafka message queue gathers and caches logs.
+- **Log transfer**: Logstash filters and transfers log data in Kafka.
+- **Log storage**: Logstash writes logs in JSON format into Elasticsearch for 
storage.
+- **Log query**: Users search for logs via Kibana visualization or send a 
query request via Elasticsearch DSL API.
+
+![ELK-Stack](../static/images/LAS_1.png)
+
+The ELK stack has outstanding real-time processing capabilities, but frictions 
exist.
+
+### Inadequate Schema-Free Support
+
+The Index Mapping in Elasticsearch defines the table scheme, which includes 
the field names, data types, and whether to enable index creation.
+
+![index-mapping-in-Elasticsearch](../static/images/LAS_2.png)
+
+Elasticsearch also boasts a Dynamic Mapping mechanism that automatically adds 
fields to the Mapping according to the input JSON data. This provides some sort 
of schema-free support, but it's not enough because:
+
+- Dynamic Mapping often creates too many fields when processing dirty data, 
which interrupts the whole system.
+- The data type of fields is immutable. To ensure compatibility, users often 
configure "text" as the data type, but that results in much slower query 
performance than binary data types such as integer.
+- The index of fields is immutable, too. Users cannot add or delete indexes 
for a certain field, so they often create indexes for all fields to facilitate 
data filtering in queries. But too many indexes require extra storage space and 
slow down data ingestion.
+
+### Inadequate Analytic Capability
+
+Elasticsearch has its unique Domain Specific Language (DSL), which is very 
different from the tech stack that most data engineers and analysts are 
familiar with, so there is a steep learning curve. Moreover, Elasticsearch has 
a relatively closed ecosystem so there might be strong resistance in 
integration with BI tools. Most importantly, Elastisearch only supports 
single-table analysis and is lagging behind the modern OLAP demands for 
multi-table join, sub-query, and views.
+
+![Elasticsearch-DSL](../static/images/LAS_3.png)
+
+### High Cost & Low Stability
+
+Elasticsearch users have been complaining about the computation and storage 
costs. The root reason lies in the way Elasticsearch works.
+
+- **Computation cost**: In data writing, Elasticsearch also executes 
compute-intensive operations including inverted index creation, tokenization, 
and inverted index ranking. Under these circumstances, data is written into 
Elasticsearch at a speed of around 2MB/s per core. When CPU resources are 
tight, data writing requirements often get rejected during peak times, which 
further leads to higher latency. 
+- **Storage cost**: To speed up retrieval, Elasticsearch stores the forward 
indexes, inverted indexes, and docvalues of the original data, consuming a lot 
more storage space. The compression ratio of a single data copy is only 1.5:1, 
compared to the 5:1 in most log solutions.
+
+As data and cluster size grows, maintaining stability can be another issue:
+
+- **During data writing peaks**: Clusters are prone to overload during data 
writing peaks.
+
+- **During queries**: Since all queries are processed in the memory, big 
queries can easily lead to JVM OOM.
+
+- **Slow recovery**: For a cluster failure, Elasticsearch should reload 
indexes, which is resource-intensive, so it will take many minutes to recover. 
That challenges service availability guarantee.
+
+  
+
+## A More Cost-Effective Option
+
+Reflecting on the strengths and limitations of the Elasticsearch-based 
solution, the Apache Doris developers have optimized Apache Doris for log 
processing. 
+
+- **Increase writing throughout**: The performance of Elasticsearch is 
bottlenecked by data parsing and inverted index creation, so we improved Apache 
Doris in these factors: we quickened data parsing and index creation by SIMD 
instructions and CPU vector instructions; then we removed those data structures 
unnecessary for log analysis scenarios, such as forward indexes, to simplify 
index creation.
+- **Reduce storage costs**: We removed forward indexes, which represented 30% 
of index data. We adopted columnar storage and the ZSTD compression algorithm, 
and thus achieved a compression ratio of 5:1 to 10:1. Given that a large part 
of the historical logs are rarely accessed, we introduced tiered storage to 
separate hot and cold data. Logs that are older than a specified time period 
will be moved to object storage, which is much less expensive. This can reduce 
storage costs by around 70%. 
+
+Benchmark tests with ES Rally, the official testing tool for Elasticsearch, 
showed that Apache Doris was around 5 times as fast as Elasticsearch in data 
writing, 2.3 times as fast in queries, and it consumed only 1/5 of the storage 
space that Elasticsearch used. On the test dataset of HTTP logs, it achieved a 
writing speed of 550 MB/s and a compression ratio of 10:1.
+
+![Elasticsearch-VS-Apache-Doris](../static/images/LAS_4.png)
+
+The below figure show what a typical Doris-based log processing system looks 
like. It is more inclusive and allows for more flexible usage from data 
ingestion, analysis, and application:
+
+- **Ingestion**: Apache Doris supports various ingestion methods for log data. 
You can push logs to Doris via HTTP Output using Logstash, you can use Flink to 
pre-process the logs before you write them into Doris, or you can load logs 
from Flink or object storage to Doris via Routine Load and S3 Load. 
+- **Analysis**: You can put log data in Doris and conduct join queries across 
logs and other data in the data warehouse.
+- **Application**: Apache Doris is compatible with MySQL protocol, so you can 
integrate a wide variety of data analytic tools and clients to Doris, such as 
Grafana and Tableau. You can also connect applications to Doris via JDBC and 
ODBC APIs. We are planning to build a Kibana-like system to visualize logs.
+
+![Apache-Doris-log-analysis-stack](../static/images/LAS_5.png)
+
+Moreover, Apache Doris has better scheme-free support and a more user-friendly 
analytic engine.
+
+### Native Support for Semi-Structured Data
+
+**Firstly, we worked on the data types.** We optimized the string search and 
regular expression matching for "text" through vectorization and brought a 
performance increase of 2~10 times. For JSON strings, Apache Doris will parse 
and store them as a more compacted and efficient binary format, which can speed 
up queries by 4 times. We also added a new data type for complicated data: 
Array Map. It can structuralize concatenated strings to allow for higher 
compression rate and faster queries.
+
+**Secondly, Apache Doris supports schema evolution.** This means you can 
adjust the schema as your business changes. You can add or delete fields and 
indexes, and change the data types for fields.
+
+Apache Doris provides Light Schema Change capabilities, so you can add or 
delete fields within milliseconds:
+
+```SQL
+-- Add a column. Result will be returned in milliseconds.
+ALTER TABLE lineitem ADD COLUMN l_new_column INT;
+```
+
+ 
+
+You can also add index only for your target fields, so you can avoid overheads 
from unnecessary index creation. After you add an index, by default, the system 
will generate the index for all incremental data, and you can specify which 
historical data partitions that need the index.
+
+```SQL
+-- Add inverted index. Doris will generate inverted index for all new data 
afterward.
+ALTER TABLE table_name ADD INDEX index_name(column_name) USING INVERTED;
+
+-- Build index for the specified historical data partitions.
+BUILD INDEX index_name ON table_name PARTITIONS(partition_name1, 
partition_name2);
+```
+
+### SQL-Based Analytic Engine
+
+The SQL-based analytic engine makes sure that data engineers and analysts can 
smoothly grasp Apache Doris in a short time and bring their experience with SQL 
to this OLAP engine. Building on the rich features of SQL, users can execute 
data retrieval, aggregation, multi-table join, sub-query, UDF, logic views, and 
materialized views to serve their own needs. 
+
+With MySQL compatibility, Apache Doris can be integrated with most GUI and BI 
tools in the big data ecosystem, so users can realize more complex and 
diversified data analysis.
+
+### Performance in Use Case
+
+A gaming company has transitioned from the ELK stack to the Apache Doris 
solution. Their Doris-based log system used 1/6 of the storage space that they 
previously needed. 
+
+In a cybersecurity company who built their log analysis system utilizing 
inverted index in Apache Doris, they supported a data writing speed of 300,000 
rows per second with 1/5 of the server resources that they formerly used. 
+
+## Hands-On Guide
+
+Now let's go through the three steps of building a log analysis system with 
Apache Doris. 
+
+Before you start, [download](https://doris.apache.org/download/) Apache Doris 
2.0 or newer versions from the website and 
[deploy](https://doris.apache.org/docs/dev/install/standard-deployment/) 
clusters.
+
+### Step 1: Create Tables
+
+This is an example of table creation.
+
+Explanations for the configurations:
+
+- The DATETIMEV2 time field is specified as the Key in order to speed up 
queries for the latest N log records.
+- Indexes are created for the frequently accessed fields, and fields that 
require full-text search are specified with Parser parameters.
+- "PARTITION BY RANGE" means to partition the data by RANGE based on time 
fields, [Dynamic 
Partition](https://doris.apache.org/docs/dev/advanced/partition/dynamic-partition/)
 is enabled for auto-management.
+- "DISTRIBUTED BY RANDOM BUCKETS AUTO" means to distribute the data into 
buckets randomly and the system will automatically decide the number of buckets 
based on the cluster size and data volume.
+- "log_policy_1day" and "log_s3" means to move logs older than 1 day to S3 
storage.
+
+```Go
+CREATE DATABASE log_db;
+USE log_db;
+
+CREATE RESOURCE "log_s3"
+PROPERTIES
+(
+    "type" = "s3",
+    "s3.endpoint" = "your_endpoint_url",
+    "s3.region" = "your_region",
+    "s3.bucket" = "your_bucket",
+    "s3.root.path" = "your_path",
+    "s3.access_key" = "your_ak",
+    "s3.secret_key" = "your_sk"
+);
+
+CREATE STORAGE POLICY log_policy_1day
+PROPERTIES(
+    "storage_resource" = "log_s3",
+    "cooldown_ttl" = "86400"
+);
+
+CREATE TABLE log_table
+(
+  `ts` DATETIMEV2,
+  `clientip` VARCHAR(20),
+  `request` TEXT,
+  `status` INT,
+  `size` INT,
+  INDEX idx_size (`size`) USING INVERTED,
+  INDEX idx_status (`status`) USING INVERTED,
+  INDEX idx_clientip (`clientip`) USING INVERTED,
+  INDEX idx_request (`request`) USING INVERTED PROPERTIES("parser" = "english")
+)
+ENGINE = OLAP
+DUPLICATE KEY(`ts`)
+PARTITION BY RANGE(`ts`) ()
+DISTRIBUTED BY RANDOM BUCKETS AUTO
+PROPERTIES (
+"replication_num" = "1",
+"storage_policy" = "log_policy_1day",
+"deprecated_dynamic_schema" = "true",
+"dynamic_partition.enable" = "true",
+"dynamic_partition.time_unit" = "DAY",
+"dynamic_partition.start" = "-3",
+"dynamic_partition.end" = "7",
+"dynamic_partition.prefix" = "p",
+"dynamic_partition.buckets" = "AUTO",
+"dynamic_partition.replication_num" = "1"
+);
+```
+
+### Step 2: Ingest the Logs
+
+Apache Doris supports various ingestion methods. For real-time logs, we 
recommend the following three methods:
+
+- Pull logs from Kafka message queue: Routine Load 
+- Logstash: write logs into Doris via HTTP API
+- Self-defined writing program: write logs into Doris via HTTP API
+
+**Ingest from Kafka**
+
+For JSON logs that are written into Kafka message queues, create [Routine 
Load](https://doris.apache.org/docs/dev/data-operate/import/import-way/routine-load-manual/)
 so Doris will pull data from Kafka. The following is an example. The 
`property.*` configurations are optional:
+
+```SQL
+-- Prepare the Kafka cluster and topic ("log_topic")
+
+-- Create Routine Load, load data from Kafka log_topic to "log_table"
+CREATE ROUTINE LOAD load_log_kafka ON log_db.log_table
+COLUMNS(ts, clientip, request, status, size)
+PROPERTIES (
+    "max_batch_interval" = "10",
+    "max_batch_rows" = "1000000",
+    "max_batch_size" = "109715200",
+    "strict_mode" = "false",
+    "format" = "json"
+)
+FROM KAFKA (
+    "kafka_broker_list" = "host:port",
+    "kafka_topic" = "log_topic",
+    "property.group.id" = "your_group_id",
+    "property.security.protocol"="SASL_PLAINTEXT",     
+    "property.sasl.mechanism"="GSSAPI",     
+    "property.sasl.kerberos.service.name"="kafka",     
+    "property.sasl.kerberos.keytab"="/path/to/xxx.keytab",     
+    "property.sasl.kerberos.principal"="[email protected]"
+);
+```
+
+You can check how the Routine Load runs via the `SHOW ROUTINE LOAD` command. 
+
+**Ingest via Logstash**
+
+Configure HTTP Output for Logstash, and then data will be sent to Doris via 
HTTP Stream Load.
+
+1. Specify the batch size and batch delay in `logstash.yml` to improve data 
writing performance.
+
+```Plain
+pipeline.batch.size: 100000
+pipeline.batch.delay: 10000
+```
+
+2. Add HTTP Output to the log collection configuration file `testlog.conf`, 
URL => the Stream Load address in Doris.
+
+- Since Logstash does not support HTTP redirection, you should use a backend 
address instead of a FE address.
+- Authorization in the headers is `http basic auth`. It is computed with `echo 
-n 'username:password' | base64`.
+- The `load_to_single_tablet` in the headers can reduce the number of small 
files in data ingestion.
+
+```Plain
+output {
+    http {
+       follow_redirects => true
+       keepalive => false
+       http_method => "put"
+       url => "http://172.21.0.5:8640/api/logdb/logtable/_stream_load";
+       headers => [
+           "format", "json",
+           "strip_outer_array", "true",
+           "load_to_single_tablet", "true",
+           "Authorization", "Basic cm9vdDo=",
+           "Expect", "100-continue"
+       ]
+       format => "json_batch"
+    }
+}
+```
+
+**Ingest via self-defined program**
+
+This is an example of ingesting data to Doris via HTTP Stream Load.
+
+Notes:
+
+- Use `basic auth` for HTTP authorization, use `echo -n 'username:password' | 
base64` in computation
+- `http header "format:json"`: the data type is specified as JSON
+- `http header "read_json_by_line:true"`: each line is a JSON record
+- `http header "load_to_single_tablet:true"`: write to one tablet each time
+- For the data writing clients, we recommend a batch size of 100MB~1GB. Future 
versions will enable Group Commit at the server end and reduce batch size from 
clients.
+
+```Bash
+curl \
+--location-trusted \
+-u username:password \
+-H "format:json" \
+-H "read_json_by_line:true" \
+-H "load_to_single_tablet:true" \
+-T logfile.json \
+http://fe_host:fe_http_port/api/log_db/log_table/_stream_load
+```
+
+### Step 3: Execute Queries
+
+Apache Doris supports standard SQL, so you can connect to Doris via MySQL 
client or JDBC and then execute SQL queries.
+
+```SQL
+mysql -h fe_host -P fe_mysql_port -u root -Dlog_db
+```
+
+**A few common queries in log analysis:**
+
+- Check the latest 10 records.
+
+```SQL
+SELECT * FROM log_table ORDER BY ts DESC LIMIT 10;
+```
+
+- Check the latest 10 records of Client IP "8.8.8.8".
+
+```SQL
+SELECT * FROM log_table WHERE clientip = '8.8.8.8' ORDER BY ts DESC LIMIT 10;
+```
+
+- Retrieve the latest 10 records with "error" or "404" in the "request" field. 
**MATCH_ANY** is a SQL syntax keyword for full-text search in Doris. It means 
to find the records that include any one of the specified keywords.
+
+```SQL
+SELECT * FROM log_table WHERE request MATCH_ANY 'error 404' ORDER BY ts DESC 
LIMIT 10;
+```
+
+- Retrieve the latest 10 records with "image" and "faq" in the "request" 
field. **MATCH_ALL** is also a SQL syntax keyword for full-text search in 
Doris. It means to find the records that include all of the specified keywords.
+
+```SQL
+SELECT * FROM log_table WHERE request MATCH_ALL 'image faq' ORDER BY ts DESC 
LIMIT 10;
+```
+
+## Conclusion
+
+If you are looking for an efficient log analytic solution, Apache Doris is 
friendly to anyone equipped with SQL knowledge; if you find friction with the 
ELK stack, try Apache Doris provides better schema-free support, enables faster 
data writing and queries, and brings much less storage burden.
+
+But we won't stop here. We are going to provide more features to facilitate 
log analysis. We plan to add more complicated data types to inverted index, and 
support BKD index to make Apache Doris a fit for geo data analysis. We also 
plan to expand capabilities in semi-structured data analysis, such as working 
on the complex data types (Array, Map, Struct, JSON) and high-performance 
string matching algorithm. And we welcome any [user feedback and development 
advice](https://t.co/ZxJuNJHXb2).
diff --git a/blog/migrating-from-clickhouse-to-apache-doris-what-happened.md 
b/blog/migrating-from-clickhouse-to-apache-doris-what-happened.md
new file mode 100644
index 00000000000..d5e63c95caf
--- /dev/null
+++ b/blog/migrating-from-clickhouse-to-apache-doris-what-happened.md
@@ -0,0 +1,161 @@
+---
+{
+    'title': 'Migrating from ClickHouse to Apache Doris: What Happened?',
+    'summary': "A user of Apache Doris has written down their migration 
process from ClickHouse to Doris, including why they need the change, what 
needs to be taken care of, and how they compare the performance of the two 
databases in their environment. ",
+    'date': '2023-10-11',
+    'author': 'Chuang Li',
+    'tags': ['Best Practice'],
+}
+
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+Migrating from one OLAP database to another is huge. Even if you're unhappy 
with your current data tool and have found some promising candidate, you might 
still hesitate to do the big surgery on your data architecture, because you're 
uncertain about how things are going to work. So you need experience shared by 
someone who has walked the path. 
+
+Luckily, a user of Apache Doris has written down their migration process from 
ClickHouse to Doris, including why they need the change, what needs to be taken 
care of, and how they compare the performance of the two databases in their 
environment. 
+
+To decide whether you want to continue reading, check if you tick one of the 
following boxes:
+
+- You need your join queries to be executed faster.
+- You need flexible data updates.
+- You need real-time data analysis.
+- You need to minimize your components.
+
+If you do, this post might be of some help to you.
+
+## Replacing Kylin, ClickHouse, and Druid with Apache Doris
+
+The user undergoing this change is an e-commerce SaaS provider. Its data 
system serves realtime and offline reporting, customer segmentation, and log 
analysis. Initially, they used different OLAP engines for these various 
purposes:
+
+- **Apache Kylin for offline reporting**: The system provides offline 
reporting services for over 5 million sellers. The big ones among them have 
more than 10 million registered members and 100,000 SKU, and the detailed 
information is put into over 400 data cubes on the platform. 
+- **ClickHouse for customer segmentation and Top-N log queries**: This entails 
high-frequency updates, high QPS, and complicated SQL.
+- **Apache Druid for real-time reporting**: Sellers extract data they need by 
combining different dimensions, and such real-time reporting requires quick 
data updates, quick query response, and strong stability of the system. 
+
+![ClickHouse-Druid-Apache-Kylin](../static/images/youzan-1.png)
+
+The three components have their own sore spots.
+
+- **Apache Kylin** runs well with a fixed table schema, but every time you 
want to add a dimension, you need to create a new data cube and refill the 
historical data in it.
+- **ClickHouse** is not designed for multi-table processing, so you might need 
an extra solution for federated queries and multi-table join queries. And in 
this case, it was below expectation in high-concurrency scenarios.
+- **Apache Druid** implements idempotent writing so it does not support data 
updating or deletion itself. That means when there is something wrong at the 
upstream, you will need a full data replacement. And such data fixing is a 
multi-step process if you think it all the way through, because of all the data 
backups and movements. Plus, newly ingested data will not be accessible for 
queries until it is put in segments in Druid. That means a longer window such 
that data inconsistency betwe [...]
+
+As they work together, this architecture might be too demanding to navigate 
because it requires knowledge of all these components in terms of development, 
monitoring, and maintenance. Also, every time the user scales a cluster, they 
must stop the current cluster and migrate all databases and tables, which is 
not only a big undertaking but also a huge interruption to business.
+
+![Replace-ClickHouse-Druid-Apache-Kylin-with-Apache-Doris](../static/images/youzan-2.png)
+
+Apache Doris fills these gaps.
+
+- **Query performance**: Doris is good at high-concurrency queries and join 
queries, and it is now equipped with inverted index to speed up searches in 
logs.
+- **Data update**: The Unique Key model of Doris supports both large-volume 
update and high-freqency real-time writing, and the Duplicate Key model and 
Unique Key model supports partial column update. It also provides exactly-once 
guarantee in data writing and ensures consistency between base tables, 
materialized views, and replicas.
+- **Maintenance**: Doris is MySQL-compatible. It supports easy scaling and 
light schema change. It comes with its own integration tools such as 
Flink-Doris-Connector and Spark-Doris-Connector. 
+
+So they plan on the migration.
+
+## The Replacement Surgery
+
+ClickHouse was the main performance bottleneck in the old data architecture 
and why the user wanted the change in the first place, so they started with 
ClickHouse.
+
+### Changes in SQL statements
+
+**Table creation statements**
+
+![table-creation-statements-in-ClickHouse-and-Apache-Doris](../static/images/youzan-3.png)
+
+The user built their own SQL rewriting tool that can convert a ClickHouse 
table creation statement into a Doris table creation statement. The tool can 
automate the following changes:
+
+- **Mapping the field types**: It converts ClickHouse field types into the 
corresponding ones in Doris. For example, it converts String as a Key into 
Varchar, and String as a partitioning field into Date V2.
+- **Setting the number of historical partitions in dynamic partitioning 
tables**: Some tables have historical partitions and the number of partitions 
should be specified upon table creation in Doris, otherwise a "No Partition" 
error will be thrown.
+- **Determining the number of buckets**: It decides the number of buckets 
based on the data volume of historical partitions; for non-partitioned tables, 
it decides the bucketing configurations based on the historical data volume.
+- **Determining TTL**: It decides the time to live of partitions in dynamic 
partitioning tables.
+- **Setting the import sequence**: For the Unique Key model of Doris, it can 
specify the data import order based on the Sequence column to ensure 
orderliness in data ingestion.
+
+![changes-in-table-creation-statements-from-ClickHouse-to-Apache-Doris](../static/images/youzan-4.png)
+
+**Query statements**
+
+Similarly, they have their own tool to transform the ClickHouse query 
statements into Doris query statements. This is to prepare for the comparison 
test between ClickHouse and Doris. The key considerations in the conversions 
include:
+
+- **Conversion of table names**: This is simple given the mapping rules in 
table creation statements.
+- **Conversion of functions**: For example, the `COUNTIF` function in 
ClickHouse is equivalent to `SUM(CASE WHEN_THEN 1 ELSE 0)`, `Array Join` is 
equivalent to `Explode` and `Lateral View`, and `ORDER BY` and `GROUP BY` 
should be converted to window functions.
+- **Difference** **in semantics**: ClickHouse goes by its own protocol while 
Doris is MySQL-compatible, so there needs to be alias for subqueries. In this 
use case, subqueries are common in customer segmentation, so they use 
`sqlparse` 
+
+### Changes in data ingestion methods
+
+![changes-in-data-ingestion-methods-from-ClickHouse-to-Apache-Doris](../static/images/youzan-5.png)
+
+Apache Doris provides broad options of data writing methods. For the real-time 
link, the user adopts Stream Load to ingest data from NSQ and Kafka. 
+
+For the sizable offline data, the user tested different methods and here are 
the takeouts:
+
+1. **Insert Into**
+
+Using Multi-Catalog to read external data sources and ingesting with Insert 
Into can serve most needs in this use case.
+
+2. **Stream Load**
+
+The Spark-Doris-Connector is a more general method. It can handle large data 
volumes and ensure writing stability. The key is to find the right writing pace 
and parallelism.
+
+The Spark-Doris-Connector also supports Bitmap. It allows you to move the 
computation workload of Bitmap data in Spark clusters. 
+
+Both the Spark-Doris-Connector and the Flink-Doris-Connector rely on Stream 
Load. CSV is the recommended format choice. Tests on the user's billions of 
rows showed that CSV was 40% faster than JSON.  
+
+3. **Spark Load**
+
+The Spark Load method utilizes Spark resources for data shuffling and ranking. 
The computation results are put in HDFS, and then Doris reads the files from 
HDFS directly (via Broker Load). This approach is ideal for huge data 
ingestion. The more data there is, the faster and more resource-efficient the 
ingestion is.  
+
+## Pressure Test
+
+The user compared performance of the two components on their SQL and join 
query scenarios, and calculated the CPU and memory consumption of Apache Doris.
+
+### SQL query performance
+
+Apache Doris outperformed ClickHouse in 10 of the 16 SQL queries, and the 
biggest performance gap was a ratio of almost 30. Overall, Apache Doris was 2~3 
times faster than ClickHouse. 
+
+![SQL-query-performance-ClickHouse-VS-Apache-Doris](../static/images/youzan-6.png)
+
+### Join query performance
+
+For join query tests, the user used different sizes of main tables and 
dimension tables.
+
+- **Primary tables**: user activity table (4 billion rows), user attribute 
table (25 billion rows), and user attribute table (96 billion rows)
+- **Dimension tables**: 1 million rows, 10 million rows, 50 million rows, 100 
million rows, 500 million rows, 1 billion rows, and 2.5 billion rows.
+
+The tests include **full join queries** and **filtering join queries**. Full 
join queries join all rows of the primary table and dimension tables, while 
filtering join queries retrieve data of a certain seller ID with a `WHERE` 
filter. The results are concluded as follows:
+
+**Primary table (4 billion rows):**
+
+- Full join queries: Doris outperforms ClickHouse in full join queries with 
all dimension tables. The performance gap widens as the dimension tables get 
larger. The largest difference is a ratio of 5.
+- Filtering join queries: Based on the seller ID, the filter screened out 41 
million rows from the primary table. With small dimension tables, Doris was 2~3 
times faster than ClickHouse; with large dimension tables, Doris was over 10 
times faster; with dimension tables larger than 100 million rows, ClickHouse 
threw an OOM error and Doris functions normally. 
+
+**Primary table (25 billion rows):**
+
+- Full join queries: Doris outperforms ClickHouse in full join queries with 
all dimension tables. ClickHouse produced an OOM error with dimension tables 
larger than 50 million rows.
+- Filtering join queries: The filter screened out 570 million rows from the 
primary table. Doris responded within seconds and ClickHouse finished within 
minutes and broke down when joining large dimension tables.
+
+**Primary table (96 billion rows):**
+
+Doris delivered relatively quick performance in all queries and ClickHouse was 
unable to execute all of them.
+
+In terms of CPU and memory consumption, Apache Doris maintained stable cluster 
loads in all sizes of join queries.
+
+## Future Directions
+
+As the migration goes on, the user works closely with the [Doris 
community](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-1t3wfymur-0soNPATWQ~gbU8xutFOLog),
 and their feedback has contributed to the making of [Apache Doris 
2.0.0](https://doris.apache.org/docs/dev/releasenotes/release-2.0.0/). We will 
continue assisting them in their migration from Kylin and Druid to Doris, and 
we look forward to see their Doris-based unified data platform come into being.
diff --git a/blog/release-note-2.0.0.md b/blog/release-note-2.0.0.md
index 60fed524577..626f2ba7f50 100644
--- a/blog/release-note-2.0.0.md
+++ b/blog/release-note-2.0.0.md
@@ -188,9 +188,9 @@ To separate storage and computation is a way to realize 
elastic scaling of resou
 Apache Doris 2.0 provides two solutions to address the needs of the first two 
types of users.
 
 1. **Compute nodes**. We introduced stateless compute nodes in version 2.0. 
Unlike the mix nodes, the compute nodes do not save any data and are not 
involved in workload balancing of data tablets during cluster scaling. Thus, 
they are able to quickly join the cluster and share the computing pressure 
during peak times. In addition, in data lakehouse analysis, these nodes will be 
the first ones to execute queries on remote storage (HDFS/S3) so there will be 
no resource competition between  [...]
-   1.  Doc: https://doris.apache.org/docs/dev/advanced/compute_node/
+   1.  Doc: https://doris.apache.org/docs/dev/advanced/compute-node
 2. **Hot-cold data separation**. Hot/cold data refers to data that is 
frequently/seldom accessed, respectively. Generally, it makes more sense to 
store cold data in low-cost storage. Older versions of Apache Doris support 
lifecycle management of table partitions: As hot data cooled down, it would be 
moved from SSD to HDD. However, data was stored with multiple replicas on HDD, 
which was still a waste. Now, in Apache Doris 2.0, cold data can be stored in 
object storage, which is even chea [...]
-   1.  Read more: 
https://doris.apache.org/blog/Tiered-Storage-for-Hot-and-Cold-Data:-What,-Why,-and-How?/
+   1.  Read more: 
https://doris.apache.org/blog/Tiered-Storage-for-Hot-and-Cold-Data-What-Why-and-How
 
 For neater separate of computation and storage, the VeloDB team is going to 
contribute the Cloud Compute-Storage-Separation solution to the Apache Doris 
project. The performance and stability of it has stood the test of hundreds of 
companies in their production environment. The merging of code will be finished 
by October this year, and all Apache Doris users will be able to get an early 
taste of it in September.
 
diff --git a/static/images/Introduction_3.png b/static/images/Introduction_3.png
new file mode 100644
index 00000000000..7e3e112c235
Binary files /dev/null and b/static/images/Introduction_3.png differ
diff --git a/static/images/Introduction_4.png b/static/images/Introduction_4.png
new file mode 100644
index 00000000000..4d7c893cf8e
Binary files /dev/null and b/static/images/Introduction_4.png differ
diff --git a/static/images/LAS_1.png b/static/images/LAS_1.png
new file mode 100644
index 00000000000..c7f75990173
Binary files /dev/null and b/static/images/LAS_1.png differ
diff --git a/static/images/LAS_2.png b/static/images/LAS_2.png
new file mode 100644
index 00000000000..c86d90a3c1a
Binary files /dev/null and b/static/images/LAS_2.png differ
diff --git a/static/images/LAS_3.png b/static/images/LAS_3.png
new file mode 100644
index 00000000000..6d3194f1e11
Binary files /dev/null and b/static/images/LAS_3.png differ
diff --git a/static/images/LAS_4.png b/static/images/LAS_4.png
new file mode 100644
index 00000000000..e298b60940e
Binary files /dev/null and b/static/images/LAS_4.png differ
diff --git a/static/images/LAS_5.png b/static/images/LAS_5.png
new file mode 100644
index 00000000000..eedf9bc8ab1
Binary files /dev/null and b/static/images/LAS_5.png differ
diff --git a/static/images/introduction_1.png b/static/images/introduction_1.png
new file mode 100644
index 00000000000..0568bba99ec
Binary files /dev/null and b/static/images/introduction_1.png differ
diff --git a/static/images/introduction_10.png 
b/static/images/introduction_10.png
new file mode 100644
index 00000000000..da0624ae8bb
Binary files /dev/null and b/static/images/introduction_10.png differ
diff --git a/static/images/introduction_11.png 
b/static/images/introduction_11.png
new file mode 100644
index 00000000000..e98f99cd01e
Binary files /dev/null and b/static/images/introduction_11.png differ
diff --git a/static/images/introduction_2.png b/static/images/introduction_2.png
new file mode 100644
index 00000000000..ecaa8adec66
Binary files /dev/null and b/static/images/introduction_2.png differ
diff --git a/static/images/introduction_5.png b/static/images/introduction_5.png
new file mode 100644
index 00000000000..5616299f505
Binary files /dev/null and b/static/images/introduction_5.png differ
diff --git a/static/images/introduction_6.png b/static/images/introduction_6.png
new file mode 100644
index 00000000000..71a83250a1b
Binary files /dev/null and b/static/images/introduction_6.png differ
diff --git a/static/images/introduction_7.png b/static/images/introduction_7.png
new file mode 100644
index 00000000000..d56b89fb340
Binary files /dev/null and b/static/images/introduction_7.png differ
diff --git a/static/images/introduction_8.png b/static/images/introduction_8.png
new file mode 100644
index 00000000000..b7de900ceb3
Binary files /dev/null and b/static/images/introduction_8.png differ
diff --git a/static/images/introduction_9.png b/static/images/introduction_9.png
new file mode 100644
index 00000000000..ef2d4401594
Binary files /dev/null and b/static/images/introduction_9.png differ
diff --git a/static/images/youzan-1.png b/static/images/youzan-1.png
new file mode 100644
index 00000000000..a966e51cd81
Binary files /dev/null and b/static/images/youzan-1.png differ
diff --git a/static/images/youzan-2.png b/static/images/youzan-2.png
new file mode 100644
index 00000000000..cf64e1e3808
Binary files /dev/null and b/static/images/youzan-2.png differ
diff --git a/static/images/youzan-3.png b/static/images/youzan-3.png
new file mode 100644
index 00000000000..e90fb32534a
Binary files /dev/null and b/static/images/youzan-3.png differ
diff --git a/static/images/youzan-4.png b/static/images/youzan-4.png
new file mode 100644
index 00000000000..9dda8649452
Binary files /dev/null and b/static/images/youzan-4.png differ
diff --git a/static/images/youzan-5.png b/static/images/youzan-5.png
new file mode 100644
index 00000000000..48b3ef45a92
Binary files /dev/null and b/static/images/youzan-5.png differ
diff --git a/static/images/youzan-6.png b/static/images/youzan-6.png
new file mode 100644
index 00000000000..46f46a68a5a
Binary files /dev/null and b/static/images/youzan-6.png differ


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[doris-website] branch master updated: add blogs & update dead links (#319)

Reply via email to