(doris-website) branch master updated: add blog (#343)

luzhijing Tue, 21 Nov 2023 19:16:08 -0800

This is an automated email from the ASF dual-hosted git repository.

luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new c70bf3ad4f3 add blog (#343)
c70bf3ad4f3 is described below

commit c70bf3ad4f369b767b99eb9f69ae938e3e8bbfc2
Author: Hu Yanjun <[email protected]>
AuthorDate: Wed Nov 22 11:15:28 2023 +0800

    add blog (#343)
---
 ...instead-of-clickhouse-mysql-presto-and-hbase.md | 101 +++++++++++++++++++++
 ...arehouse-mysql-clickhouse-hbase-hive-presto.png | Bin 0 -> 235723 bytes
 ...fied-data-warehouse-kafka-apache-doris-hive.png | Bin 0 -> 246774 bytes
 3 files changed, 101 insertions(+)

diff --git 
a/blog/less-components-higher-performance-apache-doris-instead-of-clickhouse-mysql-presto-and-hbase.md
 
b/blog/less-components-higher-performance-apache-doris-instead-of-clickhouse-mysql-presto-and-hbase.md
new file mode 100644
index 00000000000..0ce5fd0ef5f
--- /dev/null
+++ 
b/blog/less-components-higher-performance-apache-doris-instead-of-clickhouse-mysql-presto-and-hbase.md
@@ -0,0 +1,101 @@
+---
+{
+    'title': 'Less Components, Higher Performance: Apache Doris Instead of 
ClickHouse, MySQL, Presto, and HBase',
+    'summary': "This post is about building a unified OLAP platform. An 
insurance company tries to build a data warehouse that can undertake all their 
customer-facing, analyst-facing, and management-facing data analysis 
workloads.",
+    'date': '2023-11-22',
+    'author': 'Big Data Platform R&D Team of CIGNA&CMB',
+    'tags': ['Best Practice'],
+}
+
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This post is about building a unified OLAP platform. An insurance company 
tries to build a data warehouse that can undertake all their customer-facing, 
analyst-facing, and management-facing data analysis workloads. The main tasks 
include: 
+
+- **Self-service insurance contract query**: This is for insurance customers 
to check their contract details by their contract ID. It should also support 
filters such as coverage period, insurance types, and claim amount. 
+- **Multi-dimensional analysis**: Analysts develop their reports based on 
different data dimensions as they need, so they can extract insights to 
facilitate product innovation and their anti-fraud efforts. 
+- **Dashboarding**: This is to create visual overview of the insurance sales 
trends and the horizontal and vertical comparison of different metrics.
+
+## Component-Heavy Data Architecture
+
+The user started with Lambda architecture, spliting their data pipeline into a 
batch processing link and a stream processing link. For real-time data 
streaming, they apply Flink CDC; for batch import, they incorporate Sqoop, 
Python, and DataX to build their own data integration tool named Hisen.  
+
+![multi-component-data-warehouse-mysql-clickhouse-hbase-hive-presto](../static/images/multi-component-data-warehouse-mysql-clickhouse-hbase-hive-presto.png)
+
+Then, the real-time and offline data meets in the data warehousing layer, 
which is made up of five components.
+
+**ClickHouse**
+
+The data warehouse is of flat table design and ClickHouse is superb in flat 
table reading. But as business evolves, things become challenging in two ways:
+
+- To support cross-table joins and point queries, the user requires the star 
schema, but that's difficult to implement in ClickHouse.
+- Changes in insurance contracts need to be updated in the data warehouse in 
real time. In ClickHouse, that is done by recreating a flat table to overwrite 
the old one, which is not fast enough.
+
+**MySQL**
+
+After calculation, data metrics are stored in MySQL, but as the data size 
grows, MySQL starts to struggle, with emerging problems like prolonged 
execution time and errors thrown.
+
+**Apache** **Hive** **+ Presto**
+
+Hive is the main executor in the batch processing link. It can transform, 
aggregate, query offline data. Presto is a complement to Hive for interactive 
analysis.
+
+**Apache HBase**
+
+HBase undertakes primary key queries. It reads customer status from MySQL and 
Hive, including customer credits, coverage period, and sum insured. However, 
since HBase does not support secondary indexes, it has limited capability in 
reading non-primary key columns. Plus, as a NoSQL database, HBase does not 
support SQL statements.
+
+The components have to work in conjunction to serve all needs, making the data 
warehouse too much to take care of. It is not easy to get started with because 
engineers must be trained on all these components. Also, the complexity of 
architecture adds to the risks of latency. 
+
+So the user tried to look for a tool that ticks more boxes in fulfilling their 
requirements. The first thing they need is real-time capabilities, including 
real-time writing, real-time updating, and real-time response to data queries. 
Secondly, they need more flexibility in data analysis to support 
customer-facing self-service queries, like multi-dimensional analysis, join 
queries of large tables, primary key indexes, roll-ups, and drill-downs. Then, 
for batch processing, they also want  [...]
+
+They eventually made up their mind with [Apache 
Doris](https://doris.apache.org/). 
+
+## Replacing Four Components with Apache Doris
+
+ Apache Doris is capable of both real-time and offline data analysis, and it 
supports both high-throughput interactive analysis and high-concurrency point 
queries. That's why it can replace ClickHouse, MySQL, Presto, and Apache HBase 
and work as the unified query gateway for the entire data system. 
+
+![unified-data-warehouse-kafka-apache-doris-hive](../static/images/unified-data-warehouse-kafka-apache-doris-hive.png)
+
+The improved data pipeline is a much cleaner Lambda architecture. 
+
+Apache Doris provides a wide range of data ingestion methods. It's quick in 
data writing. On top of this, it also implements Merge-on-Write to improve its 
performance on concurrent point queries. 
+
+**Reduced Cost**
+
+The new architecture has reduced the user's cost in human efforts. For one 
thing, the much simpler data architecture leads to much easier maintenance; for 
another, developers no longer need to join the real-time and offline data in 
the data serving API.
+
+The user can also save money with Doris because it supports tiered storage. It 
allows the user to put their huge amount of rarely accessed historical data in 
object storage, which is much cheaper to hoard data.
+
+**Higher Efficiency**
+
+Apache Doris can reach a QPS of 10,000s and respond to billions of point 
queries within milliseconds, so the customer-facing queries are easy for it to 
handle. Tiered storage that separates hot data from cold data also increases 
their query efficiency.
+
+**Service Availability**
+
+As a unified data warehouse for storage, computation, and data services, 
Apache Doris allows for easy disaster recovery. With less components, they 
don't have to worry about data loss or duplication. 
+
+An important guarantee of service availability for the user is the 
Cross-Cluster Replication (CCR) capability of Apache Doris. It can synchronize 
data from cluster to cluster within minutes or even seconds, and it implements 
two mechanisms to ensure data reliability:
+
+- **Binlog**: This mechanism can automatically log the data changes and 
generate a LogID for each data modification operation. The incremental LogIDs 
make sure that data changes are traceable and orderly.
+- **Data persistence**: In the case of system meltdown or emergencies, data 
will be put into disks.
+
+## A Deeper Look into Apache Doris
+
+Apache Doris can replace the ClickHouse, MySQL, Presto, and HBase because it 
has a comprehensive collection of capabilities all along the data processing 
pipeline. In data ingestion, it enables low-latency real-time writing based on 
its support for Flink CDC and Merge-on-Write. It guarantees Exactly-Once 
writing by its Label mechanism and transactional loading. In data queries, it 
supports both Star Schema and flat table aggregation, so it can provide high 
performance in bother multi-tab [...]
\ No newline at end of file
diff --git 
a/static/images/multi-component-data-warehouse-mysql-clickhouse-hbase-hive-presto.png
 
b/static/images/multi-component-data-warehouse-mysql-clickhouse-hbase-hive-presto.png
new file mode 100644
index 00000000000..1cb5389631b
Binary files /dev/null and 
b/static/images/multi-component-data-warehouse-mysql-clickhouse-hbase-hive-presto.png
 differ
diff --git a/static/images/unified-data-warehouse-kafka-apache-doris-hive.png 
b/static/images/unified-data-warehouse-kafka-apache-doris-hive.png
new file mode 100644
index 00000000000..428837ba026
Binary files /dev/null and 
b/static/images/unified-data-warehouse-kafka-apache-doris-hive.png differ


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) branch master updated: add blog (#343)

Reply via email to