docs...

lidong Tue, 06 Jul 2021 00:51:13 -0700

Modified: kylin/site/feed.xml
URL: 
http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1891303&r1=1891302&r2=1891303&view=diff
==============================================================================
--- kylin/site/feed.xml (original)
+++ kylin/site/feed.xml Tue Jul  6 07:50:56 2021
@@ -19,153 +19,100 @@
     <description>Apache Kylin Home</description>
     <link>http://kylin.apache.org/</link>
     <atom:link href="http://kylin.apache.org/feed.xml"; rel="self" 
type="application/rss+xml"/>
-    <pubDate>Wed, 30 Jun 2021 19:37:02 -0700</pubDate>
-    <lastBuildDate>Wed, 30 Jun 2021 19:37:02 -0700</lastBuildDate>
+    <pubDate>Tue, 06 Jul 2021 00:25:02 -0700</pubDate>
+    <lastBuildDate>Tue, 06 Jul 2021 00:25:02 -0700</lastBuildDate>
     <generator>Jekyll v2.5.3</generator>
     
       <item>
-        <title>Why did Youzan choose Kylin4</title>
-        <description>&lt;p&gt;At the QCon Global Software Developers 
Conference held on May 29, 2021, Zheng Shengjun, head of Youzanâs data 
infrastructure platform, shared Youzanâs internal use experience and 
optimization practice of Kylin 4.0 on the meeting room of open source big data 
frameworks and applications. &lt;br /&gt;
-For many users of Kylin2/3(Kylin on HBase), this is also a chance to learn how 
and why to upgrade to Kylin 4.&lt;/p&gt;
-
-&lt;p&gt;This sharing is mainly divided into the following parts:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;The reason for choosing Kylin 4&lt;/li&gt;
-  &lt;li&gt;Introduction to Kylin 4&lt;/li&gt;
-  &lt;li&gt;How to optimize performance of Kylin 4&lt;/li&gt;
-  &lt;li&gt;Practice of Kylin 4 in Youzan&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;h2 id=&quot;the-reason-for-choosing-kylin-4&quot;&gt;01 The reason for 
choosing Kylin 4&lt;/h2&gt;
-
-&lt;h3 id=&quot;introduction-to-youzan&quot;&gt;Introduction to 
Youzan&lt;/h3&gt;
-&lt;p&gt;China Youzan Co., Ltd (stock code 08083.HK). is an enterprise mainly 
engaged in retail technology services.&lt;br /&gt;
-At present, it owns several tools and solutions to provide SaaS software 
products and talent services to help merchants operate mobile social e-commerce 
and new retail channels in an all-round way. &lt;br /&gt;
-Currently Youzan has hundreds of millions of consumers and 6 million existing 
merchants.&lt;/p&gt;
-
-&lt;h3 id=&quot;history-of-kylin-in-youzan&quot;&gt;History of Kylin in 
Youzan&lt;/h3&gt;
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/1 
history_of_youzan_OLAP.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;First of all, I would like to share why Youzan chose to upgrade to 
Kylin 4. Here, let me briefly reviewed the history of Youzan OLAP 
infra.&lt;/p&gt;
-
-&lt;p&gt;In the early days of Youzan, in order to iterate develop process 
quickly, we chose the method of pre-computation + MySQL; in 2018, Druid was 
introduced because of query flexibility and development efficiency, but there 
were problems such as low pre-aggregation, not supporting precisely count 
distinct measure. In this situation, Youzan introduced Apache Kylin and 
ClickHouse. Kylin supports high aggregation, precisely count distinct measure 
and the lowest RT, while ClickHouse is quite flexible in usage(ad hoc 
query).&lt;/p&gt;
-
-&lt;p&gt;From the introduction of Kylin in 2018 to now, Youzan has used Kylin 
for more than three years. With the continuous enrichment of business scenarios 
and the continuous accumulation of data volume, Youzan currently has 6 million 
existing merchants, GMV in 2020 is 107.3 billion, and the daily build data 
volume is 10 billion +. At present, Kylin has basically covered all the 
business scenarios of Youzan.&lt;/p&gt;
-
-&lt;h3 id=&quot;the-challenges-of-kylin-3&quot;&gt;The challenges of Kylin 
3&lt;/h3&gt;
-&lt;p&gt;With Youzanâs rapid development and in-depth use of Kylin, we also 
encountered some challenges:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;First of all, the build performance of Kylin on HBase cannot meet 
the favorable expectations, and the build performance will affect the userâs 
failure recovery time and stability experience;&lt;/li&gt;
-  &lt;li&gt;Secondly, with the access of more large merchants (tens of 
millions of members in a single store, with hundreds of thousands of goods for 
each store), it also brings great challenges to our OLAP system. Kylin on HBase 
is limited by the single-point query of Query Server, and cannot support these 
complex scenarios well;&lt;/li&gt;
-  &lt;li&gt;Finally, because HBase is not a cloud-native system, it is 
difficult to achieve flexible scale up and scale down. With the continuous 
growth of data volume, this system has peaks and valleys for businesses, which 
results in the average resource utilization rate is not high enough.&lt;/li&gt;
-&lt;/ul&gt;
+        <title>Apache Kylin4 â A new storage and compute architecture</title>
+        <description>&lt;p&gt;This article will discuss three aspects of 
Apache Kylin: First, we will briefly introduce query principles of Apache 
Kylin. Next, we will introduce Apache Parquet Storage, a project our team has 
been involved in that Kyligence is contributing back to the open source 
software community by the end of this year (2020). Finally, we will introduce 
the extensive use of precision count distinct by community users as well as its 
implementation in Kylin and some extensions.&lt;/p&gt;
 
-&lt;p&gt;Faced with these challenges, Youzan chose to move closer and upgrade 
to the more cloud-native Apache Kylin 4.&lt;/p&gt;
+&lt;h2 id=&quot;introduction-to-apache-kylin&quot;&gt;01 Introduction to 
Apache Kylin&lt;/h2&gt;
+&lt;p&gt;Apache Kylin is an open source distributed analysis engine that 
provides SQL query interfaces above Hadoop/Spark and OLAP capabilities to 
support extremely large data. It was initially developed at eBay Inc. and 
contributed to the open source software community. It can query massive 
relational tables with sub-second response times. &lt;br /&gt;
+&lt;img src=&quot;/images/blog/kylin4/1 apache_kylin_introduction.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
 
-&lt;h2 id=&quot;introduction-to-kylin-4&quot;&gt;02 Introduction to Kylin 
4&lt;/h2&gt;
-&lt;p&gt;First of all, letâs introduce the main advantages of Kylin 4. 
Apache Kylin 4 completely depends on Spark for cubing job and query. It can 
make full use of Sparkâs parallelization, quantization(åéå), and global 
dynamic code generation technologies to improve the efficiency of large 
queries.&lt;br /&gt;
-Here is a brief introduction to the principle of Kylin 4, that is storage 
engine, build engine and query engine.&lt;/p&gt;
+&lt;p&gt;As a SQL acceleration layer, Kylin can connect with various data 
sources such as Hive and Kafka, and can connect with commonly used BI systems 
such as Tableau and Power BI. It can also be queried directly (ad hoc) using 
standard SQL tools.&lt;/p&gt;
 
-&lt;h3 id=&quot;storage-engine&quot;&gt;Storage engine&lt;/h3&gt;
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/2 kylin4_storage.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+&lt;p&gt;If you find yourself confronted by unhappy BI users for any of the 
following reasons, you should consider using Apache Kylin:  &lt;br /&gt;
+- Their batch of queries are too slow &lt;br /&gt;
+- Query or user concurrency should be higher &lt;br /&gt;
+- Resources usage should be lower &lt;br /&gt;
+- The system doesnât fully support SQL syntax &lt;br /&gt;
+- The system doesnât seamlessly integrate with their favorite BI 
tools\&lt;/p&gt;
 
-&lt;p&gt;First of all, letâs take a look at the new storage engine, 
comparison between Kylin on HBase and Kylin on Parquet. The cuboid data of 
Kylin on HBase is stored in the table of HBase. Single Segment corresponds to 
one HBase table. Aggregation is pushed down to HBase coprocessor.&lt;/p&gt;
+&lt;h2 id=&quot;apache-kylin-rationale&quot;&gt;02 Apache Kylin 
Rationale&lt;/h2&gt;
+&lt;p&gt;Kylinâs core idea is the precomputation of result sets, meaning it 
calculates all possible query results in advance according to the specified 
dimensions and indicators and uses space for time to speed up OLAP queries with 
fixed query patterns. &lt;br /&gt;
+&lt;img src=&quot;/images/blog/kylin4/2 cube_vs_cuboid.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
 
-&lt;p&gt;But as we know,  HBase is not a real Columnar Storage and its 
throughput is not enough for OLAP System. Kylin 4 replaces HBase with Parquet, 
all the data is stored in files. Each segment will have a corresponding HDFS 
directory. All queries and cubing jobs read and write files without HBase . 
Although there will be a certain loss of performance for simple queries, the 
improvement brought about by complex queries is more considerable and 
worthwhile.&lt;/p&gt;
+&lt;p&gt;Kylinâs design is based on cube theory. Each combination of 
dimensions is called a cuboid and the set of all cuboids is a cube. The cuboid 
composed of all dimensions is called the base cuboid, and the time, item, 
location, and supplier shown in the figure is an example of this. All cuboids 
can be calculated from the base cuboid. A cuboid can be understood as a wide 
table after precomputation. During the query, Kylin will automatically select 
the most suitable cuboid that meets the query requirements. &lt;br /&gt;
+&lt;img src=&quot;/images/blog/kylin4/3 cuboid_selected_for_query.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
 
-&lt;h3 id=&quot;build-engine&quot;&gt;Build engine&lt;/h3&gt;
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/3 kylin4_build_engine.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+&lt;p&gt;For example, the query in the above figure will look for the cuboid 
(time, item, location). Compared with the calculation from the userâs 
original table, the calculation from the cuboid can greatly reduce the amount 
of scanned data and calculation.&lt;/p&gt;
 
-&lt;p&gt;The second is the new build engine. Based on our test, the build 
speed of Kylin on Parquet has been optimized from 82 minutes to 15 minutes. 
There are several reasons:&lt;/p&gt;
+&lt;h2 id=&quot;apache-kylin-basic-query-process&quot;&gt;03 Apache Kylin 
Basic Query Process&lt;/h2&gt;
+&lt;p&gt;Letâs look briefly at the rationale of Kylin queries. The first 
three steps are the routine operations of all query engines. We use the Apache 
Calcite framework to complete this operation. We will not go into great detail 
here but, should you wish to learn more, there is plenty of related material 
online.  &lt;br /&gt;
+&lt;img src=&quot;/images/blog/kylin4/4 apache_kylin_query_process.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
 
-&lt;ul&gt;
-  &lt;li&gt;Kylin 4 removes the encoding of the dimension, eliminating a 
building step of encoding;&lt;/li&gt;
-  &lt;li&gt;Removed the HBase File generation step;&lt;/li&gt;
-  &lt;li&gt;Kylin on Parquet changes the granularity of cubing to cuboid 
level, which is conducive to further improving parallelism of cubing 
job.&lt;/li&gt;
-  &lt;li&gt;Enhanced implementation for global dictionary. In the new 
algorithm, dictionary and source data are hashed into the same buckets, making 
it possible for loading only piece of dictionary bucket to encode source 
data.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;As you can see on the right, after upgradation to Kylin 4, cubing job 
changes from ten steps to two steps, the performance improvement of the 
construction is very obvious.&lt;/p&gt;
+&lt;p&gt;The introduction here focuses on the last two steps: Kylin adaptation 
and query execution. Why do we need to do Kylin adaptation? Because the query 
plan we obtained earlier is directly converted according to the userâs query, 
and so this query plan cannot directly query the precomputed data. Here, a 
rewrite is needed to create an execution plan so that it can query the 
precomputed data (i.e. cube data). Letâs look at the following example: 
&lt;br /&gt;
+&lt;img src=&quot;/images/blog/kylin4/5 query_using_precomputed_data.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
 
-&lt;h3 id=&quot;query-engine&quot;&gt;Query engine&lt;/h3&gt;
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/4 kylin4_query.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Next is the new query engine of Kylin 4. As you can see, the 
calculation of Kylin on HBase is completely dependent on the coprocessor of 
HBase and query server process. When the data is read from HBase into query 
server to do aggregation, sorting, etc, the bottleneck will be restricted by 
the single point of query server. But Kylin 4 is converted to a fully 
distributed query mechanism based on Spark, whatâs more, it âs able to do 
configuration tuning automatically in spark query step !&lt;/p&gt;
-
-&lt;h2 id=&quot;how-to-optimize-performance-of-kylin-4&quot;&gt;03 How to 
optimize performance of Kylin 4&lt;/h2&gt;
-&lt;p&gt;Next, Iâd like to share some performance optimizations made by 
Youzan in Kylin 4.&lt;/p&gt;
-
-&lt;h3 id=&quot;optimization-of-query-engine&quot;&gt;Optimization of query 
engine&lt;/h3&gt;
-&lt;p&gt;#### 1.Cache Calcite physical plan&lt;br /&gt;
-&lt;img src=&quot;/images/blog/youzan/5 cache_calcite_plan.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;In Kylin4, SQL will be analyzed, optimized and do code generation in 
calcite. This step takes up about 150ms for some queries. We have supported 
PreparedStatementCache in Kylin4 to cache calcite plan, so that the structured 
SQL donât have to do the same step again. With this optimization it saved 
about 150ms of time cost.&lt;/p&gt;
-
-&lt;h4 id=&quot;tunning-spark-configuration&quot;&gt;2.Tunning spark 
configuration&lt;/h4&gt;
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/6 
tuning_spark_configuration.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Kylin4 uses spark as query engine. As spark is a distributed engine 
designed for massive data processing, itâs inevitable to loose some 
performance for small queries. We have tried to do some tuning to catch up with 
the latency in Kylin on HBase for small queries.&lt;/p&gt;
+&lt;p&gt;The user has a stock of goods. Item and user_id indicate which item 
has been accessed and the user wants to analyze the Page View (PV) of the 
goods. The user defines a cube where the dimension is item and the measure is 
COUNT (user_id). If the user wants to analyze the PV of the goods, he will 
issue the following SQL:&lt;/p&gt;
 
-&lt;p&gt;Our first optimization is to make more calculations finish in memory. 
The key is to avoid data spill during aggregation, shuffle and sort. Tuning the 
following configuration is helpful.&lt;/p&gt;
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;SELECT item, COUNT (user_id) FROM 
stock GROUP BY item;  
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
 
-&lt;ul&gt;
-  &lt;li&gt;1.set &lt;code 
class=&quot;highlighter-rouge&quot;&gt;spark.sql.objectHashAggregate.sortBased.fallbackThreshold&lt;/code&gt;
 to larger value to avoid HashAggregate fall back to Sort Based Aggregate, 
which really kills performance when happens.&lt;/li&gt;
-  &lt;li&gt;2.set &lt;code 
class=&quot;highlighter-rouge&quot;&gt;spark.shuffle.spill.initialMemoryThreshold&lt;/code&gt;
 to a large value to avoid to many spills during shuffle.&lt;/li&gt;
-&lt;/ul&gt;
+&lt;p&gt;After this SQL is sent to Kylin, Kylin cannot directly use its 
original semantics to query our cube data. This is because after the data is 
precomputed, there will only be one row of data in the key of each item. The 
rows of the same item key in the original table have been aggregated in 
advance, generating a new measure column to store how many user_id accesses 
each item key has, so the rewritten SQL will be similar to this:&lt;/p&gt;
 
-&lt;p&gt;Secondly, we route small queries to Query Server which run spark in 
local mode. Because the overhead of task schedule, shuffle read and variable 
broadcast is enlarged for small queries on YARN/Standalone mode.&lt;/p&gt;
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt; SELECT item, SUM (M_C) FROM stock 
GROUP BY item;  
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
 
-&lt;p&gt;Thirdly, we use RAM disk to enhance shuffle performance. Mount RAM 
disk as TMPFS and set spark.local.dir to directory using RAM disk.&lt;/p&gt;
+&lt;p&gt;Why is there another SUM/GROUP BY operation here instead of directly 
fetching the data and returning it? Because the cuboid that may be hit by the 
query is more than one dimension of item, meaning it is not the most accurate 
cuboid. It needs to be aggregated again from these dimensions, but the amount 
of partially aggregated data still significantly reduces the amount of data and 
calculation compared with the data in the userâs original table. If the query 
hits the cuboid accurately, we can directly skip the process of Agg/GROUP BY, 
as it is shown in the following figure: &lt;br /&gt;
+&lt;img src=&quot;/images/blog/kylin4/6 on-site-computation.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
 
-&lt;p&gt;Lastly, we disabled sparkâs whole stage code generation for small 
queries, for sparkâs whole stage code generation will cost about 100ms~200ms, 
whereas itâs not beneficial to small queries which is a simple 
project.&lt;/p&gt;
+&lt;p&gt;The above figure is a scenario without precomputation, which requires 
on-site calculation. Agg and Join will involve shuffle, so the performance will 
be poor and more resources will be occupied with large amounts of data, which 
will affect the concurrency of queries. &lt;br /&gt;
+&lt;img src=&quot;/images/blog/kylin4/7 on-site-computation.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
 
-&lt;h4 id=&quot;parquet-optimization&quot;&gt;3.Parquet optimization&lt;/h4&gt;
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/7 
parquet_optimization.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+&lt;p&gt;After the precomputation, the previously most time-consuming two-step 
operation (Agg/Join) disappeared from the rewritten execution plan, showing a 
cuboid precise match. Additionally, when defining the cube we can choose to 
order by column so the Sort operation does not need to be calculated. The whole 
calculation is a single stage without the expense of a shuffle. The calculation 
can be completed with only a few tasks therefore improving the concurrency of 
the query.&lt;/p&gt;
 
-&lt;p&gt;Optimizing parquet is also important for queries.&lt;/p&gt;
+&lt;h2 id=&quot;apache-kylin-on-hbase&quot;&gt;04 Apache Kylin on 
HBase&lt;/h2&gt;
+&lt;p&gt;In the current open source version, the built data is stored in 
HBase, weâve got a logical execution plan that can query cube data from the 
above section. Calcite framework will generate the corresponding physical 
execution plan according to this logical execution plan and, finally, each 
operator will generate its own executable code through code generation.  &lt;br 
/&gt;
+&lt;img src=&quot;/images/blog/kylin4/8 on-site-computation.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
 
-&lt;p&gt;The first principal is that weâd better always include shard by 
column in our filter condition, for parquet files are shard by shard-by-column, 
filter using shard by column reduces the data files to read.&lt;/p&gt;
+&lt;p&gt;This process is an iterator model. Data flows from the lowest 
TableScan operator to the upstream operator. The whole process is like a 
volcanic eruption, so it is also called Volcano Iterator Mode. The code 
generated by this TableScan will fetch cube data from HBase, and when the data 
is returned to Kylin Query Server, it will be consumed layer by layer by the 
upper operator.&lt;/p&gt;
 
-&lt;p&gt;Then look into parquet files, data within files are sorted by rowkey 
columns, that is to say, prefix match in query is as important as Kylin on 
HBase. When a query condition satisfies prefix match, it can filter row groups 
with columnâs max/min index. Furthermore, we can reduce row group size to 
make finer index granularity, but be aware that the compression rate will be 
lower if we set row group size smaller.&lt;/p&gt;
+&lt;h2 id=&quot;bottlenecks-with-kylin-on-hbase&quot;&gt;05 Bottlenecks with 
Kylin on HBase&lt;/h2&gt;
+&lt;p&gt;This scenario is not a big problem with simple SQL because, in the 
case of a precise matching cuboid, minimal computing will be done on Kylin 
Query Server after retrieving the data from HBase. However, for some more 
complex queries, Kylin Query Server will not only pull back a large amount of 
data from HBase but also compute very resource-intensive operations such as 
Joins and Aggregates. &lt;br /&gt;
+&lt;img src=&quot;/images/blog/kylin4/9 
diagram_of_bottleneck_on_HBase.png&quot; alt=&quot;&quot; /&gt;&lt;br /&gt;
+For example, a query joins two subqueries, each subquery hits its own cube and 
then does some more complicated aggregate operations at the outermost layer 
such as COUNT DISTINCT. When the amount of data becomes large, Kylin Query 
Server may be out of memory (OOM). The solution is to simply increase the 
memory of the Kylin Query Server.&lt;/p&gt;
 
-&lt;h4 
id=&quot;dynamic-elimination-of-partitioning-dimensions&quot;&gt;4.Dynamic 
elimination of partitioning dimensions&lt;/h4&gt;
-&lt;p&gt;Kylin4 have a new ability that the older version is not capable of, 
which is able to reduce dozens of times of data reading and computing for some 
big queries. Itâs offen the case that partition column is used to filter data 
but not used as group dimension. For those cases Kylin would always choose 
cuboid with partition column, but now it is able to use different cuboid in 
that query to reduce IO read and computing.&lt;/p&gt;
+&lt;p&gt;However, this is a vertical expansion process that becomes a 
bottleneck. We know from experience that bottlenecks in big data can be 
difficult to diagnose and can lead to the abandonment of a critical technology 
when selecting an architecture. In addition, there are many other limitations 
when using this system. For example, the operation and maintenance of HBase is 
notoriously difficult. It is safe to assume that once the performance of HBase 
is not good, the performance of Kylin will also suffer.&lt;/p&gt;
 
-&lt;p&gt;The key of this optimization is to split a query into two parts, one 
of the part uses all segmentâs data so that partition column doesnât have 
to be included in cuboid, the other part that uses part of segments data will 
choose cuboid with partition dimension to do the data filter.&lt;/p&gt;
-
-&lt;p&gt;We have tested that in some situations the response time reduced from 
20s to 6s, 10s to 3s.&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/8 
Dynamic_elimination_of_partitioning_dimensions.png&quot; alt=&quot;&quot; 
/&gt;&lt;/p&gt;
-
-&lt;h3 id=&quot;optimization-of-build-engine&quot;&gt;Optimization of build 
engine&lt;/h3&gt;
-&lt;p&gt;#### 1.cache parent dataset&lt;br /&gt;
-&lt;img src=&quot;/images/blog/youzan/9 cache_parent_dataset.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+&lt;p&gt;The resource isolation capabilities of HBase are also relatively 
weak. When there is a large load at a given moment, other applications using 
HBase will also be affected. This may cause Kylin to have unstable query 
performance which can be difficult to troubleshoot. All data stored in HBase 
are encoded Byte Array types and the overhead of serialization and 
deserialization cannot be ignored.&lt;/p&gt;
 
-&lt;p&gt;Kylin build cube layer by layer. For a parent layer with multi 
cuboids to build, we can choose to cache parent dataset by setting 
kylin.engine.spark.parent-dataset.max.persist.count to a number greater than 0. 
But notice that if you set this value too small, it will affect the parallelism 
of build job, as the build granularity is at cuboid level.&lt;/p&gt;
+&lt;h2 id=&quot;apache-kylin-with-spark--parquet&quot;&gt;06 Apache Kylin with 
Spark + Parquet&lt;/h2&gt;
+&lt;p&gt;Due to the limitations of the Kylin-on-HBase solution mentioned 
above, Kyligence has developed a new generation of Spark + Parquet-based 
solutions for the commercial version of Kylin. This was done early on to update 
and enhance the open source software solution for enterprise use.&lt;/p&gt;
 
-&lt;h2 id=&quot;practice-of-kylin-4-in-youzan&quot;&gt;04 Practice of Kylin 4 
in Youzan&lt;/h2&gt;
-&lt;p&gt;After introducing Youzanâs experience of performance optimization, 
letâs share the optimization effect. That is, Kylin 4âs practice in Youzan 
includes the upgrade process and the performance of online system.&lt;/p&gt;
+&lt;p&gt;The following is an introduction to the overall framework of this new 
system. &lt;br /&gt;
+&lt;img src=&quot;/images/blog/kylin4/10 spark_parquet_solution.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
 
-&lt;h3 id=&quot;upgrade-metadata-to-adapt-to-kylin-4&quot;&gt;Upgrade metadata 
to adapt to Kylin 4&lt;/h3&gt;
-&lt;p&gt;First of all, for metadata for Kylin 3 which stored on HBase, we have 
developed a tool for seamless upgrading of metadata. First of all, our metadata 
in Kylin on HBase is stored in HBase. We export the metadata in HBase into 
local files, and then use tools to transform and write back the new metadata 
into MySQL. We also updated the operation documents and general principles in 
the official wiki of Apache Kylin. For more details, you can refer to: &lt;a 
href=&quot;https://wiki.apache.org/confluence/display/KYLIN/How+to+migrate+metadata+to+Kylin+4&quot;&gt;How
 to migrate metadata to Kylin 4&lt;/a&gt;.&lt;/p&gt;
+&lt;p&gt;In fact, the new design is very simple. The visitor mode is used to 
traverse the previously generated logical execution plan tree that can query 
cube data. The nodes of the execution plan tree represent an operator, which 
actually stores nothing more than some information such as which table to scan, 
which columns to filter/project, etc. Each operator will be translated into a 
Spark operation on Dataframe on the original tree, each upstream node asks its 
downstream node for a DF up to the most downstream TableScan node after it has 
finished processing. After it generates the initial DF, which can be simply 
understood as cuboidDF = spark.read.parquet (path). After obtaining the initial 
DF, it returns to its upstream. The upstream node applies its own operation on 
the downstream DF and returns to its upstream. Finally, the top node collects 
the DF to trigger the whole calculation process.&lt;/p&gt;
 
-&lt;p&gt;Letâs give a general introduction to some compatibility in the 
whole process. The project metadata, tables metadata, permission-related 
metadata, and model metadata do not need be modified. What needs to be modified 
is the cube metadata, including the type of storage and query used by Cube. 
After updating these two fields, you need to recalculate the Cube signature. 
The function of this signature is designed internally by Kylin to avoid some 
problems caused by Cube after Cube is determined.&lt;/p&gt;
+&lt;h2 id=&quot;advantages-of-the-sparkparquet-architecture&quot;&gt;07 
Advantages of the Spark/Parquet Architecture&lt;/h2&gt;
+&lt;p&gt;This Kylin on Parquet plan relies on Spark. All calculations are 
distributed and there is no single point where performance can bottleneck. The 
computing power of the system can be improved through horizontal expansion 
(scale-out). There are various schemes for resource scheduling such as Yarn, 
K8S, or Mesos to meet the needs of enterprises for resource isolation. 
Sparkâs performance efforts can be naturally enjoyed. The overhead of 
serialization and deserialization of Kylin on HBase mentioned above can be 
optimized by Sparkâs Tungsten project.  &lt;br /&gt;
+&lt;img src=&quot;/images/blog/kylin4/11 spark_parquet_architecture.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
 
-&lt;h3 
id=&quot;performance-of-kylin-4-on-youzan-online-system&quot;&gt;Performance of 
Kylin 4 on Youzan online system&lt;/h3&gt;
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/10 commodity_insight.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+&lt;p&gt;Reducing the dependence upon HBase simplifies operation and 
maintenance. All upstream and downstream dependencies can be handled by Spark 
for us, reducing our dependence and facilitating cloud access.&lt;/p&gt;
 
-&lt;p&gt;After the migration of metadata to Kylin4, letâs share the 
qualitative changes and substantial performance improvements brought about by 
some of the promising scenarios. First of all, in a scenario like Commodity 
Insight, there is a large store with several hundred thousand of commodities. 
We have to analyze its transactions and traffic, etc. There are more than a 
dozen precise precisely count distinct measures in single cube. Precisely count 
distinct measure is actually very inefficient if it is not optimized through 
pre-calculation and Bitmap. Kylin currently uses Bitmap to support precisely 
count distinct measure. In a scene that requires complex queries to sort 
hundreds of thousands of commodities in various UV(precisely count distinct 
measure), the RT of Kylin 2 is 27 seconds, while the RT of Kylin 4 is reduced 
from 27 seconds to less than 2 seconds.&lt;/p&gt;
+&lt;p&gt;For developers, the DF generated by each operator can be collected 
directly to observe whether there is any problem with the data at this level, 
and Spark + Parquet is currently a very popular SQL on Hadoop scheme. The open 
source committers at Kyligence are also familiar with these two projects and 
maintain their own Spark and Parquet branch. A lot of performance optimization 
and stability improvements have been done in this area for our specific 
scenarios.&lt;/p&gt;
 
-&lt;p&gt;What I find most appealing to me about Kylin 4 is that itâs like a 
manual transmission car, you can control its query concurrency at your will, 
whereas you canât change query concurrency in Kylin on HBase freely, because 
its concurrency is completely tied to the number of regions.&lt;/p&gt;
-
-&lt;h3 id=&quot;plan-for-kylin-4-in-youzan&quot;&gt;Plan for Kylin 4 in 
Youzan&lt;/h3&gt;
-&lt;p&gt;We have made full test, fixed several bugs and improved apache KYLIN4 
for several months. Now we are migrating cubes from older version to newer 
version. For the cubes already migrated to KYLIN4, its small queriesâ 
performance meet our expectations, its complex query and build performance did 
bring us a big surprise. We are planning to migrate all cubes from older 
version to Kylin4.&lt;/p&gt;
+&lt;h2 id=&quot;summary&quot;&gt;08 Summary&lt;/h2&gt;
+&lt;p&gt;Apache Kylin has over 1,000 users worldwide. But, in order for the 
project to ensure its future position as a vital, Cloud-Native technology for 
enterprise analytics, the Kylin community must periodically evaluate and update 
the key architectural assumptions being made to accomplish that goal. The 
removal of legacy connections to the Hadoop ecosystem in favor of Spark and 
Parquet is an important next step to realizing the dream of pervasive analytics 
based on open source technology for organizations of all sizes around the 
world.&lt;/p&gt;
 </description>
-        <pubDate>Thu, 17 Jun 2021 08:00:00 -0700</pubDate>
-        
<link>http://kylin.apache.org/blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</link>
-        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</guid>
+        <pubDate>Fri, 02 Jul 2021 08:00:00 -0700</pubDate>
+        
<link>http://kylin.apache.org/blog/2021/07/02/Apache-Kylin4-A-new-storage-and-compute-architecture/</link>
+        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2021/07/02/Apache-Kylin4-A-new-storage-and-compute-architecture/</guid>
         
         
         <category>blog</category>
@@ -332,6 +279,155 @@ Here is a brief introduction to the prin
       </item>
     
       <item>
+        <title>Why did Youzan choose Kylin4</title>
+        <description>&lt;p&gt;At the QCon Global Software Developers 
Conference held on May 29, 2021, Zheng Shengjun, head of Youzanâs data 
infrastructure platform, shared Youzanâs internal use experience and 
optimization practice of Kylin 4.0 on the meeting room of open source big data 
frameworks and applications. &lt;br /&gt;
+For many users of Kylin2/3(Kylin on HBase), this is also a chance to learn how 
and why to upgrade to Kylin 4.&lt;/p&gt;
+
+&lt;p&gt;This sharing is mainly divided into the following parts:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;The reason for choosing Kylin 4&lt;/li&gt;
+  &lt;li&gt;Introduction to Kylin 4&lt;/li&gt;
+  &lt;li&gt;How to optimize performance of Kylin 4&lt;/li&gt;
+  &lt;li&gt;Practice of Kylin 4 in Youzan&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;the-reason-for-choosing-kylin-4&quot;&gt;01 The reason for 
choosing Kylin 4&lt;/h2&gt;
+
+&lt;h3 id=&quot;introduction-to-youzan&quot;&gt;Introduction to 
Youzan&lt;/h3&gt;
+&lt;p&gt;China Youzan Co., Ltd (stock code 08083.HK). is an enterprise mainly 
engaged in retail technology services.&lt;br /&gt;
+At present, it owns several tools and solutions to provide SaaS software 
products and talent services to help merchants operate mobile social e-commerce 
and new retail channels in an all-round way. &lt;br /&gt;
+Currently Youzan has hundreds of millions of consumers and 6 million existing 
merchants.&lt;/p&gt;
+
+&lt;h3 id=&quot;history-of-kylin-in-youzan&quot;&gt;History of Kylin in 
Youzan&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/1 
history_of_youzan_OLAP.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;First of all, I would like to share why Youzan chose to upgrade to 
Kylin 4. Here, let me briefly reviewed the history of Youzan OLAP 
infra.&lt;/p&gt;
+
+&lt;p&gt;In the early days of Youzan, in order to iterate develop process 
quickly, we chose the method of pre-computation + MySQL; in 2018, Druid was 
introduced because of query flexibility and development efficiency, but there 
were problems such as low pre-aggregation, not supporting precisely count 
distinct measure. In this situation, Youzan introduced Apache Kylin and 
ClickHouse. Kylin supports high aggregation, precisely count distinct measure 
and the lowest RT, while ClickHouse is quite flexible in usage(ad hoc 
query).&lt;/p&gt;
+
+&lt;p&gt;From the introduction of Kylin in 2018 to now, Youzan has used Kylin 
for more than three years. With the continuous enrichment of business scenarios 
and the continuous accumulation of data volume, Youzan currently has 6 million 
existing merchants, GMV in 2020 is 107.3 billion, and the daily build data 
volume is 10 billion +. At present, Kylin has basically covered all the 
business scenarios of Youzan.&lt;/p&gt;
+
+&lt;h3 id=&quot;the-challenges-of-kylin-3&quot;&gt;The challenges of Kylin 
3&lt;/h3&gt;
+&lt;p&gt;With Youzanâs rapid development and in-depth use of Kylin, we also 
encountered some challenges:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;First of all, the build performance of Kylin on HBase cannot meet 
the favorable expectations, and the build performance will affect the userâs 
failure recovery time and stability experience;&lt;/li&gt;
+  &lt;li&gt;Secondly, with the access of more large merchants (tens of 
millions of members in a single store, with hundreds of thousands of goods for 
each store), it also brings great challenges to our OLAP system. Kylin on HBase 
is limited by the single-point query of Query Server, and cannot support these 
complex scenarios well;&lt;/li&gt;
+  &lt;li&gt;Finally, because HBase is not a cloud-native system, it is 
difficult to achieve flexible scale up and scale down. With the continuous 
growth of data volume, this system has peaks and valleys for businesses, which 
results in the average resource utilization rate is not high enough.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Faced with these challenges, Youzan chose to move closer and upgrade 
to the more cloud-native Apache Kylin 4.&lt;/p&gt;
+
+&lt;h2 id=&quot;introduction-to-kylin-4&quot;&gt;02 Introduction to Kylin 
4&lt;/h2&gt;
+&lt;p&gt;First of all, letâs introduce the main advantages of Kylin 4. 
Apache Kylin 4 completely depends on Spark for cubing job and query. It can 
make full use of Sparkâs parallelization, quantization(åéå), and global 
dynamic code generation technologies to improve the efficiency of large 
queries.&lt;br /&gt;
+Here is a brief introduction to the principle of Kylin 4, that is storage 
engine, build engine and query engine.&lt;/p&gt;
+
+&lt;h3 id=&quot;storage-engine&quot;&gt;Storage engine&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/2 kylin4_storage.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;First of all, letâs take a look at the new storage engine, 
comparison between Kylin on HBase and Kylin on Parquet. The cuboid data of 
Kylin on HBase is stored in the table of HBase. Single Segment corresponds to 
one HBase table. Aggregation is pushed down to HBase coprocessor.&lt;/p&gt;
+
+&lt;p&gt;But as we know,  HBase is not a real Columnar Storage and its 
throughput is not enough for OLAP System. Kylin 4 replaces HBase with Parquet, 
all the data is stored in files. Each segment will have a corresponding HDFS 
directory. All queries and cubing jobs read and write files without HBase . 
Although there will be a certain loss of performance for simple queries, the 
improvement brought about by complex queries is more considerable and 
worthwhile.&lt;/p&gt;
+
+&lt;h3 id=&quot;build-engine&quot;&gt;Build engine&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/3 kylin4_build_engine.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;The second is the new build engine. Based on our test, the build 
speed of Kylin on Parquet has been optimized from 82 minutes to 15 minutes. 
There are several reasons:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Kylin 4 removes the encoding of the dimension, eliminating a 
building step of encoding;&lt;/li&gt;
+  &lt;li&gt;Removed the HBase File generation step;&lt;/li&gt;
+  &lt;li&gt;Kylin on Parquet changes the granularity of cubing to cuboid 
level, which is conducive to further improving parallelism of cubing 
job.&lt;/li&gt;
+  &lt;li&gt;Enhanced implementation for global dictionary. In the new 
algorithm, dictionary and source data are hashed into the same buckets, making 
it possible for loading only piece of dictionary bucket to encode source 
data.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;As you can see on the right, after upgradation to Kylin 4, cubing job 
changes from ten steps to two steps, the performance improvement of the 
construction is very obvious.&lt;/p&gt;
+
+&lt;h3 id=&quot;query-engine&quot;&gt;Query engine&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/4 kylin4_query.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Next is the new query engine of Kylin 4. As you can see, the 
calculation of Kylin on HBase is completely dependent on the coprocessor of 
HBase and query server process. When the data is read from HBase into query 
server to do aggregation, sorting, etc, the bottleneck will be restricted by 
the single point of query server. But Kylin 4 is converted to a fully 
distributed query mechanism based on Spark, whatâs more, it âs able to do 
configuration tuning automatically in spark query step !&lt;/p&gt;
+
+&lt;h2 id=&quot;how-to-optimize-performance-of-kylin-4&quot;&gt;03 How to 
optimize performance of Kylin 4&lt;/h2&gt;
+&lt;p&gt;Next, Iâd like to share some performance optimizations made by 
Youzan in Kylin 4.&lt;/p&gt;
+
+&lt;h3 id=&quot;optimization-of-query-engine&quot;&gt;Optimization of query 
engine&lt;/h3&gt;
+&lt;p&gt;#### 1.Cache Calcite physical plan&lt;br /&gt;
+&lt;img src=&quot;/images/blog/youzan/5 cache_calcite_plan.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;In Kylin4, SQL will be analyzed, optimized and do code generation in 
calcite. This step takes up about 150ms for some queries. We have supported 
PreparedStatementCache in Kylin4 to cache calcite plan, so that the structured 
SQL donât have to do the same step again. With this optimization it saved 
about 150ms of time cost.&lt;/p&gt;
+
+&lt;h4 id=&quot;tunning-spark-configuration&quot;&gt;2.Tunning spark 
configuration&lt;/h4&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/6 
tuning_spark_configuration.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Kylin4 uses spark as query engine. As spark is a distributed engine 
designed for massive data processing, itâs inevitable to loose some 
performance for small queries. We have tried to do some tuning to catch up with 
the latency in Kylin on HBase for small queries.&lt;/p&gt;
+
+&lt;p&gt;Our first optimization is to make more calculations finish in memory. 
The key is to avoid data spill during aggregation, shuffle and sort. Tuning the 
following configuration is helpful.&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;1.set &lt;code 
class=&quot;highlighter-rouge&quot;&gt;spark.sql.objectHashAggregate.sortBased.fallbackThreshold&lt;/code&gt;
 to larger value to avoid HashAggregate fall back to Sort Based Aggregate, 
which really kills performance when happens.&lt;/li&gt;
+  &lt;li&gt;2.set &lt;code 
class=&quot;highlighter-rouge&quot;&gt;spark.shuffle.spill.initialMemoryThreshold&lt;/code&gt;
 to a large value to avoid to many spills during shuffle.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Secondly, we route small queries to Query Server which run spark in 
local mode. Because the overhead of task schedule, shuffle read and variable 
broadcast is enlarged for small queries on YARN/Standalone mode.&lt;/p&gt;
+
+&lt;p&gt;Thirdly, we use RAM disk to enhance shuffle performance. Mount RAM 
disk as TMPFS and set spark.local.dir to directory using RAM disk.&lt;/p&gt;
+
+&lt;p&gt;Lastly, we disabled sparkâs whole stage code generation for small 
queries, for sparkâs whole stage code generation will cost about 100ms~200ms, 
whereas itâs not beneficial to small queries which is a simple 
project.&lt;/p&gt;
+
+&lt;h4 id=&quot;parquet-optimization&quot;&gt;3.Parquet optimization&lt;/h4&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/7 
parquet_optimization.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Optimizing parquet is also important for queries.&lt;/p&gt;
+
+&lt;p&gt;The first principal is that weâd better always include shard by 
column in our filter condition, for parquet files are shard by shard-by-column, 
filter using shard by column reduces the data files to read.&lt;/p&gt;
+
+&lt;p&gt;Then look into parquet files, data within files are sorted by rowkey 
columns, that is to say, prefix match in query is as important as Kylin on 
HBase. When a query condition satisfies prefix match, it can filter row groups 
with columnâs max/min index. Furthermore, we can reduce row group size to 
make finer index granularity, but be aware that the compression rate will be 
lower if we set row group size smaller.&lt;/p&gt;
+
+&lt;h4 
id=&quot;dynamic-elimination-of-partitioning-dimensions&quot;&gt;4.Dynamic 
elimination of partitioning dimensions&lt;/h4&gt;
+&lt;p&gt;Kylin4 have a new ability that the older version is not capable of, 
which is able to reduce dozens of times of data reading and computing for some 
big queries. Itâs offen the case that partition column is used to filter data 
but not used as group dimension. For those cases Kylin would always choose 
cuboid with partition column, but now it is able to use different cuboid in 
that query to reduce IO read and computing.&lt;/p&gt;
+
+&lt;p&gt;The key of this optimization is to split a query into two parts, one 
of the part uses all segmentâs data so that partition column doesnât have 
to be included in cuboid, the other part that uses part of segments data will 
choose cuboid with partition dimension to do the data filter.&lt;/p&gt;
+
+&lt;p&gt;We have tested that in some situations the response time reduced from 
20s to 6s, 10s to 3s.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/8 
Dynamic_elimination_of_partitioning_dimensions.png&quot; alt=&quot;&quot; 
/&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;optimization-of-build-engine&quot;&gt;Optimization of build 
engine&lt;/h3&gt;
+&lt;p&gt;#### 1.cache parent dataset&lt;br /&gt;
+&lt;img src=&quot;/images/blog/youzan/9 cache_parent_dataset.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Kylin build cube layer by layer. For a parent layer with multi 
cuboids to build, we can choose to cache parent dataset by setting 
kylin.engine.spark.parent-dataset.max.persist.count to a number greater than 0. 
But notice that if you set this value too small, it will affect the parallelism 
of build job, as the build granularity is at cuboid level.&lt;/p&gt;
+
+&lt;h2 id=&quot;practice-of-kylin-4-in-youzan&quot;&gt;04 Practice of Kylin 4 
in Youzan&lt;/h2&gt;
+&lt;p&gt;After introducing Youzanâs experience of performance optimization, 
letâs share the optimization effect. That is, Kylin 4âs practice in Youzan 
includes the upgrade process and the performance of online system.&lt;/p&gt;
+
+&lt;h3 id=&quot;upgrade-metadata-to-adapt-to-kylin-4&quot;&gt;Upgrade metadata 
to adapt to Kylin 4&lt;/h3&gt;
+&lt;p&gt;First of all, for metadata for Kylin 3 which stored on HBase, we have 
developed a tool for seamless upgrading of metadata. First of all, our metadata 
in Kylin on HBase is stored in HBase. We export the metadata in HBase into 
local files, and then use tools to transform and write back the new metadata 
into MySQL. We also updated the operation documents and general principles in 
the official wiki of Apache Kylin. For more details, you can refer to: &lt;a 
href=&quot;https://wiki.apache.org/confluence/display/KYLIN/How+to+migrate+metadata+to+Kylin+4&quot;&gt;How
 to migrate metadata to Kylin 4&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;Letâs give a general introduction to some compatibility in the 
whole process. The project metadata, tables metadata, permission-related 
metadata, and model metadata do not need be modified. What needs to be modified 
is the cube metadata, including the type of storage and query used by Cube. 
After updating these two fields, you need to recalculate the Cube signature. 
The function of this signature is designed internally by Kylin to avoid some 
problems caused by Cube after Cube is determined.&lt;/p&gt;
+
+&lt;h3 
id=&quot;performance-of-kylin-4-on-youzan-online-system&quot;&gt;Performance of 
Kylin 4 on Youzan online system&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/10 commodity_insight.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;After the migration of metadata to Kylin4, letâs share the 
qualitative changes and substantial performance improvements brought about by 
some of the promising scenarios. First of all, in a scenario like Commodity 
Insight, there is a large store with several hundred thousand of commodities. 
We have to analyze its transactions and traffic, etc. There are more than a 
dozen precise precisely count distinct measures in single cube. Precisely count 
distinct measure is actually very inefficient if it is not optimized through 
pre-calculation and Bitmap. Kylin currently uses Bitmap to support precisely 
count distinct measure. In a scene that requires complex queries to sort 
hundreds of thousands of commodities in various UV(precisely count distinct 
measure), the RT of Kylin 2 is 27 seconds, while the RT of Kylin 4 is reduced 
from 27 seconds to less than 2 seconds.&lt;/p&gt;
+
+&lt;p&gt;What I find most appealing to me about Kylin 4 is that itâs like a 
manual transmission car, you can control its query concurrency at your will, 
whereas you canât change query concurrency in Kylin on HBase freely, because 
its concurrency is completely tied to the number of regions.&lt;/p&gt;
+
+&lt;h3 id=&quot;plan-for-kylin-4-in-youzan&quot;&gt;Plan for Kylin 4 in 
Youzan&lt;/h3&gt;
+&lt;p&gt;We have made full test, fixed several bugs and improved apache KYLIN4 
for several months. Now we are migrating cubes from older version to newer 
version. For the cubes already migrated to KYLIN4, its small queriesâ 
performance meet our expectations, its complex query and build performance did 
bring us a big surprise. We are planning to migrate all cubes from older 
version to Kylin4.&lt;/p&gt;
+</description>
+        <pubDate>Thu, 17 Jun 2021 08:00:00 -0700</pubDate>
+        
<link>http://kylin.apache.org/blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</link>
+        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
         <title>ä½ ç¦»å¯è§åé·ç«å¤§å±åªå·®ä¸å¥ Kylin + Davinci</title>
         <description>&lt;p&gt;Kylin æä¾ä¸ BI å·¥å·çæ´åè½åï¼å¦ 
Tableauï¼PowerBI/Excelï¼MSTRï¼QlikSenseï¼Hue å 
SuperSetãä½å°±å¯è§åå·¥å·èè¨ï¼Davinci 
è¯å¥½çäº¤äºæ§åä¸ªæ§åçå¯è§åå¤§å±å±ç°ææï¼ä½¿å¶ä¸ Kylin 
çç»åè½è®©å¤§é¨åç¨æ·ææ´å¥½çå¯è§ååæä½éªã&lt;/p&gt;
 
@@ -1647,61 +1743,5 @@ Security: (depend on your security setti
         
       </item>
     
-      <item>
-        <title>Apache Kylin v3.0.0-alpha Release Announcement</title>
-        <description>&lt;p&gt;The Apache Kylin community is pleased to 
announce the release of Apache Kylin v3.0.0-alpha.&lt;/p&gt;
-
-&lt;p&gt;Apache Kylin is an open source Distributed Analytics Engine designed 
to provide SQL interface and multi-dimensional analysis (OLAP) on Big Data 
supporting extremely large datasets.&lt;/p&gt;
-
-&lt;p&gt;This is the first release of the new generation v3.x, the main 
feature introduced is the Real-time OLAP. All of the changes can be found in 
theÂ &lt;a href=&quot;/docs/release_notes.html&quot;&gt;release 
notes&lt;/a&gt;. Here we just highlight the main features.&lt;/p&gt;
-
-&lt;h1 id=&quot;important-features&quot;&gt;Important features&lt;/h1&gt;
-
-&lt;h3 id=&quot;kylin-3654---real-time-olap&quot;&gt;KYLIN-3654 - Real-time 
OLAP&lt;/h3&gt;
-&lt;p&gt;With the newly introduced Kylin real-time receiver and coordinator 
components, Kylin can implement a millisecond-level data preparation delay for 
streaming data from sources like Apache Kafka. This means since v3.0 on,  Kylin 
can support sub-second level OLAP over historical batch data, near real-time 
streaming as well as real-time streaming. The user can use one OLAP platform to 
serve different scenarios. This solution has been deployed and verified in 
early adopters like eBay since 2018. For how to enable it, please refer to 
&lt;a href=&quot;/docs30/tutorial/realtime_olap.html&quot;&gt;this 
tutorial&lt;/a&gt;.&lt;/p&gt;
-
-&lt;h3 
id=&quot;kylin-3795---submit-spark-jobs-via-apache-livy&quot;&gt;KYLIN-3795 - 
Submit Spark jobs via Apache Livy&lt;/h3&gt;
-&lt;p&gt;This feature allows the administrator to configure Kylin to integrate 
with Apache Livy (incubating) for Spark job submissions. The Spark job is 
submitted to the Livy Server through Livyâs REST API, instead of starting the 
Spark Driver process in local, which facilitates the management and monitoring 
of the Spark resources, and also releases the pressure of the nodes where the 
Kylin job server is running.&lt;/p&gt;
-
-&lt;h3 id=&quot;kylin-3820---a-curator-based-job-scheduler&quot;&gt;KYLIN-3820 
- A curator-based job scheduler&lt;/h3&gt;
-&lt;p&gt;A new job scheduler is added to automatically discover the Kylin 
nodes and do an automatic leader selection among them (only the leader will 
submit jobs). With this feature, you can easily deploy and scale out Kylin 
nodes without manually update the node address in &lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin.properties&lt;/code&gt; and 
restart Kylin to take effective.&lt;/p&gt;
-
-&lt;h1 id=&quot;other-enhancements&quot;&gt;Other enhancements&lt;/h1&gt;
-
-&lt;h3 
id=&quot;kylin-3716---fastthreadlocal-replaces-threadlocal&quot;&gt;KYLIN-3716 
- FastThreadLocal replaces ThreadLocal&lt;/h3&gt;
-&lt;p&gt;Using FastThreadLocal instead of ThreadLocal can improve Kylinâs 
overall performance to some extent.&lt;/p&gt;
-
-&lt;h3 
id=&quot;kylin-3867---enable-jdbc-to-use-key-store--trust-store-for-https-connection&quot;&gt;KYLIN-3867
 - Enable JDBC to use key store &amp;amp; trust store for https 
connection&lt;/h3&gt;
-&lt;p&gt;By using HTTPS, the authentication information used by JDBC is 
protected, making Kylin more secure.&lt;/p&gt;
-
-&lt;h3 
id=&quot;kylin-3905---enable-shrunken-dictionary-default&quot;&gt;KYLIN-3905 - 
Enable shrunken dictionary default&lt;/h3&gt;
-&lt;p&gt;By default, the shrunken dictionary is enabled, and the precise 
counting scene for high cardinal dimensions can significantly reduce the build 
time.&lt;/p&gt;
-
-&lt;h3 
id=&quot;kylin-3839---storage-clean-up-after-the-refreshing-and-deleting-a-segment&quot;&gt;KYLIN-3839
 - Storage clean up after the refreshing and deleting a segment&lt;/h3&gt;
-&lt;p&gt;Clear unnecessary data files in a timely manner&lt;/p&gt;
-
-&lt;p&gt;&lt;strong&gt;Download&lt;/strong&gt;&lt;/p&gt;
-
-&lt;p&gt;To download Apache Kylin v3.0.0-alpha source code or binary package, 
visit the &lt;a 
href=&quot;http://kylin.apache.org/download&quot;&gt;download&lt;/a&gt; 
page.&lt;/p&gt;
-
-&lt;p&gt;&lt;strong&gt;Upgrade&lt;/strong&gt;&lt;/p&gt;
-
-&lt;p&gt;Follow the &lt;a 
href=&quot;/docs/howto/howto_upgrade.html&quot;&gt;upgrade 
guide&lt;/a&gt;.&lt;/p&gt;
-
-&lt;p&gt;&lt;strong&gt;Feedback&lt;/strong&gt;&lt;/p&gt;
-
-&lt;p&gt;If you face issue or question, please send mail to Apache Kylin dev 
or user mailing list:Â d...@kylin.apache.org , u...@kylin.apache.org; Before 
sending, please make sure you have subscribed the mailing list by dropping an 
email to dev-subscr...@kylin.apache.org or 
user-subscr...@kylin.apache.org.&lt;/p&gt;
-
-&lt;p&gt;&lt;em&gt;Great thanks to everyone who 
contributed!&lt;/em&gt;&lt;/p&gt;
-</description>
-        <pubDate>Fri, 19 Apr 2019 13:00:00 -0700</pubDate>
-        
<link>http://kylin.apache.org/blog/2019/04/19/release-v3.0.0-alpha/</link>
-        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2019/04/19/release-v3.0.0-alpha/</guid>
-        
-        
-        <category>blog</category>
-        
-      </item>
-    
   </channel>
 </rss>


Added: kylin/site/images/blog/kylin4/1 apache_kylin_introduction.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/blog/kylin4/1%20apache_kylin_introduction.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/kylin4/1 apache_kylin_introduction.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/kylin4/10 spark_parquet_solution.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/blog/kylin4/10%20spark_parquet_solution.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/kylin4/10 spark_parquet_solution.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/kylin4/11 spark_parquet_architecture.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/blog/kylin4/11%20spark_parquet_architecture.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/kylin4/11 spark_parquet_architecture.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/kylin4/2 cube_vs_cuboid.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/blog/kylin4/2%20cube_vs_cuboid.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/kylin4/2 cube_vs_cuboid.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/kylin4/3 cuboid_selected_for_query.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/blog/kylin4/3%20cuboid_selected_for_query.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/kylin4/3 cuboid_selected_for_query.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/kylin4/4 apache_kylin_query_process.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/blog/kylin4/4%20apache_kylin_query_process.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/kylin4/4 apache_kylin_query_process.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/kylin4/5 query_using_precomputed_data.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/blog/kylin4/5%20query_using_precomputed_data.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/kylin4/5 query_using_precomputed_data.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/kylin4/6 on-site-computation.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/blog/kylin4/6%20on-site-computation.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/kylin4/6 on-site-computation.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/kylin4/7 precomputation_using_aggregated_data.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/blog/kylin4/7%20precomputation_using_aggregated_data.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/kylin4/7 
precomputation_using_aggregated_data.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/kylin4/8 diagram_of_calcite_executions.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/blog/kylin4/8%20diagram_of_calcite_executions.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/kylin4/8 diagram_of_calcite_executions.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/kylin4/9 diagram_of_bottleneck_on_HBase.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/blog/kylin4/9%20diagram_of_bottleneck_on_HBase.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/kylin4/9 diagram_of_bottleneck_on_HBase.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/develop40/debug_tomcat_config.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/develop40/debug_tomcat_config.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/develop40/debug_tomcat_config.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/tutorial/4.0/overview/build_duration_ssb.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/tutorial/4.0/overview/build_duration_ssb.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/tutorial/4.0/overview/build_duration_ssb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/tutorial/4.0/overview/query_response_ssb.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/tutorial/4.0/overview/query_response_ssb.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/tutorial/4.0/overview/query_response_ssb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/tutorial/4.0/overview/query_response_tpch.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/tutorial/4.0/overview/query_response_tpch.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/tutorial/4.0/overview/query_response_tpch.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/tutorial/4.0/overview/result_size_ssb.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/tutorial/4.0/overview/result_size_ssb.png?rev=1891303&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/tutorial/4.0/overview/result_size_ssb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/website.iml
URL: http://svn.apache.org/viewvc/kylin/site/website.iml?rev=1891303&view=auto
==============================================================================
--- kylin/site/website.iml (added)
+++ kylin/site/website.iml Tue Jul  6 07:50:56 2021
@@ -0,0 +1,9 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<module type="WEB_MODULE" version="4">
+  <component name="NewModuleRootManager" inherit-compiler-output="true">
+    <exclude-output />
+    <content url="file://$MODULE_DIR$" />
+    <orderEntry type="inheritedJdk" />
+    <orderEntry type="sourceFolder" forTests="false" />
+  </component>
+</module>
\ No newline at end of file

svn commit: r1891303 [22/22] - in /kylin/site: ./ blog/ blog/2021/07/ blog/2021/07/02/ blog/2021/07/02/Apache-Kylin4-A-new-storage-and-compute-architecture/ cn/development/ cn/development40/ cn/docs/install/ cn/docs40/ cn/docs40/gettingstarted/ cn/docs...

Reply via email to