Modified: kylin/site/feed.xml URL: http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1894464&r1=1894463&r2=1894464&view=diff ============================================================================== --- kylin/site/feed.xml (original) +++ kylin/site/feed.xml Fri Oct 22 05:11:34 2021 @@ -19,11 +19,162 @@ <description>Apache Kylin Home</description> <link>http://kylin.apache.org/</link> <atom:link href="http://kylin.apache.org/feed.xml" rel="self" type="application/rss+xml"/> - <pubDate>Wed, 08 Sep 2021 00:12:30 -0700</pubDate> - <lastBuildDate>Wed, 08 Sep 2021 00:12:30 -0700</lastBuildDate> + <pubDate>Thu, 21 Oct 2021 22:00:29 -0700</pubDate> + <lastBuildDate>Thu, 21 Oct 2021 22:00:29 -0700</lastBuildDate> <generator>Jekyll v2.5.3</generator> <item> + <title>Kylin4 äºä¸æ§è½ä¼åï¼æ¬å°ç¼åå软亲åæ§è°åº¦</title> + <description><h2 id="section">01 èæ¯ä»ç»</h2> +<p>æ¥åï¼Apache Kylin 社åºåå¸äºå ¨æ°æ¶æç Kylin 4.0ãKylin 4.0 çæ¶ææ¯æåå¨å计ç®å离ï¼è¿ä½¿å¾ kylin ç¨æ·å¯ä»¥éåæ´å çµæ´»ã计ç®èµæºå¯ä»¥å¼¹æ§ä¼¸ç¼©çäºä¸é¨ç½²æ¹å¼æ¥è¿è¡ Kylin 4.0ãåå©äºä¸çåºç¡è®¾æ½ï¼ç¨æ·å¯ä»¥éæ©ä½¿ç¨ä¾¿å®ä¸å¯é ç对象åå¨æ¥å¨å cube æ°æ®ï¼æ¯å¦ S3 çãä¸è¿å¨åå¨ä¸è®¡ç®å离çæ¶æä¸ï¼æ们éè¦èèå°ï¼è®¡ç®èç¹éè¿ç½ç»ä»è¿ç«¯åå¨è¯»åæ°æ®ä»ç¶æ¯ä¸ä¸ªä»£ä»·è¾å¤§çæä½ï¼å¾å¾ ä¼å¸¦æ¥æ§è½çæèã<br /> +为äºæé« Kylin 4.0 å¨ä½¿ç¨äºä¸å¯¹è±¡åå¨ä½ä¸ºåå¨æ¶çæ¥è¯¢æ§è½ï¼æ们å°è¯å¨ Kylin 4.0 çæ¥è¯¢å¼æä¸å¼å ¥æ¬å°ç¼åï¼Local Cacheï¼æºå¶ï¼å¨æ§è¡æ¥è¯¢æ¶ï¼å°ç»å¸¸ä½¿ç¨çæ°æ®ç¼åå¨æ¬å°ç£çï¼åå°ä»è¿ç¨å¯¹è±¡åå¨ä¸æåæ°æ®å¸¦æ¥ç延è¿ï¼å®ç°æ´å¿«çæ¥è¯¢ååºï¼é¤æ¤ä¹å¤ï¼ä¸ºäºé¿å åæ ·çæ°æ®å¨å¤§é spark executor ä¸åæ¶ç¼å浪费ç£ç空é´ï¼å¹¶ä¸è®¡ç®èç¹å¯ä»¥æ´å¤çä»æ¬å°ç¼å读åæéæ°æ®ï¼æ们å¼å ¥äº 软äº� �åæ§ï¼Soft Affinity ï¼çè°åº¦çç¥ï¼æè°è½¯äº²åæ§çç¥ï¼å°±æ¯éè¿æç§æ¹æ³å¨ spark executor åæ°æ®æ件ä¹é´å»ºç«å¯¹åºå ³ç³»ï¼ä½¿å¾åæ ·çæ°æ®å¨å¤§é¨åæ åµä¸è½å¤æ»æ¯å¨åä¸ä¸ª executor ä¸é¢è¯»åï¼ä»èæé«ç¼åçå½ä¸çã</p> + +<h2 id="section-1">02 å®ç°åç</h2> + +<h4 id="section-2">1.æ¬å°ç¼å</h4> +<p>å¨ Kylin 4.0 æ§è¡æ¥è¯¢æ¶ï¼ä¸»è¦ç»è¿ä»¥ä¸å 个é¶æ®µï¼å ¶ä¸ç¨è线æ 注åºäºå¯ä»¥ä½¿ç¨æ¬å°ç¼åæ¥æåæ§è½çé¶æ®µï¼</p> + +<p><img src="/images/blog/local-cache/Local_cache_stage.png" alt="" /></p> + +<ul> + <li>File list cacheï¼å¨ spark driver 端对 file status è¿è¡ç¼åãå¨æ§è¡æ¥è¯¢æ¶ï¼spark driver éè¦è¯»åæ件å表ï¼è·åä¸äºæ件信æ¯è¿è¡åç»çè°åº¦æ§è¡ï¼è¿éä¼å° file status ä¿¡æ¯ç¼åå°æ¬å°é¿å é¢ç¹è¯»åè¿ç¨æ件ç®å½ã</li> + <li>Data cacheï¼å¨ spark executor 端对æ°æ®è¿è¡ç¼åãç¨æ·å¯ä»¥è®¾ç½®å°æ°æ®ç¼åå°å åææ¯ç£çï¼è¥è®¾ç½®ä¸ºç¼åå°å åï¼åéè¦éå½è°å¤§ executor memoryï¼ä¿è¯ executor æ足å¤çå åå¯ä»¥è¿è¡æ°æ®ç¼åï¼è¥æ¯ç¼åå°ç£çï¼éè¦ç¨æ·è®¾ç½®æ°æ®ç¼åç®å½ï¼æ好设置为 SSD ç£çç®å½ãé¤æ¤ä¹å¤ï¼ç¼åæ°æ®çæ大容éãå¤ä»½æ°éçåå¯ç±ç¨æ·é ç½®è°æ´ã</li> +</ul> + +<p>åºäºä»¥ä¸è®¾è®¡ï¼å¨ Kylin 4.0 çæ¥è¯¢å¼æ sparder ç driver 端å executor 端åå«åä¸åç±»åçç¼åï¼åºæ¬æ¶æå¦ä¸ï¼</p> + +<p><img src="/images/blog/local-cache/kylin4_local_cache.png" alt="" /></p> + +<h4 id="section-3">2.软亲åæ§è°åº¦</h4> +<p>å¨ executor 端å data cache æ¶ï¼å¦æå¨ææç executor ä¸é½ç¼åå ¨é¨çæ°æ®ï¼é£ä¹ç¼åæ°æ®ç大å°å°ä¼é常å¯è§ï¼æ大ç浪费ç£ç空é´ï¼åæ¶ä¹å®¹æ导è´ç¼åæ°æ®è¢«é¢ç¹æ¸ çã为äºæ大å spark executor çç¼åå½ä¸çï¼spark driver éè¦å°åä¸æ件ç task å¨èµæºæ¡ä»¶æ»¡è¶³çæ åµä¸å°½å¯è½è°åº¦å°åæ ·ç executorï¼è¿æ ·å¯ä»¥ä¿è¯ç¸åæ件çæ°æ®è½å¤ç¼åå¨ç¹å®çæ个æè æå 个 executor ä¸ï¼å次读åæ¶ä¾¿å¯ä»¥éè¿ç¼å读åæ°æ ®ã<br /> +为æ¤ï¼æ们éåæ ¹æ®æ件åè®¡ç® hash ä¹ååä¸ executors num å模çç»ææ¥è®¡ç®ç®æ executor å表ï¼å¨å¤å°ä¸ª executor ä¸é¢åç¼åç±ç¨æ·é ç½®çç¼åå¤ä»½æ°éå³å®ï¼ä¸è¬æ åµä¸ï¼ç¼åå¤ä»½æ°éè¶å¤§ï¼å»ä¸ç¼åçæ¦çè¶é«ãå½ç®æ executor åä¸å¯è¾¾æè 没æèµæºä¾è°åº¦æ¶ï¼è°åº¦ç¨åºå°åéå° spark çéæºè°åº¦æºå¶ä¸ãè¿ç§è°åº¦æ¹å¼ä¾¿ç§°ä¸ºè½¯äº²åæ§è°åº¦çç¥ï¼å®è½ç¶ä¸è½ä¿è¯ 100% å»ä¸ç¼åï¼ä½è½å¤æææé«ç¼åå½ä� �çï¼å¨å°½éä¸æ失æ§è½çåæä¸é¿å full cache 浪费大éç£ç空é´ã</p> + +<h2 id="section-4">03 ç¸å ³é ç½®</h2> +<p>æ ¹æ®ä»¥ä¸åçï¼æä»¬å¨ Kylin 4.0 ä¸å®ç°äºæ¬å°ç¼å+软亲åæ§è°åº¦çåºç¡åè½ï¼å¹¶åå«åºäº ssb æ°æ®éå tpch æ°æ®éåäºæ¥è¯¢æ§è½æµè¯ã<br /> +è¿éååºå 个æ¯è¾éè¦çé 置项ä¾ç¨æ·äºè§£ï¼å®é 使ç¨çé ç½®å°å¨ç»å°¾é¾æ¥ä¸ç»åºï¼<br /> +- æ¯å¦å¼å¯è½¯äº²åæ§è°åº¦çç¥ï¼kylin.query.spark-conf.spark.kylin.soft-affinity.enabled<br /> +- æ¯å¦å¼å¯æ¬å°ç¼åï¼kylin.query.spark-conf.spark.hadoop.spark.kylin.local-cache.enabled<br /> +- Data cache çå¤ä»½æ°éï¼å³å¨å¤å°ä¸ª executor ä¸å¯¹åä¸æ°æ®æ件è¿è¡ç¼åï¼kylin.query.spark-conf.spark.kylin.soft-affinity.replications.num<br /> +- ç¼åå°å åä¸è¿æ¯æ¬å°ç®å½ï¼ç¼åå°å å设置为 BUFFï¼ç¼åå°æ¬å°è®¾ç½®ä¸º LOCALï¼kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.store.type<br /> +- æ大ç¼å容éï¼kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.size</p> + +<h2 id="section-5">04 æ§è½å¯¹æ¯</h2> +<p>æä»¬å¨ AWS EMR ç¯å¢ä¸è¿è¡äº 3 ç§åºæ¯çæ§è½æµè¯ï¼å¨ scale factor = 10çæ åµä¸ï¼å¯¹ ssb æ°æ®éè¿è¡å并åæ¥è¯¢æµè¯ãtpch æ°æ®éè¿è¡å并åæ¥è¯¢ä»¥å 4 并åæ¥è¯¢æµè¯ï¼å®éªç»åå¯¹ç §ç»åé ç½® s3 ä½ä¸ºåå¨ï¼å¨å®éªç»ä¸å¼å¯æ¬å°ç¼åå软亲åæ§è°åº¦ï¼å¯¹ç §ç»åä¸å¼å¯ãé¤æ¤ä¹å¤ï¼æ们è¿å°å®éªç»ç»æä¸ç¸åç¯å¢ä¸ hdfs ä½ä¸ºåå¨æ¶çç»æè¿è¡å¯¹æ¯ï¼ä»¥ä¾¿ç¨æ·å¯ä»¥ç´è§çæåå° æ¬å°ç¼å+软亲åæ§è°åº¦ å¯¹ä ºä¸é¨ç½² Kylin 4.0 使ç¨å¯¹è±¡åå¨ä½ä¸ºåå¨åºæ¯ä¸çä¼åææã</p> + +<p><img src="/images/blog/local-cache/local_cache_benchmark_result_ssb.png" alt="" /></p> + +<p><img src="/images/blog/local-cache/local_cache_benchmark_result_tpch1.png" alt="" /></p> + +<p><img src="/images/blog/local-cache/local_cache_benchmark_result_tpch4.png" alt="" /></p> + +<p>ä»ä»¥ä¸ç»æå¯ä»¥çåºï¼<br /> +1. å¨ ssb 10 æ°æ®éå并ååºæ¯ä¸ï¼ä½¿ç¨ s3 ä½ä¸ºåå¨æ¶ï¼å¼å¯æ¬å°ç¼åå软亲åæ§è°åº¦è½å¤è·å¾3åå·¦å³çæ§è½æåï¼å¯ä»¥è¾¾å°ä¸ hdfs ä½ä¸ºåå¨æ¶çç¸åæ§è½çè³è¿æ 5% å·¦å³çæåã<br /> +2. å¨ tpch 10 æ°æ®éä¸ï¼ä½¿ç¨ s3 ä½ä¸ºåå¨æ¶ï¼æ 论æ¯å并åæ¥è¯¢è¿æ¯å¤å¹¶åæ¥è¯¢ï¼å¼å¯æ¬å°ç¼åå软亲åæ§è°åº¦åï¼åºæ¬å¨æææ¥è¯¢ä¸é½è½å¤è·å¾å¤§å¹ 度çæ§è½æåã</p> + +<p>ä¸è¿å¨ tpch 10 æ°æ®éç 4 并åæµè¯ä¸ç Q21 ç对æ¯ç»æä¸ï¼æ们è§å¯å°ï¼å¼å¯æ¬å°ç¼åå软亲åæ§è°åº¦çç»æåèæ¯åç¬ä½¿ç¨ s3 ä½ä¸ºåå¨æ¶ææä¸éï¼è¿éå¯è½æ¯ç±äºæç§åå 导è´æ²¡æéè¿ç¼å读åæ°æ®ï¼æ·±å±åå å¨æ¤æ¬¡æµè¯ä¸æ²¡æè¿è¡è¿ä¸æ¥çåæï¼å¨åç»çä¼åè¿ç¨ä¸æ们ä¼éæ¥æ¹è¿ãç±äº tpch çæ¥è¯¢æ¯è¾å¤æä¸ SQL ç±»ååå¼ï¼ä¸ hdfs ä½ä¸ºåå¨æ¶çç»æç¸æ¯ï¼ä»ç¶æé¨å sql çæ§è½ç¥æ� �ä¸è¶³ï¼ä¸è¿æ»ä½æ¥è¯´å·²ç»ä¸ hdfs çç»ææ¯è¾æ¥è¿ã<br /> +æ¬æ¬¡æ§è½æµè¯çç»ææ¯ä¸æ¬¡å¯¹ æ¬å°ç¼å+软亲åæ§è°åº¦ æ§è½æåææçåæ¥éªè¯ï¼ä»æ»ä½ä¸æ¥çï¼æ¬å°ç¼å+软亲åæ§è°åº¦ æ 论对äºç®åæ¥è¯¢è¿æ¯å¤ææ¥è¯¢é½è½å¤è·å¾ææ¾çæ§è½æåï¼ä½æ¯å¨é«å¹¶åæ¥è¯¢åºæ¯ä¸åå¨ä¸å®çæ§è½æ失ã<br /> +å¦æç¨æ·ä½¿ç¨äºä¸å¯¹è±¡åå¨ä½ä¸º Kylin 4.0 çåå¨ï¼å¨å¼å¯ æ¬å°ç¼å+ 软亲åæ§è°åº¦çæ åµä¸ï¼æ¯å¯ä»¥è·å¾å¾å¥½çæ§è½ä½éªçï¼è¿ä¸º Kylin 4.0 å¨äºä¸ä½¿ç¨è®¡ç®ååå¨å离æ¶ææä¾äºæ§è½ä¿éã</p> + +<h2 id="section-6">05 代ç å®ç°</h2> +<p>ç±äºç®åç代ç å®ç°è¿å¤äºæ¯è¾åºç¡çé¶æ®µï¼è¿æ许å¤ç»èéè¦å®åï¼æ¯å¦å®ç°ä¸è´æ§åå¸ãå½ executor æ°éåçååæ¶å¦ä½å¤çå·²æ cache çï¼æ以ä½è è¿æªå社åºä»£ç åºæ交 PRï¼æ³è¦æåé¢è§çå¼åè å¯ä»¥éè¿ä¸é¢çé¾æ¥æ¥çæºç ï¼<br /> +<a href="https://github.com/zzcclp/kylin/commit/4e75b7fa4059dd2eaed24061fda7797fecaf2e35">Kylin4.0 æ¬å°ç¼å+软亲åæ§è°åº¦ä»£ç å®ç°</a></p> + +<h2 id="section-7">06 ç¸å ³é¾æ¥</h2> +<p>éè¿é¾æ¥å¯æ¥é æ§è½æµè¯ç»ææ°æ®åå ·ä½é ç½®ï¼<br /> +<a href="https://github.com/Kyligence/kylin-tpch/issues/9">Kylin4.0 æ¬å°ç¼å+软亲åæ§è°åº¦æµè¯</a></p> +</description> + <pubDate>Thu, 21 Oct 2021 04:00:00 -0700</pubDate> + <link>http://kylin.apache.org/cn_blog/2021/10/21/Local-Cache-and-Soft-Affinity-Scheduling/</link> + <guid isPermaLink="true">http://kylin.apache.org/cn_blog/2021/10/21/Local-Cache-and-Soft-Affinity-Scheduling/</guid> + + + <category>cn_blog</category> + + </item> + + <item> + <title>Performance optimization of Kylin 4.0 in cloud -- local cache and soft affinity scheduling</title> + <description><h2 id="background-introduction">01 Background Introduction</h2> +<p>Recently, the Apache Kylin community released Kylin 4.0.0 with a new architecture. The architecture of Kylin 4.0 supports the separation of storage and computing, which enables kylin users to run Kylin 4.0 in a more flexible cloud deployment mode with flexible computing resources. With the cloud infrastructure, users can choose to use cheap and reliable object storage to store cube data, such as S3. However, in the architecture of separation of storage and computing, we need to consider that reading data from remote storage by computing nodes through the network is still a costly operation, which often leads to performance loss.<br /> +In order to improve the query performance of Kylin 4.0 when using cloud object storage as the storage, we try to introduce the local cache mechanism into the Kylin 4.0 query engine. When executing the query, the frequently used data is cached on the local disk to reduce the delay caused by pulling data from the remote object storage and achieve faster query response. In addition, in order to avoid wasting disk space when the same data is cached on a large number of spark executors at the same time, and the computing node can read more required data from the local cache, we introduce the scheduling strategy of soft affinity. The soft affinity strategy is to establish a corresponding relationship between the spark executor and the data file through some method, In most cases, the same data can always be read on the same executor, so as to improve the hit rate of the cache.</p> + +<h2 id="implementation-principle">02 Implementation Principle</h2> + +<h4 id="local-cache">1. Local Cache</h4> + +<p>When Kylin 4.0 executes a query, it mainly goes through the following stages, in which the stages where local cache can be used to improve performance are marked with dotted lines:</p> + +<p><img src="/images/blog/local-cache/Local_cache_stage.png" alt="" /></p> + +<ul> + <li>File list cacheï¼Cache the file status on the spark driver side. When executing the query, the spark driver needs to read the file list and obtain some file information for subsequent scheduling execution. Here, the file status information will be cached locally to avoid frequent reading of remote file directories.</li> + <li>Data cacheï¼Cache the data on the spark executor side. You can set the data cache to memory or disk. If it is set to cache to memory, you need to appropriately increase the executor memory to ensure that the executor has enough memory for data cache; If it is cached to disk, you need to set the data cache directory, preferably SSD disk directory.</li> +</ul> + +<p>Based on the above design, different types of caches are made on the driver side and the executor side of the query engine of kylin 4.0. The basic architecture is as follows:</p> + +<p><img src="/images/blog/local-cache/kylin4_local_cache.png" alt="" /></p> + +<h4 id="soft-affinity-scheduling">2. Soft Affinity Scheduling</h4> + +<p>When doing data cache on the executor side, if all data is cached on all executors, the size of cached data will be very considerable and a great waste of disk space, and it is easy to cause frequent evict cache data. In order to maximize the cache hit rate of the spark executor, the spark driver needs to schedule the tasks of the same file to the same executor as far as possible when the resource conditions are me, so as to ensure that the data of the same file can be cached on a specific one or several executors, and the data can be read through the cache when it is read again.<br /> +To this end, we calculate the target executor list by calculating the hash according to the file name and then modulo with the executor num. The number of executors to cache is determined by the number of data cache replications configured by the user. Generally, the larger the number of cache replications, the higher the probability of hitting the cache. When the target executors are unreachable or have no resources for scheduling, the scheduler will fall back to the random scheduling mechanism of spark. This scheduling method is called soft affinity scheduling strategy. Although it can not guarantee 100% hit to the cache, it can effectively improve the cache hit rate and avoid a large amount of disk space wasted by full cache on the premise of minimizing performance loss.</p> + +<h2 id="related-configuration">03 Related Configuration</h2> + +<p>According to the above principles, we implemented the basic function of local cache + soft affinity scheduling in Kylin 4.0, and tested the query performance based on SSB data set and TPCH data set respectively.<br /> +Several important configuration items are listed here for users to understand. The actual configuration will be given in the attachment at the end:</p> + +<ul> + <li>Enable soft affinity schedulingï¼kylin.query.spark-conf.spark.kylin.soft-affinity.enabled</li> + <li>Enable local cacheï¼kylin.query.spark-conf.spark.hadoop.spark.kylin.local-cache.enabled</li> + <li>The number of data cache replications, that is, how many executors cache the same data fileï¼kylin.query.spark-conf.spark.kylin.soft-affinity.replications.num</li> + <li>Cache to memory or local directory. Set cache to memory as buff and cache to local as local: kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.store.type</li> + <li>Maximum cache capacityï¼kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.size</li> +</ul> + +<h2 id="performance-benchmark">04 Performance Benchmark</h2> + +<p>We conducted performance tests in three scenarios under AWS EMR environment. When scale factor = 10, we conducted single concurrent query test on SSB dataset, single concurrent query test and 4 concurrent query test on TPCH dataset. S3 was configured as storage in the experimental group and the control group. Local cache and soft affinity scheduling were enabled in the experimental group, but not in the control group. In addition, we also compare the results of the experimental group with the results when HDFS is used as storage in the same environment, so that users can intuitively feel the optimization effect of local cache + soft affinity scheduling on deploying Kylin 4.0 on the cloud and using object storage as storage.</p> + +<p><img src="/images/blog/local-cache/local_cache_benchmark_result_ssb.png" alt="" /></p> + +<p><img src="/images/blog/local-cache/local_cache_benchmark_result_tpch1.png" alt="" /></p> + +<p><img src="/images/blog/local-cache/local_cache_benchmark_result_tpch4.png" alt="" /></p> + +<p>As can be seen from the above results:</p> + +<ol> + <li>In the single concurrency scenario of SSB data set, when S3 is used as storage, turning on the local cache and soft affinity scheduling can achieve about three times the performance improvement, which can be the same as that of HDFS, or even improved.</li> + <li>Under TPCH data set, when S3 is used as storage, whether single concurrent query or multiple concurrent query, after local cache and soft affinity scheduling are enabled, the performance of all queries can be greatly improved.</li> +</ol> + +<p>However, in the comparison results of Q21 under the 4 concurrent tests of TPCH dataset, we observed that the results of enabling local cache and soft affinity scheduling are lower than those when using S3 alone as storage. Here, it may be that the data is not read through the cache for some reason. The underlying reason is not further analyzed in this test, in the subsequent optimization process, we will gradually improve. Moreover, because the query of TPCH is complex and the SQL types are different, compared with the results of HDFS, the performance of some SQL is improved, while the performance of some SQL is slightly insufficient, but generally speaking, it is very close to the results of HDFS as storage.<br /> +The result of this performance test is a preliminary verification of the performance improvement effect of local cache + soft affinity scheduling. On the whole, local cache + soft affinity scheduling can achieve significant performance improvement for both simple queries and complex queries, but there is a certain performance loss in the scenario of high concurrent queries.<br /> +If users use cloud object storage as Kylin 4.0 storage, they can get a good performance experience when local cache + soft affinity scheduling is enabled, which provides performance guarantee for Kylin 4.0 to use the separation architecture of computing and storage in the cloud.</p> + +<h2 id="code-implementation">05 Code Implementation</h2> + +<p>Since the current code implementation is still in the basic stage, there are still many details to be improved, such as implementing consistent hash, how to deal with the existing cache when the number of executors changes, so the author has not submitted PR to the community code base. Developers who want to preview in advance can view the source code through the following link:</p> + +<p><a href="https://github.com/zzcclp/kylin/commit/4e75b7fa4059dd2eaed24061fda7797fecaf2e35">The code implementation of local cache and soft affinity scheduling</a></p> + +<h2 id="related-link">06 Related Link</h2> + +<p>You can view the performance test result data and specific configuration through the link:<br /> +<a href="https://github.com/Kyligence/kylin-tpch/issues/9">The benchmark of Kylin4.0 with local cache and soft affinity scheduling</a></p> +</description> + <pubDate>Thu, 21 Oct 2021 04:00:00 -0700</pubDate> + <link>http://kylin.apache.org/blog/2021/10/21/Local-Cache-and-Soft-Affinity-Scheduling/</link> + <guid isPermaLink="true">http://kylin.apache.org/blog/2021/10/21/Local-Cache-and-Soft-Affinity-Scheduling/</guid> + + + <category>blog</category> + + </item> + + <item> <title>Kylin å¨ç¾å¢å°åºé¤é¥®çå®è·µåä¼å</title> <description><p>ä»2016å¹´å¼å§ï¼ç¾å¢å°åºé¤é¥®ææ¯å¢éå°±å¼å§ä½¿ç¨Apache Kylinä½ä¸ºOLAPå¼æï¼ä½æ¯éçä¸å¡çé«éåå±ï¼å¨æ建åæ¥è¯¢å±é¢é½åºç°äºæçé®é¢ãäºæ¯ï¼ææ¯å¢éä»åç解读å¼å§ï¼ç¶å对è¿ç¨è¿è¡å±å±æ解ï¼å¹¶å¶å®äºç±ç¹åé¢çå®æ½è·¯çº¿ãæ¬ææ»ç»äºä¸äºç»éªåå¿å¾ï¼å¸æè½å¤å¸®å©ä¸çæ´å¤çææ¯å¢éæé«æ°æ®ç产åºæçã</p> @@ -577,155 +728,6 @@ For example, a query joins two subquerie </item> <item> - <title>Why did Youzan choose Kylin4</title> - <description><p>At the QCon Global Software Developers Conference held on May 29, 2021, Zheng Shengjun, head of Youzanâs data infrastructure platform, shared Youzanâs internal use experience and optimization practice of Kylin 4.0 on the meeting room of open source big data frameworks and applications. <br /> -For many users of Kylin2/3(Kylin on HBase), this is also a chance to learn how and why to upgrade to Kylin 4.</p> - -<p>This sharing is mainly divided into the following parts:</p> - -<ul> - <li>The reason for choosing Kylin 4</li> - <li>Introduction to Kylin 4</li> - <li>How to optimize performance of Kylin 4</li> - <li>Practice of Kylin 4 in Youzan</li> -</ul> - -<h2 id="the-reason-for-choosing-kylin-4">01 The reason for choosing Kylin 4</h2> - -<h3 id="introduction-to-youzan">Introduction to Youzan</h3> -<p>China Youzan Co., Ltd (stock code 08083.HK). is an enterprise mainly engaged in retail technology services.<br /> -At present, it owns several tools and solutions to provide SaaS software products and talent services to help merchants operate mobile social e-commerce and new retail channels in an all-round way. <br /> -Currently Youzan has hundreds of millions of consumers and 6 million existing merchants.</p> - -<h3 id="history-of-kylin-in-youzan">History of Kylin in Youzan</h3> -<p><img src="/images/blog/youzan/1 history_of_youzan_OLAP.png" alt="" /></p> - -<p>First of all, I would like to share why Youzan chose to upgrade to Kylin 4. Here, let me briefly reviewed the history of Youzan OLAP infra.</p> - -<p>In the early days of Youzan, in order to iterate develop process quickly, we chose the method of pre-computation + MySQL; in 2018, Druid was introduced because of query flexibility and development efficiency, but there were problems such as low pre-aggregation, not supporting precisely count distinct measure. In this situation, Youzan introduced Apache Kylin and ClickHouse. Kylin supports high aggregation, precisely count distinct measure and the lowest RT, while ClickHouse is quite flexible in usage(ad hoc query).</p> - -<p>From the introduction of Kylin in 2018 to now, Youzan has used Kylin for more than three years. With the continuous enrichment of business scenarios and the continuous accumulation of data volume, Youzan currently has 6 million existing merchants, GMV in 2020 is 107.3 billion, and the daily build data volume is 10 billion +. At present, Kylin has basically covered all the business scenarios of Youzan.</p> - -<h3 id="the-challenges-of-kylin-3">The challenges of Kylin 3</h3> -<p>With Youzanâs rapid development and in-depth use of Kylin, we also encountered some challenges:</p> - -<ul> - <li>First of all, the build performance of Kylin on HBase cannot meet the favorable expectations, and the build performance will affect the userâs failure recovery time and stability experience;</li> - <li>Secondly, with the access of more large merchants (tens of millions of members in a single store, with hundreds of thousands of goods for each store), it also brings great challenges to our OLAP system. Kylin on HBase is limited by the single-point query of Query Server, and cannot support these complex scenarios well;</li> - <li>Finally, because HBase is not a cloud-native system, it is difficult to achieve flexible scale up and scale down. With the continuous growth of data volume, this system has peaks and valleys for businesses, which results in the average resource utilization rate is not high enough.</li> -</ul> - -<p>Faced with these challenges, Youzan chose to move closer and upgrade to the more cloud-native Apache Kylin 4.</p> - -<h2 id="introduction-to-kylin-4">02 Introduction to Kylin 4</h2> -<p>First of all, letâs introduce the main advantages of Kylin 4. Apache Kylin 4 completely depends on Spark for cubing job and query. It can make full use of Sparkâs parallelization, quantization(åéå), and global dynamic code generation technologies to improve the efficiency of large queries.<br /> -Here is a brief introduction to the principle of Kylin 4, that is storage engine, build engine and query engine.</p> - -<h3 id="storage-engine">Storage engine</h3> -<p><img src="/images/blog/youzan/2 kylin4_storage.png" alt="" /></p> - -<p>First of all, letâs take a look at the new storage engine, comparison between Kylin on HBase and Kylin on Parquet. The cuboid data of Kylin on HBase is stored in the table of HBase. Single Segment corresponds to one HBase table. Aggregation is pushed down to HBase coprocessor.</p> - -<p>But as we know, HBase is not a real Columnar Storage and its throughput is not enough for OLAP System. Kylin 4 replaces HBase with Parquet, all the data is stored in files. Each segment will have a corresponding HDFS directory. All queries and cubing jobs read and write files without HBase . Although there will be a certain loss of performance for simple queries, the improvement brought about by complex queries is more considerable and worthwhile.</p> - -<h3 id="build-engine">Build engine</h3> -<p><img src="/images/blog/youzan/3 kylin4_build_engine.png" alt="" /></p> - -<p>The second is the new build engine. Based on our test, the build speed of Kylin on Parquet has been optimized from 82 minutes to 15 minutes. There are several reasons:</p> - -<ul> - <li>Kylin 4 removes the encoding of the dimension, eliminating a building step of encoding;</li> - <li>Removed the HBase File generation step;</li> - <li>Kylin on Parquet changes the granularity of cubing to cuboid level, which is conducive to further improving parallelism of cubing job.</li> - <li>Enhanced implementation for global dictionary. In the new algorithm, dictionary and source data are hashed into the same buckets, making it possible for loading only piece of dictionary bucket to encode source data.</li> -</ul> - -<p>As you can see on the right, after upgradation to Kylin 4, cubing job changes from ten steps to two steps, the performance improvement of the construction is very obvious.</p> - -<h3 id="query-engine">Query engine</h3> -<p><img src="/images/blog/youzan/4 kylin4_query.png" alt="" /></p> - -<p>Next is the new query engine of Kylin 4. As you can see, the calculation of Kylin on HBase is completely dependent on the coprocessor of HBase and query server process. When the data is read from HBase into query server to do aggregation, sorting, etc, the bottleneck will be restricted by the single point of query server. But Kylin 4 is converted to a fully distributed query mechanism based on Spark, whatâs more, it âs able to do configuration tuning automatically in spark query step !</p> - -<h2 id="how-to-optimize-performance-of-kylin-4">03 How to optimize performance of Kylin 4</h2> -<p>Next, Iâd like to share some performance optimizations made by Youzan in Kylin 4.</p> - -<h3 id="optimization-of-query-engine">Optimization of query engine</h3> -<p>#### 1.Cache Calcite physical plan<br /> -<img src="/images/blog/youzan/5 cache_calcite_plan.png" alt="" /></p> - -<p>In Kylin4, SQL will be analyzed, optimized and do code generation in calcite. This step takes up about 150ms for some queries. We have supported PreparedStatementCache in Kylin4 to cache calcite plan, so that the structured SQL donât have to do the same step again. With this optimization it saved about 150ms of time cost.</p> - -<h4 id="tunning-spark-configuration">2.Tunning spark configuration</h4> -<p><img src="/images/blog/youzan/6 tuning_spark_configuration.png" alt="" /></p> - -<p>Kylin4 uses spark as query engine. As spark is a distributed engine designed for massive data processing, itâs inevitable to loose some performance for small queries. We have tried to do some tuning to catch up with the latency in Kylin on HBase for small queries.</p> - -<p>Our first optimization is to make more calculations finish in memory. The key is to avoid data spill during aggregation, shuffle and sort. Tuning the following configuration is helpful.</p> - -<ul> - <li>1.set <code class="highlighter-rouge">spark.sql.objectHashAggregate.sortBased.fallbackThreshold</code> to larger value to avoid HashAggregate fall back to Sort Based Aggregate, which really kills performance when happens.</li> - <li>2.set <code class="highlighter-rouge">spark.shuffle.spill.initialMemoryThreshold</code> to a large value to avoid to many spills during shuffle.</li> -</ul> - -<p>Secondly, we route small queries to Query Server which run spark in local mode. Because the overhead of task schedule, shuffle read and variable broadcast is enlarged for small queries on YARN/Standalone mode.</p> - -<p>Thirdly, we use RAM disk to enhance shuffle performance. Mount RAM disk as TMPFS and set spark.local.dir to directory using RAM disk.</p> - -<p>Lastly, we disabled sparkâs whole stage code generation for small queries, for sparkâs whole stage code generation will cost about 100ms~200ms, whereas itâs not beneficial to small queries which is a simple project.</p> - -<h4 id="parquet-optimization">3.Parquet optimization</h4> -<p><img src="/images/blog/youzan/7 parquet_optimization.png" alt="" /></p> - -<p>Optimizing parquet is also important for queries.</p> - -<p>The first principal is that weâd better always include shard by column in our filter condition, for parquet files are shard by shard-by-column, filter using shard by column reduces the data files to read.</p> - -<p>Then look into parquet files, data within files are sorted by rowkey columns, that is to say, prefix match in query is as important as Kylin on HBase. When a query condition satisfies prefix match, it can filter row groups with columnâs max/min index. Furthermore, we can reduce row group size to make finer index granularity, but be aware that the compression rate will be lower if we set row group size smaller.</p> - -<h4 id="dynamic-elimination-of-partitioning-dimensions">4.Dynamic elimination of partitioning dimensions</h4> -<p>Kylin4 have a new ability that the older version is not capable of, which is able to reduce dozens of times of data reading and computing for some big queries. Itâs offen the case that partition column is used to filter data but not used as group dimension. For those cases Kylin would always choose cuboid with partition column, but now it is able to use different cuboid in that query to reduce IO read and computing.</p> - -<p>The key of this optimization is to split a query into two parts, one of the part uses all segmentâs data so that partition column doesnât have to be included in cuboid, the other part that uses part of segments data will choose cuboid with partition dimension to do the data filter.</p> - -<p>We have tested that in some situations the response time reduced from 20s to 6s, 10s to 3s.</p> - -<p><img src="/images/blog/youzan/8 Dynamic_elimination_of_partitioning_dimensions.png" alt="" /></p> - -<h3 id="optimization-of-build-engine">Optimization of build engine</h3> -<p>#### 1.cache parent dataset<br /> -<img src="/images/blog/youzan/9 cache_parent_dataset.png" alt="" /></p> - -<p>Kylin build cube layer by layer. For a parent layer with multi cuboids to build, we can choose to cache parent dataset by setting kylin.engine.spark.parent-dataset.max.persist.count to a number greater than 0. But notice that if you set this value too small, it will affect the parallelism of build job, as the build granularity is at cuboid level.</p> - -<h2 id="practice-of-kylin-4-in-youzan">04 Practice of Kylin 4 in Youzan</h2> -<p>After introducing Youzanâs experience of performance optimization, letâs share the optimization effect. That is, Kylin 4âs practice in Youzan includes the upgrade process and the performance of online system.</p> - -<h3 id="upgrade-metadata-to-adapt-to-kylin-4">Upgrade metadata to adapt to Kylin 4</h3> -<p>First of all, for metadata for Kylin 3 which stored on HBase, we have developed a tool for seamless upgrading of metadata. First of all, our metadata in Kylin on HBase is stored in HBase. We export the metadata in HBase into local files, and then use tools to transform and write back the new metadata into MySQL. We also updated the operation documents and general principles in the official wiki of Apache Kylin. For more details, you can refer to: <a href="https://wiki.apache.org/confluence/display/KYLIN/How+to+migrate+metadata+to+Kylin+4">How to migrate metadata to Kylin 4</a>.</p> - -<p>Letâs give a general introduction to some compatibility in the whole process. The project metadata, tables metadata, permission-related metadata, and model metadata do not need be modified. What needs to be modified is the cube metadata, including the type of storage and query used by Cube. After updating these two fields, you need to recalculate the Cube signature. The function of this signature is designed internally by Kylin to avoid some problems caused by Cube after Cube is determined.</p> - -<h3 id="performance-of-kylin-4-on-youzan-online-system">Performance of Kylin 4 on Youzan online system</h3> -<p><img src="/images/blog/youzan/10 commodity_insight.png" alt="" /></p> - -<p>After the migration of metadata to Kylin4, letâs share the qualitative changes and substantial performance improvements brought about by some of the promising scenarios. First of all, in a scenario like Commodity Insight, there is a large store with several hundred thousand of commodities. We have to analyze its transactions and traffic, etc. There are more than a dozen precise precisely count distinct measures in single cube. Precisely count distinct measure is actually very inefficient if it is not optimized through pre-calculation and Bitmap. Kylin currently uses Bitmap to support precisely count distinct measure. In a scene that requires complex queries to sort hundreds of thousands of commodities in various UV(precisely count distinct measure), the RT of Kylin 2 is 27 seconds, while the RT of Kylin 4 is reduced from 27 seconds to less than 2 seconds.</p> - -<p>What I find most appealing to me about Kylin 4 is that itâs like a manual transmission car, you can control its query concurrency at your will, whereas you canât change query concurrency in Kylin on HBase freely, because its concurrency is completely tied to the number of regions.</p> - -<h3 id="plan-for-kylin-4-in-youzan">Plan for Kylin 4 in Youzan</h3> -<p>We have made full test, fixed several bugs and improved apache KYLIN4 for several months. Now we are migrating cubes from older version to newer version. For the cubes already migrated to KYLIN4, its small queriesâ performance meet our expectations, its complex query and build performance did bring us a big surprise. We are planning to migrate all cubes from older version to Kylin4.</p> -</description> - <pubDate>Thu, 17 Jun 2021 08:00:00 -0700</pubDate> - <link>http://kylin.apache.org/blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</link> - <guid isPermaLink="true">http://kylin.apache.org/blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</guid> - - - <category>blog</category> - - </item> - - <item> <title>æèµä¸ºä»ä¹éæ© Kylin4</title> <description><p>å¨ 2021å¹´5æ29æ¥ä¸¾åç QCon å ¨ç软件å¼åè 大ä¼ä¸ï¼æ¥èªæèµçæ°æ®åºç¡å¹³å°è´è´£äºº éçä¿ å¨å¤§æ°æ®å¼æºæ¡æ¶ä¸åºç¨ä¸é¢ä¸å享äºæèµå é¨å¯¹ Kylin 4.0 ç使ç¨ç»ååä¼åå®è·µï¼å¯¹äºä¼å¤ Kylin èç¨æ·æ¥è¯´ï¼è¿ä¹æ¯å级 Kylin 4 çå®ç¨æ»ç¥ã</p> @@ -885,6 +887,155 @@ Here is a brief introduction to the prin </item> <item> + <title>Why did Youzan choose Kylin4</title> + <description><p>At the QCon Global Software Developers Conference held on May 29, 2021, Zheng Shengjun, head of Youzanâs data infrastructure platform, shared Youzanâs internal use experience and optimization practice of Kylin 4.0 on the meeting room of open source big data frameworks and applications. <br /> +For many users of Kylin2/3(Kylin on HBase), this is also a chance to learn how and why to upgrade to Kylin 4.</p> + +<p>This sharing is mainly divided into the following parts:</p> + +<ul> + <li>The reason for choosing Kylin 4</li> + <li>Introduction to Kylin 4</li> + <li>How to optimize performance of Kylin 4</li> + <li>Practice of Kylin 4 in Youzan</li> +</ul> + +<h2 id="the-reason-for-choosing-kylin-4">01 The reason for choosing Kylin 4</h2> + +<h3 id="introduction-to-youzan">Introduction to Youzan</h3> +<p>China Youzan Co., Ltd (stock code 08083.HK). is an enterprise mainly engaged in retail technology services.<br /> +At present, it owns several tools and solutions to provide SaaS software products and talent services to help merchants operate mobile social e-commerce and new retail channels in an all-round way. <br /> +Currently Youzan has hundreds of millions of consumers and 6 million existing merchants.</p> + +<h3 id="history-of-kylin-in-youzan">History of Kylin in Youzan</h3> +<p><img src="/images/blog/youzan/1 history_of_youzan_OLAP.png" alt="" /></p> + +<p>First of all, I would like to share why Youzan chose to upgrade to Kylin 4. Here, let me briefly reviewed the history of Youzan OLAP infra.</p> + +<p>In the early days of Youzan, in order to iterate develop process quickly, we chose the method of pre-computation + MySQL; in 2018, Druid was introduced because of query flexibility and development efficiency, but there were problems such as low pre-aggregation, not supporting precisely count distinct measure. In this situation, Youzan introduced Apache Kylin and ClickHouse. Kylin supports high aggregation, precisely count distinct measure and the lowest RT, while ClickHouse is quite flexible in usage(ad hoc query).</p> + +<p>From the introduction of Kylin in 2018 to now, Youzan has used Kylin for more than three years. With the continuous enrichment of business scenarios and the continuous accumulation of data volume, Youzan currently has 6 million existing merchants, GMV in 2020 is 107.3 billion, and the daily build data volume is 10 billion +. At present, Kylin has basically covered all the business scenarios of Youzan.</p> + +<h3 id="the-challenges-of-kylin-3">The challenges of Kylin 3</h3> +<p>With Youzanâs rapid development and in-depth use of Kylin, we also encountered some challenges:</p> + +<ul> + <li>First of all, the build performance of Kylin on HBase cannot meet the favorable expectations, and the build performance will affect the userâs failure recovery time and stability experience;</li> + <li>Secondly, with the access of more large merchants (tens of millions of members in a single store, with hundreds of thousands of goods for each store), it also brings great challenges to our OLAP system. Kylin on HBase is limited by the single-point query of Query Server, and cannot support these complex scenarios well;</li> + <li>Finally, because HBase is not a cloud-native system, it is difficult to achieve flexible scale up and scale down. With the continuous growth of data volume, this system has peaks and valleys for businesses, which results in the average resource utilization rate is not high enough.</li> +</ul> + +<p>Faced with these challenges, Youzan chose to move closer and upgrade to the more cloud-native Apache Kylin 4.</p> + +<h2 id="introduction-to-kylin-4">02 Introduction to Kylin 4</h2> +<p>First of all, letâs introduce the main advantages of Kylin 4. Apache Kylin 4 completely depends on Spark for cubing job and query. It can make full use of Sparkâs parallelization, quantization(åéå), and global dynamic code generation technologies to improve the efficiency of large queries.<br /> +Here is a brief introduction to the principle of Kylin 4, that is storage engine, build engine and query engine.</p> + +<h3 id="storage-engine">Storage engine</h3> +<p><img src="/images/blog/youzan/2 kylin4_storage.png" alt="" /></p> + +<p>First of all, letâs take a look at the new storage engine, comparison between Kylin on HBase and Kylin on Parquet. The cuboid data of Kylin on HBase is stored in the table of HBase. Single Segment corresponds to one HBase table. Aggregation is pushed down to HBase coprocessor.</p> + +<p>But as we know, HBase is not a real Columnar Storage and its throughput is not enough for OLAP System. Kylin 4 replaces HBase with Parquet, all the data is stored in files. Each segment will have a corresponding HDFS directory. All queries and cubing jobs read and write files without HBase . Although there will be a certain loss of performance for simple queries, the improvement brought about by complex queries is more considerable and worthwhile.</p> + +<h3 id="build-engine">Build engine</h3> +<p><img src="/images/blog/youzan/3 kylin4_build_engine.png" alt="" /></p> + +<p>The second is the new build engine. Based on our test, the build speed of Kylin on Parquet has been optimized from 82 minutes to 15 minutes. There are several reasons:</p> + +<ul> + <li>Kylin 4 removes the encoding of the dimension, eliminating a building step of encoding;</li> + <li>Removed the HBase File generation step;</li> + <li>Kylin on Parquet changes the granularity of cubing to cuboid level, which is conducive to further improving parallelism of cubing job.</li> + <li>Enhanced implementation for global dictionary. In the new algorithm, dictionary and source data are hashed into the same buckets, making it possible for loading only piece of dictionary bucket to encode source data.</li> +</ul> + +<p>As you can see on the right, after upgradation to Kylin 4, cubing job changes from ten steps to two steps, the performance improvement of the construction is very obvious.</p> + +<h3 id="query-engine">Query engine</h3> +<p><img src="/images/blog/youzan/4 kylin4_query.png" alt="" /></p> + +<p>Next is the new query engine of Kylin 4. As you can see, the calculation of Kylin on HBase is completely dependent on the coprocessor of HBase and query server process. When the data is read from HBase into query server to do aggregation, sorting, etc, the bottleneck will be restricted by the single point of query server. But Kylin 4 is converted to a fully distributed query mechanism based on Spark, whatâs more, it âs able to do configuration tuning automatically in spark query step !</p> + +<h2 id="how-to-optimize-performance-of-kylin-4">03 How to optimize performance of Kylin 4</h2> +<p>Next, Iâd like to share some performance optimizations made by Youzan in Kylin 4.</p> + +<h3 id="optimization-of-query-engine">Optimization of query engine</h3> +<p>#### 1.Cache Calcite physical plan<br /> +<img src="/images/blog/youzan/5 cache_calcite_plan.png" alt="" /></p> + +<p>In Kylin4, SQL will be analyzed, optimized and do code generation in calcite. This step takes up about 150ms for some queries. We have supported PreparedStatementCache in Kylin4 to cache calcite plan, so that the structured SQL donât have to do the same step again. With this optimization it saved about 150ms of time cost.</p> + +<h4 id="tunning-spark-configuration">2.Tunning spark configuration</h4> +<p><img src="/images/blog/youzan/6 tuning_spark_configuration.png" alt="" /></p> + +<p>Kylin4 uses spark as query engine. As spark is a distributed engine designed for massive data processing, itâs inevitable to loose some performance for small queries. We have tried to do some tuning to catch up with the latency in Kylin on HBase for small queries.</p> + +<p>Our first optimization is to make more calculations finish in memory. The key is to avoid data spill during aggregation, shuffle and sort. Tuning the following configuration is helpful.</p> + +<ul> + <li>1.set <code class="highlighter-rouge">spark.sql.objectHashAggregate.sortBased.fallbackThreshold</code> to larger value to avoid HashAggregate fall back to Sort Based Aggregate, which really kills performance when happens.</li> + <li>2.set <code class="highlighter-rouge">spark.shuffle.spill.initialMemoryThreshold</code> to a large value to avoid to many spills during shuffle.</li> +</ul> + +<p>Secondly, we route small queries to Query Server which run spark in local mode. Because the overhead of task schedule, shuffle read and variable broadcast is enlarged for small queries on YARN/Standalone mode.</p> + +<p>Thirdly, we use RAM disk to enhance shuffle performance. Mount RAM disk as TMPFS and set spark.local.dir to directory using RAM disk.</p> + +<p>Lastly, we disabled sparkâs whole stage code generation for small queries, for sparkâs whole stage code generation will cost about 100ms~200ms, whereas itâs not beneficial to small queries which is a simple project.</p> + +<h4 id="parquet-optimization">3.Parquet optimization</h4> +<p><img src="/images/blog/youzan/7 parquet_optimization.png" alt="" /></p> + +<p>Optimizing parquet is also important for queries.</p> + +<p>The first principal is that weâd better always include shard by column in our filter condition, for parquet files are shard by shard-by-column, filter using shard by column reduces the data files to read.</p> + +<p>Then look into parquet files, data within files are sorted by rowkey columns, that is to say, prefix match in query is as important as Kylin on HBase. When a query condition satisfies prefix match, it can filter row groups with columnâs max/min index. Furthermore, we can reduce row group size to make finer index granularity, but be aware that the compression rate will be lower if we set row group size smaller.</p> + +<h4 id="dynamic-elimination-of-partitioning-dimensions">4.Dynamic elimination of partitioning dimensions</h4> +<p>Kylin4 have a new ability that the older version is not capable of, which is able to reduce dozens of times of data reading and computing for some big queries. Itâs offen the case that partition column is used to filter data but not used as group dimension. For those cases Kylin would always choose cuboid with partition column, but now it is able to use different cuboid in that query to reduce IO read and computing.</p> + +<p>The key of this optimization is to split a query into two parts, one of the part uses all segmentâs data so that partition column doesnât have to be included in cuboid, the other part that uses part of segments data will choose cuboid with partition dimension to do the data filter.</p> + +<p>We have tested that in some situations the response time reduced from 20s to 6s, 10s to 3s.</p> + +<p><img src="/images/blog/youzan/8 Dynamic_elimination_of_partitioning_dimensions.png" alt="" /></p> + +<h3 id="optimization-of-build-engine">Optimization of build engine</h3> +<p>#### 1.cache parent dataset<br /> +<img src="/images/blog/youzan/9 cache_parent_dataset.png" alt="" /></p> + +<p>Kylin build cube layer by layer. For a parent layer with multi cuboids to build, we can choose to cache parent dataset by setting kylin.engine.spark.parent-dataset.max.persist.count to a number greater than 0. But notice that if you set this value too small, it will affect the parallelism of build job, as the build granularity is at cuboid level.</p> + +<h2 id="practice-of-kylin-4-in-youzan">04 Practice of Kylin 4 in Youzan</h2> +<p>After introducing Youzanâs experience of performance optimization, letâs share the optimization effect. That is, Kylin 4âs practice in Youzan includes the upgrade process and the performance of online system.</p> + +<h3 id="upgrade-metadata-to-adapt-to-kylin-4">Upgrade metadata to adapt to Kylin 4</h3> +<p>First of all, for metadata for Kylin 3 which stored on HBase, we have developed a tool for seamless upgrading of metadata. First of all, our metadata in Kylin on HBase is stored in HBase. We export the metadata in HBase into local files, and then use tools to transform and write back the new metadata into MySQL. We also updated the operation documents and general principles in the official wiki of Apache Kylin. For more details, you can refer to: <a href="https://wiki.apache.org/confluence/display/KYLIN/How+to+migrate+metadata+to+Kylin+4">How to migrate metadata to Kylin 4</a>.</p> + +<p>Letâs give a general introduction to some compatibility in the whole process. The project metadata, tables metadata, permission-related metadata, and model metadata do not need be modified. What needs to be modified is the cube metadata, including the type of storage and query used by Cube. After updating these two fields, you need to recalculate the Cube signature. The function of this signature is designed internally by Kylin to avoid some problems caused by Cube after Cube is determined.</p> + +<h3 id="performance-of-kylin-4-on-youzan-online-system">Performance of Kylin 4 on Youzan online system</h3> +<p><img src="/images/blog/youzan/10 commodity_insight.png" alt="" /></p> + +<p>After the migration of metadata to Kylin4, letâs share the qualitative changes and substantial performance improvements brought about by some of the promising scenarios. First of all, in a scenario like Commodity Insight, there is a large store with several hundred thousand of commodities. We have to analyze its transactions and traffic, etc. There are more than a dozen precise precisely count distinct measures in single cube. Precisely count distinct measure is actually very inefficient if it is not optimized through pre-calculation and Bitmap. Kylin currently uses Bitmap to support precisely count distinct measure. In a scene that requires complex queries to sort hundreds of thousands of commodities in various UV(precisely count distinct measure), the RT of Kylin 2 is 27 seconds, while the RT of Kylin 4 is reduced from 27 seconds to less than 2 seconds.</p> + +<p>What I find most appealing to me about Kylin 4 is that itâs like a manual transmission car, you can control its query concurrency at your will, whereas you canât change query concurrency in Kylin on HBase freely, because its concurrency is completely tied to the number of regions.</p> + +<h3 id="plan-for-kylin-4-in-youzan">Plan for Kylin 4 in Youzan</h3> +<p>We have made full test, fixed several bugs and improved apache KYLIN4 for several months. Now we are migrating cubes from older version to newer version. For the cubes already migrated to KYLIN4, its small queriesâ performance meet our expectations, its complex query and build performance did bring us a big surprise. We are planning to migrate all cubes from older version to Kylin4.</p> +</description> + <pubDate>Thu, 17 Jun 2021 08:00:00 -0700</pubDate> + <link>http://kylin.apache.org/blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</link> + <guid isPermaLink="true">http://kylin.apache.org/blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</guid> + + + <category>blog</category> + + </item> + + <item> <title>ä½ ç¦»å¯è§åé ·ç«å¤§å±åªå·®ä¸å¥ Kylin + Davinci</title> <description><p>Kylin æä¾ä¸ BI å·¥å ·çæ´åè½åï¼å¦ Tableauï¼PowerBI/Excelï¼MSTRï¼QlikSenseï¼Hue å SuperSetãä½å°±å¯è§åå·¥å ·èè¨ï¼Davinci è¯å¥½ç交äºæ§å个æ§åçå¯è§å大å±å±ç°ææï¼ä½¿å ¶ä¸ Kylin çç»åè½è®©å¤§é¨åç¨æ·ææ´å¥½çå¯è§ååæä½éªã</p> @@ -1030,730 +1181,6 @@ You should be able to see the tables/cub <category>blog</category> - - </item> - - <item> - <title>Detailed Analysis of refine query cache</title> - <description><hr /> - -<h2 id="part-i-basic-introduction">Part-I Basic Introduction</h2> - -<h3 id="backgroud">Backgroud</h3> -<p>In the past, query cache are not efficiently used in Kylin due to two aspects: <strong>coarse-grained cache expiration strategy</strong> and <strong>lack of external cache</strong>. Because of the aggressive cache expiration strategy, useful caches are often cleaned up unnecessarily. Because query caches are stored in local servers, they cannot be shared between servers. And because of the size limitation of local cache, not all useful query results can be cached.</p> - -<p>To deal with these shortcomings, we change the query cache expiration strategy by signature checking and introduce the memcached as Kylinâs distributed cache so that Kylin servers are able to share cache between servers. And itâs easy to add memcached servers to scale out distributed cache.</p> - -<p>These features is proposed and developed by eBay Kylin team. Thanks so much for their contribution.</p> - -<h3 id="related-jira">Related JIRA</h3> - -<ul> - <li><a href="https://issues.apache.org/jira/browse/KYLIN-2895">KYLIN-2895 Refine Query Cache</a> - <ul> - <li><a href="https://issues.apache.org/jira/browse/KYLIN-2899">KYLIN-2899 Introduce segment level query cache</a></li> - <li><a href="https://issues.apache.org/jira/browse/KYLIN-2898">KYLIN-2898 Introduce memcached as a distributed cache for queries</a></li> - <li><a href="https://issues.apache.org/jira/browse/KYLIN-2894">KYLIN-2894 Change the query cache expiration strategy by signature checking</a></li> - <li><a href="https://issues.apache.org/jira/browse/KYLIN-2897">KYLIN-2897 Improve the query execution for a set of duplicate queries in a short period</a></li> - <li><a href="https://issues.apache.org/jira/browse/KYLIN-2896">KYLIN-2896 Refine query exception cache</a></li> - </ul> - </li> -</ul> - -<hr /> - -<h2 id="part-ii-deep-dive">Part-II Deep Dive</h2> - -<ul> - <li>Introduce memcached as a Distributed Query Cache</li> - <li>Segment Level Cache</li> - <li>Query Cache Expiration Strategy by Signature Checking</li> - <li>Other Enhancement</li> -</ul> - -<h3 id="introduce-memcached-as-a-distributed-query-cache">Introduce memcached as a Distributed Query Cache</h3> - -<p><strong>Memcached</strong> is a Free and open source, high-performance, distributed memory object caching system. It is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering. It is simple yet powerful. Its simple design promotes quick deployment, ease of development, and solves many problems facing large data caches. Its API is available for most popular languages.</p> - -<p>By KYLIN-2898, Kylin use <strong>Memcached</strong> as distributed cache service, and use <strong>EhCache</strong> as local cache service. When <code class="highlighter-rouge">RemoteLocalFailOverCacheManager</code> is configured in <code class="highlighter-rouge">applicationContext.xml</code>, for each cache put/get action, Kylin will first check if remote cache service is available, only if remote cache service is unavailable, local cache service will be used.</p> - -<p>Firstly, multi query server can share query cache. For each kylin server, less jvm memory will be occupied which help to reduce GC pressure. Secondly, since memcached is centralized so duplicated cache entry will avoid in serval Kylin process. Thirdly, memcached has larger size and easy to scale out, this will help to reduce the chance which useful cache entry have to be dropped due to limited memory capacity.</p> - -<p>To handle node failure and to scale out memcached cluster, author has introduced a consistent hash strategy to smoothly solve such problem. Ketama is an implementation of a consistent hashing algorithm, meaning you can add or remove servers from the memcached pool without causing a complete remap of all keys. Detail could be checked at <a href="https://www.last.fm/user/RJ/journal/2007/04/10/rz_libketama_-_a_consistent_hashing_algo_for_memcache_clients">Ketama consistent hash strategy</a>.</p> - -<p><img src="/images/blog/refine-query-cache/consistent-hashing.png" alt="consistent hashing" /></p> - -<h3 id="segment-level-cache">Segment level Cache</h3> - -<p>Currently Kylin use sql as the cache key, when sql comes, if result exists in the cache, it will directly returned the cached result and donât need to query hbase. When there is new segment build or existing segment refresh, all related cache result need to be evicted. For some frequently build cube such as streaming cube(NRT Streaming or Real-time OLAP), the cache miss will increase dramatically, that may decrease the query performance.</p> - -<p>Since for Kylin cube, most historical segments are immutable, the same query against historical segments should be always same, donât need to be evicted for new segment building. So we decide to implement the segment level cache, it is a complement of the existing front-end cache, the idea is similar as the level1/level2 cache in operating system.</p> - -<p><img src="/images/blog/refine-query-cache/l1-l2-cache.png" alt="l1-l2-cache" /></p> - -<h3 id="query-cache-expiration-strategy-by-signature-checking">Query Cache Expiration Strategy by Signature Checking</h3> - -<p>Currently, to invalid query cache, <code class="highlighter-rouge">CacheService</code> will either invoke <code class="highlighter-rouge">cleanDataCache</code> or <code class="highlighter-rouge">cleanAllDataCache</code>. Both methods will clear all of the query cache , which is very inefficient and unnecessary. In production environment, thereâs around hundreds of cubing jobs per day, which means the query cache will be cleared very several minutes. Then we introduced a signature to upgrade cache invalidation strategy.</p> - -<p>The basic idea is as follows:<br /> -When put SQLResponse into cache, we add signature for each SQLResponse. To calculate signature for SQLResponse, we choose the cube last build time and its segments to as input of <code class="highlighter-rouge">SignatureCalculator</code>.<br /> -When fetch <code class="highlighter-rouge">SQLResponse</code> for cache, first check whether the signature is consistent. If not, this cached value is overdue and will be invalidate.</p> - -<p>As for the calculation of signature is show as follows:<br /> -1. <code class="highlighter-rouge">toString</code> of <code class="highlighter-rouge">ComponentSignature</code> will concatenate member varible into a large String; if a <code class="highlighter-rouge">ComponentSignature</code> has other <code class="highlighter-rouge">ComponentSignature</code> as member, toString will be calculated recursively<br /> -2. return value of <code class="highlighter-rouge">toString</code> will be input of <code class="highlighter-rouge">SignatureCalculator</code>,<br /> -<code class="highlighter-rouge">SignatureCalculator</code> encode string using MD5 as identifer of signature of query cache</p> - -<p><img src="/images/blog/refine-query-cache/cache-signature.png" alt="cache-signature" /></p> - -<h3 id="other-enhancement">Other Enhancement</h3> - -<h4 id="improve-the-query-execution-for-a-set-of-duplicate-queries-in-a-short-period">Improve the query execution for a set of duplicate queries in a short period</h4> - -<p>If same query enter Kylin at the same time by different client, for each query they can not find query cache so they must be calculated respectively. And even wrose, if these query are complex, they usually cost a long duration so Kylin have less chance to utilize cache query; and them cost large computation resources that will make query server has poor performance has harm to HBase cluster.</p> - -<p>To reduce the impact of duplicated and complex query, it may be a good idea to block query which came later, wait to first one return result as far as possible. This lazy strategy is especially useful if you have duplicated complex query came in same time. To enbale it, you should set <code class="highlighter-rouge">kylin.query.lazy-query-enabled</code> to <code class="highlighter-rouge">true</code>. Optionlly, you may set <code class="highlighter-rouge">kylin.query.lazy-query-waiting-timeout-milliseconds</code> to what you think later duplicated query wait duration to meet your situation.</p> - -<h4 id="remove-exception-cache">Remove exception cache</h4> -<p>Formerly, query cache has been divided into two part, one part for storing success query result, another for failed query result, and they are invalidated respectively. It looks like not a good classification criteria because it is not fine-grained enough. After query cache signature was introduced, we have no reason to take them apart, so exception cache was removed.</p> - -<hr /> - -<h2 id="part-iii-how-to-use">Part-III How to Use</h2> - -<p>To get prepared, you need to install memcached, you may refer to https://github.com/memcached/memcached/wiki/Install. Then you should modify <code class="highlighter-rouge">kylin.properties</code> and <code class="highlighter-rouge">applicationContext.xml</code>.</p> - -<ul> - <li>kylin.properties</li> -</ul> - -<div class="highlight"><pre><code class="language-groff" data-lang="groff">kylin.cache.memcached.hosts=10.1.2.42:11211 -kylin.query.cache-signature-enabled=true -kylin.query.lazy-query-enabled=true -kylin.metrics.memcached.enabled=true -kylin.query.segment-cache-enabled=true</code></pre></div> - -<ul> - <li>applicationContext.xml</li> -</ul> - -<div class="highlight"><pre><code class="language-groff" data-lang="groff">&lt;cache:annotation-driven/&gt; - -&lt;bean id="ehcache" class="org.springframework.cache.ehcache.EhCacheManagerFactoryBean" - p:configLocation="classpath:ehcache-test.xml" p:shared="true"/&gt; - -&lt;bean id="remoteCacheManager" class="org.apache.kylin.cache.cachemanager.MemcachedCacheManager"/&gt; -&lt;bean id="localCacheManager" class="org.apache.kylin.cache.cachemanager.InstrumentedEhCacheCacheManager" - p:cacheManager-ref="ehcache"/&gt; -&lt;bean id="cacheManager" class="org.apache.kylin.cache.cachemanager.RemoteLocalFailOverCacheManager"/&gt; - -&lt;bean id="memcachedCacheConfig" class="org.apache.kylin.cache.memcached.MemcachedCacheConfig"&gt; - &lt;property name="timeout" value="500"/&gt; - &lt;property name="hosts" value="${kylin.cache.memcached.hosts}"/&gt; -&lt;/bean&gt;</code></pre></div> - -<h3 id="configuration-for-query-cache">Configuration for query cache</h3> - -<h4 id="general-part">General part</h4> - -<table> - <thead> - <tr> - <th style="text-align: left">Conf Key</th> - <th style="text-align: left">Conf value</th> - <th style="text-align: left">Explanation</th> - </tr> - </thead> - <tbody> - <tr> - <td style="text-align: left">kylin.query.cache-enabled</td> - <td style="text-align: left">boolean, default true</td> - <td style="text-align: left">whether to enable query cache</td> - </tr> - <tr> - <td style="text-align: left">kylin.query.cache-threshold-duration</td> - <td style="text-align: left">long, in milliseconds, default is 2000</td> - <td style="text-align: left">query duration threshold</td> - </tr> - <tr> - <td style="text-align: left">kylin.query.cache-threshold-scan-count</td> - <td style="text-align: left">long, default is 10240</td> - <td style="text-align: left">query scan row count threshold</td> - </tr> - <tr> - <td style="text-align: left">kylin.query.cache-threshold-scan-bytes</td> - <td style="text-align: left">long, default is 1024 * 1024 (1MB)</td> - <td style="text-align: left">query scan byte threshold</td> - </tr> - </tbody> -</table> - -<h4 id="memcached-part">Memcached part</h4> - -<table> - <thead> - <tr> - <th style="text-align: left">Conf Key</th> - <th style="text-align: left">Conf value</th> - <th style="text-align: left">Explanation</th> - </tr> - </thead> - <tbody> - <tr> - <td style="text-align: left">kylin.cache.memcached.hosts</td> - <td style="text-align: left">host1:port1,host2:port2</td> - <td style="text-align: left">host list of memcached host</td> - </tr> - <tr> - <td style="text-align: left">kylin.query.segment-cache-enabled</td> - <td style="text-align: left">default false</td> - <td style="text-align: left">wether to enable</td> - </tr> - <tr> - <td style="text-align: left">kylin.query.segment-cache-timeout</td> - <td style="text-align: left">default 2000</td> - <td style="text-align: left">timeout of memcached</td> - </tr> - <tr> - <td style="text-align: left">kylin.query.segment-cache-max-size</td> - <td style="text-align: left">200 (MB)</td> - <td style="text-align: left">max size put into memcached</td> - </tr> - </tbody> -</table> - -<h4 id="cache-signature-part">Cache signature part</h4> - -<table> - <thead> - <tr> - <th style="text-align: left">Conf Key</th> - <th style="text-align: left">Conf value</th> - <th style="text-align: left">Explanation</th> - </tr> - </thead> - <tbody> - <tr> - <td style="text-align: left">kylin.query.cache-signature-enabled</td> - <td style="text-align: left">default false</td> - <td style="text-align: left">whether to use signature for query cache</td> - </tr> - <tr> - <td style="text-align: left">kylin.query.signature-class</td> - <td style="text-align: left">default is org.apache.kylin.rest.signature.FactTableRealizationSetCalculator</td> - <td style="text-align: left">use which class to calculate signature of query cache</td> - </tr> - </tbody> -</table> - -<h4 id="other-optimize-part">Other optimize part</h4> - -<table> - <thead> - <tr> - <th style="text-align: left">Conf Key</th> - <th style="text-align: left">Conf value</th> - <th style="text-align: left">Explanation</th> - </tr> - </thead> - <tbody> - <tr> - <td style="text-align: left">kylin.query.lazy-query-enabled</td> - <td style="text-align: left">default false</td> - <td style="text-align: left">whether to block duplicated sql query</td> - </tr> - <tr> - <td style="text-align: left">kylin.query.lazy-query-waiting-timeout-milliseconds</td> - <td style="text-align: left">long , in milliseconds, default is 60000</td> - <td style="text-align: left">max druation for blocking duplicated sql query</td> - </tr> - </tbody> -</table> - -<h4 id="metrics-part">Metrics part</h4> - -<table> - <thead> - <tr> - <th style="text-align: left">Conf Key</th> - <th style="text-align: left">Conf value</th> - <th style="text-align: left">Explanation</th> - </tr> - </thead> - <tbody> - <tr> - <td style="text-align: left">kylin.metrics.memcached.enabled</td> - <td style="text-align: left">true</td> - <td style="text-align: left">Enable memcached metrics in memcached.</td> - </tr> - <tr> - <td style="text-align: left">kylin.metrics.memcached.metricstype</td> - <td style="text-align: left">off/performance/debug</td> - <td style="text-align: left">refer to net.spy.memcached.metrics.MetricType</td> - </tr> - </tbody> -</table> -</description> - <pubDate>Tue, 30 Jul 2019 03:30:00 -0700</pubDate> - <link>http://kylin.apache.org/blog/2019/07/30/detailed-analysis-of-refine-query-cache/</link> - <guid isPermaLink="true">http://kylin.apache.org/blog/2019/07/30/detailed-analysis-of-refine-query-cache/</guid> - - - <category>blog</category> - - </item> - - <item> - <title>Deep dive into Kylin's Real-time OLAP</title> - <description><h2 id="preface">Preface</h2> - -<p>At the beginning of Apache Kylin, the main purpose was to solve the need for interactive data analysis on massive data. The data source mainly comes from the data warehouse (Hive), and the data is mostly historical rather than real-time. Streaming data processing is an brand-new field of big data development that requires data to be queried as soon as it enters the system(second latency). Until now (the latest release of v2.6), Apache Kylinâs main capabilities are still in the field of historical data analysis, even the NRT(Near real-time streaming) feature was introduced in v1.6, there are still several minutes of delay, it is difficult to meet real-time query requirements.</p> - -<p>To keep up with the trend of big data development, <strong>eBay</strong>âs Kylin development team (<a href="https://github.com/allenma">allenma</a>, <a href="https://github.com/mingmwang">mingmwang</a>, <a href="Https://github.com/sanjulian">sanjulian</a>, <a href="https://github.com/wangshisan">wangshisan</a>, etc.) Based on Kylin, the Real-time OLAP feature was developed to implement Kylinâs real-time query of Kafka streaming data. This feature has been used in <strong>eBay</strong> in production env and has been running stably for more than one year. It was contributed to community in the December of 2018.</p> - -<p>In this article, we will focus on introducing and analyzing Apache Kylinâs Real-time OLAP feature, usage, benchmarking, etc. In <strong>What is Real-time OLAP</strong>, we will introduce architecture, concepts and features. In <strong>How to use Real-time OLAP</strong>, we will introduce the deployment, enabling and monitoring aspects of the Receiver cluster. Finally, in the <strong>Real-time OLAP FAQ</strong>, we will introduce the answers to some common questions. The meaning of important configuration entry, usage restrictions, and future development plans.</p> - -<ul> - <li> - <p>What is Real-time OLAP</p> - - <ul> - <li>The importance of streaming data processing</li> - <li>Introduction to Real-time OLAP</li> - <li>Real-time OLAP concepts and roles</li> - <li>Real-time OLAP architecture</li> - <li>Real-time OLAP features</li> - <li>Real-time OLAP metadata</li> - <li>Real-time OLAP Local Segment Cache</li> - <li>The status of Streaming Segment and its transformation</li> - <li>Real-time OLAP build process analysis</li> - <li>Real-time OLAP query process analysis</li> - <li>Real-time OLAP Rebalance process analysis</li> - </ul> - </li> - <li> - <p>How to use Real-time OLAP</p> - - <ul> - <li>Deploy Coordinator and Receiver</li> - <li>Configuring Streaming Table</li> - <li>Add and modify Replica Set</li> - <li>Design model and cube</li> - <li>Enable and stop Cube</li> - <li>Monitor consumption status</li> - <li>Coordinator Rest API Description</li> - </ul> - </li> - <li> - <p>Frequently Asked Questions for Real-time OLAP</p> - - <ul> - <li>There is a âLambdaâ checkbox when configuring the Kafka data source. What does it do?</li> - <li>In addition to the base cuboid, can I build other cuboids on the receiver side?</li> - <li>How should I scale out my receiver cluster? How to deal with partition increase for Kafka topic?</li> - <li>What is the benchmark result? What is the approximate length of the query? What is the approximate data ingest rate of a single Receiver?</li> - <li>Which one is more suitable for my needs than Kylinâs NRT Streaming?</li> - <li>What are the main limitations of Real-time OLAP? What are the future development plans?</li> - </ul> - </li> -</ul> - -<h2 id="part-i-what-is-real-time-olap-for-kylin">Part-I. What is Real-time OLAP for Kylin</h2> - -<hr /> - -<h3 id="streaming-data-processing-and-real-time-olap">1.1 Streaming Data Processing and Real-time OLAP</h3> -<p>For many commercial companies, user messages are analyzed for the purpose of making better business decisions and better market planning. If the message enters the data analysis platform earlier, decision makers can respond faster, reducing time and money waste. Streaming data processing means faster feedback, and decision makers can make more frequent and flexible planning adjustments.</p> - -<p>There are various types of data sources in the company, including mobile devices such as servers and mobile phones, and IoT devices. Messages from different sources are often distinguished by different topic and aggregated into a message queue (Message Queue/Message Bus) for data analysis. Traditional data analysis tools use batch tools such as MapReduce for data analysis, which has large data delays, typically hours to days. As you can see from the figure below, the main data latency comes from two processes: extracting from the message queue through the ETL process to the data warehouse, and extracting data from the data warehouse for precomputation to save the results as cube data. Since both of these parts are calculated using batch-compute programs, the calculation take a long time , which make real-time query difficult to achieve. We think to solve the problem, we need to bypass these processes, by building a bridge between data collection and OLAP platforms. Let the data go directly to the OLAP platform.</p> - -<p><img src="/images/blog/deep-dive-realtime-olap/pic-1.png" alt="diagram1" /></p> - -<p>There are already some mature real-time OLAP solutions, such as Druid, that provide lower data latency by combining query results in real-time and historical parts. Kylin has reached a certain level in analyzing massive historical data. In order to take a step toward real-time OLAP, Kylin developers have developed Real-time OLAP.</p> - -<hr /> - -<h3 id="introduction-to-real-time-olap">1.2 Introduction to Real-time OLAP</h3> -
[... 383 lines stripped ...] Added: kylin/site/images/blog/local-cache/Local_cache_stage.png URL: http://svn.apache.org/viewvc/kylin/site/images/blog/local-cache/Local_cache_stage.png?rev=1894464&view=auto ============================================================================== Binary file - no diff available.