[
https://issues.apache.org/jira/browse/KYLIN-5946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870175#comment-17870175
]
Shuai Li commented on KYLIN-5946:
---------------------------------
h3. 设计
向量化引擎已经广范运用到各种数据库,原生Spark由于语言限制,很难完美契合一些底层的指令。为了提高Kylin的查询性能,引入Spark插件[Gluten|https://github.com/apache/incubator-gluten](Clickhouse)来提高执行引擎效率。
具体流程变更:
查询
Gluten引擎完成对Spark执行计划的替换,全面下推到CH
Backend执行。对于Kylin定制的一些算子或者函数,通过复写来支持在backend中执行。Gluten仍未支持的算子或者函数,进行Fallback并交给spark计算完成后再次传递到Backend。
查询是否fallback,通过接口返回字段'glutenFallback'来进行确认。
构建
构建后的数据从传统的Parquet变成CH原生的mergetree。通过Gluten中Delta和ClickHouseSparkCatalog的配置,完成数据向mergetree表写入。
Gluten社区为了解决mergetree小文件过多问题,引入数据合并的概念,最终一个part文件将合并成两个文件,一个保存data,一个保存meta(索引信息等)。
当存储为S3或HDFS时,提供远程文件的缓存和软亲和性,保证查询速度。
特性对齐
* 查询访问S3时缓存失效时间和Kylin对齐,Backend原生不支持此操作
实现的算子
* KylinFileSourceScanExec
实现的函数
* PreciseCardinality
* PreciseCountDistinctDecode
* ReusePreciseCountDistinct
* PreciseCountDistinctAndValue
* PreciseCountDistinctAndArray
* PreciseCountDistinct
* KylinTimestampAdd
Design
The vectorization engine has been widely used in various databases. Due to
language limitations, native Spark is difficult to perfectly match some
underlying instruction sets. In order to improve Kylin's query performance, the
Spark plug-in [Gluten|https://github.com/apache/incubator-gluten] (Clickhouse)
is introduced to improve the efficiency of the execution engine.
Specific process changes:
Query
The Gluten engine completes the replacement of the Spark execution plan and
pushes it down to the CH Backend for execution. For some operators or functions
customized by Kylin, they are supported in the backend through replication.
Operators or functions that Gluten still does not support will be Fallbacked
and handed over to Spark for calculation and then passed to the Backend again.
To query whether fallback occurs, you can confirm it through the interface
return field 'glutenFallback'.
Build
The data after construction changes from traditional Parquet to CH native
mergetree. Through the configuration of Delta and ClickHouseSparkCatalog in
Gluten, the data is written to the mergetree table.
In order to solve the problem of too many small files in mergetree, the Gluten
community introduced the concept of data merging. Finally, a part file will be
merged into two files, one to store data and the other to store meta (index
information, etc.).
When stored as S3 or HDFS, remote file caching and soft affinity are provided
to ensure query speed.
Feature alignment
* Align cache expiration time and Kylin when querying access to S3, Backend
does not natively support this operation
Implemented operators
* KylinFileSourceScanExec
Implemented functions
* PreciseCardinality
* PreciseCountDistinctDecode
* ReusePreciseCountDistinct
* PreciseCountDistinctAndValue
* PreciseCountDistinctAndArray
* PreciseCountDistinct
* KylinTimestampAdd
> Integration with gluten
> -----------------------
>
> Key: KYLIN-5946
> URL: https://issues.apache.org/jira/browse/KYLIN-5946
> Project: Kylin
> Issue Type: New Feature
> Components: Job Engine, Query Engine
> Affects Versions: 5.0.0
> Reporter: pengfei.zhan
> Priority: Major
> Fix For: 5.0.0
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)