[jira] [Updated] (KYLIN-5564) Introduce Bloom Filter to optimize data scanning based on Spark

Guangyuan Feng (Jira) Wed, 07 Jun 2023 02:33:37 -0700


     [ 
https://issues.apache.org/jira/browse/KYLIN-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Guangyuan Feng updated KYLIN-5564:
----------------------------------
    Description: 
Currently, all the data generated by Kylin are saved as *Parquet* files through 
Spark, but Kylin has not make full use of the features of Parquet when scanning 
data. Among them, BloomFilter must be stressed, because it's the most common 
tool to help *READERs* to skip useless data.

Therefore, we introduced an approach to build *BloomFilter* automatically, 
conditionally and smartly when constructing segments, on the desired columns 
especially according to the query histories.

After brought in BloomFilter, Spark will have a good performance improvement in 
the most cases.

 

_About the benchmarks or performance tests, please read the attached PDF is the 
report testing on SSB._

 

  was:
Currently, all the data generated by Kylin are saved as *Parquet* files through 
Spark, but Kylin has not make full use of the features of Parquet when scanning 
data. Among them, BloomFilter must be stressed, because it's the most common 
tool to help *READERs* to skip useless data.

Therefore, we introduced an approach to build *BloomFilter* automatically, 
conditionally and smartly when constructing segments, on the desired columns 
especially according to the query histories.

 

_After brought in BloomFilter, Spark will have a good performance improvement 
in the most cases._

 

About the benchmarks or performance tests, please read the attached PDF is the 
report testing on SSB.

 


> Introduce Bloom Filter to optimize data scanning based on Spark
> ---------------------------------------------------------------
>
>                 Key: KYLIN-5564
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5564
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Query Engine
>    Affects Versions: 5.0-alpha
>            Reporter: Guangyuan Feng
>            Assignee: Guangyuan Feng
>            Priority: Major
>             Fix For: 5.0-alpha
>
>
> Currently, all the data generated by Kylin are saved as *Parquet* files 
> through Spark, but Kylin has not make full use of the features of Parquet 
> when scanning data. Among them, BloomFilter must be stressed, because it's 
> the most common tool to help *READERs* to skip useless data.
> Therefore, we introduced an approach to build *BloomFilter* automatically, 
> conditionally and smartly when constructing segments, on the desired columns 
> especially according to the query histories.
> After brought in BloomFilter, Spark will have a good performance improvement 
> in the most cases.
>  
> _About the benchmarks or performance tests, please read the attached PDF is 
> the report testing on SSB._
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KYLIN-5564) Introduce Bloom Filter to optimize data scanning based on Spark

Reply via email to