[GitHub] [spark] LuciferYang opened a new pull request #30483: [SPARK-33449][SQL] Add File Metadata cache support for Parquet and Orc

GitBox Fri, 06 Aug 2021 16:25:50 -0700


LuciferYang opened a new pull request #30483:
URL: https://github.com/apache/spark/pull/30483



   ### What changes were proposed in this pull request?
   The main purpose of this pr is to introduce the File Meta Cache mechanism 
for Spark SQL and the basic File Meta Cache implementation for Parquet and Orc 
is provided at the same time.
   
   The main change of this pr as follows:
   
   - Defined a `FileMetaCacheManager` to cache the mapping `FileMetaKey` to 
`FileMeta`. The `FileMetaKey` is the cache key, `equals` is determined by the 
file path by default. The `FileMeta` used to represent the cache value and It 
is generated by the  `FileMetaKey#getFileMeta `  method. 
   
   - Currently, the `FileMetaCacheManager` supports a simple cache expiration 
elimination mechanism, and the expiration time is determined by the new config  
`FILE_META_CACHE_TTL_SINCE_LAST_ACCESS `
   
   - For Parquet file format, this pr added `ParquetFileMetaKey ` and 
`ParquetFileMeta` to cache Parquet file Footer and the Footer cache can be used 
by Vectorized read scene both in DS API V1 and V2, the feature will be enabled 
when `FILE_META_CACHE_PARQUET_ENABLED ` is true
   
   - For Orc file format, this pr added `OrcFileMetaKey ` and `OrcFileMeta` to 
cache Orc file Tail and and the Tail cache can be used by Vectorized read scene 
both in DS API V1 and V2, the feature will be enabled when 
`FILE_META_CACHE_ORC_ENABLED ` is true
   
   Currently, the file meta cache mechanism cannot be used by `RowBasedReader`, 
and it needs the completion of 
[PARQUET-1965](https://issues.apache.org/jira/browse/PARQUET-1965) and 
[ORC-746](https://issues.apache.org/jira/browse/ORC-746) for further support.
   
   ### Why are the changes needed?
   Support Parquet and Orc datasource use File Meta Cache mechanism to reduce 
the times of metadata reads multiple queries are performed on the same dataset. 
   
   ### Does this PR introduce _any_ user-facing change?
   Add 3 new config:
   
   - `FILE_META_CACHE_PARQUET_ENABLED(spark.sql.fileMetaCache.parquet.enabled)` 
to indicate if enable parquet file meta cache mechanism
   - `FILE_META_CACHE_ORC_ENABLED(spark.sql.fileMetaCache.orc.enabled)` to 
indicate if enable orc file meta cache mechanism
   - 
`FILE_META_CACHE_TTL_SINCE_LAST_ACCESS(spark.sql.fileMetaCache.ttlSinceLastAccess)`
 to represent Time-to-live for file metadata cache entry after last access, the 
unit is seconds.
   
   ### How was this patch tested?
   
   - Pass the Jenkins or GitHub Action
   - Add new test suites to `OrcQuerySuite` and `ParquetQuerySuite`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang opened a new pull request #30483: [SPARK-33449][SQL] Add File Metadata cache support for Parquet and Orc

Reply via email to