[ 
https://issues.apache.org/jira/browse/SPARK-30616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30616:
----------------------------------
    Affects Version/s:     (was: 3.0.0)
                       3.1.0

> Introduce TTL config option for SQL Parquet Metadata Cache
> ----------------------------------------------------------
>
>                 Key: SPARK-30616
>                 URL: https://issues.apache.org/jira/browse/SPARK-30616
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Yaroslav Tkachenko
>            Priority: Major
>
> From 
> [documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
> {quote}Spark SQL caches Parquet metadata for better performance. When Hive 
> metastore Parquet table conversion is enabled, metadata of those converted 
> tables are also cached. If these tables are updated by Hive or other external 
> tools, you need to refresh them manually to ensure consistent metadata.
> {quote}
> Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
> cumbersome. Assuming frequently generated new Parquet files, hundreds of 
> tables and dozens of users querying the data (and expecting up-to-date 
> results), manually refreshing metadata for each table is not an optimal 
> solution. And this is a pretty common use-case for streaming ingestion of 
> data.    
> I propose to introduce a new option in Spark (something like 
> "spark.sql.parquet.metadataCache.refreshInterval") that controls the TTL of 
> this metadata cache. Its default value can be pretty high (an hour? a few 
> hours?), so it doesn't alter the existing behavior much. When it's set to 0 
> the cache is effectively disabled (could be useful for testing or some edge 
> cases). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to