[jira] [Updated] (SPARK-30616) Introduce TTL config option for SQL Parquet Metadata Cache

Yaroslav Tkachenko (Jira) Thu, 11 Jun 2020 12:04:13 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-30616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yaroslav Tkachenko updated SPARK-30616:
---------------------------------------
    Description: 
>From 
>[documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
{quote}Spark SQL caches Parquet metadata for better performance. When Hive 
metastore Parquet table conversion is enabled, metadata of those converted 
tables are also cached. If these tables are updated by Hive or other external 
tools, you need to refresh them manually to ensure consistent metadata.
{quote}
Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
cumbersome. Assuming frequently generated new Parquet files, hundreds of tables 
and dozens of users querying the data (and expecting up-to-date results), 
manually refreshing metadata for each table is not an optimal solution. And 
this is a pretty common use-case for streaming ingestion of data.    

I propose to introduce a new option in Spark (something like 
"spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of this 
metadata cache. Its default value can be pretty high (an hour? a few hours?), 
so it doesn't alter the existing behavior much. When it's set to 0 the cache is 
effectively disabled (could be useful for testing or some edge cases). 

  was:
>From 
>[documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
{quote}Spark SQL caches Parquet metadata for better performance. When Hive 
metastore Parquet table conversion is enabled, metadata of those converted 
tables are also cached. If these tables are updated by Hive or other external 
tools, you need to refresh them manually to ensure consistent metadata.
{quote}
Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
cumbersome. Assuming frequently generated new Parquet files, hundreds of tables 
and dozens of users querying the data (and expecting up-to-date results), 
manually refreshing metadata for each table is not an optimal solution. And 
this is a pretty common use-case for streaming ingestion of data.    

I propose to introduce a new option in Spark (something like 
"spark.sql.parquet.metadataCache.refreshInterval") that controls the TTL of 
this metadata cache. Its default value can be pretty high (an hour? a few 
hours?), so it doesn't alter the existing behavior much. When it's set to 0 the 
cache is effectively disabled (could be useful for testing or some edge cases). 


> Introduce TTL config option for SQL Parquet Metadata Cache
> ----------------------------------------------------------
>
>                 Key: SPARK-30616
>                 URL: https://issues.apache.org/jira/browse/SPARK-30616
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Yaroslav Tkachenko
>            Priority: Major
>
> From 
> [documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
> {quote}Spark SQL caches Parquet metadata for better performance. When Hive 
> metastore Parquet table conversion is enabled, metadata of those converted 
> tables are also cached. If these tables are updated by Hive or other external 
> tools, you need to refresh them manually to ensure consistent metadata.
> {quote}
> Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
> cumbersome. Assuming frequently generated new Parquet files, hundreds of 
> tables and dozens of users querying the data (and expecting up-to-date 
> results), manually refreshing metadata for each table is not an optimal 
> solution. And this is a pretty common use-case for streaming ingestion of 
> data.    
> I propose to introduce a new option in Spark (something like 
> "spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of 
> this metadata cache. Its default value can be pretty high (an hour? a few 
> hours?), so it doesn't alter the existing behavior much. When it's set to 0 
> the cache is effectively disabled (could be useful for testing or some edge 
> cases). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30616) Introduce TTL config option for SQL Parquet Metadata Cache

Reply via email to