Yaroslav Tkachenko created SPARK-30616:
------------------------------------------

             Summary: Introduce TTL config option for SQL Parquet Metadata Cache
                 Key: SPARK-30616
                 URL: https://issues.apache.org/jira/browse/SPARK-30616
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.4, 3.0.0
            Reporter: Yaroslav Tkachenko


>From 
>[documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
{quote}Spark SQL caches Parquet metadata for better performance. When Hive 
metastore Parquet table conversion is enabled, metadata of those converted 
tables are also cached. If these tables are updated by Hive or other external 
tools, you need to refresh them manually to ensure consistent metadata.
{quote}
Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
cumbersome. Assuming frequently generated new Parquet files, hundreds of tables 
and dozens of users querying the data (and expecting up-to-date results), 
manually refreshing metadata for each table is not an optimal solution. And 
this is a pretty common use-case for streaming ingestion of data.    

I propose to introduce a new option in Spark (something like 
"spark.sql.parquet.metadataCache.refreshInterval") that controls the TTL of 
this metadata cache. Its default value can be pretty high (an hour? a few 
hours?), so it doesn't alter the existing behavior much. When it's set to 0 the 
cache is effectively disabled (could be useful for testing or some edge cases). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to