[ https://issues.apache.org/jira/browse/SPARK-30616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-30616: ---------------------------------- Affects Version/s: (was: 3.0.0) 3.1.0 > Introduce TTL config option for SQL Parquet Metadata Cache > ---------------------------------------------------------- > > Key: SPARK-30616 > URL: https://issues.apache.org/jira/browse/SPARK-30616 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.1.0 > Reporter: Yaroslav Tkachenko > Priority: Major > > From > [documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]: > {quote}Spark SQL caches Parquet metadata for better performance. When Hive > metastore Parquet table conversion is enabled, metadata of those converted > tables are also cached. If these tables are updated by Hive or other external > tools, you need to refresh them manually to ensure consistent metadata. > {quote} > Unfortunately simply submitting "REFRESH TABLE" commands could be very > cumbersome. Assuming frequently generated new Parquet files, hundreds of > tables and dozens of users querying the data (and expecting up-to-date > results), manually refreshing metadata for each table is not an optimal > solution. And this is a pretty common use-case for streaming ingestion of > data. > I propose to introduce a new option in Spark (something like > "spark.sql.parquet.metadataCache.refreshInterval") that controls the TTL of > this metadata cache. Its default value can be pretty high (an hour? a few > hours?), so it doesn't alter the existing behavior much. When it's set to 0 > the cache is effectively disabled (could be useful for testing or some edge > cases). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org