Alexey Kudinkin created HUDI-5697: ------------------------------------- Summary: Spark SQL re-lists Hudi table after every SQL operations Key: HUDI-5697 URL: https://issues.apache.org/jira/browse/HUDI-5697 Project: Apache Hudi Issue Type: Bug Components: spark, spark-sql Reporter: Alexey Kudinkin Assignee: Alexey Kudinkin Fix For: 0.13.1
Currently, after most DML operations in Spark SQL, Hudi invokes `Catalog.refreshTable` Prior to Spark 3.2, this was essentially doing the following: # Invalidating relation cache (forcing next time for relation to be re-resolved, creating new FileIndex, listing files, etc) # Trigger cascading invalidation (re-caching) of the cached data (in CacheManager) As of Spark 3.2 it now additionally does `LogicalRelation.refresh` for ALL tables (previously this was only done for Temporary Views), therefore entailing whole table to be re-listed again by triggering `FileIndex.refresh` which might be costly operation. We should revert back to preceding behavior from Spark 3.1 -- This message was sent by Atlassian Jira (v8.20.10#820010)