Dear Apache Iceberg Community, I hope this message finds you well. I’m writing to discuss a proposed improvement to caching strategies in the Iceberg-Spark integration, as outlined in Issue #14417 Background and Problem:
The current caching behaviour in Iceberg’s Spark integration uses expireAfterAccess semantics. This can prevent periodic refreshes of Iceberg metadata or table data in long-running structured streaming jobs. This poses challenges for workloads that require updated reference data, such as stream-to-static joins. In such cases, frequently accessed data remains in cache indefinitely, reflecting stale snapshots. Disabling caching entirely is the only workaround, which leads to significant overhead due to frequent metadata reloads during micro-batches. For example, in a Spark Structured Streaming job with continuous Kafka input and joins against slowly evolving reference data, updates to Iceberg tables are not reflected unless caching is disabled. While this ensures data freshness, it introduces performance bottlenecks due to repeated table reloads. Proposed Solution: To address this issue, the proposal suggests making the cache expiration strategy configurable. This could include: 1. Allowing users to choose between expireAfterAccess and expireAfterWrite for both catalogue and executor caches. 2. Implementing a smarter refresh mechanism that detects changes in table metadata or snapshots and refreshes the cache. This flexibility would enable users to balance performance and data freshness, aligning Iceberg’s caching capabilities with other data lake formats like Delta Lake. Benefits: - Improved support for long-running structured streaming jobs that rely on up-to-date reference data. - Reduced overhead from unnecessary metadata fetches. - Greater flexibility to meet diverse caching requirements. - Enhanced user experience by addressing caching limitations in the current implementation. I’ve also submitted a corresponding pull request (#14440) to implement the cachePolicy feature, which introduces support for `EXPIRE_AFTER_WRITE` and `EXPIRE_AFTER_ACCESS` strategies. I kindly invite the community to review and provide feedback on both the issue and the pull request. Your insights and guidance would be invaluable in refining and advancing this feature. Feel free to join the discussion on the GitHub issue or reply to this email. I look forward to collaborating with you! Kind regards, Hossein
signature.asc
Description: OpenPGP digital signature
