anchovYu opened a new pull request, #45990: URL: https://github.com/apache/spark/pull/45990
### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> This PR adds a debug log for Dataframe cache that uses SQL conf to turn on. It logs necessary information on * cache hit during cache application (the application happens basically on every query) * cache miss * adding new cache entries * removing cache entries Because every query applies cache, this log could be huge and should be only turned on during some debugging process, and should not enabled by default in production. Example: ``` spark.conf.set("spark.sql.dataframeCache.logLevel", "warn") val df = spark.range(1, 10) df.collect() 24/03/30 06:28:45 WARN CacheManager: Dataframe cache miss for input plan: Range (1, 10, step=1, splits=Some(32)) 24/03/30 06:28:45 WARN CacheManager: Last 20 Dataframe cache entry logical plans: [] df.cache() 24/03/30 06:36:50 WARN CacheManager: Dataframe cache miss for input plan: Range (1, 10, step=1, splits=Some(32)) 24/03/30 06:36:50 WARN CacheManager: Last 20 Dataframe cache entry logical plans: [] 24/03/30 06:36:50 WARN CacheManager: Added Dataframe cache entry: CachedData( session UUID=None, Range (1, 10, step=1, splits=Some(32)) InMemoryRelation [id#0L], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Range (1, 10, step=1, splits=32) ) df.count() 24/03/30 06:37:30 WARN CacheManager: Dataframe cache hit for input plan: Range (1, 10, step=1, splits=Some(32)) matched with cache entry: CachedData( session UUID=None, Range (1, 10, step=1, splits=Some(32)) InMemoryRelation [id#0L], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Range (1, 10, step=1, splits=32) ) 24/03/30 06:37:30 WARN CacheManager: Dataframe cache hit plan change summary: Aggregate [count(1) AS count#11L] Aggregate [count(1) AS count#11L] !+- Range (1, 10, step=1, splits=Some(32)) +- InMemoryRelation [id#0L], StorageLevel(disk, memory, deserialized, 1 replicas) ! +- *(1) Range (1, 10, step=1, splits=32) df.unpersist() 24/03/30 06:41:00 WARN CacheManager: Removed 1 Dataframe cache entries, with logical plans being [Range (1, 10, step=1, splits=Some(32)) ] ``` ### Why are the changes needed? Easier debugging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Run local spark shell. ### Was this patch authored or co-authored using generative AI tooling? <!-- If generative AI tooling has been used in the process of authoring this patch, please include the phrase: 'Generated-by: ' followed by the name of the tool and its version. If no, write 'No'. Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org