[PR] [SPARK-47804] Add Dataframe cache debug log [spark]

via GitHub Wed, 10 Apr 2024 16:24:11 -0700


anchovYu opened a new pull request, #45990:
URL: https://github.com/apache/spark/pull/45990


   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section 
is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster 
reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class 
hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other 
DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   This PR adds a debug log for Dataframe cache that uses SQL conf to turn on. 
It logs necessary information on
   * cache hit during cache application (the application happens basically on 
every query)
   * cache miss
   * adding new cache entries
   * removing cache entries
   
   Because every query applies cache, this log could be huge and should be only 
turned on during some debugging process, and should not enabled by default in 
production.
   
   Example:
   ```
   spark.conf.set("spark.sql.dataframeCache.logLevel", "warn")
   val df = spark.range(1, 10)
   
   
   df.collect()
   24/03/30 06:28:45 WARN CacheManager: Dataframe cache miss for input plan:
   Range (1, 10, step=1, splits=Some(32))
   24/03/30 06:28:45 WARN CacheManager: Last 20 Dataframe cache entry logical 
plans:
   []
   
   
   df.cache()
   24/03/30 06:36:50 WARN CacheManager: Dataframe cache miss for input plan:
   Range (1, 10, step=1, splits=Some(32))
   24/03/30 06:36:50 WARN CacheManager: Last 20 Dataframe cache entry logical 
plans:
   []
   24/03/30 06:36:50 WARN CacheManager: Added Dataframe cache entry:
   CachedData(
   session UUID=None,
   Range (1, 10, step=1, splits=Some(32))
   
   InMemoryRelation [id#0L], StorageLevel(disk, memory, deserialized, 1 
replicas)
      +- *(1) Range (1, 10, step=1, splits=32)
   )
   
   
   df.count()
   24/03/30 06:37:30 WARN CacheManager: Dataframe cache hit for input plan:
   Range (1, 10, step=1, splits=Some(32))
   matched with cache entry:
   CachedData(
   session UUID=None,
   Range (1, 10, step=1, splits=Some(32))
   
   InMemoryRelation [id#0L], StorageLevel(disk, memory, deserialized, 1 
replicas)
      +- *(1) Range (1, 10, step=1, splits=32)
   )
   24/03/30 06:37:30 WARN CacheManager: Dataframe cache hit plan change summary:
    Aggregate [count(1) AS count#11L]           Aggregate [count(1) AS 
count#11L]
   !+- Range (1, 10, step=1, splits=Some(32))   +- InMemoryRelation [id#0L], 
StorageLevel(disk, memory, deserialized, 1 replicas)
   !                                                  +- *(1) Range (1, 10, 
step=1, splits=32)
   
   
   df.unpersist()
   24/03/30 06:41:00 WARN CacheManager: Removed 1 Dataframe cache entries, with 
logical plans being
   [Range (1, 10, step=1, splits=Some(32))
   ]
   ```
   
   
   ### Why are the changes needed?
   Easier debugging.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Run local spark shell.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this 
patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling 
Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47804] Add Dataframe cache debug log [spark]

Reply via email to