Github user maryannxue commented on a diff in the pull request: https://github.com/apache/spark/pull/21594#discussion_r196899266 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -107,22 +107,35 @@ class CacheManager extends Logging { /** * Un-cache all the cache entries that refer to the given plan. */ - def uncacheQuery(query: Dataset[_], blocking: Boolean = true): Unit = writeLock { - uncacheQuery(query.sparkSession, query.logicalPlan, blocking) + def uncacheQuery(query: Dataset[_], + cascade: Boolean, blocking: Boolean = true): Unit = writeLock { + uncacheQuery(query.sparkSession, query.logicalPlan, cascade, blocking) } /** * Un-cache all the cache entries that refer to the given plan. */ - def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): Unit = writeLock { + def uncacheQuery(spark: SparkSession, plan: LogicalPlan, + cascade: Boolean, blocking: Boolean): Unit = writeLock { val it = cachedData.iterator() + val needToRecache = scala.collection.mutable.ArrayBuffer.empty[CachedData] while (it.hasNext) { val cd = it.next() if (cd.plan.find(_.sameResult(plan)).isDefined) { - cd.cachedRepresentation.cacheBuilder.clearCache(blocking) it.remove() + if (cascade || cd.plan.sameResult(plan)) { + cd.cachedRepresentation.cacheBuilder.clearCache(blocking) + } else { + val plan = spark.sessionState.executePlan(cd.plan).executedPlan + val newCache = InMemoryRelation( --- End diff -- Yes, you are right, although it wouldn't lead to any error just like all other compiled dataframes that refer to this old InMemoryRelation. I'll change this piece of code. But you've brought out another interesting question: A scenario similar to what you've mentioned: ```df1 = ... df2 = df1.filter(...) df2.cache() df1.cache() df1.collect() ``` , which means we cache the dependent cache first and the cache being depended upon next. Optimally when you do df2.collect(), you would like df2 to use the cached data in df1, but it doesn't work like this now since df2's execution plan has already been generated before we call df1.cache(). It might be worth revisiting the caches and update their plans if necessary when we call cacheQuery()
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org