Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/19864#discussion_r156718763 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -80,6 +80,14 @@ class CacheManager extends Logging { cachedData.isEmpty } + private def extractStatsOfPlanForCache(plan: LogicalPlan): Option[Statistics] = { + if (plan.stats.rowCount.isDefined) { --- End diff -- the current logic is that only when we have run have enough stats for the interested table, we would put the value here, I agree that we should choose a better value for format like parquet where the actual in-memory size would be much larger than the sizeInBytes (i.e. on-disk size), my question is 1. what's the expected value if we have a HadoopFsRelation which is in parquet format ? 2. do we want to do it in this PR?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org