Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/19864#discussion_r156718225 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala --- @@ -71,9 +74,10 @@ case class InMemoryRelation( override def computeStats(): Statistics = { if (batchStats.value == 0L) { - // Underlying columnar RDD hasn't been materialized, no useful statistics information - // available, return the default statistics. - Statistics(sizeInBytes = child.sqlContext.conf.defaultSizeInBytes) + // Underlying columnar RDD hasn't been materialized, use the stats from the plan to cache when + // applicable + statsOfPlanToCache.getOrElse(Statistics(sizeInBytes = + child.sqlContext.conf.defaultSizeInBytes)) --- End diff -- If we don't trust the size in bytes for parquet then we should fix that in the datasource and not here. The old version did not use statistics proper at all, now that we do, we should use that.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org