[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...

hvanhovell Wed, 13 Dec 2017 08:56:16 -0800

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19864#discussion_r156718225
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
    @@ -71,9 +74,10 @@ case class InMemoryRelation(
     
       override def computeStats(): Statistics = {
         if (batchStats.value == 0L) {
    -      // Underlying columnar RDD hasn't been materialized, no useful 
statistics information
    -      // available, return the default statistics.
    -      Statistics(sizeInBytes = child.sqlContext.conf.defaultSizeInBytes)
    +      // Underlying columnar RDD hasn't been materialized, use the stats 
from the plan to cache when
    +      // applicable
    +      statsOfPlanToCache.getOrElse(Statistics(sizeInBytes =
    +        child.sqlContext.conf.defaultSizeInBytes))
    --- End diff --
    
    If we don't trust the size in bytes for parquet then we should fix that in 
the datasource and not here. The old version did not use statistics proper at 
all, now that we do, we should use that.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...

Reply via email to