[GitHub] spark pull request #20394: [SPARK-23214][SQL] cached data should not carry e...

gatorsmile Fri, 26 Jan 2018 16:44:44 -0800

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20394#discussion_r164254321
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
    @@ -73,11 +73,16 @@ case class InMemoryRelation(
       @transient val partitionStatistics = new PartitionStatistics(output)
     
       override def computeStats(): Statistics = {
    -    if (batchStats.value == 0L) {
    -      // Underlying columnar RDD hasn't been materialized, use the stats 
from the plan to cache
    -      statsOfPlanToCache
    +    if (sizeInBytesStats.value == 0L) {
    +      // Underlying columnar RDD hasn't been materialized, use the stats 
from the plan to cache.
    +      // Note that we should drop the hint info here. We may cache a plan 
whose root node is a hint
    +      // node. When we lookup the cache with a semantically same plan 
without hint info, the plan
    +      // returned by cache lookup should not have hint info. If we lookup 
the cache with a
    +      // semantically same plan with a different hint info, 
`CacheManager.useCachedData` will take
    +      // care of it and retain the hint info in the lookup input plan.
    +      statsOfPlanToCache.copy(hints = HintInfo())
    --- End diff --
    
    This is a new behavior we introduced in 2.3. I will first keep the behavior 
unchanged and merge it to 2.3. 
    
    We can have more discussion in the next release.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20394: [SPARK-23214][SQL] cached data should not carry e...

Reply via email to