maryannxue opened a new pull request #23644: [SPARK-26708][SQL] Incorrect 
result caused by inconsistency between a SQL cache's cached RDD and its 
physical plan
URL: https://github.com/apache/spark/pull/23644
 
 
   ## What changes were proposed in this pull request?
   
   When performing non-cascading cache invalidation, `recache` is called on the 
other cache entries which are dependent on the cache being invalidated. It 
leads to the the physical plans of those cache entries being re-compiled. For 
those cache entries, if the cache RDD has already been persisted, chances are 
there will be inconsistency between the data and the new plan. It can cause a 
correctness issue if the new plan's `outputPartitioning`  or `outputOrdering` 
is different from the that of the actual data, and meanwhile the cache is used 
by another query that asks for specific `outputPartitioning` or 
`outputOrdering` which happens to match the new plan but not the actual data.
   
   The fix is to keep the cache entry as it is if the data has been loaded, 
otherwise re-build the cache entry, with a new plan and an empty cache buffer.
   
   ## How was this patch tested?
   
   Added UT.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to