Github user sujith71955 commented on the issue:

    https://github.com/apache/spark/pull/22758
  
    @cloud-fan Please find my understanding of the flow as mentioned below, its 
bit tricky :)
     Lets elaborate this flow might be we get more suggestions.
    
    Step 1 : insert command
    
    insert command --> In DetermineTableStats  condition will not be met  since 
its a OneTableRelation in HiveStrategies(so CatalogTable stats is none) --> In 
convertToLogicalRelation(), Relation will be saved with None stats  as per 
HiveMetastoreCatalog flow  ---> HiveStats will not be set since CatalogTable 
stats is none, as per InsertIntoHadoopFsRelationCommand flow
    
    Step 2 : Execute select statement
    
    Select flow --> In DetermineTableStats  we get as HiveTableRelation 
condition met so update stats with default value --> In 
convertToLogicalRelation(), the LogicalRelation will be fetched from cache with 
 stats as none (set in insert command flow), -->,While computing 
LogicalRelation since CatalogTable stats is none , stats will be calculated 
from HadoopFSRelation based on file source
    
    Result : we get a plan with Broadcast join  as per expectation.
    
    Step 3 : 
    
    Exit; ( come out of spark-shell)
    
    Step 4: Execute select statement
    
    Select flow -> In DetermineTableStats  we get as HiveTableRelation 
condition met so update stats with default value--> Now no LogicalRelation 
cache will be available as  per   convertToLogicalRelation() flow so the 
CatalogTable with updated stats(set by DetermineStats)  will be added in 
cache-->while computing LogicalRelation stats since no Hivestats is available, 
it will consider default stats of LogicalRelation(set by DetermineStats) .
    
    Result : we get a plan with SortMergejoin  , which is wrong.
    
    Step 5: Again Run insert command in same session
    
    insert command --> In DetermineTableStats  condition will not be met  since 
its a OneTableRelation in HiveStrategies(so CatalogTable stats is none) --> In 
convertToLogicalRelation(), Relation is already present in cache with default 
value(set in select flow of prev step)  as per HiveMetastoreCatalog flow.  ---> 
This time HiveStats will be recorded , since CatalogTable stats is not none it 
has default value already set as per InsertIntoHadoopFsRelationCommand flow.
    
    Step 6 : 
    
    exit; ( come out of spark-shell).
    
    Step 7 : 
    
    Run Select -> In DetermineTableStats  we get as HiveTableRelation condition 
met so update stats with default value-->Now no LogicalRelation cache will be 
available as  per   convertToLogicalRelation() flow so the CatalogTable with 
updated stats(set by DetermineStats)  will be added in cache -->While computing 
LogicalRelation stats since Hivestats is available(set in insert command of  
prev step), it will consider HiveStats (HiveStats is available now which is set 
in prev step 5).
    
    Result : This time we get a plan with Broadcast join  as per expectation.
    
    So overall i can see there is a inconsistency in the flow which is 
happening due to the DetermineStats flow ,which is setting default  stats in 
CatalogTable for even convertable type relations.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to