[GitHub] spark pull request #22758: [SPARK-25332][SQL] Instead of broadcast hash join...

sujith71955 Thu, 18 Oct 2018 00:21:02 -0700

Github user sujith71955 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22758#discussion_r226192210
  
    --- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -193,6 +193,16 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
               None)
             val logicalRelation = cached.getOrElse {
               val updatedTable = inferIfNeeded(relation, options, fileFormat)
    +          // Intialize the catalogTable stats if its not defined.An intial 
value has to be defined
    --- End diff --
    
    Thanks for your valuable feedback.
    My observations : 
    1) In insert flow we are always trying to update the HiveStats as per the 
below statement in InsertIntoHadoopFsRelationCommand. 
    ```
          if (catalogTable.nonEmpty) {
            CommandUtils.updateTableStats(sparkSession, catalogTable.get)
          }
    
    ```
    but after create table command, when we do insert command within the same 
session Hive statistics is not getting updated due to below validation where 
condition expects stats to be non-empty as below
      
    ```
    def updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit 
= {
        if (table.stats.nonEmpty) { 
    ```
    But if we re-launch spark-shell and trying to do insert command the 
Hivestatistics will be saved and now onward the stats will be taken from 
HiveStats and the flow will never try to estimate the data size with file .
    
    2) Currently always system is not trying to  estimate the data size with 
files when we are executing the insert command, as i told above if we launch 
the query from a new context , system will try to read the stats from the Hive. 
i think there is a problem in the behavior consistency and also if we can 
always get the stats from hive then shall we need to calculate again eveytime 
the stats from files?
    
     >> I think we may need to update the flow where it shall always try read 
the data size from files, it shall never depend on HiveStats,
     >> Or if we are recording the HiveStats then everytime it shall read the 
Hivestats.  
    Please let me know whether i am going right direction, let me know for any 
clarifications.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22758: [SPARK-25332][SQL] Instead of broadcast hash join...

Reply via email to