[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...

chenghao-intel Wed, 21 Oct 2015 19:10:29 -0700

Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5668#discussion_r42705230
  
    --- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
    @@ -253,4 +253,80 @@ private[hive] trait HiveStrategies {
           case _ => Nil
         }
       }
    +
    +  /**
    +   * Uses the ExtractEquiJoinKeys pattern to find joins where at least 
some of the predicates can be
    +   * evaluated by matching hash keys.
    +   *
    +   * This strategy applies a simple optimization based on the estimates of 
the physical sizes of
    +   * the two join sides.  When planning a [[joins.BroadcastHashJoin]], if 
one side has an
    +   * estimated physical size smaller than the user-settable threshold
    +   * [[org.apache.spark.sql.SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]], the 
planner would mark it as the
    +   * ''build'' relation and mark the other relation as the ''stream'' 
side.  The build table will be
    +   * ''broadcasted'' to all of the executors involved in the join, as a
    +   * [[org.apache.spark.broadcast.Broadcast]] object.  If both estimates 
exceed the threshold, they
    +   * will instead be used to decide the build side in a 
[[joins.ShuffledHashJoin]].
    +   * Works similar to HashJoin strategy in SparkStrategies, but applies a 
better estimate of join
    +   * size in case of partitioned tables by computing sizes of referred 
partitions of the table.
    +   */
    +  object HiveHashJoin extends Strategy with PredicateHelper {
    --- End diff --
    
    Actually my concern on this, probably we don't want to duplicate the 
optimizer rules here. A better idea is to reflect the real statistic info in 
`MetastoreRelation` as @navis did in #6767, so the default optimizer will 
handle the rest for us.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...

Reply via email to