[jira] [Assigned] (SPARK-25403) Broadcast join is changing to sort merge join , after spark-beeline session restarts.

Wenchen Fan (Jira) Mon, 06 Jan 2020 19:44:38 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-25403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wenchen Fan reassigned SPARK-25403:
-----------------------------------

    Assignee: Yuming Wang

> Broadcast join is changing to sort merge join , after spark-beeline session 
> restarts.
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-25403
>                 URL: https://issues.apache.org/jira/browse/SPARK-25403
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.1, 2.3.1
>         Environment: Spark 2.3.1
> Hadoop 2.7.2
>            Reporter: Ayush Anubhava
>            Assignee: Yuming Wang
>            Priority: Major
>
> *Issue 1*: {color:#ff0000}Broadcast join is changing to sort merge join , 
> after spark-beeline session restarts{color}.
> *Precondition* : The JDBC/Thrift Server is continuously running. 
> *Steps:*
>  
> {code:java}
> 0: jdbc:hive2://10.18.18.214:23040/default> use x1;
> +---------+--+
> | Result |
> +---------+--+
> +---------+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> create table cv (a int, b string) 
> stored as parquet;
> +---------+--+
> | Result |
> +---------+--+
> +---------+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> create table c (a int, b string) 
> stored as parquet;
> +---------+--+
> | Result |
> +---------+--+
> +---------+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table c values 
> (1,'a');
> +---------+--+
> | Result |
> +---------+--+
> +---------+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table cv values 
> (1,'a');
> +---------+--+
> | Result |
> +---------+--+
> +---------+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from c , cv where c.a = 
> cv.
> +----+----+----+----+--+
> | a | b | a | b |
> +----+----+----+----+--+
> | 1 | a | 1 | a |
> +----+----+----+----+--+
> {code}
>  
> {code:java}
> Before Restarting the session (spark-beeline)
> {code}
> *{color:#d04437}explain select * from c , cv where c.a = cv.a;{color}*
> |== Physical Plan ==
>  *(2) {color:#d04437}BroadcastHashJoin{color}[a#3284|#3284], [a#3286|#3286], 
> Inner, BuildRight
>  :- *(2) Project [a#3284, b#3285|#3284, b#3285]
>  : +- *(2) Filter isnotnull(a#3284)
>  : +- *(2) FileScan parquet x1.c[a#3284,b#3285|#3284,b#3285] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://hacluster/user/sparkhive/warehouse/x1.db/c], 
> PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: 
> struct<a:int,b:string>
>  +- {color:#d04437}BroadcastExchange{color} 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))
>  +- *(1) Project [a#3286, b#3287|#3286, b#3287]
>  +- *(1) Filter isnotnull(a#3286)
>  +- *(1) FileScan parquet x1.cv[a#3286,b#3287|#3286,b#3287] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://hacluster/user/sparkhive/warehouse/x1.db/cv], 
> PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: 
> struct<a:int,b:string>|
> {code:java}
> After Session Restarts (spark-beeline)
> {code}
> {color:#d04437} *explain select * from c , cv where c.a = cv.a;*{color}
> |== Physical Plan == *(5) *{color:#d04437}SortMergeJoin{color}* 
> [a#3312|#3312], [a#3314|#3314], Inner :- *(2) Sort [a#3312 ASC NULLS 
> FIRST|#3312 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(a#3312, 
> 200) : +- *(1) Project [a#3312, b#3313|#3312, b#3313] : +- *(1) Filter 
> isnotnull(a#3312) : +- *(1) FileScan parquet x1.c[a#3312,b#3313|#3312,b#3313] 
> Batched: true, Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://hacluster/user/sparkhive/warehouse/x1.db/c], 
> PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: 
> struct<a:int,b:string> +- *(4) Sort [a#3314 ASC NULLS FIRST|#3314 ASC NULLS 
> FIRST], false, 0 +- Exchange hashpartitioning(a#3314, 200) +- *(3) Project 
> [a#3314, b#3315|#3314, b#3315] +- *(3) Filter isnotnull(a#3314) +- *(3) 
> FileScan parquet x1.cv[a#3314,b#3315|#3314,b#3315] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://hacluster/user/sparkhive/warehouse/x1.db/cv], 
> PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: 
> struct<a:int,b:string>|
> +_*Note: JDBC Server is continuously running at the time of session restart 
> i.e. Application is not restarting. The driver remains the same.*_+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25403) Broadcast join is changing to sort merge join , after spark-beeline session restarts.

Reply via email to