[ https://issues.apache.org/jira/browse/SPARK-13900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198099#comment-15198099 ]
Xiao Li commented on SPARK-13900: --------------------------------- BroadcastNestedLoopJoin is much slower than BroadcastHashJoin. This is the major reason why your first plan is faster than the second one. > Spark SQL queries with OR condition is not optimized properly > ------------------------------------------------------------- > > Key: SPARK-13900 > URL: https://issues.apache.org/jira/browse/SPARK-13900 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.0 > Reporter: Ashok kumar Rajendran > > I have a large table with few billions of rows and have a very small table > with 4 dimensional values. All the data is stored in parquet format. I would > like to get rows that match any of these dimensions. For example, > Select field1, field2 from A, B where A.dimension1 = B.dimension1 OR > A.dimension2 = B.dimension2 OR A.dimension3 = B.dimension3 OR A.dimension4 = > B.dimension4. > The query plan takes this as BroadcastNestedLoopJoin and executes for very > long time. > If I execute this as Union queries, it takes around 1.5mins for each > dimension. Each query internally does BroadcastHashJoin. > Select field1, field2 from A, B where A.dimension1 = B.dimension1 > UNION ALL > Select field1, field2 from A, B where A.dimension2 = B.dimension2 > UNION ALL > Select field1, field2 from A, B where A.dimension3 = B.dimension3 > UNION ALL > Select field1, field2 from A, B where A.dimension4 = B.dimension4. > This is obviously not an optimal solution as it makes multiple scanning at > same table but it gives result much better than OR condition. > Seems the SQL optimizer is not working properly which causes huge performance > impact on this type of OR query. > Please correct me if I miss anything here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org