Actually this may not be a bug. It just the Optimizer decides to do a nested loop join over Hash Join when more that two OR joins are involved
With one equality predicate Hash JOin is chosen 4> SELECT COUNT(SALES.PROD_ID) from SALES, SALES2 5> WHERE SALES.CUST_ID = SALES2.CUST_ID 6> go QUERY PLAN FOR STATEMENT 1 (at line 4). Optimized using Parallel Mode STEP 1 The type of query is SELECT. 4 operator(s) under root |ROOT:EMIT Operator (VA = 4) | | |SCALAR AGGREGATE Operator (VA = 3) | | Evaluate Ungrouped COUNT AGGREGATE. | | | | |*HASH JOIN Operator* (VA = 2) (Join Type: Inner Join) | | | Using Worktable1 for internal storage. | | | Key Count: 1 | | | | | | |SCAN Operator (VA = 0) | | | | FROM TABLE | | | | SALES2 | | | | Table Scan. | | | | Forward Scan. | | | | Positioning at start of table. | | | | Using I/O Size 64 Kbytes for data pages. | | | | With LRU Buffer Replacement Strategy for data pages. | | | | | | |SCAN Operator (VA = 1) | | | | FROM TABLE | | | | SALES | | | | Table Scan. | | | | Forward Scan. | | | | Positioning at start of table. | | | | Using I/O Size 64 Kbytes for data pages. | | | | With MRU Buffer Replacement Strategy for data pages. Total estimated I/O cost for statement 1 (at line 4): 783206. Now if I chose two predicates it reverts to Nested Loop Join 4> SELECT COUNT(SALES.PROD_ID) from SALES, SALES2 5> WHERE SALES.CUST_ID = SALES2.CUST_ID 6> OR 7> SALES.TIME_ID = SALES2.TIME_ID 8> go STEP 1 The type of query is SET OPTION ON. Total estimated I/O cost for statement 3 (at line 3): 0. QUERY PLAN FOR STATEMENT 4 (at line 4). Optimized using Parallel Mode STEP 1 The type of query is SELECT. 5 operator(s) under root |ROOT:EMIT Operator (VA = 5) | | |SCALAR AGGREGATE Operator (VA = 4) | | Evaluate Ungrouped COUNT AGGREGATE. | | | | |*NESTED LOOP JOIN* Operator (VA = 3) (Join Type: Inner Join) | | | | | | |SCAN Operator (VA = 0) | | | | FROM TABLE | | | | SALES | | | | Table Scan. | | | | Forward Scan. | | | | Positioning at start of table. | | | | Using I/O Size 64 Kbytes for data pages. | | | | With MRU Buffer Replacement Strategy for data pages. | | | | | | |RESTRICT Operator (VA = 2)(0)(0)(0)(7)(0) | | | | | | | | |SCAN Operator (VA = 1) | | | | | FROM TABLE | | | | | SALES2 | | | | | Table Scan. | | | | | Forward Scan. | | | | | Positioning at start of table. | | | | | Using I/O Size 64 Kbytes for data pages. | | | | | With LRU Buffer Replacement Strategy for data pages. Total estimated I/O cost for statement 4 (at line 4): 2147483647. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 1 April 2016 at 13:19, ashokkumar rajendran < ashokkumar.rajend...@gmail.com> wrote: > I agree with Hemant's comment. But it does not give good results for > simple usecases like 2 OR conditions. Ultimately we need good results from > Spark for end users. shall we consider this as a request to support SQL > hints then? Is there any plan to support SQL hint in Spark in upcoming > release? > > Regards > Ashok > > On Fri, Apr 1, 2016 at 5:04 PM, Robin East <robin.e...@xense.co.uk> wrote: > >> Yes and even today CBO (e.g. in Oracle) will still require hints in some >> cases so I think it is more like: >> >> RBO -> RBO + Hints -> CBO + Hints. Most relational databases meet >> significant numbers of corner cases where CBO plans simply don’t do what >> you would want. I don’t know enough about Spark SQL to comment on whether >> the same problems would afflict Spark. >> >> >> >> >> On 31 Mar 2016, at 15:54, Yong Zhang <java8...@hotmail.com> wrote: >> >> I agree that there won't be a generic solution for these kind of cases. >> >> Without the CBO from Spark or Hadoop ecosystem in short future, maybe >> Spark DataFrame/SQL should support more hints from the end user, as in >> these cases, end users will be smart enough to tell the engine what is the >> correct way to do. >> >> Weren't the relational DBs doing exactly same path? RBO -> RBO + Hints -> >> CBO? >> >> Yong >> >> ------------------------------ >> Date: Thu, 31 Mar 2016 16:07:14 +0530 >> Subject: Re: SPARK-13900 - Join with simple OR conditions take too long >> From: hemant9...@gmail.com >> To: ashokkumar.rajend...@gmail.com >> CC: user@spark.apache.org >> >> Hi Ashok, >> >> That's interesting. >> >> As I understand, on table A and B, a nested loop join (that will produce >> m X n rows) is performed and than each row is evaluated to see if any of >> the condition is met. You are asking that Spark should instead do a >> BroadcastHashJoin on the equality conditions in parallel and then union the >> results like you are doing in a different query. >> >> If we leave aside parallelism for a moment, theoretically, time taken for >> nested loop join would vary little when the number of conditions are >> increased while the time taken for the solution that you are suggesting >> would increase linearly with number of conditions. So, when number of >> conditions are too many, nested loop join would be faster than the solution >> that you suggest. Now the question is, how should Spark decide when to do >> what? >> >> >> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> >> www.snappydata.io >> >> On Thu, Mar 31, 2016 at 2:28 PM, ashokkumar rajendran < >> ashokkumar.rajend...@gmail.com> wrote: >> >> Hi, >> >> I have filed ticket SPARK-13900. There was an initial reply from a >> developer but did not get any reply on this. How can we do multiple hash >> joins together for OR conditions based joins? Could someone please guide on >> how can we fix this? >> >> Regards >> Ashok >> >> >> >