Re: SPARK-13900 - Join with simple OR conditions take too long

Mich Talebzadeh Mon, 04 Apr 2016 13:03:55 -0700

Actually this may not be a bug. It just the Optimizer decides to do a
nested loop join over Hash Join when more that two OR joins are involved


With one equality predicate Hash JOin is chosen

4> SELECT COUNT(SALES.PROD_ID) from SALES, SALES2
5> WHERE SALES.CUST_ID = SALES2.CUST_ID
6> go
QUERY PLAN FOR STATEMENT 1 (at line 4).
Optimized using Parallel Mode

    STEP 1
        The type of query is SELECT.
        4 operator(s) under root
       |ROOT:EMIT Operator (VA = 4)
       |
       |   |SCALAR AGGREGATE Operator (VA = 3)
       |   |  Evaluate Ungrouped COUNT AGGREGATE.
       |   |
       |   |   |*HASH JOIN Operator* (VA = 2) (Join Type: Inner Join)
       |   |   | Using Worktable1 for internal storage.
       |   |   |  Key Count: 1
       |   |   |
       |   |   |   |SCAN Operator (VA = 0)
       |   |   |   |  FROM TABLE
       |   |   |   |  SALES2
       |   |   |   |  Table Scan.
       |   |   |   |  Forward Scan.
       |   |   |   |  Positioning at start of table.
       |   |   |   |  Using I/O Size 64 Kbytes for data pages.
       |   |   |   |  With LRU Buffer Replacement Strategy for data pages.
       |   |   |
       |   |   |   |SCAN Operator (VA = 1)
       |   |   |   |  FROM TABLE
       |   |   |   |  SALES
       |   |   |   |  Table Scan.
       |   |   |   |  Forward Scan.
       |   |   |   |  Positioning at start of table.
       |   |   |   |  Using I/O Size 64 Kbytes for data pages.
       |   |   |   |  With MRU Buffer Replacement Strategy for data pages.

Total estimated I/O cost for statement 1 (at line 4): 783206.

Now if I chose two predicates it reverts to Nested Loop Join

4> SELECT COUNT(SALES.PROD_ID) from SALES, SALES2
5> WHERE SALES.CUST_ID = SALES2.CUST_ID
6> OR
7> SALES.TIME_ID = SALES2.TIME_ID
8> go

    STEP 1
        The type of query is SET OPTION ON.
Total estimated I/O cost for statement 3 (at line 3): 0.

QUERY PLAN FOR STATEMENT 4 (at line 4).
Optimized using Parallel Mode

    STEP 1
        The type of query is SELECT.
        5 operator(s) under root
       |ROOT:EMIT Operator (VA = 5)
       |
       |   |SCALAR AGGREGATE Operator (VA = 4)
       |   |  Evaluate Ungrouped COUNT AGGREGATE.
       |   |
       |   |   |*NESTED LOOP JOIN* Operator (VA = 3) (Join Type: Inner Join)
       |   |   |
       |   |   |   |SCAN Operator (VA = 0)
       |   |   |   |  FROM TABLE
       |   |   |   |  SALES
       |   |   |   |  Table Scan.
       |   |   |   |  Forward Scan.
       |   |   |   |  Positioning at start of table.
       |   |   |   |  Using I/O Size 64 Kbytes for data pages.
       |   |   |   |  With MRU Buffer Replacement Strategy for data pages.
       |   |   |
       |   |   |   |RESTRICT Operator (VA = 2)(0)(0)(0)(7)(0)
       |   |   |   |
       |   |   |   |   |SCAN Operator (VA = 1)
       |   |   |   |   |  FROM TABLE
       |   |   |   |   |  SALES2
       |   |   |   |   |  Table Scan.
       |   |   |   |   |  Forward Scan.
       |   |   |   |   |  Positioning at start of table.
       |   |   |   |   |  Using I/O Size 64 Kbytes for data pages.
       |   |   |   |   |  With LRU Buffer Replacement Strategy for data
pages.

Total estimated I/O cost for statement 4 (at line 4): 2147483647.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 1 April 2016 at 13:19, ashokkumar rajendran <
ashokkumar.rajend...@gmail.com> wrote:

> I agree with Hemant's comment. But it does not give good results for
> simple usecases like 2 OR conditions. Ultimately we need good results from
> Spark for end users. shall we consider this as a request to support SQL
> hints then? Is there any plan to support SQL hint in Spark in upcoming
> release?
>
> Regards
> Ashok
>
> On Fri, Apr 1, 2016 at 5:04 PM, Robin East <robin.e...@xense.co.uk> wrote:
>
>> Yes and even today CBO (e.g. in Oracle) will still require hints in some
>> cases so I think it is more like:
>>
>> RBO -> RBO + Hints -> CBO + Hints. Most relational databases meet
>> significant numbers of corner cases where CBO plans simply don’t do what
>> you would want. I don’t know enough about Spark SQL to comment on whether
>> the same problems would afflict Spark.
>>
>>
>>
>>
>> On 31 Mar 2016, at 15:54, Yong Zhang <java8...@hotmail.com> wrote:
>>
>> I agree that there won't be a generic solution for these kind of cases.
>>
>> Without the CBO from Spark or Hadoop ecosystem in short future, maybe
>> Spark DataFrame/SQL should support more hints from the end user, as in
>> these cases, end users will be smart enough to tell the engine what is the
>> correct way to do.
>>
>> Weren't the relational DBs doing exactly same path? RBO -> RBO + Hints ->
>> CBO?
>>
>> Yong
>>
>> ------------------------------
>> Date: Thu, 31 Mar 2016 16:07:14 +0530
>> Subject: Re: SPARK-13900 - Join with simple OR conditions take too long
>> From: hemant9...@gmail.com
>> To: ashokkumar.rajend...@gmail.com
>> CC: user@spark.apache.org
>>
>> Hi Ashok,
>>
>> That's interesting.
>>
>> As I understand, on table A and B, a nested loop join (that will produce
>> m X n rows) is performed and than each row is evaluated to see if any of
>> the condition is met. You are asking that Spark should instead do a
>> BroadcastHashJoin on the equality conditions in parallel and then union the
>> results like you are doing in a different query.
>>
>> If we leave aside parallelism for a moment, theoretically, time taken for
>> nested loop join would vary little when the number of conditions are
>> increased while the time taken for the solution that you are suggesting
>> would increase linearly with number of conditions. So, when number of
>> conditions are too many, nested loop join would be faster than the solution
>> that you suggest. Now the question is, how should Spark decide when to do
>> what?
>>
>>
>> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
>> www.snappydata.io
>>
>> On Thu, Mar 31, 2016 at 2:28 PM, ashokkumar rajendran <
>> ashokkumar.rajend...@gmail.com> wrote:
>>
>> Hi,
>>
>> I have filed ticket SPARK-13900. There was an initial reply from a
>> developer but did not get any reply on this. How can we do multiple hash
>> joins together for OR conditions based joins? Could someone please guide on
>> how can we fix this?
>>
>> Regards
>> Ashok
>>
>>
>>
>

Re: SPARK-13900 - Join with simple OR conditions take too long

Reply via email to