I agree with Hemant's comment. But it does not give good results for simple
usecases like 2 OR conditions. Ultimately we need good results from Spark
for end users. shall we consider this as a request to support SQL hints
then? Is there any plan to support SQL hint in Spark in upcoming release?

Regards
Ashok

On Fri, Apr 1, 2016 at 5:04 PM, Robin East <robin.e...@xense.co.uk> wrote:

> Yes and even today CBO (e.g. in Oracle) will still require hints in some
> cases so I think it is more like:
>
> RBO -> RBO + Hints -> CBO + Hints. Most relational databases meet
> significant numbers of corner cases where CBO plans simply don’t do what
> you would want. I don’t know enough about Spark SQL to comment on whether
> the same problems would afflict Spark.
>
>
>
>
> On 31 Mar 2016, at 15:54, Yong Zhang <java8...@hotmail.com> wrote:
>
> I agree that there won't be a generic solution for these kind of cases.
>
> Without the CBO from Spark or Hadoop ecosystem in short future, maybe
> Spark DataFrame/SQL should support more hints from the end user, as in
> these cases, end users will be smart enough to tell the engine what is the
> correct way to do.
>
> Weren't the relational DBs doing exactly same path? RBO -> RBO + Hints ->
> CBO?
>
> Yong
>
> ------------------------------
> Date: Thu, 31 Mar 2016 16:07:14 +0530
> Subject: Re: SPARK-13900 - Join with simple OR conditions take too long
> From: hemant9...@gmail.com
> To: ashokkumar.rajend...@gmail.com
> CC: user@spark.apache.org
>
> Hi Ashok,
>
> That's interesting.
>
> As I understand, on table A and B, a nested loop join (that will produce m
> X n rows) is performed and than each row is evaluated to see if any of the
> condition is met. You are asking that Spark should instead do a
> BroadcastHashJoin on the equality conditions in parallel and then union the
> results like you are doing in a different query.
>
> If we leave aside parallelism for a moment, theoretically, time taken for
> nested loop join would vary little when the number of conditions are
> increased while the time taken for the solution that you are suggesting
> would increase linearly with number of conditions. So, when number of
> conditions are too many, nested loop join would be faster than the solution
> that you suggest. Now the question is, how should Spark decide when to do
> what?
>
>
> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
> www.snappydata.io
>
> On Thu, Mar 31, 2016 at 2:28 PM, ashokkumar rajendran <
> ashokkumar.rajend...@gmail.com> wrote:
>
> Hi,
>
> I have filed ticket SPARK-13900. There was an initial reply from a
> developer but did not get any reply on this. How can we do multiple hash
> joins together for OR conditions based joins? Could someone please guide on
> how can we fix this?
>
> Regards
> Ashok
>
>
>

Reply via email to