I agree with Hemant's comment. But it does not give good results for simple usecases like 2 OR conditions. Ultimately we need good results from Spark for end users. shall we consider this as a request to support SQL hints then? Is there any plan to support SQL hint in Spark in upcoming release?
Regards Ashok On Fri, Apr 1, 2016 at 5:04 PM, Robin East <robin.e...@xense.co.uk> wrote: > Yes and even today CBO (e.g. in Oracle) will still require hints in some > cases so I think it is more like: > > RBO -> RBO + Hints -> CBO + Hints. Most relational databases meet > significant numbers of corner cases where CBO plans simply don’t do what > you would want. I don’t know enough about Spark SQL to comment on whether > the same problems would afflict Spark. > > > > > On 31 Mar 2016, at 15:54, Yong Zhang <java8...@hotmail.com> wrote: > > I agree that there won't be a generic solution for these kind of cases. > > Without the CBO from Spark or Hadoop ecosystem in short future, maybe > Spark DataFrame/SQL should support more hints from the end user, as in > these cases, end users will be smart enough to tell the engine what is the > correct way to do. > > Weren't the relational DBs doing exactly same path? RBO -> RBO + Hints -> > CBO? > > Yong > > ------------------------------ > Date: Thu, 31 Mar 2016 16:07:14 +0530 > Subject: Re: SPARK-13900 - Join with simple OR conditions take too long > From: hemant9...@gmail.com > To: ashokkumar.rajend...@gmail.com > CC: user@spark.apache.org > > Hi Ashok, > > That's interesting. > > As I understand, on table A and B, a nested loop join (that will produce m > X n rows) is performed and than each row is evaluated to see if any of the > condition is met. You are asking that Spark should instead do a > BroadcastHashJoin on the equality conditions in parallel and then union the > results like you are doing in a different query. > > If we leave aside parallelism for a moment, theoretically, time taken for > nested loop join would vary little when the number of conditions are > increased while the time taken for the solution that you are suggesting > would increase linearly with number of conditions. So, when number of > conditions are too many, nested loop join would be faster than the solution > that you suggest. Now the question is, how should Spark decide when to do > what? > > > Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> > www.snappydata.io > > On Thu, Mar 31, 2016 at 2:28 PM, ashokkumar rajendran < > ashokkumar.rajend...@gmail.com> wrote: > > Hi, > > I have filed ticket SPARK-13900. There was an initial reply from a > developer but did not get any reply on this. How can we do multiple hash > joins together for OR conditions based joins? Could someone please guide on > how can we fix this? > > Regards > Ashok > > >