Another possibility is about the parallelism? Probably be 1 or some other small 
value, since the input data size is not that big.

If in that case, probably you can try something like:

Df1.repartition(10).registerTempTable(“hospitals”);
Df2.repartition(10).registerTempTable(“counties”);
…
And then doing the join.


From: Raghu Ganti [mailto:raghuki...@gmail.com]
Sent: Thursday, January 28, 2016 3:06 AM
To: Ted Yu; Дмитро Попович
Cc: user
Subject: Re: Spark SQL joins taking too long

The problem is with the way Spark query plan is being created, IMO, what was 
happening before is that the order of the tables mattered and when the larger 
table is given first, it took a very long time (~53mins to complete). I changed 
the order of the tables with the smaller one first (including replacing the 
table with one element with that of the entire one) and modified the query to 
look like this:

SELECT c.NAME, h.name<http://h.name> FROM counties c, hospitals h WHERE c.NAME 
= 'Dutchess' AND ST_Intersects(c.shape, h.location)
With the above query, things worked like a charm (<1min to finish the entire 
execution and join on 3141 polygons with 6.5k points).
Do let me know if you need more info in order to pin point the issue.
Regards,
Raghu

On Tue, Jan 26, 2016 at 5:13 PM, Ted Yu 
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote:
What's the type of shape column ?

Can you disclose what SomeUDF does (by showing the code) ?

Cheers

On Tue, Jan 26, 2016 at 12:41 PM, raghukiran 
<raghuki...@gmail.com<mailto:raghuki...@gmail.com>> wrote:
Hi,

I create two tables, one counties with just one row (it actually has 2k
rows, but I used only one) and another hospitals, which has 6k rows. The
join command I use is as follows, which takes way too long to run and has
never finished successfully (even after nearly 10mins). The following is
what I have:

DataFrame df1 = ...
df1.registerTempTable("hospitals");
DataFrame df2 = ...
df2.registerTempTable("counties"); //has only one row right now
DataFrame joinDf = sqlCtx.sql("SELECT h.name<http://h.name>, 
c.name<http://c.name> FROM hospitals h JOIN
counties c ON SomeUDF(c.shape, h.location)");
long count = joinDf.count(); //this takes too long!

//whereas the following which is the exact equivalent of the above gets done
very quickly!
DataFrame joinDf = sqlCtx.sql("SELECT h.name<http://h.name> FROM hospitals WHERE
SomeUDF('c.shape as string', h.location)");
long count = joinDf.count(); //gives me the correct answer of 8

Any suggestions on what I can do to optimize and debug this piece of code?

Regards,
Raghu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>


Reply via email to