Another possibility is about the parallelism? Probably be 1 or some other small value, since the input data size is not that big.
If in that case, probably you can try something like: Df1.repartition(10).registerTempTable(“hospitals”); Df2.repartition(10).registerTempTable(“counties”); … And then doing the join. From: Raghu Ganti [mailto:raghuki...@gmail.com] Sent: Thursday, January 28, 2016 3:06 AM To: Ted Yu; Дмитро Попович Cc: user Subject: Re: Spark SQL joins taking too long The problem is with the way Spark query plan is being created, IMO, what was happening before is that the order of the tables mattered and when the larger table is given first, it took a very long time (~53mins to complete). I changed the order of the tables with the smaller one first (including replacing the table with one element with that of the entire one) and modified the query to look like this: SELECT c.NAME, h.name<http://h.name> FROM counties c, hospitals h WHERE c.NAME = 'Dutchess' AND ST_Intersects(c.shape, h.location) With the above query, things worked like a charm (<1min to finish the entire execution and join on 3141 polygons with 6.5k points). Do let me know if you need more info in order to pin point the issue. Regards, Raghu On Tue, Jan 26, 2016 at 5:13 PM, Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote: What's the type of shape column ? Can you disclose what SomeUDF does (by showing the code) ? Cheers On Tue, Jan 26, 2016 at 12:41 PM, raghukiran <raghuki...@gmail.com<mailto:raghuki...@gmail.com>> wrote: Hi, I create two tables, one counties with just one row (it actually has 2k rows, but I used only one) and another hospitals, which has 6k rows. The join command I use is as follows, which takes way too long to run and has never finished successfully (even after nearly 10mins). The following is what I have: DataFrame df1 = ... df1.registerTempTable("hospitals"); DataFrame df2 = ... df2.registerTempTable("counties"); //has only one row right now DataFrame joinDf = sqlCtx.sql("SELECT h.name<http://h.name>, c.name<http://c.name> FROM hospitals h JOIN counties c ON SomeUDF(c.shape, h.location)"); long count = joinDf.count(); //this takes too long! //whereas the following which is the exact equivalent of the above gets done very quickly! DataFrame joinDf = sqlCtx.sql("SELECT h.name<http://h.name> FROM hospitals WHERE SomeUDF('c.shape as string', h.location)"); long count = joinDf.count(); //gives me the correct answer of 8 Any suggestions on what I can do to optimize and debug this piece of code? Regards, Raghu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>