Re: Spark SQL and Skewed Joins

2015-06-17 Thread Yin Huai
Hi John, Did you also set spark.sql.planner.externalSort to true? Probably you will not see executor lost with this conf. For now, maybe you can manually split the query to two parts, one for skewed keys and one for other records. Then, you union then results of these two parts together. Thanks,

Re: Spark SQL and Skewed Joins

2015-06-17 Thread Koert Kuipers
could it be composed maybe? a general version and then a sql version that exploits the additional info/abilities available there and uses the general version internally... i assume the sql version can benefit from the logical phase optimization to pick join details. or is there more? On Tue, Jun

Re: Spark SQL and Skewed Joins

2015-06-16 Thread Koert Kuipers
a skew join (where the dominant key is spread across multiple executors) is pretty standard in other frameworks, see for example in scalding: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/JoinAlgorithms.scala this would be a great addition to

Re: Spark SQL and Skewed Joins

2015-06-16 Thread Jon Walton
On Fri, Jun 12, 2015 at 9:43 PM, Michael Armbrust mich...@databricks.com wrote: 2. Does 1.3.2 or 1.4 have any enhancements that can help? I tried to use 1.3.1 but SPARK-6967 prohibits me from doing so.Now that 1.4 is available, would any of the JOIN enhancements help this situation? I

Re: Spark SQL and Skewed Joins

2015-06-16 Thread Michael Armbrust
this would be a great addition to spark, and ideally it belongs in spark core not sql. I agree with the fact that this would be a great addition, but we would likely want a specialized SQL implementation for performance reasons.

Re: Spark SQL and Skewed Joins

2015-06-12 Thread Michael Armbrust
2. Does 1.3.2 or 1.4 have any enhancements that can help? I tried to use 1.3.1 but SPARK-6967 prohibits me from doing so.Now that 1.4 is available, would any of the JOIN enhancements help this situation? I would try Spark 1.4 after running SET spark.sql.planner.sortMergeJoin=true.

Spark SQL and Skewed Joins

2015-06-12 Thread Jon Walton
Greetings, I am trying to implement a classic star schema ETL pipeline using Spark SQL, 1.2.1. I am running into problems with shuffle joins, for those dimension tables which have skewed keys and are too large to let Spark broadcast them. I have a few questions 1. Can I split my queries so a