This might be because Spark SQL first does a shuffle on both the tables involved in join on the Join condition as key.
I had a specific use case of join where I always Join on specific column "id" and have an optimisation lined up for that in which i can cache the data partitioned on JOIN key "id" and could prevent the shuffle by passing the partition information to in-memory caching. See - https://issues.apache.org/jira/browse/SPARK-4849 Thanks -Nitin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-with-join-terribly-slow-tp20751p20756.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org