I have been using Spark sql in cluster mode and I am noticing no distribution and parallelization of the query execution. The performance seems to be very slow compared to native spark applications and does not offer any speedup when compared to HIVE. I am using Spark 1.1.0 with a cluster of 5 nodes. The query that is being ran consists of 4 subqueries where each query creates groups based on a variable and then they are all unioned(I have attached part of the query at the end of this post).
Implementing the query functionality in native spark (scala) seems to run in a normal distributed manner as expected, but executing the query on spark sql does not engage the slaves (observed from the master’s web UI). The performance is the same if I run the query with cluster of a master and 4 slaves or a single node cluster(master only). I would greatly appreciate any pointers on the cause of the issue. Query Sample: SELECT * FROM ( SELECT "age" AS varname, a.tile AS catname, a.myrandom count(*) AS count FROM ( SELECT *, cast((5 * rand(65535)) AS int) AS myrandom, CASE WHEN ( ( age <= 19)) THEN 1 WHEN ( ( age <= 35)) THEN 2 WHEN ( ( age <= 51)) THEN 3 WHEN ( ( age <= 63)) THEN 4 ELSE 5 END AS tile FROM tablea a) a GROUP BY a.tile, a.myrandom UNION ALL SELECT "salary" AS varname, a.tile AS catname, a.myrandom, count(*) AS count FROM ( SELECT *, cast((5 * rand(65535)) AS int) AS myrandom, CASE WHEN ( ( salary <= 39615)) THEN 1 WHEN ( ( salary <= 65740)) THEN 2 WHEN ( ( salary <= 117555)) THEN 3 ELSE 4 END AS tile FROM tablea a) a GROUP BY a.tile, a.myrandom ) unioned; -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Sql-not-using-cluster-slaves-tp21155.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org