I have been using Spark sql in cluster mode and I am noticing no distribution
and parallelization of the query execution. The performance seems to be very
slow compared to native spark applications and does not offer any speedup
when compared to HIVE. I am using Spark 1.1.0 with a cluster of 5 nodes. The
query that is being ran consists of 4 subqueries where each query creates
groups based on a variable and then they are all unioned(I have attached
part of the query at the end of this post). 

Implementing the query functionality in native spark (scala) seems to run in
a normal distributed manner as expected, but executing the query on spark
sql does not engage the slaves (observed from the master’s web UI). The
performance is the same if I run the query with cluster of a master and 4
slaves or a single node cluster(master only). 

I would greatly appreciate any pointers on the cause of the issue.


Query Sample:

SELECT * 
FROM   ( 
                SELECT   "age"               AS varname, 
                         a.tile              AS catname, 
                         a.myrandom count(*) AS count 
                FROM     ( 
                                SELECT *, 
                                       cast((5 * rand(65535)) AS int) AS
myrandom, 
                                       CASE 
                                              WHEN ( 
                                                            ( 
                                                                   age <=
19)) THEN 1 
                                              WHEN ( 
                                                            ( 
                                                                   age <=
35)) THEN 2 
                                              WHEN ( 
                                                            ( 
                                                                   age <=
51)) THEN 3 
                                              WHEN ( 
                                                            ( 
                                                                   age <=
63)) THEN 4 
                                              ELSE 5 
                                       END AS tile 
                                FROM   tablea a) a 
                GROUP BY a.tile, 
                         a.myrandom 
                UNION ALL 
                SELECT   "salary" AS varname, 
                         a.tile   AS catname, 
                         a.myrandom, 
                         count(*) AS count 
                FROM     ( 
                                SELECT *, 
                                       cast((5 * rand(65535)) AS int) AS
myrandom, 
                                       CASE 
                                              WHEN ( 
                                                            ( 
                                                                   salary <=
39615)) THEN 1 
                                              WHEN ( 
                                                            ( 
                                                                   salary <=
65740)) THEN 2 
                                              WHEN ( 
                                                            ( 
                                                                   salary <=
117555)) THEN 3 
                                              ELSE 4 
                                       END AS tile 
                                FROM   tablea a) a 
                GROUP BY a.tile, 
                         a.myrandom ) unioned; 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Sql-not-using-cluster-slaves-tp21155.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to