Hi, I am a graduate student from Virginia Tech (USA) pursuing my Masters in Computer Science. I’ve been researching on parallel and distributed databases and their performance for running some Range queries involving simple joins and group by on large datasets. As part of my research, I tried evaluating query performance of Spark SQL on the data set that I have. It would be really great if you could please confirm on the numbers that I get from Spark SQL? Following is the type of query that am running,
Table 1 - 22,000,483 records Table 2 - 10,173,311 records Query : SELECT b.x, count(b.y) FROM Table1 a, Table2 b WHERE a.y=b.y AND a.z=‘xxxx' GROUP BY b.x ORDER BY b.x Total Running Time 4 Worker Nodes:177.68s 8 Worker Nodes: 186.72s I am using Apache Spark 1.3.0 with the default configuration. Is the query running time reasonable? Is it because of non-availability of indexes increasing the query run time? Can you please clarify? Thanks Mani Graduate Student, Department of Computer Science Virginia Tech