Question on Spark SQL performance of Range Queries on Large Datasets

Mani Mon, 27 Apr 2015 00:59:45 -0700

Hi,

I am a graduate student from Virginia Tech (USA) pursuing my Masters in 
Computer Science. I’ve been researching on parallel and distributed databases 
and their performance for running some Range queries involving simple joins and 
group by on large datasets. As part of my research, I tried evaluating query 
performance of Spark SQL on the data set that I have. It would be really great 
if you could please confirm on the numbers that I get from Spark SQL? Following 
is the type of query that am running,


Table 1 - 22,000,483 records
Table 2 - 10,173,311 records

Query : SELECT b.x, count(b.y) FROM Table1 a, Table2 b WHERE a.y=b.y AND 
a.z=‘xxxx' GROUP BY b.x ORDER BY b.x

Total Running Time
4 Worker Nodes:177.68s
8 Worker Nodes: 186.72s

I am using Apache Spark 1.3.0 with the default configuration. Is the query 
running time reasonable? Is it because of non-availability of indexes 
increasing the query run time? Can you please clarify?

Thanks
Mani
Graduate Student, Department of Computer Science
Virginia Tech

Question on Spark SQL performance of Range Queries on Large Datasets

Reply via email to