Spark response times for queries seem slow

2015-01-05 Thread Sam Flint
I am running pyspark job over 4GB of data that is split into 17 parquet files on HDFS cluster. This is all in cloudera manager. Here is the query the job is running : parquetFile.registerTempTable(parquetFileone) results = sqlContext.sql(SELECT sum(total_impressions), sum(total_clicks) FROM

Re: Spark response times for queries seem slow

2015-01-05 Thread Cody Koeninger
That sounds slow to me. It looks like your sql query is grouping by a column that isn't in the projections, I'm a little surprised that even works. But you're getting the same time reducing manually? Have you looked at the shuffle amounts in the UI for the job? Are you certain there aren't a