I am running pyspark job over 4GB of data that is split into 17 parquet
files on HDFS cluster.   This is all in cloudera manager.

Here is the query the job is running :


results = sqlContext.sql("SELECT sum(total_impressions), sum(total_clicks)
FROM parquetFileone group by hour")

I also ran this way :
mapped = parquetFile.map(lambda row: (str(row.hour),
(row.total_impressions, row.total_clicks))) counts =
mapped.reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1]))

my results where anywhere from 8 - 10 minutes.

I am wondering if there is a configuration that needs to be tweaked or if
this is expected response time.

Machines are 30g RAM and 4 cores. Seems the CPU's are just getting pegged
and that is what is taking so long.

 Any help on this would be amazing.




