I am running pyspark job over 4GB of data that is split into 17 parquet
files on HDFS cluster. This is all in cloudera manager.
Here is the query the job is running :
parquetFile.registerTempTable(parquetFileone)
results = sqlContext.sql(SELECT sum(total_impressions), sum(total_clicks)
FROM
That sounds slow to me.
It looks like your sql query is grouping by a column that isn't in the
projections, I'm a little surprised that even works. But you're getting
the same time reducing manually?
Have you looked at the shuffle amounts in the UI for the job? Are you
certain there aren't a