If you only need the group by in the same hierarchy logic, then you can group by at the lowest level, and cache it, then use the cached DF to derive to the higher level, so Spark will only scan the originally table once, and reuse the cache in the following.
val df_base = sqlContext.sql("select col1,col2,col3,col4,col5, count(*) from table groupby col1,col2,col3,col4,col5").cache df_base.registerTempTable("df_base") val df1 = sqlContext.sql("select col1, col2, count(*) from df_base group by col1, col2") val df2 = // similar logic Yong ________________________________ From: Patrick <titlibat...@gmail.com> Sent: Saturday, February 18, 2017 4:23 PM To: user Subject: Efficient Spark-Sql queries when only nth Column changes Hi, I have read 5 columns from parquet into data frame. My queries on the parquet table is of below type: val df1 = sqlContext.sql(select col1,col2,count(*) from table groupby col1,col2) val df2 = sqlContext.sql(select col1,col3,count(*) from table groupby col1,col3) val df3 = sqlContext.sql(select col1,col4,count(*) from table groupby col1,col4) val df4 = sqlContext.sql(select col1,col5,count(*) from table groupby col1,col5) And then i require to union the results from df1 to df4 into a single df. So basically, only the second column is changing, Is there any efficient way to write the above queries in Spark-Sql instead of writing 4 different queries(OR in loop) and doing union to get the result. Thanks