If you only need the group by in the same hierarchy logic, then you can group 
by at the lowest level, and cache it, then use the cached DF to derive to the 
higher level, so Spark will only scan the originally table once, and reuse the 
cache in the following.


val df_base =  sqlContext.sql("select col1,col2,col3,col4,col5, count(*) from 
table groupby col1,col2,col3,col4,col5").cache

df_base.registerTempTable("df_base")

val df1 = sqlContext.sql("select col1, col2, count(*) from df_base group by 
col1, col2")

val df2 = // similar logic

Yong
________________________________
From: Patrick <titlibat...@gmail.com>
Sent: Saturday, February 18, 2017 4:23 PM
To: user
Subject: Efficient Spark-Sql queries when only nth Column changes

Hi,

I have read 5 columns from parquet into data frame. My queries on the parquet 
table is of below type:

val df1 = sqlContext.sql(select col1,col2,count(*) from table groupby col1,col2)
val df2 = sqlContext.sql(select col1,col3,count(*) from table  groupby 
col1,col3)
val df3 = sqlContext.sql(select col1,col4,count(*) from table  groupby 
col1,col4)
val df4 = sqlContext.sql(select col1,col5,count(*) from table  groupby 
col1,col5)

And then i require to union the results from df1 to df4 into a single df.


So basically, only the second column is changing, Is there any efficient way to 
write the above queries  in Spark-Sql instead of writing 4 different queries(OR 
in loop) and doing union to get the result.


Thanks





Reply via email to