I have a query that's like:

Could you help in providing me pointers as to how to start to optimize it
w.r.t. spark sql:


sqlContext.sql("

SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE

FROM (
               SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS
SDP_USAGE

               FROM (
                             SELECT *

                             FROM date_d dd JOIN interval_f intf

                             ON intf.DATE_WID = dd.WID

                             WHERE intf.DATE_WID >= 20141116 AND
intf.DATE_WID <= 20141125 AND CAST(INTERVAL_END_TIME AS STRING) >=
'2014-11-16 00:00:00.000' AND  CAST(INTERVAL_END_TIME 
                                           AS STRING) <= '2014-11-26
00:00:00.000' AND MEAS_WID = 3

                          ) test JOIN sdp_d sdp

               ON test.SDP_WID = sdp.WID

               WHERE sdp.UDC_ID = 'SP-1931201848'

               GROUP BY sdp.WID, DAY_OF_WEEK, HOUR, sdp.UDC_ID

           ) dw

GROUP BY dw.DAY_OF_WEEK, dw.HOUR")



Currently the query takes 15 minutes execution time where interval_f table
holds approx 170GB of raw data, date_d --> 170 MB and sdp_d --> 490MB 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-SQL-Query-tp21948.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to