Data Size: 167 MB Schema:
case class GuidSess( guid: String, sessionKey: String, sessionStartDate: String, siteId: String, eventCount: String, browser: String, browserVersion: String, operatingSystem: String, experimentChannel: String, deviceName: String) %sql select siteid, count(distinct guid) total_visitor, count(sessionKey) as total_visits from guidsess group by siteid This started a job with 1 stage and 1 task and 1 executor The task failed twice and i clicked || button on zeppelin and cancelled the job. In logs i see java.lang.OutOfMemoryError: GC overhead limit exceeded Question 1) How can i increase the number of executors ? I thought this was dynamic and handled by Zeppelin itself. 2) How can i increase the memory for each executor ? Parallel Question 1) A analytics user (the one who fires all SQL like queries) need/will not have knowledge about underlying processing framework (Spark/Teradata/Hadoop/XYZ) . Once data is loaded (the code written by a Spark/Scala developer) the analytic person should be able to run any kind of SQL queries to analyze the data. 2) Should not Zeppelin handle the Spark optimization/tuning parameters ? 3) The queries are dynamic and will change. If each query requires tuning spark then it might be a involved activity. Please suggest. -Deepak -- Deepak