Data Size: 167 MB

Schema:

 case class GuidSess(

      guid: String,

      sessionKey: String,

      sessionStartDate: String,

      siteId: String,

      eventCount: String,

      browser: String,

      browserVersion: String,

      operatingSystem: String,

      experimentChannel: String,

      deviceName: String)


%sql
select siteid, count(distinct guid) total_visitor,
count(sessionKey) as total_visits
from guidsess
group by siteid


This started a job with 1 stage and 1 task and 1 executor

The task failed twice and i clicked || button on zeppelin and cancelled the
job. In logs i see

java.lang.OutOfMemoryError: GC overhead limit exceeded

Question
1) How can i increase the number of executors ? I thought this was dynamic
and handled by Zeppelin itself.
2) How can i increase the memory for each executor ?

Parallel Question
1) A analytics user (the one who fires all SQL like queries) need/will not
have knowledge about underlying processing framework
(Spark/Teradata/Hadoop/XYZ) . Once data is loaded (the code written by a
Spark/Scala developer) the analytic person should be able to run any kind
of SQL queries to analyze the data.
2) Should not Zeppelin handle the Spark optimization/tuning parameters ?
3) The queries are dynamic and will change. If each query requires tuning
spark then it might be a involved activity.

Please suggest.

-Deepak

-- 
Deepak

Reply via email to