I do not want to change executor/driver cores/memory on the fly in a single Spark job, all I want is to make them cluster specific. So, I want to have a formulae, with which, depending on the size of driver and executor details, I can find out the values for them before submitting those details in the spark-submit.
I, more or less know how to achieve the above as I've previously done that. All I need to do is, I want to tweak the other spark confs depending on the data. Is that possible? I mean (just an example), if I have 100+ features, I want to double my default spark.driver.maxResultSize to 2G, and similarly for other configs. Can that be achieved by any means for a optimal run on that kind of dataset? If yes, can I? On Tue, Jul 3, 2018 at 6:28 PM, Vadim Semenov <va...@datadoghq.com> wrote: > You can't change the executor/driver cores/memory on the fly once > you've already started a Spark Context. > On Tue, Jul 3, 2018 at 4:30 AM Aakash Basu <aakash.spark....@gmail.com> > wrote: > > > > We aren't using Oozie or similar, moreover, the end to end job shall be > exactly the same, but the data will be extremely different (number of > continuous and categorical columns, vertical size, horizontal size, etc), > hence, if there would have been a calculation of the parameters to arrive > at a conclusion that we can simply get the data and derive the respective > configuration/parameters, it would be great. > > > > On Tue, Jul 3, 2018 at 1:09 PM, Jörn Franke <jornfra...@gmail.com> > wrote: > >> > >> Don’t do this in your job. Create for different types of jobs different > jobs and orchestrate them using oozie or similar. > >> > >> On 3. Jul 2018, at 09:34, Aakash Basu <aakash.spark....@gmail.com> > wrote: > >> > >> Hi, > >> > >> Cluster - 5 node (1 Driver and 4 workers) > >> Driver Config: 16 cores, 32 GB RAM > >> Worker Config: 8 cores, 16 GB RAM > >> > >> I'm using the below parameters from which I know the first chunk is > cluster dependent and the second chunk is data/code dependent. > >> > >> --num-executors 4 > >> --executor-cores 5 > >> --executor-memory 10G > >> --driver-cores 5 > >> --driver-memory 25G > >> > >> > >> --conf spark.sql.shuffle.partitions=100 > >> --conf spark.driver.maxResultSize=2G > >> --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC" > >> --conf spark.scheduler.listenerbus.eventqueue.capacity=20000 > >> > >> I've come upto these values depending on my R&D on the properties and > the issues I faced and hence the handles. > >> > >> My ask here is - > >> > >> 1) How can I infer, using some formula or a code, to calculate the > below chunk dependent on the data/code? > >> 2) What are the other usable properties/configurations which I can use > to shorten my job runtime? > >> > >> Thanks, > >> Aakash. > > > > > > > -- > Sent from my iPhone >