Can you share the API that your jobs use.. just core RDDs or SQL or DStreams..etc? refer recommendations from https://spark.apache.org/docs/2.3.0/configuration.html for detailed configurations. Thanks, Prem
On Wed, Jul 4, 2018 at 12:34 PM, Aakash Basu <aakash.spark....@gmail.com> wrote: > I do not want to change executor/driver cores/memory on the fly in a > single Spark job, all I want is to make them cluster specific. So, I want > to have a formulae, with which, depending on the size of driver and > executor details, I can find out the values for them before submitting > those details in the spark-submit. > > I, more or less know how to achieve the above as I've previously done that. > > All I need to do is, I want to tweak the other spark confs depending on > the data. Is that possible? I mean (just an example), if I have 100+ > features, I want to double my default spark.driver.maxResultSize to 2G, and > similarly for other configs. Can that be achieved by any means for a > optimal run on that kind of dataset? If yes, can I? > > On Tue, Jul 3, 2018 at 6:28 PM, Vadim Semenov <va...@datadoghq.com> wrote: > >> You can't change the executor/driver cores/memory on the fly once >> you've already started a Spark Context. >> On Tue, Jul 3, 2018 at 4:30 AM Aakash Basu <aakash.spark....@gmail.com> >> wrote: >> > >> > We aren't using Oozie or similar, moreover, the end to end job shall be >> exactly the same, but the data will be extremely different (number of >> continuous and categorical columns, vertical size, horizontal size, etc), >> hence, if there would have been a calculation of the parameters to arrive >> at a conclusion that we can simply get the data and derive the respective >> configuration/parameters, it would be great. >> > >> > On Tue, Jul 3, 2018 at 1:09 PM, Jörn Franke <jornfra...@gmail.com> >> wrote: >> >> >> >> Don’t do this in your job. Create for different types of jobs >> different jobs and orchestrate them using oozie or similar. >> >> >> >> On 3. Jul 2018, at 09:34, Aakash Basu <aakash.spark....@gmail.com> >> wrote: >> >> >> >> Hi, >> >> >> >> Cluster - 5 node (1 Driver and 4 workers) >> >> Driver Config: 16 cores, 32 GB RAM >> >> Worker Config: 8 cores, 16 GB RAM >> >> >> >> I'm using the below parameters from which I know the first chunk is >> cluster dependent and the second chunk is data/code dependent. >> >> >> >> --num-executors 4 >> >> --executor-cores 5 >> >> --executor-memory 10G >> >> --driver-cores 5 >> >> --driver-memory 25G >> >> >> >> >> >> --conf spark.sql.shuffle.partitions=100 >> >> --conf spark.driver.maxResultSize=2G >> >> --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC" >> >> --conf spark.scheduler.listenerbus.eventqueue.capacity=20000 >> >> >> >> I've come upto these values depending on my R&D on the properties and >> the issues I faced and hence the handles. >> >> >> >> My ask here is - >> >> >> >> 1) How can I infer, using some formula or a code, to calculate the >> below chunk dependent on the data/code? >> >> 2) What are the other usable properties/configurations which I can use >> to shorten my job runtime? >> >> >> >> Thanks, >> >> Aakash. >> > >> > >> >> >> -- >> Sent from my iPhone >> > >