Thank you very much for your answers, Now i understand better what i have to do! Thank you!
On Wed, 8 Feb 2017 at 22:37, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi, > > I am not quite sure of your used case here, but I would use spark-submit > and submit sequential jobs as steps to an EMR cluster. > > > Regards, > Gourav > > On Wed, Feb 8, 2017 at 11:10 AM, Cosmin Posteuca < > cosmin.poste...@gmail.com> wrote: > > I tried to run some test on EMR on yarn cluster mode. > > I have a cluster with 16 cores(8 processors with 2 threads each). If i run > one job(use 5 core) takes 90 seconds, if i run 2 jobs simultaneous, both > finished in 170 seconds. If i run 3 jobs simultaneous, all three finished > in 240 seconds. > > If i run 6 jobs, i expect to first 3 jobs to finish simultaneous in 240 > seconds, and next 3 jobs finish in 480 seconds from cluster start time. But > that doesn’t happened. My firs job finished after 120 second, second > finished after 180 seconds, third finished after 240 second, the fourth and > the fifth finished simultaneous after 360 seconds, and the last finished > after 400 seconds. > > I expected to run in a FIFO mode, but that doesn’t happened. Seems to be a > combination of FIFO and FAIR. > > Is this the correct behavior of spark? > > Thank you! > > 2017-02-08 9:29 GMT+02:00 Gourav Sengupta <gourav.sengu...@gmail.com>: > > Hi, > > Michael's answer will solve the problem in case you using only SQL based > solution. > > Otherwise please refer to the wonderful details mentioned here > https://spark.apache.org/docs/latest/job-scheduling.html. With EMR 5.3.0 > released SPARK 2.1.0 is available in AWS. > > (note that there is an issue with using zeppelin in it and I have raised > it as an issue to AWS and they are looking into it now) > > Regards, > Gourav Sengupta > > On Tue, Feb 7, 2017 at 10:37 PM, Michael Segel <msegel_had...@hotmail.com> > wrote: > > > > > > > > > > > > Why couldn’t you use the spark thrift server? > > > > > > > > > > > > > > On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca <cosmin.poste...@gmail.com> > wrote: > > > > > > > > answer for Gourav Sengupta > > > > > > I want to use same spark application because i want to work as a FIFO > scheduler. My problem is that i have many jobs(not so big) and if i run an > application for every job my cluster will split resources as a FAIR > scheduler(it's what i observe, maybe i'm wrong) > > and exist the possibility to create bottleneck effect. The start time > isn't a problem for me, because it isn't a real-time application. > > > > > > I need a business solution, that's the reason why i can't use code from > github. > > > > > > Thanks! > > > > > > 2017-02-07 19:55 GMT+02:00 Gourav Sengupta > > <gourav.sengu...@gmail.com>: > > > > > > > Hi, > > > > > > > > May I ask the reason for using the same spark application? Is it because > of the time it takes in order to start a spark context? > > > > > > > On another note you may want to look at the number of contributors in a > github repo before choosing a solution. > > > > > > > > > > > > > Regards, > > > Gourav > > > > > > > > > > > > On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski > > <vincent.gromakow...@gmail.com> wrote: > > > > > Spark jobserver or Livy server are the best options for pure technical API. > > If you want to publish business API you will probably have to build you > own app like the one I wrote a year ago > > > > https://github.com/elppc/akka-spark-experiments > > > It combines Akka actors and a shared Spark context to serve concurrent > subsecond jobs > > > > > > > > > > > > > > > > > 2017-02-07 15:28 GMT+01:00 ayan guha > > <guha.a...@gmail.com>: > > > > > I think you are loking for livy or spark jobserver > > > > > > > > > On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <cosmin.poste...@gmail.com> > wrote: > > > > > > > > > > > I want to run different jobs on demand with same spark context, but i > don't know how exactly i can do this. > > > > > I try to get current context, but seems it create a new spark context(with > new executors). > > > > > I call spark-submit to add new jobs. > > > > > I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance), with > yarn as resource manager. > > > > > My code: > > > val sparkContext = SparkContext.getOrCreate() > > val content = 1 to 40000 > > val result = sparkContext.parallelize(content, 5) > > result.map(value => value.toString).foreach(loop) > > > > def loop(x: String): Unit = { > > for (a <- 1 to 30000000) { > > > > } > > } > > > > > > spark-submit: > > > spark-submit --executor-cores 1 \ > > --executor-memory 1g \ > > --driver-memory 1g \ > > --master yarn \ > > --deploy-mode cluster \ > > --conf spark.dynamicAllocation.enabled=true \ > > --conf spark.shuffle.service.enabled=true \ > > --conf spark.dynamicAllocation.minExecutors=1 \ > > --conf spark.dynamicAllocation.maxExecutors=3 \ > > --conf spark.dynamicAllocation.initialExecutors=3 \ > > --conf spark.executor.instances=3 \ > > > > > > If i run twice spark-submit it create 6 executors, but i want to run all > this jobs on same spark application. > > > > > How can achieve adding jobs to an existing spark application? > > > > > I don't understand why SparkContext.getOrCreate() don't > > get existing spark context. > > > > > > > > > > > > Thanks, > > > > > Cosmin P. > > > > > > > > > > > > > > > > > -- > > > > > Best Regards, > > > Ayan Guha > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >