Re: [Spark Context]: How to add on demand jobs to an existing spark context?

Cosmin Posteuca Fri, 10 Feb 2017 09:37:10 -0800

Thank you very much for your answers, Now i understand better what i have
to do!  Thank you!


On Wed, 8 Feb 2017 at 22:37, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi,
>
> I am not quite sure of your used case here, but I would use spark-submit
> and submit sequential jobs as steps to an EMR cluster.
>
>
> Regards,
> Gourav
>
> On Wed, Feb 8, 2017 at 11:10 AM, Cosmin Posteuca <
> cosmin.poste...@gmail.com> wrote:
>
> I tried to run some test on EMR on yarn cluster mode.
>
> I have a cluster with 16 cores(8 processors with 2 threads each). If i run
> one job(use 5 core) takes 90 seconds, if i run 2 jobs simultaneous, both
> finished in 170 seconds. If i run 3 jobs simultaneous, all three finished
> in 240 seconds.
>
> If i run 6 jobs, i expect to first 3 jobs to finish simultaneous in 240
> seconds, and next 3 jobs finish in 480 seconds from cluster start time. But
> that doesn’t happened. My firs job finished after 120 second, second
> finished after 180 seconds, third finished after 240 second, the fourth and
> the fifth finished simultaneous after 360 seconds, and the last finished
> after 400 seconds.
>
> I expected to run in a FIFO mode, but that doesn’t happened. Seems to be a
> combination of FIFO and FAIR.
>
> Is this the correct behavior of spark?
>
> Thank you!
>
> 2017-02-08 9:29 GMT+02:00 Gourav Sengupta <gourav.sengu...@gmail.com>:
>
> Hi,
>
> Michael's answer will solve the problem in case you using only SQL based
> solution.
>
> Otherwise please refer to the wonderful details mentioned here
> https://spark.apache.org/docs/latest/job-scheduling.html. With EMR 5.3.0
> released  SPARK 2.1.0 is available in AWS.
>
> (note that there is an issue with using zeppelin in it and I have raised
> it as an issue to AWS and they are looking into it now)
>
> Regards,
> Gourav Sengupta
>
> On Tue, Feb 7, 2017 at 10:37 PM, Michael Segel <msegel_had...@hotmail.com>
> wrote:
>
>
>
>
>
>
>
>
>
>
>
> Why couldn’t you use the spark thrift server?
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca <cosmin.poste...@gmail.com>
> wrote:
>
>
>
>
>
>
>
> answer for Gourav Sengupta
>
>
>
>
>
> I want to use same spark application because i want to work as a FIFO
> scheduler. My problem is that i have many jobs(not so big) and if i run an
> application for every job my cluster will split resources as a FAIR
> scheduler(it's what i observe, maybe i'm wrong)
>
> and exist the possibility to create bottleneck effect. The start time
> isn't a problem for me, because it isn't a real-time application.
>
>
>
>
>
> I need a business solution, that's the reason why i can't use code from
> github.
>
>
>
>
>
> Thanks!
>
>
>
>
>
> 2017-02-07 19:55 GMT+02:00 Gourav Sengupta
>
> <gourav.sengu...@gmail.com>:
>
>
>
>
>
>
> Hi,
>
>
>
>
>
>
>
> May I ask the reason for using the same spark application? Is it because
> of the time it takes in order to start a spark context?
>
>
>
>
>
>
> On another note you may want to look at the number of contributors in a
> github repo before choosing a solution.
>
>
>
>
>
>
>
>
>
>
>
>
> Regards,
>
>
> Gourav
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski
>
> <vincent.gromakow...@gmail.com> wrote:
>
>
>
>
> Spark jobserver or Livy server are the best options for pure technical API.
>
> If you want to publish business API you will probably have to build you
> own app like the one I wrote a year ago
>
>
>
> https://github.com/elppc/akka-spark-experiments
>
>
> It combines Akka actors and a shared Spark context to serve concurrent
> subsecond jobs
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2017-02-07 15:28 GMT+01:00 ayan guha
>
> <guha.a...@gmail.com>:
>
>
>
>
> I think you are loking for livy or spark  jobserver
>
>
>
>
>
>
>
>
> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <cosmin.poste...@gmail.com>
> wrote:
>
>
>
>
>
>
>
>
>
>
> I want to run different jobs on demand with same spark context, but i
> don't know how exactly i can do this.
>
>
>
>
> I try to get current context, but seems it create a new spark context(with
> new executors).
>
>
>
>
> I call spark-submit to add new jobs.
>
>
>
>
> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance), with
> yarn as resource manager.
>
>
>
>
> My code:
>
>
> val sparkContext = SparkContext.getOrCreate()
>
> val content = 1 to 40000
>
> val result = sparkContext.parallelize(content, 5)
>
> result.map(value => value.toString).foreach(loop)
>
>
>
> def loop(x: String): Unit = {
>
>    for (a <- 1 to 30000000) {
>
>
>
>    }
>
> }
>
>
>
>
>
> spark-submit:
>
>
> spark-submit --executor-cores 1 \
>
>              --executor-memory 1g \
>
>              --driver-memory 1g \
>
>              --master yarn \
>
>              --deploy-mode cluster \
>
>              --conf spark.dynamicAllocation.enabled=true \
>
>              --conf spark.shuffle.service.enabled=true \
>
>              --conf spark.dynamicAllocation.minExecutors=1 \
>
>              --conf spark.dynamicAllocation.maxExecutors=3 \
>
>              --conf spark.dynamicAllocation.initialExecutors=3 \
>
>              --conf spark.executor.instances=3 \
>
>
>
>
>
> If i run twice spark-submit it create 6 executors, but i want to run all
> this jobs on same spark application.
>
>
>
>
> How can achieve adding jobs to an existing spark application?
>
>
>
>
> I don't understand why SparkContext.getOrCreate() don't
>
> get existing spark context.
>
>
>
>
>
>
>
>
>
>
>
> Thanks,
>
>
>
>
> Cosmin P.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
>
>
> Best Regards,
>
>
> Ayan Guha
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

Reply via email to