Re: [Spark Context]: How to add on demand jobs to an existing spark context?

Jörn Franke Wed, 08 Feb 2017 05:45:14 -0800

The resource management in yarn cluster mode is yarns task. So it dependents 
how you configured the queues and the scheduler there.


> On 8 Feb 2017, at 12:10, Cosmin Posteuca <cosmin.poste...@gmail.com> wrote:
> 
> I tried to run some test on EMR on yarn cluster mode.
> 
> I have a cluster with 16 cores(8 processors with 2 threads each). If i run 
> one job(use 5 core) takes 90 seconds, if i run 2 jobs simultaneous, both 
> finished in 170 seconds. If i run 3 jobs simultaneous, all three finished in 
> 240 seconds.
> 
> If i run 6 jobs, i expect to first 3 jobs to finish simultaneous in 240 
> seconds, and next 3 jobs finish in 480 seconds from cluster start time. But 
> that doesn’t happened. My firs job finished after 120 second, second finished 
> after 180 seconds, third finished after 240 second, the fourth and the fifth 
> finished simultaneous after 360 seconds, and the last finished after 400 
> seconds.
> 
> I expected to run in a FIFO mode, but that doesn’t happened. Seems to be a 
> combination of FIFO and FAIR.
> 
> Is this the correct behavior of spark?
> 
> Thank you!
> 
> 
> 2017-02-08 9:29 GMT+02:00 Gourav Sengupta <gourav.sengu...@gmail.com>:
>> Hi,
>> 
>> Michael's answer will solve the problem in case you using only SQL based 
>> solution.
>> 
>> Otherwise please refer to the wonderful details mentioned here 
>> https://spark.apache.org/docs/latest/job-scheduling.html. With EMR 5.3.0 
>> released  SPARK 2.1.0 is available in AWS.
>> 
>> (note that there is an issue with using zeppelin in it and I have raised it 
>> as an issue to AWS and they are looking into it now)
>> 
>> Regards,
>> Gourav Sengupta
>> 
>>> On Tue, Feb 7, 2017 at 10:37 PM, Michael Segel <msegel_had...@hotmail.com> 
>>> wrote:
>>> Why couldn’t you use the spark thrift server? 
>>> 
>>> 
>>>> On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca <cosmin.poste...@gmail.com> 
>>>> wrote:
>>>> 
>>>> answer for Gourav Sengupta
>>>> 
>>>> I want to use same spark application because i want to work as a FIFO 
>>>> scheduler. My problem is that i have many jobs(not so big) and if i run an 
>>>> application for every job my cluster will split resources as a FAIR 
>>>> scheduler(it's what i observe, maybe i'm wrong) and exist the possibility 
>>>> to create bottleneck effect. The start time isn't a problem for me, 
>>>> because it isn't a real-time application.
>>>> 
>>>> I need a business solution, that's the reason why i can't use code from 
>>>> github.
>>>> 
>>>> Thanks!
>>>> 
>>>> 2017-02-07 19:55 GMT+02:00 Gourav Sengupta <gourav.sengu...@gmail.com>:
>>>>> Hi,
>>>>> 
>>>>> May I ask the reason for using the same spark application? Is it because 
>>>>> of the time it takes in order to start a spark context?
>>>>> 
>>>>> On another note you may want to look at the number of contributors in a 
>>>>> github repo before choosing a solution.
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Gourav 
>>>>> 
>>>>>> On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski 
>>>>>> <vincent.gromakow...@gmail.com> wrote:
>>>>>> Spark jobserver or Livy server are the best options for pure technical 
>>>>>> API.
>>>>>> If you want to publish business API you will probably have to build you 
>>>>>> own app like the one I wrote a year ago 
>>>>>> https://github.com/elppc/akka-spark-experiments
>>>>>> It combines Akka actors and a shared Spark context to serve concurrent 
>>>>>> subsecond jobs
>>>>>> 
>>>>>> 
>>>>>> 2017-02-07 15:28 GMT+01:00 ayan guha <guha.a...@gmail.com>:
>>>>>>> I think you are loking for livy or spark  jobserver
>>>>>>> 
>>>>>>>> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca 
>>>>>>>> <cosmin.poste...@gmail.com> wrote:
>>>>>>>> I want to run different jobs on demand with same spark context, but i 
>>>>>>>> don't know how exactly i can do this.
>>>>>>>> 
>>>>>>>> I try to get current context, but seems it create a new spark 
>>>>>>>> context(with new executors).
>>>>>>>> 
>>>>>>>> I call spark-submit to add new jobs.
>>>>>>>> 
>>>>>>>> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance), 
>>>>>>>> with yarn as resource manager.
>>>>>>>> 
>>>>>>>> My code:
>>>>>>>> 
>>>>>>>> val sparkContext = SparkContext.getOrCreate()
>>>>>>>> val content = 1 to 40000
>>>>>>>> val result = sparkContext.parallelize(content, 5)
>>>>>>>> result.map(value => value.toString).foreach(loop)
>>>>>>>> 
>>>>>>>> def loop(x: String): Unit = {
>>>>>>>>    for (a <- 1 to 30000000) {
>>>>>>>> 
>>>>>>>>    }
>>>>>>>> }
>>>>>>>> spark-submit:
>>>>>>>> 
>>>>>>>> spark-submit --executor-cores 1 \
>>>>>>>>              --executor-memory 1g \
>>>>>>>>              --driver-memory 1g \
>>>>>>>>              --master yarn \
>>>>>>>>              --deploy-mode cluster \
>>>>>>>>              --conf spark.dynamicAllocation.enabled=true \
>>>>>>>>              --conf spark.shuffle.service.enabled=true \
>>>>>>>>              --conf spark.dynamicAllocation.minExecutors=1 \
>>>>>>>>              --conf spark.dynamicAllocation.maxExecutors=3 \
>>>>>>>>              --conf spark.dynamicAllocation.initialExecutors=3 \
>>>>>>>>              --conf spark.executor.instances=3 \
>>>>>>>> If i run twice spark-submit it create 6 executors, but i want to run 
>>>>>>>> all this jobs on same spark application.
>>>>>>>> 
>>>>>>>> How can achieve adding jobs to an existing spark application?
>>>>>>>> 
>>>>>>>> I don't understand why SparkContext.getOrCreate() don't get existing 
>>>>>>>> spark context.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Cosmin P.
>>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Best Regards,
>>>>>>> Ayan Guha
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

Reply via email to