Spark don't run all code when is submit to yarn-cluster mode.

2017-06-15 Thread Cosmin Posteuca
Hi,

I have the following problem:

After SparkSession is initialized i create a task:

 val task = new Runnable { } where i make a REST API, and from it's
response i read some data from internet/ES/Hive.
This task is running to every 5 second with Akka scheduler:

scheduler.schedule( Duration(0, TimeUnit.SECONDS), Duration(5,
TimeUnit.SECONDS), task)

My problem is when i run the code with yarn-cluster, the SparkSession
it's immediately closed after init, and the SparkSession don't wait after
the tasks that will be scheduled later. But in yarn-client everything it's
ok. SparkSession it's not closed automatically, and the tasks is running to
every 5 seconds.

What it's the problem when i start Spark Application with yarn-cluster mode?
How work submitting when i use yarn-cluster?
How to resolve this problem? How to make this to work in yarn-cluster?

ps. I use Spark 2.1.1, and akka-actor_2.11 - 2.4.17

Thanks,
Cosmin


Re: Spark Worker can't find jar submitted programmatically

2017-02-20 Thread Cosmin Posteuca
Hi Zoran,

I think you are looking for --jars parameter/argument to spark-submit

When using spark-submit, the application jar along with any jars included
> with the --jars option will be automatically transferred to the cluster.
> URLs supplied after --jars must be separated by commas. (
> http://spark.apache.org/docs/latest/submitting-applications.html)


I don't know if this work on standalone mode, but for me work on yarn mode.

Thanks,
Cosmin

2017-02-17 2:46 GMT+02:00 jeremycod :

> Hi, I'm trying to create application that would programmatically submit
> jar file to Spark standalone cluster running on my local PC. However, I'm
> always getting the error WARN TaskSetManager:66 - Lost task 1.0 in stage
> 0.0 (TID 1, 192.168.2.68, executor 0): java.lang.RuntimeException: Stream
> '/jars/sample-spark-maven-one-jar.jar' was not found. I'm creating the
> SparkContext in the following way: val sparkConf = new SparkConf()
> sparkConf.setMaster("spark://zoran-Latitude-E5420:7077")
> sparkConf.set("spark.cores_max","2") 
> sparkConf.set("spark.executor.memory","2g")
> sparkConf.set("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
> sparkConf.setAppName("Test application") sparkConf.set("spark.ui.port","4041")
> sparkConf.set("spark.local.ip","192.168.2.68") val
> oneJar="/samplesparkmaven/target/sample-spark-maven-one-jar.jar"
> sparkConf.setJars(List(oneJar)) val sc = new SparkContext(sparkConf) I'm
> using Spark 2.1.0 in standalone mode with master and one worker. Does
> anyone have idea where the problem might be or how to investigate it
> further? Thanks, Zoran
> --
> View this message in context: Spark Worker can't find jar submitted
> programmatically
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: [Spark Launcher] How to launch parallel jobs?

2017-02-14 Thread Cosmin Posteuca
Hi,

Egor is right, for every partition it create a task, and every task run on
a single core. But with different configurations spark has different
results:

1 executor with 4 cores takes 120 seconds
2 executors with 2 cores each, takes twice 60 seconds, and once 120 seconds
4 executors with 1 core each, take 60 seconds

Why is it happen? why is non deterministic?

Thanks

2017-02-14 10:29 GMT+02:00 Cosmin Posteuca <cosmin.poste...@gmail.com>:

> Memory seems to be enough. My cluster has 22.5 gb total memory and my job
> use 6.88 gb. If i run twice this job, they will use 13.75 gb, but sometimes
> the cluster has a spike of memory of 19.5 gb.
>
> Thanks,
> Cosmin
>
> 2017-02-14 10:03 GMT+02:00 Mendelson, Assaf <assaf.mendel...@rsa.com>:
>
>> You should also check your memory usage.
>>
>> Let’s say for example you have 16 cores and 8 GB. And that you use 4
>> executors with 1 core each.
>>
>> When you use an executor, spark reserves it from yarn and yarn allocates
>> the number of cores (e.g. 1 in our case) and the memory. The memory is
>> actually more than you asked for. If you ask for 1GB it will in fact
>> allocate almost 1.5GB with overhead. In addition, it will probably allocate
>> an executor for the driver (probably with 1024MB memory usage).
>>
>> When you run your program and look in port 8080, you should look not only
>> on the VCores used out of the VCores total but also on the Memory used and
>> Memory total. You should also navigate to the executors (e.g.
>> applications->running on the left and then choose you application and
>> navigate all the way down to a single container). You can see there the
>> actual usage.
>>
>>
>>
>> BTW, it doesn’t matter how much memory your program wants but how much it
>> reserves. In your example it will not take the 50MB of the test but the
>> ~1.5GB (after overhead) per executor.
>>
>> Hope this helps,
>>
>> Assaf.
>>
>>
>>
>> *From:* Cosmin Posteuca [mailto:cosmin.poste...@gmail.com]
>> *Sent:* Tuesday, February 14, 2017 9:53 AM
>> *To:* Egor Pahomov
>> *Cc:* user
>> *Subject:* Re: [Spark Launcher] How to launch parallel jobs?
>>
>>
>>
>> Hi Egor,
>>
>>
>>
>> About the first problem i think you are right, it's make sense.
>>
>>
>>
>> About the second problem, i check available resource on 8088 port and
>> there show 16 available cores. I start my job with 4 executors with 1 core
>> each, and 1gb per executor. My job use maximum 50mb of memory(just for
>> test). From my point of view the resources are enough, and the problem i
>> think is from yarn configuration files, but i don't know what is missing.
>>
>>
>>
>> Thank you
>>
>>
>>
>> 2017-02-13 21:14 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>:
>>
>> About second problem: I understand this can be in two cases: when one job
>> prevents the other one from getting resources for executors or (2)
>> bottleneck is reading from disk, so you can not really parallel that. I
>> have no experience with second case, but it's easy to verify the fist one:
>> just look on you hadoop UI and verify, that both job get enough resources.
>>
>>
>>
>> 2017-02-13 11:07 GMT-08:00 Egor Pahomov <pahomov.e...@gmail.com>:
>>
>> "But if i increase only executor-cores the finish time is the same".
>> More experienced ones can correct me, if I'm wrong, but as far as I
>> understand that: one partition processed by one spark task. Task is always
>> running on 1 core and not parallelized among cores. So if you have 5
>> partitions and you increased totall number of cores among cluster from 7 to
>> 10 for example - you have not gained anything. But if you repartition you
>> give an opportunity to process thing in more threads, so now more tasks can
>> execute in parallel.
>>
>>
>>
>> 2017-02-13 7:05 GMT-08:00 Cosmin Posteuca <cosmin.poste...@gmail.com>:
>>
>> Hi,
>>
>>
>>
>> I think i don't understand enough how to launch jobs.
>>
>>
>>
>> I have one job which takes 60 seconds to finish. I run it with following
>> command:
>>
>>
>>
>> spark-submit --executor-cores 1 \
>>
>>  --executor-memory 1g \
>>
>>  --driver-memory 1g \
>>
>>  --master yarn \
>>
>>  --deploy-mode cluster \
>>
>>  --conf spark.dynamicAllocation.enabled=true \
>>
>>  --conf spark

Re: [Spark Launcher] How to launch parallel jobs?

2017-02-14 Thread Cosmin Posteuca
Memory seems to be enough. My cluster has 22.5 gb total memory and my job
use 6.88 gb. If i run twice this job, they will use 13.75 gb, but sometimes
the cluster has a spike of memory of 19.5 gb.

Thanks,
Cosmin

2017-02-14 10:03 GMT+02:00 Mendelson, Assaf <assaf.mendel...@rsa.com>:

> You should also check your memory usage.
>
> Let’s say for example you have 16 cores and 8 GB. And that you use 4
> executors with 1 core each.
>
> When you use an executor, spark reserves it from yarn and yarn allocates
> the number of cores (e.g. 1 in our case) and the memory. The memory is
> actually more than you asked for. If you ask for 1GB it will in fact
> allocate almost 1.5GB with overhead. In addition, it will probably allocate
> an executor for the driver (probably with 1024MB memory usage).
>
> When you run your program and look in port 8080, you should look not only
> on the VCores used out of the VCores total but also on the Memory used and
> Memory total. You should also navigate to the executors (e.g.
> applications->running on the left and then choose you application and
> navigate all the way down to a single container). You can see there the
> actual usage.
>
>
>
> BTW, it doesn’t matter how much memory your program wants but how much it
> reserves. In your example it will not take the 50MB of the test but the
> ~1.5GB (after overhead) per executor.
>
> Hope this helps,
>
> Assaf.
>
>
>
> *From:* Cosmin Posteuca [mailto:cosmin.poste...@gmail.com]
> *Sent:* Tuesday, February 14, 2017 9:53 AM
> *To:* Egor Pahomov
> *Cc:* user
> *Subject:* Re: [Spark Launcher] How to launch parallel jobs?
>
>
>
> Hi Egor,
>
>
>
> About the first problem i think you are right, it's make sense.
>
>
>
> About the second problem, i check available resource on 8088 port and
> there show 16 available cores. I start my job with 4 executors with 1 core
> each, and 1gb per executor. My job use maximum 50mb of memory(just for
> test). From my point of view the resources are enough, and the problem i
> think is from yarn configuration files, but i don't know what is missing.
>
>
>
> Thank you
>
>
>
> 2017-02-13 21:14 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>:
>
> About second problem: I understand this can be in two cases: when one job
> prevents the other one from getting resources for executors or (2)
> bottleneck is reading from disk, so you can not really parallel that. I
> have no experience with second case, but it's easy to verify the fist one:
> just look on you hadoop UI and verify, that both job get enough resources.
>
>
>
> 2017-02-13 11:07 GMT-08:00 Egor Pahomov <pahomov.e...@gmail.com>:
>
> "But if i increase only executor-cores the finish time is the same". More
> experienced ones can correct me, if I'm wrong, but as far as I understand
> that: one partition processed by one spark task. Task is always running on
> 1 core and not parallelized among cores. So if you have 5 partitions and
> you increased totall number of cores among cluster from 7 to 10 for example
> - you have not gained anything. But if you repartition you give an
> opportunity to process thing in more threads, so now more tasks can execute
> in parallel.
>
>
>
> 2017-02-13 7:05 GMT-08:00 Cosmin Posteuca <cosmin.poste...@gmail.com>:
>
> Hi,
>
>
>
> I think i don't understand enough how to launch jobs.
>
>
>
> I have one job which takes 60 seconds to finish. I run it with following
> command:
>
>
>
> spark-submit --executor-cores 1 \
>
>  --executor-memory 1g \
>
>  --driver-memory 1g \
>
>  --master yarn \
>
>  --deploy-mode cluster \
>
>  --conf spark.dynamicAllocation.enabled=true \
>
>  --conf spark.shuffle.service.enabled=true \
>
>  --conf spark.dynamicAllocation.minExecutors=1 \
>
>  --conf spark.dynamicAllocation.maxExecutors=4 \
>
>  --conf spark.dynamicAllocation.initialExecutors=4 \
>
>  --conf spark.executor.instances=4 \
>
> If i increase number of partitions from code and number of executors the app 
> will finish faster, which it's ok. But if i increase only executor-cores the 
> finish time is the same, and i don't understand why. I expect the time to be 
> lower than initial time.
>
> My second problem is if i launch twice above code i expect that both jobs to 
> finish in 60 seconds, but this don't happen. Both jobs finish after 120 
> seconds and i don't understand why.
>
> I run this code on AWS EMR, on 2 instances(4 cpu each, and each cpu has 2 
> threads). From what i saw in default EMR configurations, yarn is set on 
> FIFO(default) mode with CapacityScheduler.
>
> What do you think about this problems?
>
> Thanks,
>
> Cosmin
>
>
>
>
>
> --
>
>
> *Sincerely yours Egor Pakhomov*
>
>
>
>
>
> --
>
>
> *Sincerely yours Egor Pakhomov*
>
>
>


Re: [Spark Launcher] How to launch parallel jobs?

2017-02-13 Thread Cosmin Posteuca
Hi Egor,

About the first problem i think you are right, it's make sense.

About the second problem, i check available resource on 8088 port and there
show 16 available cores. I start my job with 4 executors with 1 core each,
and 1gb per executor. My job use maximum 50mb of memory(just for test).
>From my point of view the resources are enough, and the problem i think is
from yarn configuration files, but i don't know what is missing.

Thank you

2017-02-13 21:14 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>:

> About second problem: I understand this can be in two cases: when one job
> prevents the other one from getting resources for executors or (2)
> bottleneck is reading from disk, so you can not really parallel that. I
> have no experience with second case, but it's easy to verify the fist one:
> just look on you hadoop UI and verify, that both job get enough resources.
>
> 2017-02-13 11:07 GMT-08:00 Egor Pahomov <pahomov.e...@gmail.com>:
>
>> "But if i increase only executor-cores the finish time is the same".
>> More experienced ones can correct me, if I'm wrong, but as far as I
>> understand that: one partition processed by one spark task. Task is always
>> running on 1 core and not parallelized among cores. So if you have 5
>> partitions and you increased totall number of cores among cluster from 7 to
>> 10 for example - you have not gained anything. But if you repartition you
>> give an opportunity to process thing in more threads, so now more tasks can
>> execute in parallel.
>>
>> 2017-02-13 7:05 GMT-08:00 Cosmin Posteuca <cosmin.poste...@gmail.com>:
>>
>>> Hi,
>>>
>>> I think i don't understand enough how to launch jobs.
>>>
>>> I have one job which takes 60 seconds to finish. I run it with following
>>> command:
>>>
>>> spark-submit --executor-cores 1 \
>>>  --executor-memory 1g \
>>>  --driver-memory 1g \
>>>  --master yarn \
>>>  --deploy-mode cluster \
>>>  --conf spark.dynamicAllocation.enabled=true \
>>>  --conf spark.shuffle.service.enabled=true \
>>>  --conf spark.dynamicAllocation.minExecutors=1 \
>>>  --conf spark.dynamicAllocation.maxExecutors=4 \
>>>  --conf spark.dynamicAllocation.initialExecutors=4 \
>>>  --conf spark.executor.instances=4 \
>>>
>>> If i increase number of partitions from code and number of executors the 
>>> app will finish faster, which it's ok. But if i increase only 
>>> executor-cores the finish time is the same, and i don't understand why. I 
>>> expect the time to be lower than initial time.
>>>
>>> My second problem is if i launch twice above code i expect that both jobs 
>>> to finish in 60 seconds, but this don't happen. Both jobs finish after 120 
>>> seconds and i don't understand why.
>>>
>>> I run this code on AWS EMR, on 2 instances(4 cpu each, and each cpu has 2 
>>> threads). From what i saw in default EMR configurations, yarn is set on 
>>> FIFO(default) mode with CapacityScheduler.
>>>
>>> What do you think about this problems?
>>>
>>> Thanks,
>>>
>>> Cosmin
>>>
>>>
>>
>>
>> --
>>
>>
>> *Sincerely yoursEgor Pakhomov*
>>
>
>
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>


[Spark Launcher] How to launch parallel jobs?

2017-02-13 Thread Cosmin Posteuca
Hi,

I think i don't understand enough how to launch jobs.

I have one job which takes 60 seconds to finish. I run it with following
command:

spark-submit --executor-cores 1 \
 --executor-memory 1g \
 --driver-memory 1g \
 --master yarn \
 --deploy-mode cluster \
 --conf spark.dynamicAllocation.enabled=true \
 --conf spark.shuffle.service.enabled=true \
 --conf spark.dynamicAllocation.minExecutors=1 \
 --conf spark.dynamicAllocation.maxExecutors=4 \
 --conf spark.dynamicAllocation.initialExecutors=4 \
 --conf spark.executor.instances=4 \

If i increase number of partitions from code and number of executors
the app will finish faster, which it's ok. But if i increase only
executor-cores the finish time is the same, and i don't understand
why. I expect the time to be lower than initial time.

My second problem is if i launch twice above code i expect that both
jobs to finish in 60 seconds, but this don't happen. Both jobs finish
after 120 seconds and i don't understand why.

I run this code on AWS EMR, on 2 instances(4 cpu each, and each cpu
has 2 threads). From what i saw in default EMR configurations, yarn is
set on FIFO(default) mode with CapacityScheduler.

What do you think about this problems?

Thanks,

Cosmin


Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-10 Thread Cosmin Posteuca
Thank you very much for your answers, Now i understand better what i have
to do!  Thank you!

On Wed, 8 Feb 2017 at 22:37, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi,
>
> I am not quite sure of your used case here, but I would use spark-submit
> and submit sequential jobs as steps to an EMR cluster.
>
>
> Regards,
> Gourav
>
> On Wed, Feb 8, 2017 at 11:10 AM, Cosmin Posteuca <
> cosmin.poste...@gmail.com> wrote:
>
> I tried to run some test on EMR on yarn cluster mode.
>
> I have a cluster with 16 cores(8 processors with 2 threads each). If i run
> one job(use 5 core) takes 90 seconds, if i run 2 jobs simultaneous, both
> finished in 170 seconds. If i run 3 jobs simultaneous, all three finished
> in 240 seconds.
>
> If i run 6 jobs, i expect to first 3 jobs to finish simultaneous in 240
> seconds, and next 3 jobs finish in 480 seconds from cluster start time. But
> that doesn’t happened. My firs job finished after 120 second, second
> finished after 180 seconds, third finished after 240 second, the fourth and
> the fifth finished simultaneous after 360 seconds, and the last finished
> after 400 seconds.
>
> I expected to run in a FIFO mode, but that doesn’t happened. Seems to be a
> combination of FIFO and FAIR.
>
> Is this the correct behavior of spark?
>
> Thank you!
>
> 2017-02-08 9:29 GMT+02:00 Gourav Sengupta <gourav.sengu...@gmail.com>:
>
> Hi,
>
> Michael's answer will solve the problem in case you using only SQL based
> solution.
>
> Otherwise please refer to the wonderful details mentioned here
> https://spark.apache.org/docs/latest/job-scheduling.html. With EMR 5.3.0
> released  SPARK 2.1.0 is available in AWS.
>
> (note that there is an issue with using zeppelin in it and I have raised
> it as an issue to AWS and they are looking into it now)
>
> Regards,
> Gourav Sengupta
>
> On Tue, Feb 7, 2017 at 10:37 PM, Michael Segel <msegel_had...@hotmail.com>
> wrote:
>
>
>
>
>
>
>
>
>
>
>
> Why couldn’t you use the spark thrift server?
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca <cosmin.poste...@gmail.com>
> wrote:
>
>
>
>
>
>
>
> answer for Gourav Sengupta
>
>
>
>
>
> I want to use same spark application because i want to work as a FIFO
> scheduler. My problem is that i have many jobs(not so big) and if i run an
> application for every job my cluster will split resources as a FAIR
> scheduler(it's what i observe, maybe i'm wrong)
>
> and exist the possibility to create bottleneck effect. The start time
> isn't a problem for me, because it isn't a real-time application.
>
>
>
>
>
> I need a business solution, that's the reason why i can't use code from
> github.
>
>
>
>
>
> Thanks!
>
>
>
>
>
> 2017-02-07 19:55 GMT+02:00 Gourav Sengupta
>
> <gourav.sengu...@gmail.com>:
>
>
>
>
>
>
> Hi,
>
>
>
>
>
>
>
> May I ask the reason for using the same spark application? Is it because
> of the time it takes in order to start a spark context?
>
>
>
>
>
>
> On another note you may want to look at the number of contributors in a
> github repo before choosing a solution.
>
>
>
>
>
>
>
>
>
>
>
>
> Regards,
>
>
> Gourav
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski
>
> <vincent.gromakow...@gmail.com> wrote:
>
>
>
>
> Spark jobserver or Livy server are the best options for pure technical API.
>
> If you want to publish business API you will probably have to build you
> own app like the one I wrote a year ago
>
>
>
> https://github.com/elppc/akka-spark-experiments
>
>
> It combines Akka actors and a shared Spark context to serve concurrent
> subsecond jobs
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2017-02-07 15:28 GMT+01:00 ayan guha
>
> <guha.a...@gmail.com>:
>
>
>
>
> I think you are loking for livy or spark  jobserver
>
>
>
>
>
>
>
>
> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <cosmin.poste...@gmail.com>
> wrote:
>
>
>
>
>
>
>
>
>
>
> I want to run different jobs on demand with same spark context, but i
> don't know how exactly i can do this.
>
>
>
>
> I try to get current context, but seems it create a new spark context(with
> new executors).
>
>
>
>
> I call spark-submit to add new jobs.
>
>
>
>
> I run code on Amazon 

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-08 Thread Cosmin Posteuca
I tried to run some test on EMR on yarn cluster mode.

I have a cluster with 16 cores(8 processors with 2 threads each). If i run
one job(use 5 core) takes 90 seconds, if i run 2 jobs simultaneous, both
finished in 170 seconds. If i run 3 jobs simultaneous, all three finished
in 240 seconds.

If i run 6 jobs, i expect to first 3 jobs to finish simultaneous in 240
seconds, and next 3 jobs finish in 480 seconds from cluster start time. But
that doesn’t happened. My firs job finished after 120 second, second
finished after 180 seconds, third finished after 240 second, the fourth and
the fifth finished simultaneous after 360 seconds, and the last finished
after 400 seconds.

I expected to run in a FIFO mode, but that doesn’t happened. Seems to be a
combination of FIFO and FAIR.

Is this the correct behavior of spark?

Thank you!

2017-02-08 9:29 GMT+02:00 Gourav Sengupta <gourav.sengu...@gmail.com>:

> Hi,
>
> Michael's answer will solve the problem in case you using only SQL based
> solution.
>
> Otherwise please refer to the wonderful details mentioned here
> https://spark.apache.org/docs/latest/job-scheduling.html. With EMR 5.3.0
> released  SPARK 2.1.0 is available in AWS.
>
> (note that there is an issue with using zeppelin in it and I have raised
> it as an issue to AWS and they are looking into it now)
>
> Regards,
> Gourav Sengupta
>
> On Tue, Feb 7, 2017 at 10:37 PM, Michael Segel <msegel_had...@hotmail.com>
> wrote:
>
>> Why couldn’t you use the spark thrift server?
>>
>>
>> On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca <cosmin.poste...@gmail.com>
>> wrote:
>>
>> answer for Gourav Sengupta
>>
>> I want to use same spark application because i want to work as a FIFO
>> scheduler. My problem is that i have many jobs(not so big) and if i run an
>> application for every job my cluster will split resources as a FAIR
>> scheduler(it's what i observe, maybe i'm wrong) and exist the possibility
>> to create bottleneck effect. The start time isn't a problem for me, because
>> it isn't a real-time application.
>>
>> I need a business solution, that's the reason why i can't use code from
>> github.
>>
>> Thanks!
>>
>> 2017-02-07 19:55 GMT+02:00 Gourav Sengupta <gourav.sengu...@gmail.com>:
>>
>>> Hi,
>>>
>>> May I ask the reason for using the same spark application? Is it because
>>> of the time it takes in order to start a spark context?
>>>
>>> On another note you may want to look at the number of contributors in a
>>> github repo before choosing a solution.
>>>
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
>>>> Spark jobserver or Livy server are the best options for pure technical
>>>> API.
>>>> If you want to publish business API you will probably have to build you
>>>> own app like the one I wrote a year ago https://github.com/elppc/akka-
>>>> spark-experiments
>>>> It combines Akka actors and a shared Spark context to serve concurrent
>>>> subsecond jobs
>>>>
>>>>
>>>> 2017-02-07 15:28 GMT+01:00 ayan guha <guha.a...@gmail.com>:
>>>>
>>>>> I think you are loking for livy or spark  jobserver
>>>>>
>>>>> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <
>>>>> cosmin.poste...@gmail.com> wrote:
>>>>>
>>>>>> I want to run different jobs on demand with same spark context, but i
>>>>>> don't know how exactly i can do this.
>>>>>>
>>>>>> I try to get current context, but seems it create a new spark
>>>>>> context(with new executors).
>>>>>>
>>>>>> I call spark-submit to add new jobs.
>>>>>>
>>>>>> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance),
>>>>>> with yarn as resource manager.
>>>>>>
>>>>>> My code:
>>>>>>
>>>>>> val sparkContext = SparkContext.getOrCreate()
>>>>>> val content = 1 to 4
>>>>>> val result = sparkContext.parallelize(content, 5)
>>>>>> result.map(value => value.toString).foreach(loop)
>>>>>>
>>>>>> def loop(x: String): Unit = {
>>>>>>for (a <- 1 to 3000) {
>>>>>>
>>>>>>}
>>>>>> }
>

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread Cosmin Posteuca
answer for Gourav Sengupta

I want to use same spark application because i want to work as a FIFO
scheduler. My problem is that i have many jobs(not so big) and if i run an
application for every job my cluster will split resources as a FAIR
scheduler(it's what i observe, maybe i'm wrong) and exist the possibility
to create bottleneck effect. The start time isn't a problem for me, because
it isn't a real-time application.

I need a business solution, that's the reason why i can't use code from
github.

Thanks!

2017-02-07 19:55 GMT+02:00 Gourav Sengupta <gourav.sengu...@gmail.com>:

> Hi,
>
> May I ask the reason for using the same spark application? Is it because
> of the time it takes in order to start a spark context?
>
> On another note you may want to look at the number of contributors in a
> github repo before choosing a solution.
>
>
> Regards,
> Gourav
>
> On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Spark jobserver or Livy server are the best options for pure technical
>> API.
>> If you want to publish business API you will probably have to build you
>> own app like the one I wrote a year ago https://github.com/elppc/akka-
>> spark-experiments
>> It combines Akka actors and a shared Spark context to serve concurrent
>> subsecond jobs
>>
>>
>> 2017-02-07 15:28 GMT+01:00 ayan guha <guha.a...@gmail.com>:
>>
>>> I think you are loking for livy or spark  jobserver
>>>
>>> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <
>>> cosmin.poste...@gmail.com> wrote:
>>>
>>>> I want to run different jobs on demand with same spark context, but i
>>>> don't know how exactly i can do this.
>>>>
>>>> I try to get current context, but seems it create a new spark
>>>> context(with new executors).
>>>>
>>>> I call spark-submit to add new jobs.
>>>>
>>>> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance),
>>>> with yarn as resource manager.
>>>>
>>>> My code:
>>>>
>>>> val sparkContext = SparkContext.getOrCreate()
>>>> val content = 1 to 4
>>>> val result = sparkContext.parallelize(content, 5)
>>>> result.map(value => value.toString).foreach(loop)
>>>>
>>>> def loop(x: String): Unit = {
>>>>for (a <- 1 to 3000) {
>>>>
>>>>}
>>>> }
>>>>
>>>> spark-submit:
>>>>
>>>> spark-submit --executor-cores 1 \
>>>>  --executor-memory 1g \
>>>>  --driver-memory 1g \
>>>>  --master yarn \
>>>>  --deploy-mode cluster \
>>>>  --conf spark.dynamicAllocation.enabled=true \
>>>>  --conf spark.shuffle.service.enabled=true \
>>>>  --conf spark.dynamicAllocation.minExecutors=1 \
>>>>  --conf spark.dynamicAllocation.maxExecutors=3 \
>>>>  --conf spark.dynamicAllocation.initialExecutors=3 \
>>>>  --conf spark.executor.instances=3 \
>>>>
>>>> If i run twice spark-submit it create 6 executors, but i want to run
>>>> all this jobs on same spark application.
>>>>
>>>> How can achieve adding jobs to an existing spark application?
>>>>
>>>> I don't understand why SparkContext.getOrCreate() don't get existing
>>>> spark context.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Cosmin P.
>>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>


Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread Cosmin Posteuca
Response for vincent:

Thanks for answer!

Yes, i need a business solution, that's the reason why i can't use Spark
jobserver or Livy solutions. I will look on your github to see how to build
such a system.

But i don't understand, why spark doesn't have a solution for this kind of
problem? and why can't i get the existing context and run some code on it?

Thanks

2017-02-07 19:26 GMT+02:00 vincent gromakowski <
vincent.gromakow...@gmail.com>:

> Spark jobserver or Livy server are the best options for pure technical API.
> If you want to publish business API you will probably have to build you
> own app like the one I wrote a year ago https://github.com/elppc/akka-
> spark-experiments
> It combines Akka actors and a shared Spark context to serve concurrent
> subsecond jobs
>
>
> 2017-02-07 15:28 GMT+01:00 ayan guha <guha.a...@gmail.com>:
>
>> I think you are loking for livy or spark  jobserver
>>
>> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <
>> cosmin.poste...@gmail.com> wrote:
>>
>>> I want to run different jobs on demand with same spark context, but i
>>> don't know how exactly i can do this.
>>>
>>> I try to get current context, but seems it create a new spark
>>> context(with new executors).
>>>
>>> I call spark-submit to add new jobs.
>>>
>>> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance),
>>> with yarn as resource manager.
>>>
>>> My code:
>>>
>>> val sparkContext = SparkContext.getOrCreate()
>>> val content = 1 to 4
>>> val result = sparkContext.parallelize(content, 5)
>>> result.map(value => value.toString).foreach(loop)
>>>
>>> def loop(x: String): Unit = {
>>>for (a <- 1 to 3000) {
>>>
>>>}
>>> }
>>>
>>> spark-submit:
>>>
>>> spark-submit --executor-cores 1 \
>>>  --executor-memory 1g \
>>>  --driver-memory 1g \
>>>  --master yarn \
>>>  --deploy-mode cluster \
>>>  --conf spark.dynamicAllocation.enabled=true \
>>>  --conf spark.shuffle.service.enabled=true \
>>>  --conf spark.dynamicAllocation.minExecutors=1 \
>>>  --conf spark.dynamicAllocation.maxExecutors=3 \
>>>  --conf spark.dynamicAllocation.initialExecutors=3 \
>>>  --conf spark.executor.instances=3 \
>>>
>>> If i run twice spark-submit it create 6 executors, but i want to run all
>>> this jobs on same spark application.
>>>
>>> How can achieve adding jobs to an existing spark application?
>>>
>>> I don't understand why SparkContext.getOrCreate() don't get existing
>>> spark context.
>>>
>>>
>>> Thanks,
>>>
>>> Cosmin P.
>>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


[Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread Cosmin Posteuca
I want to run different jobs on demand with same spark context, but i don't
know how exactly i can do this.

I try to get current context, but seems it create a new spark context(with
new executors).

I call spark-submit to add new jobs.

I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance), with
yarn as resource manager.

My code:

val sparkContext = SparkContext.getOrCreate()
val content = 1 to 4
val result = sparkContext.parallelize(content, 5)
result.map(value => value.toString).foreach(loop)

def loop(x: String): Unit = {
   for (a <- 1 to 3000) {

   }
}

spark-submit:

spark-submit --executor-cores 1 \
 --executor-memory 1g \
 --driver-memory 1g \
 --master yarn \
 --deploy-mode cluster \
 --conf spark.dynamicAllocation.enabled=true \
 --conf spark.shuffle.service.enabled=true \
 --conf spark.dynamicAllocation.minExecutors=1 \
 --conf spark.dynamicAllocation.maxExecutors=3 \
 --conf spark.dynamicAllocation.initialExecutors=3 \
 --conf spark.executor.instances=3 \

If i run twice spark-submit it create 6 executors, but i want to run all
this jobs on same spark application.

How can achieve adding jobs to an existing spark application?

I don't understand why SparkContext.getOrCreate() don't get existing spark
context.


Thanks,

Cosmin P.