Re: Scaling spark jobs returning large amount of data

Richard Marscher Thu, 04 Jun 2015 08:58:17 -0700

It is possible to start multiple concurrent drivers, Spark dynamically
allocates ports per "spark application" on driver, master, and workers from
a port range. When you collect results back to the driver, they do not go
through the master. The master is mostly there as a coordinator between the
driver and the cluster of worker nodes, but otherwise the workers and
driver communicate directly for the underlying workload.


A "spark application" relates to one instance of a SparkContext
programmatically or to one call to one of the spark submit scripts.
Assuming you don't have dynamic resource allocation setup, each application
takes a fixed amount of the cluster resources to run. So as long as you
subdivide your cluster resources properly you can run multiple concurrent
applications against it. We are doing this in production presently.

Alternately, as Igor suggests, you can share a spark application and launch
different jobs within it. They will share the resources allocated to the
application in this case. An effect of this is you will only have a finite
amount of concurrent spark tasks (roughly translates to 1 task can execute
1 partition of a job at a time). If you launch multiple independent jobs
within the same application you will likely want to enable fair job
scheduling, otherwise stages between independent jobs will run in a FIFO
order instead of interleaving execution.

Hope this helps,
Richard

On Thu, Jun 4, 2015 at 11:20 AM, Igor Berman <igor.ber...@gmail.com> wrote:

> Hi,
> as far as I understand you shouldn't send data to driver. Suppose you have
> file in hdfs/s3 or cassandra partitioning, you should create your job such
> that every executor/worker of spark will handle part of your input,
> transform, filter it and at the end write back to cassandra as output(once
> again every executor/core inside worker will write part of the output, in
> your case they will write part of report)
>
> In general I find that submitting multiple jobs in same spark context(aka
> driver) is more performant(you don't pay startup-shutdown time), for this
> some use rest server for submitting jobs to long running spark
> context(driver)
>
> I'm not sure you can run multiple concurrent drivers because of ports
>
> On 4 June 2015 at 17:30, Giuseppe Sarno <giuseppesa...@fico.com> wrote:
>
>>  Hello,
>>
>> I am relatively new to spark and I am currently trying to understand how
>> to scale large numbers of jobs with spark.
>>
>> I understand that spark architecture is split in “Driver”, “Master” and
>> “Workers”. Master has a standby node in case of failure and workers can
>> scale out.
>>
>> All the examples I have seen show Spark been able to distribute the load
>> to the workers and returning small amount of data to the Driver. In my case
>> I would like to explore the scenario where I need to generate a large
>> report on data stored on Cassandra and understand how Spark architecture
>> will handle this case when multiple report jobs will be running in parallel.
>>
>> According to this  presentation
>> https://trongkhoanguyenblog.wordpress.com/2015/01/07/understand-the-spark-deployment-modes/
>> responses from workers go through the Master and finally to the Driver.
>> Does this mean that the Driver and/ or Master is a single point for all the
>> responses coming back from workers ?
>>
>> Is it possible to start multiple concurrent Drivers ?
>>
>>
>>
>> Regards,
>>
>> Giuseppe.
>>
>>
>>
>> Fair Isaac Services Limited (Co. No. 01998476) and Fair Isaac (Adeptra)
>> Limited (Co. No. 03295455) are registered in England and Wales and have a
>> registered office address of Cottons Centre, 5th Floor, Hays Lane, London,
>> SE1 2QP.
>>
>> This email and any files transmitted with it are confidential,
>> proprietary and intended solely for the individual or entity to whom they
>> are addressed. If you have received this email in error please delete it
>> immediately.
>>
>
>

Re: Scaling spark jobs returning large amount of data

Reply via email to