On 24 Jun 2015, at 05:55, canan chen 
<ccn...@gmail.com<mailto:ccn...@gmail.com>> wrote:

Why do you want it start until all the resources are ready ? Make it start as 
early as possible should make it complete earlier and increase the utilization 
of resources

On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra 
<arun.lut...@gmail.com<mailto:arun.lut...@gmail.com>> wrote:
Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via 
spark-submit) will begin its processing even though it apparently did not get 
all of the requested resources; it is running very slowly.

Is there a way to force Spark/YARN to only begin when it has the full set of 
resources that I request?

Thanks,
Arun



The "wait until there's space" launch policy is known as Gang Scheduling, 
https://issues.apache.org/jira/browse/YARN-624 covers what would be needed 
there.

1. It's not in YARN

2. For analytics workloads, it's not clear you benefit. You would wait a very 
long time(*) for the requirements to be satisfied. The current YARN scheduling 
and placement algorithms assume that you'd prefer "timely container launch" to 
"extended wait for containers in the right place", and expects algorithms to 
work in a degraded form with a reduced no. of workers

3. Where it really matters is long-lived applications where you need some 
quorum of container-hosted processes, or if performance collapses utterly below 
a threshold. Things like HBase on YARN are an example —but Spark streaming 
could be another.

In the absence of YARN support, it can be implemented in the application by 
having theYARN-hosted application (here: Spark) get the containers, start up a 
process on each one, but not actually start accepting/performing work until a 
threshold of containers is reached/some timeout has occurred.

If you wanted to do that in spark, you could raise the idea on the spark dev 
lists and see what people think.

-Steve

(*) i.e. forever

Reply via email to