On 24 Jun 2015, at 05:55, canan chen <ccn...@gmail.com<mailto:ccn...@gmail.com>> wrote:
Why do you want it start until all the resources are ready ? Make it start as early as possible should make it complete earlier and increase the utilization of resources On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra <arun.lut...@gmail.com<mailto:arun.lut...@gmail.com>> wrote: Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via spark-submit) will begin its processing even though it apparently did not get all of the requested resources; it is running very slowly. Is there a way to force Spark/YARN to only begin when it has the full set of resources that I request? Thanks, Arun The "wait until there's space" launch policy is known as Gang Scheduling, https://issues.apache.org/jira/browse/YARN-624 covers what would be needed there. 1. It's not in YARN 2. For analytics workloads, it's not clear you benefit. You would wait a very long time(*) for the requirements to be satisfied. The current YARN scheduling and placement algorithms assume that you'd prefer "timely container launch" to "extended wait for containers in the right place", and expects algorithms to work in a degraded form with a reduced no. of workers 3. Where it really matters is long-lived applications where you need some quorum of container-hosted processes, or if performance collapses utterly below a threshold. Things like HBase on YARN are an example —but Spark streaming could be another. In the absence of YARN support, it can be implemented in the application by having theYARN-hosted application (here: Spark) get the containers, start up a process on each one, but not actually start accepting/performing work until a threshold of containers is reached/some timeout has occurred. If you wanted to do that in spark, you could raise the idea on the spark dev lists and see what people think. -Steve (*) i.e. forever