Re: Spark launching without all of the requested YARN resources
Thanks Sandy et al, I will try that. I like that I can choose the minRegisteredResourcesRatio. On Wed, Jun 24, 2015 at 11:04 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Arun, You can achieve this by setting spark.scheduler.maxRegisteredResourcesWaitingTime to some really high number and spark.scheduler.minRegisteredResourcesRatio to 1.0. -Sandy On Wed, Jun 24, 2015 at 2:21 AM, Steve Loughran ste...@hortonworks.com wrote: On 24 Jun 2015, at 05:55, canan chen ccn...@gmail.com wrote: Why do you want it start until all the resources are ready ? Make it start as early as possible should make it complete earlier and increase the utilization of resources On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra arun.lut...@gmail.com wrote: Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via spark-submit) will begin its processing even though it apparently did not get all of the requested resources; it is running very slowly. Is there a way to force Spark/YARN to only begin when it has the full set of resources that I request? Thanks, Arun The wait until there's space launch policy is known as Gang Scheduling, https://issues.apache.org/jira/browse/YARN-624 covers what would be needed there. 1. It's not in YARN 2. For analytics workloads, it's not clear you benefit. You would wait a very long time(*) for the requirements to be satisfied. The current YARN scheduling and placement algorithms assume that you'd prefer timely container launch to extended wait for containers in the right place, and expects algorithms to work in a degraded form with a reduced no. of workers 3. Where it really matters is long-lived applications where you need some quorum of container-hosted processes, or if performance collapses utterly below a threshold. Things like HBase on YARN are an example —but Spark streaming could be another. In the absence of YARN support, it can be implemented in the application by having theYARN-hosted application (here: Spark) get the containers, start up a process on each one, but not actually start accepting/performing work until a threshold of containers is reached/some timeout has occurred. If you wanted to do that in spark, you could raise the idea on the spark dev lists and see what people think. -Steve (*) i.e. forever
Re: Spark launching without all of the requested YARN resources
On 24 Jun 2015, at 05:55, canan chen ccn...@gmail.commailto:ccn...@gmail.com wrote: Why do you want it start until all the resources are ready ? Make it start as early as possible should make it complete earlier and increase the utilization of resources On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra arun.lut...@gmail.commailto:arun.lut...@gmail.com wrote: Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via spark-submit) will begin its processing even though it apparently did not get all of the requested resources; it is running very slowly. Is there a way to force Spark/YARN to only begin when it has the full set of resources that I request? Thanks, Arun The wait until there's space launch policy is known as Gang Scheduling, https://issues.apache.org/jira/browse/YARN-624 covers what would be needed there. 1. It's not in YARN 2. For analytics workloads, it's not clear you benefit. You would wait a very long time(*) for the requirements to be satisfied. The current YARN scheduling and placement algorithms assume that you'd prefer timely container launch to extended wait for containers in the right place, and expects algorithms to work in a degraded form with a reduced no. of workers 3. Where it really matters is long-lived applications where you need some quorum of container-hosted processes, or if performance collapses utterly below a threshold. Things like HBase on YARN are an example —but Spark streaming could be another. In the absence of YARN support, it can be implemented in the application by having theYARN-hosted application (here: Spark) get the containers, start up a process on each one, but not actually start accepting/performing work until a threshold of containers is reached/some timeout has occurred. If you wanted to do that in spark, you could raise the idea on the spark dev lists and see what people think. -Steve (*) i.e. forever
Re: Spark launching without all of the requested YARN resources
Hi Arun, You can achieve this by setting spark.scheduler.maxRegisteredResourcesWaitingTime to some really high number and spark.scheduler.minRegisteredResourcesRatio to 1.0. -Sandy On Wed, Jun 24, 2015 at 2:21 AM, Steve Loughran ste...@hortonworks.com wrote: On 24 Jun 2015, at 05:55, canan chen ccn...@gmail.com wrote: Why do you want it start until all the resources are ready ? Make it start as early as possible should make it complete earlier and increase the utilization of resources On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra arun.lut...@gmail.com wrote: Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via spark-submit) will begin its processing even though it apparently did not get all of the requested resources; it is running very slowly. Is there a way to force Spark/YARN to only begin when it has the full set of resources that I request? Thanks, Arun The wait until there's space launch policy is known as Gang Scheduling, https://issues.apache.org/jira/browse/YARN-624 covers what would be needed there. 1. It's not in YARN 2. For analytics workloads, it's not clear you benefit. You would wait a very long time(*) for the requirements to be satisfied. The current YARN scheduling and placement algorithms assume that you'd prefer timely container launch to extended wait for containers in the right place, and expects algorithms to work in a degraded form with a reduced no. of workers 3. Where it really matters is long-lived applications where you need some quorum of container-hosted processes, or if performance collapses utterly below a threshold. Things like HBase on YARN are an example —but Spark streaming could be another. In the absence of YARN support, it can be implemented in the application by having theYARN-hosted application (here: Spark) get the containers, start up a process on each one, but not actually start accepting/performing work until a threshold of containers is reached/some timeout has occurred. If you wanted to do that in spark, you could raise the idea on the spark dev lists and see what people think. -Steve (*) i.e. forever
Spark launching without all of the requested YARN resources
Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via spark-submit) will begin its processing even though it apparently did not get all of the requested resources; it is running very slowly. Is there a way to force Spark/YARN to only begin when it has the full set of resources that I request? Thanks, Arun
Re: Spark launching without all of the requested YARN resources
Why do you want it start until all the resources are ready ? Make it start as early as possible should make it complete earlier and increase the utilization of resources On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra arun.lut...@gmail.com wrote: Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via spark-submit) will begin its processing even though it apparently did not get all of the requested resources; it is running very slowly. Is there a way to force Spark/YARN to only begin when it has the full set of resources that I request? Thanks, Arun