Re: Spark launching without all of the requested YARN resources

2015-07-02 Thread Arun Luthra
Thanks Sandy et al, I will try that. I like that I can choose the
minRegisteredResourcesRatio.

On Wed, Jun 24, 2015 at 11:04 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:

 Hi Arun,

 You can achieve this by
 setting spark.scheduler.maxRegisteredResourcesWaitingTime to some really
 high number and spark.scheduler.minRegisteredResourcesRatio to 1.0.

 -Sandy

 On Wed, Jun 24, 2015 at 2:21 AM, Steve Loughran ste...@hortonworks.com
 wrote:


  On 24 Jun 2015, at 05:55, canan chen ccn...@gmail.com wrote:

  Why do you want it start until all the resources are ready ? Make it
 start as early as possible should make it complete earlier and increase the
 utilization of resources

 On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra arun.lut...@gmail.com
 wrote:

 Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark
 (via spark-submit) will begin its processing even though it apparently did
 not get all of the requested resources; it is running very slowly.

  Is there a way to force Spark/YARN to only begin when it has the full
 set of resources that I request?

  Thanks,
 Arun




  The wait until there's space launch policy is known as Gang
 Scheduling, https://issues.apache.org/jira/browse/YARN-624 covers what
 would be needed there.

  1. It's not in YARN

  2. For analytics workloads, it's not clear you benefit. You would wait
 a very long time(*) for the requirements to be satisfied. The current YARN
 scheduling and placement algorithms assume that you'd prefer timely
 container launch to extended wait for containers in the right place, and
 expects algorithms to work in a degraded form with a reduced no. of workers

  3. Where it really matters is long-lived applications where you need
 some quorum of container-hosted processes, or if performance collapses
 utterly below a threshold. Things like HBase on YARN are an example —but
 Spark streaming could be another.

  In the absence of YARN support, it can be implemented in the
 application by having theYARN-hosted application (here: Spark) get the
 containers, start up a process on each one, but not actually start
 accepting/performing work until a threshold of containers is reached/some
 timeout has occurred.

  If you wanted to do that in spark, you could raise the idea on the
 spark dev lists and see what people think.

  -Steve

  (*) i.e. forever





Re: Spark launching without all of the requested YARN resources

2015-06-24 Thread Steve Loughran

On 24 Jun 2015, at 05:55, canan chen 
ccn...@gmail.commailto:ccn...@gmail.com wrote:

Why do you want it start until all the resources are ready ? Make it start as 
early as possible should make it complete earlier and increase the utilization 
of resources

On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra 
arun.lut...@gmail.commailto:arun.lut...@gmail.com wrote:
Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via 
spark-submit) will begin its processing even though it apparently did not get 
all of the requested resources; it is running very slowly.

Is there a way to force Spark/YARN to only begin when it has the full set of 
resources that I request?

Thanks,
Arun



The wait until there's space launch policy is known as Gang Scheduling, 
https://issues.apache.org/jira/browse/YARN-624 covers what would be needed 
there.

1. It's not in YARN

2. For analytics workloads, it's not clear you benefit. You would wait a very 
long time(*) for the requirements to be satisfied. The current YARN scheduling 
and placement algorithms assume that you'd prefer timely container launch to 
extended wait for containers in the right place, and expects algorithms to 
work in a degraded form with a reduced no. of workers

3. Where it really matters is long-lived applications where you need some 
quorum of container-hosted processes, or if performance collapses utterly below 
a threshold. Things like HBase on YARN are an example —but Spark streaming 
could be another.

In the absence of YARN support, it can be implemented in the application by 
having theYARN-hosted application (here: Spark) get the containers, start up a 
process on each one, but not actually start accepting/performing work until a 
threshold of containers is reached/some timeout has occurred.

If you wanted to do that in spark, you could raise the idea on the spark dev 
lists and see what people think.

-Steve

(*) i.e. forever


Re: Spark launching without all of the requested YARN resources

2015-06-24 Thread Sandy Ryza
Hi Arun,

You can achieve this by
setting spark.scheduler.maxRegisteredResourcesWaitingTime to some really
high number and spark.scheduler.minRegisteredResourcesRatio to 1.0.

-Sandy

On Wed, Jun 24, 2015 at 2:21 AM, Steve Loughran ste...@hortonworks.com
wrote:


  On 24 Jun 2015, at 05:55, canan chen ccn...@gmail.com wrote:

  Why do you want it start until all the resources are ready ? Make it
 start as early as possible should make it complete earlier and increase the
 utilization of resources

 On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra arun.lut...@gmail.com
 wrote:

 Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark
 (via spark-submit) will begin its processing even though it apparently did
 not get all of the requested resources; it is running very slowly.

  Is there a way to force Spark/YARN to only begin when it has the full
 set of resources that I request?

  Thanks,
 Arun




  The wait until there's space launch policy is known as Gang
 Scheduling, https://issues.apache.org/jira/browse/YARN-624 covers what
 would be needed there.

  1. It's not in YARN

  2. For analytics workloads, it's not clear you benefit. You would wait a
 very long time(*) for the requirements to be satisfied. The current YARN
 scheduling and placement algorithms assume that you'd prefer timely
 container launch to extended wait for containers in the right place, and
 expects algorithms to work in a degraded form with a reduced no. of workers

  3. Where it really matters is long-lived applications where you need
 some quorum of container-hosted processes, or if performance collapses
 utterly below a threshold. Things like HBase on YARN are an example —but
 Spark streaming could be another.

  In the absence of YARN support, it can be implemented in the application
 by having theYARN-hosted application (here: Spark) get the containers,
 start up a process on each one, but not actually start accepting/performing
 work until a threshold of containers is reached/some timeout has occurred.

  If you wanted to do that in spark, you could raise the idea on the spark
 dev lists and see what people think.

  -Steve

  (*) i.e. forever



Spark launching without all of the requested YARN resources

2015-06-23 Thread Arun Luthra
Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via
spark-submit) will begin its processing even though it apparently did not
get all of the requested resources; it is running very slowly.

Is there a way to force Spark/YARN to only begin when it has the full set
of resources that I request?

Thanks,
Arun


Re: Spark launching without all of the requested YARN resources

2015-06-23 Thread canan chen
Why do you want it start until all the resources are ready ? Make it start
as early as possible should make it complete earlier and increase the
utilization of resources

On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra arun.lut...@gmail.com wrote:

 Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark
 (via spark-submit) will begin its processing even though it apparently did
 not get all of the requested resources; it is running very slowly.

 Is there a way to force Spark/YARN to only begin when it has the full set
 of resources that I request?

 Thanks,
 Arun