Hi Hamel,

    Thanks for looking into the issue. What i am not understanding is,
after preemption what is the share that the second queue gets in case if
the first queue holds the entire cluster resource without releasing, is it
instantaneous fair share or fair share.

     Queue A and B are there (total 230 queues), total cluster resource is
10TB, 3000 cores. If a job submitted into queue A, it will get 10TB, 3000
cores and it is not releasing any resource. Now if a second job submitted
into queue B, so preemption definitely will happen, but what is the share
queue B will get after preemption. *Is it  <10 TB , 3000> / 2 or
<10TB,3000> / 230*

We find, after preemption queue B gets only <10TB,3000> / 230, because the
first job is holding the resource. In case if first job releases the
resource, the second queue will get <10TB,3000> /2 based on higher priority
and reservation.

The question is how much preemption tries to preempt the queue A if it
holds the entire resource without releasing? Could not able to share the
actual configuration, but the answer to the question here will help us.


Thanks,
Prabhu Joseph




On Wed, Feb 24, 2016 at 10:03 PM, Hamel Kothari <hamelkoth...@gmail.com>
wrote:

> If all queues are identical, this behavior should not be happening.
> Preemption as designed in fair scheduler (IIRC) takes place based on the
> instantaneous fair share, not the steady state fair share. The fair
> scheduler docs
> <https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html>
> aren't super helpful on this but it does say in the Monitoring section that
> preemption won't take place if you're less than your instantaneous fair
> share (which might imply that it would occur if you were over your inst.
> fair share and someone had requested resources). The code for
> FairScheduler.resToPreempt
> <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-yarn-server-resourcemanager/2.7.1/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#FairScheduler.resToPreempt%28org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue%2Clong%29>
> also seems to use getFairShare rather than getSteadyFairShare() for
> preemption so that would imply that it is using instantaneous fair share
> rather than steady state.
>
> Could you share your YARN site/fair-scheduler and Spark configurations?
> Could you also share the YARN Scheduler UI (specifically the top of of the
> RM which shows how many resources are in use)?
>
> Since it's not likely due to steady state fair share, some other possible
> reasons why this might be taking place (this is not remotely conclusive but
> with no information this is what comes to mind):
> - You're not reaching
> yarn.scheduler.fair.preemption.cluster-utilization-threshold. Perhaps due
> to core/memory ratio inconsistency with the cluster.
> - Your second job doesn't have a sufficient level of parallelism to
> request more executors than what it is recieving (perhaps there are fewer
> than 13 tasks at any point in time) and you don't have
> spark.dynamicAllocation.minExecutors set?
>
> -Hamel
>
> On Tue, Feb 23, 2016 at 8:20 PM Prabhu Joseph <prabhujose.ga...@gmail.com>
> wrote:
>
>> Hi All,
>>
>>  A YARN cluster with 352 Nodes (10TB, 3000cores) and has Fair Scheduler
>> with root queue having 230 queues.
>>
>>     Each Queue is configured with maxResources equal to Total Cluster
>> Resource. When a Spark job is submitted into a queue A, it is given with
>> 10TB, 3000 cores according to instantaneous Fair Share and it is holding
>> the entire resource without releasing. After some time, when another job is
>> submitted into other queue B, it will get the Fair Share 45GB and 13 cores
>> i.e (10TB,3000 cores)/230 using Preemption. Now if some more jobs are
>> submitted into queue B, all the jobs in B has to share the 45GB and 13
>> cores. Whereas the job which is in queue A holds the entire cluster
>> resource affecting the other jobs.
>>      This kind of issue often happens when a Spark job submitted first
>> which holds the entire cluster resource. What is the best way to fix this
>> issue. Can we make preemption to happen for instantaneous fair share
>> instead of fair share, will it help.
>>
>> Note:
>>
>> 1. We do not want to give weight for particular queue. Because all the
>> 240 queues are critical.
>> 2. Changing the queues into nested does not solve the issue.
>> 3. Adding maxResource to queue  won't allow the first job to pick entire
>> cluster resource, but still configuring the optimal maxResource for 230
>> queue is difficult and also the first job can't use the entire cluster
>> resource when the cluster is idle.
>> 4. We do not want to handle it in Spark ApplicationMaster, then we need
>> to check for other new YARN application type with similar behavior. We want
>> YARN to control this behavior by killing the resources which is hold by
>> first job for longer period.
>>
>>
>> Thanks,
>> Prabhu Joseph
>>
>>

Reply via email to