Just because you have sufficient resources doesn't mean another job should
launch an AM.  You might want to check maxAMShare
and queueMaxAMShareDefault.

Given that you have sufficient resources, you could be running into
YARN-3491.

I don't know whether you have the option, but CDH 5.3.3 is pretty old at
this point.  CDH 5.3.10/5.4.10/5.5.2 have the latest bug fixes.

-Ray

On Thu, Apr 28, 2016 at 12:03 PM, Matt Cheah <mch...@palantir.com> wrote:

> Hi,
>
> I¹ve been sporadically seeing an issue when using Hadoop YARN. I¹m using 
> Hadoop 2.5.0, CDH5.3.3.
>
> When I¹ve configured the stack to use the fair scheduler protocol, after some 
> period of time of the cluster being alive and running jobs, I¹m noticing that 
> when I submit a job, the job will be stuck in the ACCEPTED state even though 
> the cluster has sufficient resources to spawn an application master container 
> as well as the queue I¹m submitting to having sufficient resources available. 
> Furthermore, all jobs submitted to that queue will be stuck in the ACCEPTED 
> state. I can unblock job submission by going into the allocation XML file, 
> renaming the queue, and submitting jobs to that renamed queue instead. 
> However the queue has only changed name, and all of its other settings have 
> been preserved.
>
> It is clearly untenable for me to have to change the queues that I¹m using 
> sometimes. This appears to happen irrespective of the settings of the queue, 
> e.g. Its weight or its minimum resource share. The events leading up to this 
> occurrence are strictly unpredictable and I have no concrete way to reproduce 
> the issue. The logs don¹t show anything interesting either; the resource 
> manager just states that it schedules an attempt for the application 
> submitted to the bad queue, but the attempt¹s application master is never 
> allocated to a container anywhere.
>
> I have looked around the YARN bug base and couldn¹t find any similar issues. 
> I¹ve also used jstack to inspect the Resource Manager process, but nothing is 
> obviously wrong there. I was wondering if anyone has encountered a similar 
> issue before. I apologize that the description is vague, but it¹s the best 
> way I can describe it.
>
> Thanks,
>
> -Matt Cheah
>
>
>

Reply via email to