Hi,

I¹ve been sporadically seeing an issue when using Hadoop YARN. I¹m using Hadoop 
2.5.0, CDH5.3.3.

When I¹ve configured the stack to use the fair scheduler protocol, after some 
period of time of the cluster being alive and running jobs, I¹m noticing that 
when I submit a job, the job will be stuck in the ACCEPTED state even though 
the cluster has sufficient resources to spawn an application master container 
as well as the queue I¹m submitting to having sufficient resources available. 
Furthermore, all jobs submitted to that queue will be stuck in the ACCEPTED 
state. I can unblock job submission by going into the allocation XML file, 
renaming the queue, and submitting jobs to that renamed queue instead. However 
the queue has only changed name, and all of its other settings have been 
preserved.

It is clearly untenable for me to have to change the queues that I¹m using 
sometimes. This appears to happen irrespective of the settings of the queue, 
e.g. Its weight or its minimum resource share. The events leading up to this 
occurrence are strictly unpredictable and I have no concrete way to reproduce 
the issue. The logs don¹t show anything interesting either; the resource 
manager just states that it schedules an attempt for the application submitted 
to the bad queue, but the attempt¹s application master is never allocated to a 
container anywhere.

I have looked around the YARN bug base and couldn¹t find any similar issues. 
I¹ve also used jstack to inspect the Resource Manager process, but nothing is 
obviously wrong there. I was wondering if anyone has encountered a similar 
issue before. I apologize that the description is vague, but it¹s the best way 
I can describe it.

Thanks,

-Matt Cheah


Reply via email to