[jira] [Updated] (YARN-6710) There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair scheduler not assign container to the queue
[ https://issues.apache.org/jira/browse/YARN-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yufei Gu updated YARN-6710: --- Fix Version/s: (was: 2.8.0) > There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair > scheduler not assign container to the queue > --- > > Key: YARN-6710 > URL: https://issues.apache.org/jira/browse/YARN-6710 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: daemon > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png, screenshot-5.png > > > There are over three thousand nodes in my hadoop production cluster, and we > use fair schedule as my scheduler. > Though there are many free resource in my resource manager, but there are 46 > applications pending. > Those applications can not run after several hours, and in the end I have to > stop them. > I reproduce the scene in my test environment, and I find a bug in > FSLeafQueue. > In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater > than itself. > When fair scheduler try to assign container to a application attempt, it > will do as follow check: > !screenshot-2.png! > !screenshot-3.png! > Because the value of FSLeafQueue#amResourceUsage is invalid, it will greater > then it real value. > So when the value of amResourceUsage greater than the value of > Resources.multiply(getFairShare(), maxAMShare) , > and the FSLeafQueue#canRunAppAM function will return false which will let the > fair scheduler not assign container > to the FSAppAttempt. > In this scenario, all the application attempt will pending and never get any > resource. > I find the reason why so many applications in my leaf queue is pending. I > will describe it as follow: > When fair scheduler first assign a container to the application attempt, it > will do something as blow: > !screenshot-4.png! > When fair scheduler remove the application attempt from the leaf queue, it > will do something as blow: > !screenshot-5.png! > But when application attempt unregister itself, and all the container in the > SchedulerApplicationAttempt#liveContainers > are complete. There is a APP_ATTEMPT_REMOVED event will send to fair > scheduler, but it is asynchronous. > Before the application attempt is removed from FSLeafQueue, and there are > pending request in FSAppAttempt. > The fair scheduler will assign container to the FSAppAttempt, because the > size of the liveContainers will equals to > 1. > So the FSLeafQueue will add to container resource to the > FSLeafQueue#amResourceUsage, it will > let the value of amResourceUsage greater then itself. > In the end, the value of FSLeafQueue#amResourceUsage is preety large although > there is no application > it the queue. > When new application come, and the value of FSLeafQueue#amResourceUsage > greater than the value > of Resources.multiply(getFairShare(), maxAMShare), it will let the scheduler > never assign container to > the queue. > All of the applications in the queue will always pending. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6710) There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair scheduler not assign container to the queue
[ https://issues.apache.org/jira/browse/YARN-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daemon updated YARN-6710: - Description: There are over three thousand nodes in my hadoop production cluster, and we use fair schedule as my scheduler. Though there are many free resource in my resource manager, but there are 46 applications pending. Those applications can not run after several hours, and in the end I have to stop them. I reproduce the scene in my test environment, and I find a bug in FSLeafQueue. In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater than itself. When fair scheduler try to assign container to a application attempt, it will do as follow check: !screenshot-2.png! !screenshot-3.png! Because the value of FSLeafQueue#amResourceUsage is invalid, it will greater then it real value. So when the value of amResourceUsage greater than the value of Resources.multiply(getFairShare(), maxAMShare) , and the FSLeafQueue#canRunAppAM function will return false which will let the fair scheduler not assign container to the FSAppAttempt. In this scenario, all the application attempt will pending and never get any resource. I find the reason why so many applications in my leaf queue is pending. I will describe it as follow: When fair scheduler first assign a container to the application attempt, it will do something as blow: !screenshot-4.png! When fair scheduler remove the application attempt from the leaf queue, it will do something as blow: !screenshot-5.png! But when application attempt unregister itself, and all the container in the SchedulerApplicationAttempt#liveContainers are complete. There is a APP_ATTEMPT_REMOVED event will send to fair scheduler, but it is asynchronous. Before the application attempt is removed from FSLeafQueue, and there are pending request in FSAppAttempt. The fair scheduler will assign container to the FSAppAttempt, because the size of the liveContainers will equals to 1. So the FSLeafQueue will add to container resource to the FSLeafQueue#amResourceUsage, it will let the value of amResourceUsage greater then itself. In the end, the value of FSLeafQueue#amResourceUsage is preety large although there is no application it the queue. When new application come, and the value of FSLeafQueue#amResourceUsage greater than the value of Resources.multiply(getFairShare(), maxAMShare), it will let the scheduler never assign container to the queue. All of the applications in the queue will always pending. was: There are over three thousand nodes in my hadoop production cluster, and we use fair schedule as my scheduler. Though there are many free resource in my resource manager, but there are 46 applications pending. Those applications can not run after several hours, and in the end I have to stop them. I reproduce the scene in my test environment, and I find a bug in FSLeafQueue. In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater than itself. When fair scheduler try to assign container to a application attempt, it will do as follow check: !screenshot-2.png! !screenshot-3.png! Because the value of FSLeafQueue#amResourceUsage is invalid, it will greater then it real value. So when the value of amResourceUsage greater than the value of Resources.multiply(getFairShare(), maxAMShare) , and the FSLeafQueue#canRunAppAM function will return false which will let the fair scheduler not assign container to the FSAppAttempt. In this scenario, all the application attempt will pending and never get any resource. I find the reason why so many applications in my leaf queue is pending. I will describe it as flow: > There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair > scheduler not assign container to the queue > --- > > Key: YARN-6710 > URL: https://issues.apache.org/jira/browse/YARN-6710 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: daemon > Fix For: 2.8.0 > > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png, screenshot-5.png > > > There are over three thousand nodes in my hadoop production cluster, and we > use fair schedule as my scheduler. > Though there are many free resource in my resource manager, but there are 46 > applications pending. > Those applications can not run after several hours, and in the end I have to > stop them. > I reproduce the scene in my test environment, and I find a bug in > FSLeafQueue. > In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater > than itself. > When fair scheduler try to assign container to a application attempt, it > will do as
[jira] [Updated] (YARN-6710) There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair scheduler not assign container to the queue
[ https://issues.apache.org/jira/browse/YARN-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daemon updated YARN-6710: - Attachment: screenshot-5.png > There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair > scheduler not assign container to the queue > --- > > Key: YARN-6710 > URL: https://issues.apache.org/jira/browse/YARN-6710 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: daemon > Fix For: 2.8.0 > > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png, screenshot-5.png > > > There are over three thousand nodes in my hadoop production cluster, and we > use fair schedule as my scheduler. > Though there are many free resource in my resource manager, but there are 46 > applications pending. > Those applications can not run after several hours, and in the end I have to > stop them. > I reproduce the scene in my test environment, and I find a bug in > FSLeafQueue. > In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater > than itself. > When fair scheduler try to assign container to a application attempt, it > will do as follow check: > !screenshot-2.png! > !screenshot-3.png! > Because the value of FSLeafQueue#amResourceUsage is invalid, it will greater > then it real value. > So when the value of amResourceUsage greater than the value of > Resources.multiply(getFairShare(), maxAMShare) , > and the FSLeafQueue#canRunAppAM function will return false which will let the > fair scheduler not assign container > to the FSAppAttempt. > In this scenario, all the application attempt will pending and never get any > resource. > I find the reason why so many applications in my leaf queue is pending. I > will describe it as flow: -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6710) There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair scheduler not assign container to the queue
[ https://issues.apache.org/jira/browse/YARN-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daemon updated YARN-6710: - Attachment: screenshot-4.png > There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair > scheduler not assign container to the queue > --- > > Key: YARN-6710 > URL: https://issues.apache.org/jira/browse/YARN-6710 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: daemon > Fix For: 2.8.0 > > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png > > > There are over three thousand nodes in my hadoop production cluster, and we > use fair schedule as my scheduler. > Though there are many free resource in my resource manager, but there are 46 > applications pending. > Those applications can not run after several hours, and in the end I have to > stop them. > I reproduce the scene in my test environment, and I find a bug in > FSLeafQueue. > In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater > than itself. > When fair scheduler try to assign container to a application attempt, it > will do as follow check: > !screenshot-2.png! > !screenshot-3.png! > Because the value of FSLeafQueue#amResourceUsage is invalid, it will greater > then it real value. > So when the value of amResourceUsage greater than the value of > Resources.multiply(getFairShare(), maxAMShare) , > and the FSLeafQueue#canRunAppAM function will return false which will let the > fair scheduler not assign container > to the FSAppAttempt. > In this scenario, all the application attempt will pending and never get any > resource. > I find the reason why so many applications in my leaf queue is pending. I > will describe it as flow: -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6710) There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair scheduler not assign container to the queue
[ https://issues.apache.org/jira/browse/YARN-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daemon updated YARN-6710: - Description: There are over three thousand nodes in my hadoop production cluster, and we use fair schedule as my scheduler. Though there are many free resource in my resource manager, but there are 46 applications pending. Those applications can not run after several hours, and in the end I have to stop them. I reproduce the scene in my test environment, and I find a bug in FSLeafQueue. In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater than itself. When fair scheduler try to assign container to a application attempt, it will do as follow check: !screenshot-2.png! !screenshot-3.png! Because the value of FSLeafQueue#amResourceUsage is invalid, it will greater then it real value. So when the value of amResourceUsage greater than the value of Resources.multiply(getFairShare(), maxAMShare) , and the FSLeafQueue#canRunAppAM function will return false which will let the fair scheduler not assign container to the FSAppAttempt. In this scenario, all the application attempt will pending and never get any resource. I find the reason why so many applications in my leaf queue is pending. I will describe it as flow: was: There are over three thousand nodes in my hadoop production cluster, and we use fair schedule as my scheduler. Though there are many free resource in my resource manager, but there are 46 applications pending. Those applications can not run after several hours, and in the end I have to stop them. I reproduce the scene in my test environment, and I find a bug in FSLeafQueue. In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater than itself. When fair scheduler try to assign container to a application attempt, it will do as follow check: !screenshot-2.png! !screenshot-3.png! Because the value of FSLeafQueue#amResourceUsage is invalid, it will greater then it real value. So when the value of amResourceUsage greater than the value of Resources.multiply(getFairShare(), maxAMShare) , and the FSLeafQueue#canRunAppAM function will return false which will let the fair scheduler not assign container to the FSAppAttempt. In this scenario, all the application attempt will pending and never get any resource. > There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair > scheduler not assign container to the queue > --- > > Key: YARN-6710 > URL: https://issues.apache.org/jira/browse/YARN-6710 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: daemon > Fix For: 2.8.0 > > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png > > > There are over three thousand nodes in my hadoop production cluster, and we > use fair schedule as my scheduler. > Though there are many free resource in my resource manager, but there are 46 > applications pending. > Those applications can not run after several hours, and in the end I have to > stop them. > I reproduce the scene in my test environment, and I find a bug in > FSLeafQueue. > In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater > than itself. > When fair scheduler try to assign container to a application attempt, it > will do as follow check: > !screenshot-2.png! > !screenshot-3.png! > Because the value of FSLeafQueue#amResourceUsage is invalid, it will greater > then it real value. > So when the value of amResourceUsage greater than the value of > Resources.multiply(getFairShare(), maxAMShare) , > and the FSLeafQueue#canRunAppAM function will return false which will let the > fair scheduler not assign container > to the FSAppAttempt. > In this scenario, all the application attempt will pending and never get any > resource. > I find the reason why so many applications in my leaf queue is pending. I > will describe it as flow: -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6710) There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair scheduler not assign container to the queue
[ https://issues.apache.org/jira/browse/YARN-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daemon updated YARN-6710: - Description: There are over three thousand nodes in my hadoop production cluster, and we use fair schedule as my scheduler. Though there are many free resource in my resource manager, but there are 46 applications pending. Those applications can not run after several hours, and in the end I have to stop them. I reproduce the scene in my test environment, and I find a bug in FSLeafQueue. In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater than itself. When fair scheduler try to assign container to a application attempt, it will do as follow check: !screenshot-2.png! !screenshot-3.png! Because the value of FSLeafQueue#amResourceUsage is invalid, it will greater then it real value. So when the value of amResourceUsage greater than the value of Resources.multiply(getFairShare(), maxAMShare) , and the FSLeafQueue#canRunAppAM function will return false which will let the fair scheduler not assign container to the FSAppAttempt. In this scenario, all the application attempt will pending and never get any resource. was: There are over three thousand nodes in my hadoop production cluster, and we use fair schedule as my scheduler. Though there are many free resource in my resource manager, but there are > There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair > scheduler not assign container to the queue > --- > > Key: YARN-6710 > URL: https://issues.apache.org/jira/browse/YARN-6710 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: daemon > Fix For: 2.8.0 > > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png > > > There are over three thousand nodes in my hadoop production cluster, and we > use fair schedule as my scheduler. > Though there are many free resource in my resource manager, but there are 46 > applications pending. > Those applications can not run after several hours, and in the end I have to > stop them. > I reproduce the scene in my test environment, and I find a bug in > FSLeafQueue. > In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater > than itself. > When fair scheduler try to assign container to a application attempt, it > will do as follow check: > !screenshot-2.png! > !screenshot-3.png! > Because the value of FSLeafQueue#amResourceUsage is invalid, it will greater > then it real value. > So when the value of amResourceUsage greater than the value of > Resources.multiply(getFairShare(), maxAMShare) , > and the FSLeafQueue#canRunAppAM function will return false which will let the > fair scheduler not assign container > to the FSAppAttempt. > In this scenario, all the application attempt will pending and never get any > resource. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6710) There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair scheduler not assign container to the queue
[ https://issues.apache.org/jira/browse/YARN-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daemon updated YARN-6710: - Attachment: screenshot-3.png > There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair > scheduler not assign container to the queue > --- > > Key: YARN-6710 > URL: https://issues.apache.org/jira/browse/YARN-6710 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: daemon > Fix For: 2.8.0 > > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png > > > There are over three thousand nodes in my hadoop production cluster, and we > use fair schedule as my scheduler. > Though there are many free resource in my resource manager, but there are -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6710) There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair scheduler not assign container to the queue
[ https://issues.apache.org/jira/browse/YARN-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daemon updated YARN-6710: - Attachment: screenshot-2.png > There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair > scheduler not assign container to the queue > --- > > Key: YARN-6710 > URL: https://issues.apache.org/jira/browse/YARN-6710 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: daemon > Fix For: 2.8.0 > > Attachments: screenshot-1.png, screenshot-2.png > > > There are over three thousand nodes in my hadoop production cluster, and we > use fair schedule as my scheduler. > Though there are many free resource in my resource manager, but there are -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6710) There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair scheduler not assign container to the queue
[ https://issues.apache.org/jira/browse/YARN-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daemon updated YARN-6710: - Description: There are over three thousand nodes in my hadoop production cluster, and we use fair schedule as my scheduler. Though there are many free resource in my resource manager, but there are was: There are over three thousand nodes in my hadoop production cluster, and we use fair schedule as my scheduler. Though my cluster is leisure but there are about > There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair > scheduler not assign container to the queue > --- > > Key: YARN-6710 > URL: https://issues.apache.org/jira/browse/YARN-6710 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: daemon > Fix For: 2.8.0 > > Attachments: screenshot-1.png > > > There are over three thousand nodes in my hadoop production cluster, and we > use fair schedule as my scheduler. > Though there are many free resource in my resource manager, but there are -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6710) There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair scheduler not assign container to the queue
[ https://issues.apache.org/jira/browse/YARN-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daemon updated YARN-6710: - Description: There are over three thousand nodes in my hadoop production cluster, and we use fair schedule as my scheduler. Though my cluster is leisure but there are about > There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair > scheduler not assign container to the queue > --- > > Key: YARN-6710 > URL: https://issues.apache.org/jira/browse/YARN-6710 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: daemon > Fix For: 2.8.0 > > Attachments: screenshot-1.png > > > There are over three thousand nodes in my hadoop production cluster, and we > use fair schedule as my scheduler. > Though my cluster is leisure but there are about -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6710) There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair scheduler not assign container to the queue
[ https://issues.apache.org/jira/browse/YARN-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daemon updated YARN-6710: - Attachment: screenshot-1.png > There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair > scheduler not assign container to the queue > --- > > Key: YARN-6710 > URL: https://issues.apache.org/jira/browse/YARN-6710 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: daemon > Fix For: 2.8.0 > > Attachments: screenshot-1.png > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org