[jira] [Created] (YARN-8306) Support fine-grained configuration scheduler maximum allocation to sub-queue level.
JackZhou created YARN-8306: -- Summary: Support fine-grained configuration scheduler maximum allocation to sub-queue level. Key: YARN-8306 URL: https://issues.apache.org/jira/browse/YARN-8306 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: JackZhou -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8306) Support fine-grained configuration scheduler maximum allocation to sub-queue level.
[ https://issues.apache.org/jira/browse/YARN-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JackZhou updated YARN-8306: --- Description: {color:#6a8759}We need to support {color:#6a8759}control queues{color}' *yarn.scheduler.maximum-allocation-mb/*{color}*{color:#6a8759}yarn scheduler.maximum-allocation-vcores{color}* {color:#6a8759}conf fine-grained and it will enabling us to support large cpus or large memory containers based on the characteristics of the label machine bound by the queue. {color}{color:#6a8759}In this way user can use resources wisely.{color} > Support fine-grained configuration scheduler maximum allocation to sub-queue > level. > --- > > Key: YARN-8306 > URL: https://issues.apache.org/jira/browse/YARN-8306 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: JackZhou >Priority: Major > > {color:#6a8759}We need to support {color:#6a8759}control queues{color}' > *yarn.scheduler.maximum-allocation-mb/*{color}*{color:#6a8759}yarn > scheduler.maximum-allocation-vcores{color}* > {color:#6a8759}conf fine-grained and it will enabling us to support large > cpus or large memory containers based on the characteristics of the label > machine bound by the queue. {color}{color:#6a8759}In this way user can use > resources wisely.{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8306) Support fine-grained configuration scheduler maximum allocation to sub-queue level.
[ https://issues.apache.org/jira/browse/YARN-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JackZhou updated YARN-8306: --- Description: {color:#6a8759}We need to support control queues{color}' *yarn.scheduler.maximum-allocation-mb* or *yarn scheduler.maximum-allocation-vcores* {color:#6a8759}conf fine-grained and it will enabling us to support large cpus or large memory containers based on the characteristics of the label machine bound by the queue. {color}{color:#6a8759}In this way user can use resources wisely.{color} was: {color:#6a8759}We need to support control queues{color}' *yarn.scheduler.maximum-allocation-mb* or *y*arn scheduler.maximum-allocation-vcores {color:#6a8759}conf fine-grained and it will enabling us to support large cpus or large memory containers based on the characteristics of the label machine bound by the queue. {color}{color:#6a8759}In this way user can use resources wisely.{color} > Support fine-grained configuration scheduler maximum allocation to sub-queue > level. > --- > > Key: YARN-8306 > URL: https://issues.apache.org/jira/browse/YARN-8306 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: JackZhou >Priority: Major > > {color:#6a8759}We need to support control queues{color}' > *yarn.scheduler.maximum-allocation-mb* or *yarn > scheduler.maximum-allocation-vcores* > {color:#6a8759}conf fine-grained and it will enabling us to support large > cpus or large memory containers based on the characteristics of the label > machine bound by the queue. {color}{color:#6a8759}In this way user can use > resources wisely.{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8306) Support fine-grained configuration scheduler maximum allocation to sub-queue level.
[ https://issues.apache.org/jira/browse/YARN-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JackZhou updated YARN-8306: --- Description: {color:#6a8759}We need to support control queues{color}' *yarn.scheduler.maximum-allocation-mb* or *y*arn scheduler.maximum-allocation-vcores {color:#6a8759}conf fine-grained and it will enabling us to support large cpus or large memory containers based on the characteristics of the label machine bound by the queue. {color}{color:#6a8759}In this way user can use resources wisely.{color} was: {color:#6a8759}We need to support {color:#6a8759}control queues{color}' *yarn.scheduler.maximum-allocation-mb/*{color}*{color:#6a8759}yarn scheduler.maximum-allocation-vcores{color}* {color:#6a8759}conf fine-grained and it will enabling us to support large cpus or large memory containers based on the characteristics of the label machine bound by the queue. {color}{color:#6a8759}In this way user can use resources wisely.{color} > Support fine-grained configuration scheduler maximum allocation to sub-queue > level. > --- > > Key: YARN-8306 > URL: https://issues.apache.org/jira/browse/YARN-8306 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: JackZhou >Priority: Major > > {color:#6a8759}We need to support control queues{color}' > *yarn.scheduler.maximum-allocation-mb* or *y*arn > scheduler.maximum-allocation-vcores > {color:#6a8759}conf fine-grained and it will enabling us to support large > cpus or large memory containers based on the characteristics of the label > machine bound by the queue. {color}{color:#6a8759}In this way user can use > resources wisely.{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8475) Should check the resource of assignment is greater than Resources.none() before submitResourceCommitRequest
JackZhou created YARN-8475: -- Summary: Should check the resource of assignment is greater than Resources.none() before submitResourceCommitRequest Key: YARN-8475 URL: https://issues.apache.org/jira/browse/YARN-8475 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: JackZhou -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3611) Support Docker Containers In LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16583470#comment-16583470 ] JackZhou commented on YARN-3611: Hi, [~shaneku...@gmail.com] I would like to use this feature in our production environment, and It's so useful to us. Any Suggestions? > Support Docker Containers In LinuxContainerExecutor > --- > > Key: YARN-3611 > URL: https://issues.apache.org/jira/browse/YARN-3611 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana >Priority: Major > Labels: Docker > > Support Docker Containers In LinuxContainerExecutor > LinuxContainerExecutor provides useful functionality today with respect to > localization, cgroups based resource management and isolation for CPU, > network, disk etc. as well as security with a well-defined mechanism to > execute privileged operations using the container-executor utility. Bringing > docker support to LinuxContainerExecutor lets us use all of this > functionality when running docker containers under YARN, while not requiring > users and admins to configure and use a different ContainerExecutor. > There are several aspects here that need to be worked through : > * Mechanism(s) to let clients request docker-specific functionality - we > could initially implement this via environment variables without impacting > the client API. > * Security - both docker daemon as well as application > * Docker image localization > * Running a docker container via container-executor as a specified user > * “Isolate” the docker container in terms of CPU/network/disk/etc > * Communicating with and/or signaling the running container (ensure correct > pid handling) > * Figure out workarounds for certain performance-sensitive scenarios like > HDFS short-circuit reads > * All of these need to be achieved without changing the current behavior of > LinuxContainerExecutor -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6516) FairScheduler:the algorithm of assignContainer is so slow for it only can assign a thousand containers per second
[ https://issues.apache.org/jira/browse/YARN-6516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982319#comment-15982319 ] JackZhou commented on YARN-6516: [~yufeigu] I am test in a real cluster which have about 2500 nodes. I have already set continuous scheduling on but I set the yarn.scheduler.fair.continuous-scheduling-sleep-ms to 500, so it is run per 500 ms. There is about 80 parent queues in my scheduler and about 200 queues total. I think the scheduler assign a thousand containers per second is a pretty ideal scenario, because if my queue is very empty it will cost 1ms to assign a container for scheduler. But in my test, I have two queues, the queue information as blow: Used Resources: Num Active Applications:19 Num Pending Applications: 1057 Min Resources: Max Resources: Max Running Applications: 4000 Steady Fair Share: Instantaneous Fair Share: Used Resources: Num Active Applications:20 Num Pending Applications: 781 Min Resources: Max Resources: Max Running Applications: 4000 Steady Fair Share: Instantaneous Fair Share: The cost to assign a container up to about 3ms, and the scheduler only can scheduler about 40 containers. It is so slow! Apps Submitted Apps PendingApps RunningApps Completed Containers Running Memory Used Memory TotalMemory Reserved VCores Used VCores TotalVCores Reserved Active NodesDecommissioned NodesLost Nodes Unhealthy Nodes Rebooted Nodes 10268 183839 839139 39 GB 95 TB 0 B 39 97280 0 24322 64 0 0 User Metrics for hadoop > FairScheduler:the algorithm of assignContainer is so slow for it only can > assign a thousand containers per second > - > > Key: YARN-6516 > URL: https://issues.apache.org/jira/browse/YARN-6516 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: JackZhou > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6516) FairScheduler:the algorithm of assignContainer is so slow for it only can assign a thousand containers per second
[ https://issues.apache.org/jira/browse/YARN-6516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982543#comment-15982543 ] JackZhou commented on YARN-6516: [~piaoyu zhang] How mush performance do the two patch can improve ? > FairScheduler:the algorithm of assignContainer is so slow for it only can > assign a thousand containers per second > - > > Key: YARN-6516 > URL: https://issues.apache.org/jira/browse/YARN-6516 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: JackZhou > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6516) FairScheduler:the algorithm of assignContainer is so slow for it only can assign a thousand containers per second
[ https://issues.apache.org/jira/browse/YARN-6516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984016#comment-15984016 ] JackZhou commented on YARN-6516: [~yufeigu] Thank you for you answers! 1. I was saw it in ganglia for "assigning a container took 3ms" , and it think the metric is real. Because we have about 80 queues and the two active queue have about 1000 applications. Use the algorithm to assign a container, it cost 3ms is Certainly need! 2. You are right for "The 39 containers are for AM", and I am pressure test RM fair scheduler for improve the performance of schedule. But it is not important for the container is used for AM or a normal container, it is important that the resource is very leisure but the scheduler schedule is so slow. 3. I saw many jira talk about remove continuous scheduling, but I have not saw any report about the performance decrease when use continuous scheduling. In a word, I think there is too mush space for optimization especially in scheduler. > FairScheduler:the algorithm of assignContainer is so slow for it only can > assign a thousand containers per second > - > > Key: YARN-6516 > URL: https://issues.apache.org/jira/browse/YARN-6516 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.7.2 >Reporter: JackZhou > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6644) The demand of FSAppAttempt may be negative
JackZhou created YARN-6644: -- Summary: The demand of FSAppAttempt may be negative Key: YARN-6644 URL: https://issues.apache.org/jira/browse/YARN-6644 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.2 Environment: CentOS release 6.7 (Final) Reporter: JackZhou Fix For: 2.9.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6644) The demand of FSAppAttempt may be negative
[ https://issues.apache.org/jira/browse/YARN-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024581#comment-16024581 ] JackZhou commented on YARN-6644: Thank you, yufei. I find my problem is the same with YARN-6020. So my problem is solve, thanks a lot. > The demand of FSAppAttempt may be negative > --- > > Key: YARN-6644 > URL: https://issues.apache.org/jira/browse/YARN-6644 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 > Environment: CentOS release 6.7 (Final) >Reporter: JackZhou > Fix For: 2.9.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6644) The demand of FSAppAttempt may be negative
[ https://issues.apache.org/jira/browse/YARN-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024582#comment-16024582 ] JackZhou commented on YARN-6644: [~Feng Yuan] Thanks a lot. > The demand of FSAppAttempt may be negative > --- > > Key: YARN-6644 > URL: https://issues.apache.org/jira/browse/YARN-6644 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 > Environment: CentOS release 6.7 (Final) >Reporter: JackZhou > Fix For: 2.9.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool
JackZhou created YARN-6661: -- Summary: Too much CLEANUP event hang ApplicationMasterLauncher thread pool Key: YARN-6661 URL: https://issues.apache.org/jira/browse/YARN-6661 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.7.2 Environment: hadoop 2.7.2 Reporter: JackZhou Fix For: 2.9.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool
[ https://issues.apache.org/jira/browse/YARN-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027391#comment-16027391 ] JackZhou commented on YARN-6661: Some one else have already come up with the similar problem and fix it. We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for detail. But I think the fix have not solve the problem completely, blow was the problem I encountered: There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps. I failover my active rm and rm will failover all those 1800 apps. When a application failover, It will wait for AM container register itself. But there is a bug in my AM (I do it intentionally), and it will not register itself. So the RM will wait for about 10mins for the AM expiration, and it will send a CLEANUP event to ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so it will hang the ApplicationMasterLauncher thread pool for a large time. I have already use the patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch), so a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so for each of my thread, it will hang 1800 / 50 * 200s = 7200s=20min. Because the AM have register itself during 10mins, so it will retry and create a new application attempt. The application attempt will accept a container from RM, and send a LAUNCH to ApplicationMasterLauncher thread pool. Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the application attempt will not start the AM container during 10min. And it will expire, and send a CLEANUP event to ApplicationMasterLauncher thread pools too. As you can see, none of my application can really run it. Each of them have 5 application attempts as follows, and each of them keep retrying. appattempt_1495786030132_4000_05 appattempt_1495786030132_4000_04 appattempt_1495786030132_4000_03 appattempt_1495786030132_4000_02 appattempt_1495786030132_4000_01 So all of my apps have hang several hours, and none of them can really run. I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events. And use some other thread to deal with LAUNCH event or use other way. Sorry, I english is so poor. I don't know have I describe it clearly. > Too much CLEANUP event hang ApplicationMasterLauncher thread pool > - > > Key: YARN-6661 > URL: https://issues.apache.org/jira/browse/YARN-6661 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.7.2 > Environment: hadoop 2.7.2 >Reporter: JackZhou > Fix For: 2.9.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool
[ https://issues.apache.org/jira/browse/YARN-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JackZhou updated YARN-6661: --- Description: Some one else have already come up with the similar problem and fix it. We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for detail. But I think the fix have not solve the problem completely, blow was the problem I encountered: There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps. I failover my active rm and rm will failover all those 1800 apps. When a application failover, It will wait for AM container register itself. But there is a bug in my AM (I do it intentionally), and it will not register itself. So the RM will wait for about 10mins for the AM expiration, and it will send a CLEANUP event to ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so it will hang the ApplicationMasterLauncher thread pool for a large time. I have already use the patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch), so a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so for each of my thread, it will hang 1800 / 50 * 200s = 7200s=20min. Because the AM have register itself during 10mins, so it will retry and create a new application attempt. The application attempt will accept a container from RM, and send a LAUNCH to ApplicationMasterLauncher thread pool. Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the application attempt will not start the AM container during 10min. And it will expire, and send a CLEANUP event to ApplicationMasterLauncher thread pools too. As you can see, none of my application can really run it. Each of them have 5 application attempts as follows, and each of them keep retrying. appattempt_1495786030132_4000_05 appattempt_1495786030132_4000_04 appattempt_1495786030132_4000_03 appattempt_1495786030132_4000_02 appattempt_1495786030132_4000_01 So all of my apps have hang several hours, and none of them can really run. I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events. And use some other thread to deal with LAUNCH event or use other way. Sorry, I english is so poor. I don't know have I describe it clearly. > Too much CLEANUP event hang ApplicationMasterLauncher thread pool > - > > Key: YARN-6661 > URL: https://issues.apache.org/jira/browse/YARN-6661 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.7.2 > Environment: hadoop 2.7.2 >Reporter: JackZhou > Fix For: 2.9.0 > > > Some one else have already come up with the similar problem and fix it. > We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for > detail. > But I think the fix have not solve the problem completely, blow was the > problem I encountered: > There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps. > I failover my active rm and rm will failover all those 1800 apps. > When a application failover, It will wait for AM container register itself. > But there is a bug in my AM (I do it intentionally), and it will not register > itself. > So the RM will wait for about 10mins for the AM expiration, and it will send > a CLEANUP event to > ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so > it will hang the ApplicationMasterLauncher > thread pool for a large time. I have already use the > patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch), > so > a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so > for each of my thread, it will > hang 1800 / 50 * 200s = 7200s=20min. > Because the AM have register itself during 10mins, so it will retry and > create a new application attempt. > The application attempt will accept a container from RM, and send a LAUNCH to > ApplicationMasterLauncher thread pool. > Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the > application attempt will not > start the AM container during 10min. > And it will expire, and send a CLEANUP event to ApplicationMasterLauncher > thread pools too. > As you can see, none of my application can really run it. > Each of them have 5 application attempts as follows, and each of them keep > retrying. > appattempt_1495786030132_4000_05 > appattempt_1495786030132_4000_04 > appattempt_1495786030132_4000_03 > appattempt_1495786030132_4000_02 > appattempt_1495786030132_4000_01 > So all of my apps have hang several hours, and none of them can really run. > I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events. > And use some other thread to deal with LAUNCH event or use othe
[jira] [Updated] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool
[ https://issues.apache.org/jira/browse/YARN-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JackZhou updated YARN-6661: --- Issue Type: Bug (was: Improvement) > Too much CLEANUP event hang ApplicationMasterLauncher thread pool > - > > Key: YARN-6661 > URL: https://issues.apache.org/jira/browse/YARN-6661 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.2 > Environment: hadoop 2.7.2 >Reporter: JackZhou > Fix For: 2.9.0 > > > Some one else have already come up with the similar problem and fix it. > We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for > detail. > But I think the fix have not solve the problem completely, blow was the > problem I encountered: > There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps. > I failover my active rm and rm will failover all those 1800 apps. > When a application failover, It will wait for AM container register itself. > But there is a bug in my AM (I do it intentionally), and it will not register > itself. > So the RM will wait for about 10mins for the AM expiration, and it will send > a CLEANUP event to > ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so > it will hang the ApplicationMasterLauncher > thread pool for a large time. I have already use the > patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch), > so > a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so > for each of my thread, it will > hang 1800 / 50 * 200s = 7200s=20min. > Because the AM have register itself during 10mins, so it will retry and > create a new application attempt. > The application attempt will accept a container from RM, and send a LAUNCH to > ApplicationMasterLauncher thread pool. > Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the > application attempt will not > start the AM container during 10min. > And it will expire, and send a CLEANUP event to ApplicationMasterLauncher > thread pools too. > As you can see, none of my application can really run it. > Each of them have 5 application attempts as follows, and each of them keep > retrying. > appattempt_1495786030132_4000_05 > appattempt_1495786030132_4000_04 > appattempt_1495786030132_4000_03 > appattempt_1495786030132_4000_02 > appattempt_1495786030132_4000_01 > So all of my apps have hang several hours, and none of them can really run. > I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events. > And use some other thread to deal with LAUNCH event or use other way. > Sorry, I english is so poor. I don't know have I describe it clearly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org