from:"JackZhou \(JIRA\)"

[jira] [Created] (YARN-8306) Support fine-grained configuration scheduler maximum allocation to sub-queue level.

2018-05-16 Thread JackZhou (JIRA)

JackZhou created YARN-8306:
--

 Summary: Support fine-grained configuration scheduler maximum 
allocation to sub-queue level.
 Key: YARN-8306
 URL: https://issues.apache.org/jira/browse/YARN-8306
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: JackZhou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8306) Support fine-grained configuration scheduler maximum allocation to sub-queue level.

2018-05-16 Thread JackZhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JackZhou updated YARN-8306:
---
Description: 
{color:#6a8759}We need to support {color:#6a8759}control queues{color}' 
*yarn.scheduler.maximum-allocation-mb/*{color}*{color:#6a8759}yarn 
scheduler.maximum-allocation-vcores{color}*

{color:#6a8759}conf fine-grained  and it will enabling us to support large cpus 
or large memory containers based on the characteristics of the label machine 
bound by the queue.  {color}{color:#6a8759}In this way user can use resources 
wisely.{color}

> Support fine-grained configuration scheduler maximum allocation to sub-queue 
> level.
> ---
>
> Key: YARN-8306
> URL: https://issues.apache.org/jira/browse/YARN-8306
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: JackZhou
>Priority: Major
>
> {color:#6a8759}We need to support {color:#6a8759}control queues{color}' 
> *yarn.scheduler.maximum-allocation-mb/*{color}*{color:#6a8759}yarn 
> scheduler.maximum-allocation-vcores{color}*
> {color:#6a8759}conf fine-grained  and it will enabling us to support large 
> cpus or large memory containers based on the characteristics of the label 
> machine bound by the queue.  {color}{color:#6a8759}In this way user can use 
> resources wisely.{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8306) Support fine-grained configuration scheduler maximum allocation to sub-queue level.

2018-05-16 Thread JackZhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JackZhou updated YARN-8306:
---
Description: 
{color:#6a8759}We need to support control queues{color}' 
*yarn.scheduler.maximum-allocation-mb* or *yarn 
scheduler.maximum-allocation-vcores*

{color:#6a8759}conf fine-grained  and it will enabling us to support large cpus 
or large memory containers based on the characteristics of the label machine 
bound by the queue.  {color}{color:#6a8759}In this way user can use resources 
wisely.{color}

  was:
{color:#6a8759}We need to support control queues{color}' 
*yarn.scheduler.maximum-allocation-mb* or *y*arn 
scheduler.maximum-allocation-vcores

{color:#6a8759}conf fine-grained  and it will enabling us to support large cpus 
or large memory containers based on the characteristics of the label machine 
bound by the queue.  {color}{color:#6a8759}In this way user can use resources 
wisely.{color}


> Support fine-grained configuration scheduler maximum allocation to sub-queue 
> level.
> ---
>
> Key: YARN-8306
> URL: https://issues.apache.org/jira/browse/YARN-8306
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: JackZhou
>Priority: Major
>
> {color:#6a8759}We need to support control queues{color}' 
> *yarn.scheduler.maximum-allocation-mb* or *yarn 
> scheduler.maximum-allocation-vcores*
> {color:#6a8759}conf fine-grained  and it will enabling us to support large 
> cpus or large memory containers based on the characteristics of the label 
> machine bound by the queue.  {color}{color:#6a8759}In this way user can use 
> resources wisely.{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8306) Support fine-grained configuration scheduler maximum allocation to sub-queue level.

2018-05-16 Thread JackZhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JackZhou updated YARN-8306:
---
Description: 
{color:#6a8759}We need to support control queues{color}' 
*yarn.scheduler.maximum-allocation-mb* or *y*arn 
scheduler.maximum-allocation-vcores

{color:#6a8759}conf fine-grained  and it will enabling us to support large cpus 
or large memory containers based on the characteristics of the label machine 
bound by the queue.  {color}{color:#6a8759}In this way user can use resources 
wisely.{color}

  was:
{color:#6a8759}We need to support {color:#6a8759}control queues{color}' 
*yarn.scheduler.maximum-allocation-mb/*{color}*{color:#6a8759}yarn 
scheduler.maximum-allocation-vcores{color}*

{color:#6a8759}conf fine-grained  and it will enabling us to support large cpus 
or large memory containers based on the characteristics of the label machine 
bound by the queue.  {color}{color:#6a8759}In this way user can use resources 
wisely.{color}


> Support fine-grained configuration scheduler maximum allocation to sub-queue 
> level.
> ---
>
> Key: YARN-8306
> URL: https://issues.apache.org/jira/browse/YARN-8306
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: JackZhou
>Priority: Major
>
> {color:#6a8759}We need to support control queues{color}' 
> *yarn.scheduler.maximum-allocation-mb* or *y*arn 
> scheduler.maximum-allocation-vcores
> {color:#6a8759}conf fine-grained  and it will enabling us to support large 
> cpus or large memory containers based on the characteristics of the label 
> machine bound by the queue.  {color}{color:#6a8759}In this way user can use 
> resources wisely.{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-8475) Should check the resource of assignment is greater than Resources.none() before submitResourceCommitRequest

2018-06-28 Thread JackZhou (JIRA)

JackZhou created YARN-8475:
--

 Summary: Should check the resource of assignment is greater than 
Resources.none() before submitResourceCommitRequest
 Key: YARN-8475
 URL: https://issues.apache.org/jira/browse/YARN-8475
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: JackZhou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-3611) Support Docker Containers In LinuxContainerExecutor

2018-08-16 Thread JackZhou (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16583470#comment-16583470
 ] 

JackZhou commented on YARN-3611:


Hi, [~shaneku...@gmail.com] 

I would like to use this feature in our production environment,  and It's so 
useful to us. Any Suggestions?

> Support Docker Containers In LinuxContainerExecutor
> ---
>
> Key: YARN-3611
> URL: https://issues.apache.org/jira/browse/YARN-3611
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sidharta Seethana
>Assignee: Sidharta Seethana
>Priority: Major
>  Labels: Docker
>
> Support Docker Containers In LinuxContainerExecutor
> LinuxContainerExecutor provides useful functionality today with respect to 
> localization, cgroups based resource management and isolation for CPU, 
> network, disk etc. as well as security with a well-defined mechanism to 
> execute privileged operations using the container-executor utility.  Bringing 
> docker support to LinuxContainerExecutor lets us use all of this 
> functionality when running docker containers under YARN, while not requiring 
> users and admins to configure and use a different ContainerExecutor. 
> There are several aspects here that need to be worked through :
> * Mechanism(s) to let clients request docker-specific functionality - we 
> could initially implement this via environment variables without impacting 
> the client API.
> * Security - both docker daemon as well as application
> * Docker image localization
> * Running a docker container via container-executor as a specified user
> * “Isolate” the docker container in terms of CPU/network/disk/etc
> * Communicating with and/or signaling the running container (ensure correct 
> pid handling)
> * Figure out workarounds for certain performance-sensitive scenarios like 
> HDFS short-circuit reads 
> * All of these need to be achieved without changing the current behavior of 
> LinuxContainerExecutor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6516) FairScheduler:the algorithm of assignContainer is so slow for it only can assign a thousand containers per second

2017-04-24 Thread JackZhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982319#comment-15982319
 ] 

JackZhou commented on YARN-6516:


[~yufeigu] I am test in a real cluster which have about 2500 nodes. I have 
already set continuous scheduling on but I set the 
yarn.scheduler.fair.continuous-scheduling-sleep-ms to 500, so it is run per 500 
ms. There is about 80 parent queues in my scheduler and about 200 queues total.
I think the scheduler assign a thousand containers per second is a pretty ideal 
scenario, because if my queue is very empty 
it will cost 1ms to assign a container for scheduler. 
But in my test, I have two queues,  the queue information as blow:
Used Resources: 
Num Active Applications:19
Num Pending Applications:   1057
Min Resources:  
Max Resources:  
Max Running Applications:   4000
Steady Fair Share:  
Instantaneous Fair Share:   

Used Resources: 
Num Active Applications:20
Num Pending Applications:   781
Min Resources:  
Max Resources:  
Max Running Applications:   4000
Steady Fair Share:  
Instantaneous Fair Share:   


The cost to assign a container up to about 3ms, and the scheduler only can 
scheduler about 40 containers.
It is so slow!

Apps Submitted  Apps PendingApps RunningApps Completed  Containers 
Running  Memory Used Memory TotalMemory Reserved VCores Used 
VCores TotalVCores Reserved Active NodesDecommissioned NodesLost 
Nodes  Unhealthy Nodes Rebooted Nodes
10268   183839  839139  39 GB   95 TB   0 B 39  97280   
0   24322   64  0   0
User Metrics for hadoop

> FairScheduler:the algorithm of assignContainer is so slow for it only can 
> assign a thousand containers per second
> -
>
> Key: YARN-6516
> URL: https://issues.apache.org/jira/browse/YARN-6516
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: JackZhou
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6516) FairScheduler:the algorithm of assignContainer is so slow for it only can assign a thousand containers per second

2017-04-25 Thread JackZhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982543#comment-15982543
 ] 

JackZhou commented on YARN-6516:


[~piaoyu zhang] How mush performance do the two patch can improve ?

> FairScheduler:the algorithm of assignContainer is so slow for it only can 
> assign a thousand containers per second
> -
>
> Key: YARN-6516
> URL: https://issues.apache.org/jira/browse/YARN-6516
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: JackZhou
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6516) FairScheduler:the algorithm of assignContainer is so slow for it only can assign a thousand containers per second

2017-04-25 Thread JackZhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984016#comment-15984016
 ] 

JackZhou commented on YARN-6516:


[~yufeigu] Thank you for you answers!
1.  I was saw it in ganglia for "assigning a container took 3ms" , and it think 
the metric is real. Because we have about 80 queues and the two active queue 
have about 1000 applications. Use the algorithm to assign a container, it cost 
3ms is Certainly need！
2. You are right for "The 39 containers are for AM",  and I am pressure test RM 
fair scheduler for improve the performance of schedule. But it is not important 
for the container is used for AM or a normal container, it is important  that 
the resource is very leisure but the scheduler schedule is so slow.
3. I saw many jira talk about remove continuous scheduling, but I have not saw 
any report about the performance decrease when use continuous scheduling.

In a word, I think there is too mush space for optimization  especially in 
scheduler.


> FairScheduler:the algorithm of assignContainer is so slow for it only can 
> assign a thousand containers per second
> -
>
> Key: YARN-6516
> URL: https://issues.apache.org/jira/browse/YARN-6516
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: JackZhou
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-6644) The demand of FSAppAttempt may be negative

2017-05-24 Thread JackZhou (JIRA)

JackZhou created YARN-6644:
--

 Summary: The demand of FSAppAttempt may be negative 
 Key: YARN-6644
 URL: https://issues.apache.org/jira/browse/YARN-6644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.2
 Environment: CentOS release 6.7 (Final)
Reporter: JackZhou
 Fix For: 2.9.0






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6644) The demand of FSAppAttempt may be negative

2017-05-25 Thread JackZhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024581#comment-16024581
 ] 

JackZhou commented on YARN-6644:


Thank you, yufei. I find my problem is the same with YARN-6020.
So my problem is solve, thanks a lot.

> The demand of FSAppAttempt may be negative 
> ---
>
> Key: YARN-6644
> URL: https://issues.apache.org/jira/browse/YARN-6644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
> Environment: CentOS release 6.7 (Final)
>Reporter: JackZhou
> Fix For: 2.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6644) The demand of FSAppAttempt may be negative

2017-05-25 Thread JackZhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024582#comment-16024582
 ] 

JackZhou commented on YARN-6644:


[~Feng Yuan] Thanks a lot.

> The demand of FSAppAttempt may be negative 
> ---
>
> Key: YARN-6644
> URL: https://issues.apache.org/jira/browse/YARN-6644
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
> Environment: CentOS release 6.7 (Final)
>Reporter: JackZhou
> Fix For: 2.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool

2017-05-27 Thread JackZhou (JIRA)

JackZhou created YARN-6661:
--

 Summary: Too much CLEANUP event hang ApplicationMasterLauncher 
thread pool
 Key: YARN-6661
 URL: https://issues.apache.org/jira/browse/YARN-6661
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 2.7.2
 Environment: hadoop 2.7.2 
Reporter: JackZhou
 Fix For: 2.9.0






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool

2017-05-27 Thread JackZhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027391#comment-16027391
 ] 

JackZhou commented on YARN-6661:


Some one else have already come up with the similar problem and fix it.
We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for 
detail.

But I think the fix have not solve the problem completely, blow was the problem 
I encountered:
There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps.
I failover my active rm and rm will failover all those 1800 apps.
When a application failover, It will wait for AM container register itself. 
But there is a bug in my AM (I do it intentionally), and it will not register 
itself.

So the RM will wait for about 10mins for the AM expiration, and it will send a 
CLEANUP event to 
ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so it 
will hang the ApplicationMasterLauncher
thread pool for a large time.  I have already use the 
patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch),
 so
a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so for 
each of my thread, it will
hang 1800 / 50 * 200s = 7200s=20min.

Because the AM have register itself during 10mins， so it will retry and create 
a new application attempt. 
The application attempt will accept a container from RM, and send a LAUNCH to 
ApplicationMasterLauncher thread pool.
Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the 
application attempt will not 
start the AM container during 10min. 
And it will expire, and send a CLEANUP event to ApplicationMasterLauncher 
thread pools too.

As you can see, none of my application can really run it. 
Each of them have 5 application attempts as follows,  and each of them keep 
retrying.
appattempt_1495786030132_4000_05
appattempt_1495786030132_4000_04
appattempt_1495786030132_4000_03
appattempt_1495786030132_4000_02
appattempt_1495786030132_4000_01

So all of my apps have hang several hours, and none of them can really run. 
I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events.
And use some other thread to deal with LAUNCH event or use other way.

Sorry, I english is so poor. I don't know have I describe it clearly.


> Too much CLEANUP event hang ApplicationMasterLauncher thread pool
> -
>
> Key: YARN-6661
> URL: https://issues.apache.org/jira/browse/YARN-6661
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.2
> Environment: hadoop 2.7.2 
>Reporter: JackZhou
> Fix For: 2.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool

2017-05-27 Thread JackZhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JackZhou updated YARN-6661:
---
Description: 
Some one else have already come up with the similar problem and fix it.
We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for 
detail.
But I think the fix have not solve the problem completely, blow was the problem 
I encountered:
There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps.
I failover my active rm and rm will failover all those 1800 apps.
When a application failover, It will wait for AM container register itself. 
But there is a bug in my AM (I do it intentionally), and it will not register 
itself.
So the RM will wait for about 10mins for the AM expiration, and it will send a 
CLEANUP event to 
ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so it 
will hang the ApplicationMasterLauncher
thread pool for a large time. I have already use the 
patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch),
 so
a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so for 
each of my thread, it will
hang 1800 / 50 * 200s = 7200s=20min.
Because the AM have register itself during 10mins， so it will retry and create 
a new application attempt. 
The application attempt will accept a container from RM, and send a LAUNCH to 
ApplicationMasterLauncher thread pool.
Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the 
application attempt will not 
start the AM container during 10min. 
And it will expire, and send a CLEANUP event to ApplicationMasterLauncher 
thread pools too.
As you can see, none of my application can really run it. 
Each of them have 5 application attempts as follows, and each of them keep 
retrying.
appattempt_1495786030132_4000_05
appattempt_1495786030132_4000_04
appattempt_1495786030132_4000_03
appattempt_1495786030132_4000_02
appattempt_1495786030132_4000_01
So all of my apps have hang several hours, and none of them can really run. 
I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events.
And use some other thread to deal with LAUNCH event or use other way.
Sorry, I english is so poor. I don't know have I describe it clearly.

> Too much CLEANUP event hang ApplicationMasterLauncher thread pool
> -
>
> Key: YARN-6661
> URL: https://issues.apache.org/jira/browse/YARN-6661
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.2
> Environment: hadoop 2.7.2 
>Reporter: JackZhou
> Fix For: 2.9.0
>
>
> Some one else have already come up with the similar problem and fix it.
> We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for 
> detail.
> But I think the fix have not solve the problem completely, blow was the 
> problem I encountered:
> There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps.
> I failover my active rm and rm will failover all those 1800 apps.
> When a application failover, It will wait for AM container register itself. 
> But there is a bug in my AM (I do it intentionally), and it will not register 
> itself.
> So the RM will wait for about 10mins for the AM expiration, and it will send 
> a CLEANUP event to 
> ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so 
> it will hang the ApplicationMasterLauncher
> thread pool for a large time. I have already use the 
> patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch),
>  so
> a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so 
> for each of my thread, it will
> hang 1800 / 50 * 200s = 7200s=20min.
> Because the AM have register itself during 10mins， so it will retry and 
> create a new application attempt. 
> The application attempt will accept a container from RM, and send a LAUNCH to 
> ApplicationMasterLauncher thread pool.
> Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the 
> application attempt will not 
> start the AM container during 10min. 
> And it will expire, and send a CLEANUP event to ApplicationMasterLauncher 
> thread pools too.
> As you can see, none of my application can really run it. 
> Each of them have 5 application attempts as follows, and each of them keep 
> retrying.
> appattempt_1495786030132_4000_05
> appattempt_1495786030132_4000_04
> appattempt_1495786030132_4000_03
> appattempt_1495786030132_4000_02  
> appattempt_1495786030132_4000_01
> So all of my apps have hang several hours, and none of them can really run. 
> I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events.
> And use some other thread to deal with LAUNCH event or use othe

[jira] [Updated] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool

2017-05-27 Thread JackZhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JackZhou updated YARN-6661:
---
Issue Type: Bug  (was: Improvement)

> Too much CLEANUP event hang ApplicationMasterLauncher thread pool
> -
>
> Key: YARN-6661
> URL: https://issues.apache.org/jira/browse/YARN-6661
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
> Environment: hadoop 2.7.2 
>Reporter: JackZhou
> Fix For: 2.9.0
>
>
> Some one else have already come up with the similar problem and fix it.
> We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for 
> detail.
> But I think the fix have not solve the problem completely, blow was the 
> problem I encountered:
> There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps.
> I failover my active rm and rm will failover all those 1800 apps.
> When a application failover, It will wait for AM container register itself. 
> But there is a bug in my AM (I do it intentionally), and it will not register 
> itself.
> So the RM will wait for about 10mins for the AM expiration, and it will send 
> a CLEANUP event to 
> ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so 
> it will hang the ApplicationMasterLauncher
> thread pool for a large time. I have already use the 
> patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch),
>  so
> a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so 
> for each of my thread, it will
> hang 1800 / 50 * 200s = 7200s=20min.
> Because the AM have register itself during 10mins， so it will retry and 
> create a new application attempt. 
> The application attempt will accept a container from RM, and send a LAUNCH to 
> ApplicationMasterLauncher thread pool.
> Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the 
> application attempt will not 
> start the AM container during 10min. 
> And it will expire, and send a CLEANUP event to ApplicationMasterLauncher 
> thread pools too.
> As you can see, none of my application can really run it. 
> Each of them have 5 application attempts as follows, and each of them keep 
> retrying.
> appattempt_1495786030132_4000_05
> appattempt_1495786030132_4000_04
> appattempt_1495786030132_4000_03
> appattempt_1495786030132_4000_02  
> appattempt_1495786030132_4000_01
> So all of my apps have hang several hours, and none of them can really run. 
> I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events.
> And use some other thread to deal with LAUNCH event or use other way.
> Sorry, I english is so poor. I don't know have I describe it clearly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-8306) Support fine-grained configuration scheduler maximum allocation to sub-queue level.

[jira] [Updated] (YARN-8306) Support fine-grained configuration scheduler maximum allocation to sub-queue level.

[jira] [Updated] (YARN-8306) Support fine-grained configuration scheduler maximum allocation to sub-queue level.

[jira] [Updated] (YARN-8306) Support fine-grained configuration scheduler maximum allocation to sub-queue level.

[jira] [Created] (YARN-8475) Should check the resource of assignment is greater than Resources.none() before submitResourceCommitRequest

[jira] [Commented] (YARN-3611) Support Docker Containers In LinuxContainerExecutor

[jira] [Commented] (YARN-6516) FairScheduler:the algorithm of assignContainer is so slow for it only can assign a thousand containers per second

[jira] [Commented] (YARN-6516) FairScheduler:the algorithm of assignContainer is so slow for it only can assign a thousand containers per second

[jira] [Commented] (YARN-6516) FairScheduler:the algorithm of assignContainer is so slow for it only can assign a thousand containers per second

[jira] [Created] (YARN-6644) The demand of FSAppAttempt may be negative

[jira] [Commented] (YARN-6644) The demand of FSAppAttempt may be negative

[jira] [Commented] (YARN-6644) The demand of FSAppAttempt may be negative

[jira] [Created] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool

[jira] [Commented] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool

[jira] [Updated] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool

[jira] [Updated] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool

16 matches

Site Navigation

Mail list logo

Footer information