subject:"\[jira\] \[Updated\] \(YARN\-10739\) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time"

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-26 Thread Qi Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10739:
--
Attachment: YARN-10739.006.patch

> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/browse/YARN-10739
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Zhanqi Cai
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10739-001.patch, YARN-10739-002.patch, 
> YARN-10739.003.patch, YARN-10739.003.patch, YARN-10739.004.patch, 
> YARN-10739.005.patch, YARN-10739.006.patch
>
>
> Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take a long time to 
> process.
> For example:
>  If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> node manager will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
> {code:java}
> for(RMApp app : rmContext.getRMApps().values()) {
>   if (!app.isAppFinalStateStored()) {
> this.rmContext
> .getDispatcher()
> .getEventHandler()
> .handle(
> new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
> appNodeUpdateType));
>   }
> }{code}
> So the total event is 4k*4k=16 mil, during this window, the 
> GenericEventHandler.printEventQueueDetails will print the event queue detail 
> and be called frequently, once the event queue size reaches 1 mil+, the 
> Iterator of the queue from printEventQueueDetails will be so slow refer to 
> below: 
> {code:java}
> private void printEventQueueDetails() {
>   Iterator iterator = eventQueue.iterator();
>   Map counterMap = new HashMap<>();
>   while (iterator.hasNext()) {
> Enum eventType = iterator.next().getType();
> {code}
> Then RM recovery will cost too much time.
>  Refer to our log:
> {code:java}
> 2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
> record counter: 310836
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
> Event record counter: 1103
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: 
> NODE_REMOVED, Event record counter: 1
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
> Event record counter: 1
> {code}
> Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 
> 1s to do Iterator.
> I upload a file to ensure the printEventQueueDetails only be called one-time 
> pre-30s.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-26 Thread Qi Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10739:
--
Attachment: YARN-10739.005.patch

> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/browse/YARN-10739
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Zhanqi Cai
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10739-001.patch, YARN-10739-002.patch, 
> YARN-10739.003.patch, YARN-10739.003.patch, YARN-10739.004.patch, 
> YARN-10739.005.patch
>
>
> Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take a long time to 
> process.
> For example:
>  If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> node manager will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
> {code:java}
> for(RMApp app : rmContext.getRMApps().values()) {
>   if (!app.isAppFinalStateStored()) {
> this.rmContext
> .getDispatcher()
> .getEventHandler()
> .handle(
> new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
> appNodeUpdateType));
>   }
> }{code}
> So the total event is 4k*4k=16 mil, during this window, the 
> GenericEventHandler.printEventQueueDetails will print the event queue detail 
> and be called frequently, once the event queue size reaches 1 mil+, the 
> Iterator of the queue from printEventQueueDetails will be so slow refer to 
> below: 
> {code:java}
> private void printEventQueueDetails() {
>   Iterator iterator = eventQueue.iterator();
>   Map counterMap = new HashMap<>();
>   while (iterator.hasNext()) {
> Enum eventType = iterator.next().getType();
> {code}
> Then RM recovery will cost too much time.
>  Refer to our log:
> {code:java}
> 2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
> record counter: 310836
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
> Event record counter: 1103
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: 
> NODE_REMOVED, Event record counter: 1
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
> Event record counter: 1
> {code}
> Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 
> 1s to do Iterator.
> I upload a file to ensure the printEventQueueDetails only be called one-time 
> pre-30s.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-26 Thread Qi Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10739:
--
Attachment: YARN-10739.004.patch

> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/browse/YARN-10739
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Zhanqi Cai
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10739-001.patch, YARN-10739-002.patch, 
> YARN-10739.003.patch, YARN-10739.003.patch, YARN-10739.004.patch
>
>
> Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take a long time to 
> process.
> For example:
>  If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> node manager will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
> {code:java}
> for(RMApp app : rmContext.getRMApps().values()) {
>   if (!app.isAppFinalStateStored()) {
> this.rmContext
> .getDispatcher()
> .getEventHandler()
> .handle(
> new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
> appNodeUpdateType));
>   }
> }{code}
> So the total event is 4k*4k=16 mil, during this window, the 
> GenericEventHandler.printEventQueueDetails will print the event queue detail 
> and be called frequently, once the event queue size reaches 1 mil+, the 
> Iterator of the queue from printEventQueueDetails will be so slow refer to 
> below: 
> {code:java}
> private void printEventQueueDetails() {
>   Iterator iterator = eventQueue.iterator();
>   Map counterMap = new HashMap<>();
>   while (iterator.hasNext()) {
> Enum eventType = iterator.next().getType();
> {code}
> Then RM recovery will cost too much time.
>  Refer to our log:
> {code:java}
> 2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
> record counter: 310836
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
> Event record counter: 1103
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: 
> NODE_REMOVED, Event record counter: 1
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
> Event record counter: 1
> {code}
> Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 
> 1s to do Iterator.
> I upload a file to ensure the printEventQueueDetails only be called one-time 
> pre-30s.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-21 Thread Qi Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10739:
--
Parent: YARN-10695
Issue Type: Sub-task  (was: Bug)

> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/browse/YARN-10739
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Zhanqi Cai
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10739-001.patch, YARN-10739-002.patch, 
> YARN-10739.003.patch, YARN-10739.003.patch
>
>
> Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take a long time to 
> process.
> For example:
>  If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> node manager will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
> {code:java}
> for(RMApp app : rmContext.getRMApps().values()) {
>   if (!app.isAppFinalStateStored()) {
> this.rmContext
> .getDispatcher()
> .getEventHandler()
> .handle(
> new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
> appNodeUpdateType));
>   }
> }{code}
> So the total event is 4k*4k=16 mil, during this window, the 
> GenericEventHandler.printEventQueueDetails will print the event queue detail 
> and be called frequently, once the event queue size reaches 1 mil+, the 
> Iterator of the queue from printEventQueueDetails will be so slow refer to 
> below: 
> {code:java}
> private void printEventQueueDetails() {
>   Iterator iterator = eventQueue.iterator();
>   Map counterMap = new HashMap<>();
>   while (iterator.hasNext()) {
> Enum eventType = iterator.next().getType();
> {code}
> Then RM recovery will cost too much time.
>  Refer to our log:
> {code:java}
> 2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
> record counter: 310836
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
> Event record counter: 1103
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: 
> NODE_REMOVED, Event record counter: 1
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
> Event record counter: 1
> {code}
> Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 
> 1s to do Iterator.
> I upload a file to ensure the printEventQueueDetails only be called one-time 
> pre-30s.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-16 Thread Qi Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated YARN-10739:
--
Attachment: YARN-10739.003.patch

> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/browse/YARN-10739
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Zhanqi Cai
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10739-001.patch, YARN-10739-002.patch, 
> YARN-10739.003.patch
>
>
> Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take a long time to 
> process.
> For example:
>  If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> node manager will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
> {code:java}
> for(RMApp app : rmContext.getRMApps().values()) {
>   if (!app.isAppFinalStateStored()) {
> this.rmContext
> .getDispatcher()
> .getEventHandler()
> .handle(
> new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
> appNodeUpdateType));
>   }
> }{code}
> So the total event is 4k*4k=16 mil, during this window, the 
> GenericEventHandler.printEventQueueDetails will print the event queue detail 
> and be called frequently, once the event queue size reaches 1 mil+, the 
> Iterator of the queue from printEventQueueDetails will be so slow refer to 
> below: 
> {code:java}
> private void printEventQueueDetails() {
>   Iterator iterator = eventQueue.iterator();
>   Map counterMap = new HashMap<>();
>   while (iterator.hasNext()) {
> Enum eventType = iterator.next().getType();
> {code}
> Then RM recovery will cost too much time.
>  Refer to our log:
> {code:java}
> 2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
> record counter: 310836
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
> Event record counter: 1103
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: 
> NODE_REMOVED, Event record counter: 1
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
> Event record counter: 1
> {code}
> Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 
> 1s to do Iterator.
> I upload a file to ensure the printEventQueueDetails only be called one-time 
> pre-30s.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-16 Thread Zhanqi Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhanqi Cai updated YARN-10739:
--
Attachment: YARN-10739-002.patch

> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/browse/YARN-10739
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Zhanqi Cai
>Priority: Critical
> Attachments: YARN-10739-001.patch, YARN-10739-002.patch
>
>
> Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take a long time to 
> process.
> For example:
>  If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> node manager will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
> {code:java}
> for(RMApp app : rmContext.getRMApps().values()) {
>   if (!app.isAppFinalStateStored()) {
> this.rmContext
> .getDispatcher()
> .getEventHandler()
> .handle(
> new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
> appNodeUpdateType));
>   }
> }{code}
> So the total event is 4k*4k=16 mil, during this window, the 
> GenericEventHandler.printEventQueueDetails will print the event queue detail 
> and be called frequently, once the event queue size reaches 1 mil+, the 
> Iterator of the queue from printEventQueueDetails will be so slow refer to 
> below: 
> {code:java}
> private void printEventQueueDetails() {
>   Iterator iterator = eventQueue.iterator();
>   Map counterMap = new HashMap<>();
>   while (iterator.hasNext()) {
> Enum eventType = iterator.next().getType();
> {code}
> Then RM recovery will cost too much time.
>  Refer to our log:
> {code:java}
> 2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
> record counter: 310836
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
> Event record counter: 1103
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: 
> NODE_REMOVED, Event record counter: 1
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
> Event record counter: 1
> {code}
> Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 
> 1s to do Iterator.
> I upload a file to ensure the printEventQueueDetails only be called one-time 
> pre-30s.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-16 Thread Zhanqi Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhanqi Cai updated YARN-10739:
--
Description: 
Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
AsyncDispatcher, if the event queue size is too large, the 
printEventQueueDetails will cost too much time and RM  take a long time to 
process.

For example:
 If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
node manager will register with RM, and RM will call NodesListManager to do 
RMAppNodeUpdateEvent, code like below:
{code:java}
for(RMApp app : rmContext.getRMApps().values()) {
  if (!app.isAppFinalStateStored()) {
this.rmContext
.getDispatcher()
.getEventHandler()
.handle(
new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
appNodeUpdateType));
  }
}{code}
So the total event is 4k*4k=16 mil, during this window, the 
GenericEventHandler.printEventQueueDetails will print the event queue detail 
and be called frequently, once the event queue size reaches 1 mil+, the 
Iterator of the queue from printEventQueueDetails will be so slow refer to 
below: 
{code:java}
private void printEventQueueDetails() {
  Iterator iterator = eventQueue.iterator();
  Map counterMap = new HashMap<>();
  while (iterator.hasNext()) {
Enum eventType = iterator.next().getType();
{code}
Then RM recovery will cost too much time.
 Refer to our log:
{code:java}
2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200

2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
record counter: 310836
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
Event record counter: 1103
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_REMOVED, 
Event record counter: 1
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
Event record counter: 1
{code}
Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 1s 
to do Iterator.

I upload a file to ensure the printEventQueueDetails only be called one-time 
pre-30s.

 

  was:
Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
AsyncDispatcher, if the event queue size is too large, the 
printEventQueueDetails will cost too much time and RM  take a long time to 
process.

For example:
 If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
node manager will register with RM, and RM will call NodesListManager to do 
RMAppNodeUpdateEvent, code like below:
{code:java}
for(RMApp app : rmContext.getRMApps().values()) {
  if (!app.isAppFinalStateStored()) {
this.rmContext
.getDispatcher()
.getEventHandler()
.handle(
new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
appNodeUpdateType));
  }
}{code}
So the total event is 4k*4k=16 mil, during this window, the 
GenericEventHandler.printEventQueueDetails will print the event queue detail 
and be called frequently, once the event queue size reaches 100W+, the Iterator 
of the queue from printEventQueueDetails will be so slow refer to below: 
{code:java}
private void printEventQueueDetails() {
  Iterator iterator = eventQueue.iterator();
  Map counterMap = new HashMap<>();
  while (iterator.hasNext()) {
Enum eventType = iterator.next().getType();
{code}
Then RM recovery will cost too much time.
 Refer to our log:
{code:java}
2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200

2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
record counter: 310836
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
Event record counter: 1103
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_REMOVED, 
Event record counter: 1
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
Event record counter: 1
{code}
Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 1s 
to do Iterator.

I upload a file to ensure the printEventQueueDetails only be called one-time 
pre-30s.

 


> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: ht

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-16 Thread Zhanqi Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhanqi Cai updated YARN-10739:
--
Description: 
Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
AsyncDispatcher, if the event queue size is too large, the 
printEventQueueDetails will cost too much time and RM  take a long time to 
process.

For example:
 If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
node manager will register with RM, and RM will call NodesListManager to do 
RMAppNodeUpdateEvent, code like below:
{code:java}
for(RMApp app : rmContext.getRMApps().values()) {
  if (!app.isAppFinalStateStored()) {
this.rmContext
.getDispatcher()
.getEventHandler()
.handle(
new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
appNodeUpdateType));
  }
}{code}
So the total event is 4k*4k=16 mil, during this window, the 
GenericEventHandler.printEventQueueDetails will print the event queue detail 
and be called frequently, once the event queue size reaches 100W+, the Iterator 
of the queue from printEventQueueDetails will be so slow refer to below: 
{code:java}
private void printEventQueueDetails() {
  Iterator iterator = eventQueue.iterator();
  Map counterMap = new HashMap<>();
  while (iterator.hasNext()) {
Enum eventType = iterator.next().getType();
{code}
Then RM recovery will cost too much time.
 Refer to our log:
{code:java}
2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200

2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
record counter: 310836
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
Event record counter: 1103
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_REMOVED, 
Event record counter: 1
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
Event record counter: 1

{code}
Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 1s 
to do Iterator.

I upload a file to ensure the printEventQueueDetails only be called one-time 
pre-30s.

 

  was:
Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
AsyncDispatcher, if the event queue size is too large, the 
printEventQueueDetails will cost too much time and RM  take a long time to 
process.

For example:
 If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
node manager will register with RM, and RM will call NodesListManager to do 
RMAppNodeUpdateEvent, code like below:
{code:java}
for(RMApp app : rmContext.getRMApps().values()) {
  if (!app.isAppFinalStateStored()) {
this.rmContext
.getDispatcher()
.getEventHandler()
.handle(
new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
appNodeUpdateType));
  }
}{code}
So the total event is 4k*4k=1600W, during this window, the 
GenericEventHandler.printEventQueueDetails will print the event queue detail 
and be called frequently, once the event queue size reaches 100W+, the Iterator 
of the queue from printEventQueueDetails will be so slow refer to below: 
{code:java}
private void printEventQueueDetails() {
  Iterator iterator = eventQueue.iterator();
  Map counterMap = new HashMap<>();
  while (iterator.hasNext()) {
Enum eventType = iterator.next().getType();
{code}
Then RM recovery will cost too much time.
 Refer to our log:
{code:java}
2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200

2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
record counter: 310836
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
Event record counter: 1103
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_REMOVED, 
Event record counter: 1
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
Event record counter: 1

{code}
Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 1s 
to do Iterator.

I upload a file to ensure the printEventQueueDetails only be called one-time 
pre-30s.

 


> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: htt

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-16 Thread Zhanqi Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhanqi Cai updated YARN-10739:
--
Description: 
Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
AsyncDispatcher, if the event queue size is too large, the 
printEventQueueDetails will cost too much time and RM  take a long time to 
process.

For example:
 If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
node manager will register with RM, and RM will call NodesListManager to do 
RMAppNodeUpdateEvent, code like below:
{code:java}
for(RMApp app : rmContext.getRMApps().values()) {
  if (!app.isAppFinalStateStored()) {
this.rmContext
.getDispatcher()
.getEventHandler()
.handle(
new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
appNodeUpdateType));
  }
}{code}
So the total event is 4k*4k=16 mil, during this window, the 
GenericEventHandler.printEventQueueDetails will print the event queue detail 
and be called frequently, once the event queue size reaches 100W+, the Iterator 
of the queue from printEventQueueDetails will be so slow refer to below: 
{code:java}
private void printEventQueueDetails() {
  Iterator iterator = eventQueue.iterator();
  Map counterMap = new HashMap<>();
  while (iterator.hasNext()) {
Enum eventType = iterator.next().getType();
{code}
Then RM recovery will cost too much time.
 Refer to our log:
{code:java}
2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200

2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
record counter: 310836
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
Event record counter: 1103
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_REMOVED, 
Event record counter: 1
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
Event record counter: 1
{code}
Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 1s 
to do Iterator.

I upload a file to ensure the printEventQueueDetails only be called one-time 
pre-30s.

 

  was:
Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
AsyncDispatcher, if the event queue size is too large, the 
printEventQueueDetails will cost too much time and RM  take a long time to 
process.

For example:
 If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
node manager will register with RM, and RM will call NodesListManager to do 
RMAppNodeUpdateEvent, code like below:
{code:java}
for(RMApp app : rmContext.getRMApps().values()) {
  if (!app.isAppFinalStateStored()) {
this.rmContext
.getDispatcher()
.getEventHandler()
.handle(
new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
appNodeUpdateType));
  }
}{code}
So the total event is 4k*4k=16 mil, during this window, the 
GenericEventHandler.printEventQueueDetails will print the event queue detail 
and be called frequently, once the event queue size reaches 100W+, the Iterator 
of the queue from printEventQueueDetails will be so slow refer to below: 
{code:java}
private void printEventQueueDetails() {
  Iterator iterator = eventQueue.iterator();
  Map counterMap = new HashMap<>();
  while (iterator.hasNext()) {
Enum eventType = iterator.next().getType();
{code}
Then RM recovery will cost too much time.
 Refer to our log:
{code:java}
2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200

2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
record counter: 310836
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
Event record counter: 1103
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_REMOVED, 
Event record counter: 1
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
Event record counter: 1

{code}
Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 1s 
to do Iterator.

I upload a file to ensure the printEventQueueDetails only be called one-time 
pre-30s.

 


> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: htt

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-16 Thread Zhanqi Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhanqi Cai updated YARN-10739:
--
Description: 
Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
AsyncDispatcher, if the event queue size is too large, the 
printEventQueueDetails will cost too much time and RM  take a long time to 
process.

For example:
 If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
node manager will register with RM, and RM will call NodesListManager to do 
RMAppNodeUpdateEvent, code like below:
{code:java}
for(RMApp app : rmContext.getRMApps().values()) {
  if (!app.isAppFinalStateStored()) {
this.rmContext
.getDispatcher()
.getEventHandler()
.handle(
new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
appNodeUpdateType));
  }
}{code}
So the total event is 4k*4k=1600W, during this window, the 
GenericEventHandler.printEventQueueDetails will print the event queue detail 
and be called frequently, once the event queue size reaches 100W+, the Iterator 
of the queue from printEventQueueDetails will be so slow refer to below: 
{code:java}
private void printEventQueueDetails() {
  Iterator iterator = eventQueue.iterator();
  Map counterMap = new HashMap<>();
  while (iterator.hasNext()) {
Enum eventType = iterator.next().getType();
{code}
Then RM recovery will cost too much time.
 Refer to our log:
{code:java}
2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200

2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
record counter: 310836
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
Event record counter: 1103
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_REMOVED, 
Event record counter: 1
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
Event record counter: 1

{code}
Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 1s 
to do Iterator.

I upload a file to ensure the printEventQueueDetails only be called one-time 
pre-30s.

 

  was:
Due to YARN-10642 add GenericEventHandler.printEventQueueDetails on 
AsyncDispatcher, if the event queue size is too large, the 
printEventQueueDetails will cost too much time and RM  take a long time to 
process.

For example:
 If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
node manager will register with RM, and RM will call NodesListManager to do 
RMAppNodeUpdateEvent, code like below:
{code:java}
for(RMApp app : rmContext.getRMApps().values()) {
  if (!app.isAppFinalStateStored()) {
this.rmContext
.getDispatcher()
.getEventHandler()
.handle(
new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
appNodeUpdateType));
  }
}{code}
So the total event is 4k*4k=1600W, during this window, the 
GenericEventHandler.printEventQueueDetails will print the event queue detail 
and be called frequently, once the event queue size reaches 100W+, the Iterator 
of the queue from printEventQueueDetails will be so slow refer to below: 
{code:java}
private void printEventQueueDetails() {
  Iterator iterator = eventQueue.iterator();
  Map counterMap = new HashMap<>();
  while (iterator.hasNext()) {
Enum eventType = iterator.next().getType();
{code}
Then RM recovery will cost too much time.
 Refer to our log:
{code:java}
2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200

2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
record counter: 310836
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
Event record counter: 1103
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_REMOVED, 
Event record counter: 1
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
Event record counter: 1

{code}
Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 1s 
to do Iterator.

I upload a file to ensure the printEventQueueDetails only be called one-time 
pre-30s.

 


> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: https://issues

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-16 Thread Zhanqi Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhanqi Cai updated YARN-10739:
--
Description: 
Due to YARN-10642 add GenericEventHandler.printEventQueueDetails on 
AsyncDispatcher, if the event queue size is too large, the 
printEventQueueDetails will cost too much time and RM  take a long time to 
process.

For example:
 If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
node manager will register with RM, and RM will call NodesListManager to do 
RMAppNodeUpdateEvent, code like below:
{code:java}
for(RMApp app : rmContext.getRMApps().values()) {
  if (!app.isAppFinalStateStored()) {
this.rmContext
.getDispatcher()
.getEventHandler()
.handle(
new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
appNodeUpdateType));
  }
}{code}
So the total event is 4k*4k=1600W, during this window, the 
GenericEventHandler.printEventQueueDetails will print the event queue detail 
and be called frequently, once the event queue size reaches 100W+, the Iterator 
of the queue from printEventQueueDetails will be so slow refer to below: 
{code:java}
private void printEventQueueDetails() {
  Iterator iterator = eventQueue.iterator();
  Map counterMap = new HashMap<>();
  while (iterator.hasNext()) {
Enum eventType = iterator.next().getType();
{code}
Then RM recovery will cost too much time.
 Refer to our log:
{code:java}
2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200

2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
record counter: 310836
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
Event record counter: 1103
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_REMOVED, 
Event record counter: 1
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
Event record counter: 1

{code}
Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 1s 
to do Iterator.

I upload a file to ensure the printEventQueueDetails only be called one-time 
pre-30s.

 

  was:
Due to YARN-10642 add GenericEventHandler.printEventQueueDetails on 
AsyncDispatcher, if the event queue size is too large, the 
printEventQueueDetails will cost too much time and RM  take long time to 
process.

For example:
 If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
nodemanger will register with RM, and RM will call NodesListManager to do 
RMAppNodeUpdateEvent, code like below:

 
{code:java}
for(RMApp app : rmContext.getRMApps().values()) {
  if (!app.isAppFinalStateStored()) {
this.rmContext
.getDispatcher()
.getEventHandler()
.handle(
new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
appNodeUpdateType));
  }
}{code}
So the total event is 4k*4k=1600W, during this window, the 
GenericEventHandler.printEventQueueDetails will print the event queue detail 
and be called frequently, once the event queue size reach to 100W+, the 
Iterator of queue from printEventQueueDetails will be so slow refer to below:

 
{code:java}
private void printEventQueueDetails() {
 Iterator iterator = eventQueue.iterator();
 Map counterMap = new HashMap<>();
 while (iterator.hasNext()) {
 Enum eventType = iterator.next().getType();
{code}
Then RM recovery will cost too much time.
Refer to our log:
{code:java}
2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200

2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
record counter: 310836
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
Event record counter: 1103
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_REMOVED, 
Event record counter: 1
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
Event record counter: 1

{code}
Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 1s 
to do Iterator.

 I upload a file to ensure the printEventQueueDetails only be called one-time 
pre-30s.

 


> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/b

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-16 Thread Zhanqi Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhanqi Cai updated YARN-10739:
--
Description: 
Due to YARN-10642 add GenericEventHandler.printEventQueueDetails on 
AsyncDispatcher, if the event queue size is too large, the 
printEventQueueDetails will cost too much time and RM  take long time to 
process.

For example:
 If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
nodemanger will register with RM, and RM will call NodesListManager to do 
RMAppNodeUpdateEvent, code like below:

 
{code:java}
for(RMApp app : rmContext.getRMApps().values()) {
  if (!app.isAppFinalStateStored()) {
this.rmContext
.getDispatcher()
.getEventHandler()
.handle(
new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
appNodeUpdateType));
  }
}{code}
So the total event is 4k*4k=1600W, during this window, the 
GenericEventHandler.printEventQueueDetails will print the event queue detail 
and be called frequently, once the event queue size reach to 100W+, the 
Iterator of queue from printEventQueueDetails will be so slow refer to below:

 
{code:java}
private void printEventQueueDetails() {
 Iterator iterator = eventQueue.iterator();
 Map counterMap = new HashMap<>();
 while (iterator.hasNext()) {
 Enum eventType = iterator.next().getType();
{code}
Then RM recovery will cost too much time.
Refer to our log:
{code:java}
2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200

2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
record counter: 310836
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
Event record counter: 1103
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_REMOVED, 
Event record counter: 1
2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
(AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
Event record counter: 1

{code}
Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 1s 
to do Iterator.

 I upload a file to ensure the printEventQueueDetails only be called one-time 
pre-30s.

 

  was:
Due to YARN-10642 add GenericEventHandler.printEventQueueDetails on 
AsyncDispatcher, if the event queue size is too large, the 
printEventQueueDetails will cost too much time and RM  take long time to 
process.

For example:
 If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
nodemanger will register with RM, and RM will call NodesListManager to do 
RMAppNodeUpdateEvent, code like below:

for(RMApp app : rmContext.getRMApps().values()) {
 if (!app.isAppFinalStateStored())

{ this.rmContext .getDispatcher() .getEventHandler() .handle( new 
RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, appNodeUpdateType)); }

So the total event is 4k*4k=1600W, during this window, the 
GenericEventHandler.printEventQueueDetails will print the event queue detail 
and be called frequently, once the event queue size reach to 100W+, the 
Iterator of queue from printEventQueueDetails will be so slow refer to below:

private void printEventQueueDetails() {
 Iterator iterator = eventQueue.iterator();
 Map counterMap = new HashMap<>();
 while (iterator.hasNext()) {
 Enum eventType = iterator.next().getType();

Then RM recovery will cost too much time.
I upload a file to ensure the printEventQueueDetails only be called one time 
pre 30s.


> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/browse/YARN-10739
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Zhanqi Cai
>Priority: Critical
> Attachments: YARN-10739-001.patch
>
>
> Due to YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take long time to 
> process.
> For example:
>  If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> nodemanger will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
>  
> {code:java}
> for(RMApp app : rmContext.getRMApps().values()) {
>   if (!app.isAppFinalStateStored()) {
> this.rmContext
> .getDispatcher()
> .getEventHandler()
> .handle(
> new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
>

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-16 Thread Zhanqi Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhanqi Cai updated YARN-10739:
--
Attachment: YARN-10739-001.patch

> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/browse/YARN-10739
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Zhanqi Cai
>Priority: Critical
> Attachments: YARN-10739-001.patch
>
>
> Due to YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take long time to 
> process.
> For example:
> If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> nodemanger will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
> for(RMApp app : rmContext.getRMApps().values()) {
>  if (!app.isAppFinalStateStored()) {
>  this.rmContext
>  .getDispatcher()
>  .getEventHandler()
>  .handle(
>  new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
>  appNodeUpdateType));
>  }
> So the total event is 4k*4k=1600W, during this window, the 
> GenericEventHandler.printEventQueueDetails will print the event queue detail 
> and be called frequently, once the event queue size reach to 100W+, the 
> Iterator of queue from printEventQueueDetails will be so slow refer to below:
> private void printEventQueueDetails() {
>  Iterator iterator = eventQueue.iterator();
>  Map counterMap = new HashMap<>();
>  while (iterator.hasNext()) {
>  Enum eventType = iterator.next().getType();
> Then RM recovery will cost too much time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-16 Thread Zhanqi Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhanqi Cai updated YARN-10739:
--
Attachment: (was: Queue_Details.patch)

> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/browse/YARN-10739
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Zhanqi Cai
>Priority: Critical
> Attachments: YARN-10739-001.patch
>
>
> Due to YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take long time to 
> process.
> For example:
> If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> nodemanger will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
> for(RMApp app : rmContext.getRMApps().values()) {
>  if (!app.isAppFinalStateStored()) {
>  this.rmContext
>  .getDispatcher()
>  .getEventHandler()
>  .handle(
>  new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
>  appNodeUpdateType));
>  }
> So the total event is 4k*4k=1600W, during this window, the 
> GenericEventHandler.printEventQueueDetails will print the event queue detail 
> and be called frequently, once the event queue size reach to 100W+, the 
> Iterator of queue from printEventQueueDetails will be so slow refer to below:
> private void printEventQueueDetails() {
>  Iterator iterator = eventQueue.iterator();
>  Map counterMap = new HashMap<>();
>  while (iterator.hasNext()) {
>  Enum eventType = iterator.next().getType();
> Then RM recovery will cost too much time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-16 Thread Zhanqi Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhanqi Cai updated YARN-10739:
--
Attachment: Queue_Details.patch

> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/browse/YARN-10739
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Zhanqi Cai
>Priority: Critical
> Attachments: YARN-10739-001.patch
>
>
> Due to YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take long time to 
> process.
> For example:
> If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> nodemanger will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
> for(RMApp app : rmContext.getRMApps().values()) {
>  if (!app.isAppFinalStateStored()) {
>  this.rmContext
>  .getDispatcher()
>  .getEventHandler()
>  .handle(
>  new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
>  appNodeUpdateType));
>  }
> So the total event is 4k*4k=1600W, during this window, the 
> GenericEventHandler.printEventQueueDetails will print the event queue detail 
> and be called frequently, once the event queue size reach to 100W+, the 
> Iterator of queue from printEventQueueDetails will be so slow refer to below:
> private void printEventQueueDetails() {
>  Iterator iterator = eventQueue.iterator();
>  Map counterMap = new HashMap<>();
>  while (iterator.hasNext()) {
>  Enum eventType = iterator.next().getType();
> Then RM recovery will cost too much time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

15 matches

Site Navigation

Mail list logo

Footer information