[jira] [Commented] (YARN-11633) [Federation] Improve LoadBasedRouterPolicy To Use Available vcores

2023-12-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799230#comment-17799230
 ] 

ASF GitHub Bot commented on YARN-11633:
---

singer-bin commented on PR #6356:
URL: https://github.com/apache/hadoop/pull/6356#issuecomment-1865382265

   @slfan1989 Thank you for your reply, I will close this PR. Where can I 
contact you, such as wechat, and I will ask you some questions.




> [Federation] Improve LoadBasedRouterPolicy To Use Available vcores
> --
>
> Key: YARN-11633
> URL: https://issues.apache.org/jira/browse/YARN-11633
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation
>Affects Versions: 3.3.6
>Reporter: yanbin.zhang
>Priority: Major
>  Labels: pull-request-available
>
> When selecting a subcluster, consider not only available memory but also 
> available vcore



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11636) App stuck in ACCEPTED state, however Yarn metric thinks there are no pending apps in the queue

2023-12-20 Thread Helen Weng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Helen Weng updated YARN-11636:
--
Description: 
Hi, I've encountered a case recently when an app gets stuck in ACCEPTED state 
forever in a queue.

The queue is busy for the first 4 hrs that the app is queued, so during this 
time, being stuck in ACCEPTED is expected. However even as resources become 
available and all other jobs run, this job continues to be stuck. I've checked 
the following states:
1. Resources are available at the leaf queue and cluster level.
2. Other jobs can get the resources to run
3. Not hitting maxAM limits. There are no other jobs queued or running in the 
queue at this time. However...
4. When I look at jmx metric it seems to think the app is running. AppsRunning 
says 1 and containersRunning says 1 while while AppsPending says 0. However 
the app is staunchly in the "Accepted" state and does not seem to be running.

Is this known or have others encountered this issue before? Or do you have any 
advice on what I can look into to debug it? Thanks very much for the help.

  was:
Hi, I've encountered a case recently when an app gets stuck in ACCEPTED state 
forever in a queue.

The queue is busy for about 4 hrs so during this time, being stuck in ACCEPTED 
is expected. However even as resources become available and all other jobs run, 
this job continues to be stuck. I've checked the following states:
1. Resources are available at the leaf queue and cluster level.
2. Other jobs can get the resources to run
3. Not hitting maxAM limits. There are no other jobs queued or running in the 
queue at this time. However...
4. When I look at jmx metric it seems to think the app is running. AppsRunning 
says 1 and containersRunning says 1 while while AppsPending says 0. However 
the app is staunchly in the "Accepted" state and does not seem to be running.

Is this known or have others encountered this issue before? Or do you have any 
advice on what I can look into to debug it? Thanks very much for the help.


> App stuck in ACCEPTED state, however Yarn metric thinks there are no pending 
> apps in the queue
> --
>
> Key: YARN-11636
> URL: https://issues.apache.org/jira/browse/YARN-11636
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.1
>Reporter: Helen Weng
>Priority: Major
>
> Hi, I've encountered a case recently when an app gets stuck in ACCEPTED state 
> forever in a queue.
> The queue is busy for the first 4 hrs that the app is queued, so during this 
> time, being stuck in ACCEPTED is expected. However even as resources become 
> available and all other jobs run, this job continues to be stuck. I've 
> checked the following states:
> 1. Resources are available at the leaf queue and cluster level.
> 2. Other jobs can get the resources to run
> 3. Not hitting maxAM limits. There are no other jobs queued or running in the 
> queue at this time. However...
> 4. When I look at jmx metric it seems to think the app is running. 
> AppsRunning says 1 and containersRunning   says 1 while while AppsPending 
> says 0. However the app is staunchly in the "Accepted" state and does not 
> seem to be running.
> Is this known or have others encountered this issue before? Or do you have 
> any advice on what I can look into to debug it? Thanks very much for the help.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11636) App stuck in ACCEPTED state, however Yarn metric thinks there are no pending apps in the queue

2023-12-20 Thread Helen Weng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Helen Weng updated YARN-11636:
--
Description: 
Hi, I've encountered a case recently when an app gets stuck in ACCEPTED state 
forever in a queue.

The queue is busy for about 4 hrs so during this time, being stuck in ACCEPTED 
is expected. However even as resources become available and all other jobs run, 
this job continues to be stuck. I've checked the following states:
1. Resources are available at the leaf queue and cluster level.
2. Other jobs can get the resources to run
3. Not hitting maxAM limits. There are no other jobs queued or running in the 
queue at this time. However...
4. When I look at jmx metric it seems to think the app is running. AppsRunning 
says 1 and containersRunning says 1 while while AppsPending says 0. However 
the app is staunchly in the "Accepted" state and does not seem to be running.

Is this known or have others encountered this issue before? Or do you have any 
advice on what I can look into to debug it? Thanks very much for the help.

  was:
Hi, I've encountered a case recently when an app gets stuck in ACCEPTED state 
forever in a queue.

The queue is busy for about 4 hrs so during this time, being stuck in ACCEPTED 
is expected. However even as resources become available and all other jobs run, 
this job continues to be stuck. I've checked the following states:
1. Resources are available at the leaf queue and cluster level.
2. Other jobs can get the resources to run
3. Not hitting maxAM limits (there is only 1 other job running during this time 
in the queue and it is using near 0 resources)
4. When I look at jmx metric it seems to think the app is running. AppsRunning 
says 1 and containersRunning says 1 while while AppsPending says 0. However 
the app is staunchly in the "Accepted" state and does not seem to be running.

Is this known or have others encountered this issue before? Or do you have any 
advice on what I can look into to debug it? Thanks very much for the help.


> App stuck in ACCEPTED state, however Yarn metric thinks there are no pending 
> apps in the queue
> --
>
> Key: YARN-11636
> URL: https://issues.apache.org/jira/browse/YARN-11636
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.1
>Reporter: Helen Weng
>Priority: Major
>
> Hi, I've encountered a case recently when an app gets stuck in ACCEPTED state 
> forever in a queue.
> The queue is busy for about 4 hrs so during this time, being stuck in 
> ACCEPTED is expected. However even as resources become available and all 
> other jobs run, this job continues to be stuck. I've checked the following 
> states:
> 1. Resources are available at the leaf queue and cluster level.
> 2. Other jobs can get the resources to run
> 3. Not hitting maxAM limits. There are no other jobs queued or running in the 
> queue at this time. However...
> 4. When I look at jmx metric it seems to think the app is running. 
> AppsRunning says 1 and containersRunning   says 1 while while AppsPending 
> says 0. However the app is staunchly in the "Accepted" state and does not 
> seem to be running.
> Is this known or have others encountered this issue before? Or do you have 
> any advice on what I can look into to debug it? Thanks very much for the help.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11636) App stuck in ACCEPTED state, however Yarn metric thinks there are no pending apps in the queue

2023-12-20 Thread Helen Weng (Jira)
Helen Weng created YARN-11636:
-

 Summary: App stuck in ACCEPTED state, however Yarn metric thinks 
there are no pending apps in the queue
 Key: YARN-11636
 URL: https://issues.apache.org/jira/browse/YARN-11636
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.2.1
Reporter: Helen Weng


Hi, I've encountered a case recently when an app gets stuck in ACCEPTED state 
forever in a queue.

The queue is busy for about 4 hrs so during this time, being stuck in ACCEPTED 
is expected. However even as resources become available and all other jobs run, 
this job continues to be stuck. I've checked the following states:
1. Resources are available at the leaf queue and cluster level.
2. Other jobs can get the resources to run
3. Not hitting maxAM limits (there is only 1 other job running during this time 
in the queue and it is using near 0 resources)
4. When I look at jmx metric it seems to think the app is running. AppsRunning 
says 1 and containersRunning says 1 while while AppsPending says 0. However 
the app is staunchly in the "Accepted" state and does not seem to be running.

Is this known or have others encountered this issue before? Or do you have any 
advice on what I can look into to debug it? Thanks very much for the help.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11634) Speed-up TestTimelineClient

2023-12-20 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11634.
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Speed-up TestTimelineClient
> ---
>
> Key: YARN-11634
> URL: https://issues.apache.org/jira/browse/YARN-11634
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The TimelineConnector.class has a hardcoded 1-minute connection time out, 
> which makes the TestTimelineClient a long-running test (~15:30 min).
> Decreasing the timeout to 10ms will speed up the test run (~56 sec).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11634) Speed-up TestTimelineClient

2023-12-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798919#comment-17798919
 ] 

ASF GitHub Bot commented on YARN-11634:
---

brumi1024 commented on PR #6371:
URL: https://github.com/apache/hadoop/pull/6371#issuecomment-1864288422

   Merging to trunk.




> Speed-up TestTimelineClient
> ---
>
> Key: YARN-11634
> URL: https://issues.apache.org/jira/browse/YARN-11634
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Minor
>  Labels: pull-request-available
>
> The TimelineConnector.class has a hardcoded 1-minute connection time out, 
> which makes the TestTimelineClient a long-running test (~15:30 min).
> Decreasing the timeout to 10ms will speed up the test run (~56 sec).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11634) Speed-up TestTimelineClient

2023-12-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798920#comment-17798920
 ] 

ASF GitHub Bot commented on YARN-11634:
---

brumi1024 merged PR #6371:
URL: https://github.com/apache/hadoop/pull/6371




> Speed-up TestTimelineClient
> ---
>
> Key: YARN-11634
> URL: https://issues.apache.org/jira/browse/YARN-11634
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Minor
>  Labels: pull-request-available
>
> The TimelineConnector.class has a hardcoded 1-minute connection time out, 
> which makes the TestTimelineClient a long-running test (~15:30 min).
> Decreasing the timeout to 10ms will speed up the test run (~56 sec).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11634) Speed-up TestTimelineClient

2023-12-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798918#comment-17798918
 ] 

ASF GitHub Bot commented on YARN-11634:
---

brumi1024 commented on code in PR #6371:
URL: https://github.com/apache/hadoop/pull/6371#discussion_r1432578570


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java:
##
@@ -78,6 +78,7 @@ public void setup() {
 conf.setBoolean(YarnConfiguration.TIMELINE_SERVICE_ENABLED, true);
 conf.setFloat(YarnConfiguration.TIMELINE_SERVICE_VERSION, 1.0f);
 client = createTimelineClient(conf);
+TimelineConnector.DEFAULT_SOCKET_TIMEOUT = 10;

Review Comment:
   Thanks, in that case LGTM.





> Speed-up TestTimelineClient
> ---
>
> Key: YARN-11634
> URL: https://issues.apache.org/jira/browse/YARN-11634
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Minor
>  Labels: pull-request-available
>
> The TimelineConnector.class has a hardcoded 1-minute connection time out, 
> which makes the TestTimelineClient a long-running test (~15:30 min).
> Decreasing the timeout to 10ms will speed up the test run (~56 sec).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org