[jira] [Updated] (YARN-11644) LogAggregationService can't upload log in time when application finished

2024-01-09 Thread Xie YiFan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xie YiFan updated YARN-11644:
-
Affects Version/s: 3.3.6

> LogAggregationService can't upload log in time when application finished
> 
>
> Key: YARN-11644
> URL: https://issues.apache.org/jira/browse/YARN-11644
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.3.6
>Reporter: Xie YiFan
>Assignee: Xie YiFan
>Priority: Minor
> Attachments: image-2024-01-10-11-03-57-553.png
>
>
> LogAggregationService is responsible for uploading log to HDFS. It applies 
> thread pool to execute upload task.
> The workflow of upload log as follow:
>  # NM construct Applicaiton object when first container of a certain 
> application launch, then notify LogAggregationService to init 
> AppLogAggregationImpl.
>  # LogAggregationService submit AppLogAggregationImpl to task queue
>  # The idle worker of thread pool pulls AppLogAggregationImpl from task queue.
>  # AppLogAggregationImpl do while loop to check the application state, do 
> upload when application finished.
> Suppose the following scenario:
>  * LogAggregationService initialize thread pool with 4 threads.
>  * 4 long running applications start on this NM, so all threads are occupied 
> by aggregator.
>  * The next short application starts on this NM and quickly finish, but no 
> idle thread for this app to upload log.
> as a result, the following applications have to wait the previous 
> applications finish before uploading their logs.
> !image-2024-01-10-11-03-57-553.png|width=599,height=195!
> h4. Solution
> Change the spin behavior of AppLogAggregationImpl. If application has not 
> finished, just return to yield current thread and resubmit itself to executor 
> service. So the LogAggregationService can roll the task queue and the logs of 
> finished application can be uploaded immediately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11644) LogAggregationService can't upload log in time when application finished

2024-01-09 Thread Xie YiFan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xie YiFan updated YARN-11644:
-
Description: 
LogAggregationService is responsible for uploading log to HDFS. It applies 
thread pool to execute upload task.

The workflow of upload log as follow:
 # NM construct Applicaiton object when first container of a certain 
application launch, then notify LogAggregationService to init 
AppLogAggregationImpl.
 # LogAggregationService submit AppLogAggregationImpl to task queue
 # The idle worker of thread pool pulls AppLogAggregationImpl from task queue.
 # AppLogAggregationImpl do while loop to check the application state, do 
upload when application finished.

Suppose the following scenario:
 * LogAggregationService initialize thread pool with 4 threads.
 * 4 long running applications start on this NM, so all threads are occupied by 
aggregator.
 * The next short application starts on this NM and quickly finish, but no idle 
thread for this app to upload log.

as a result, the following applications have to wait the previous applications 
finish before uploading their logs.

!image-2024-01-10-11-03-57-553.png|width=599,height=195!
h4. Solution

Change the spin behavior of AppLogAggregationImpl. If application has not 
finished, just return to yield current thread and resubmit itself to executor 
service. So the LogAggregationService can roll the task queue and the logs of 
finished application can be uploaded immediately.

  was:
LogAggregationService is responsible for uploading log to HDFS. It applies 
thread pool to execute upload task.

The workflow of upload log as follow:
 # NM construct Applicaiton object when first container of a certain 
application launch, then notify LogAggregationService to init 
AppLogAggregationImpl.
 # LogAggregationService submit AppLogAggregationImpl to task queue.

 # The idle worker of thread pool pulls AppLogAggregationImpl from task queue.

 # AppLogAggregationImpl do while loop to check the application state, do 
upload when application finished.

Suppose the following scenario:
 * LogAggregationService initialize thread pool with 4 threads.

 * 4 long running applications start on this NM, so all threads are occupied by 
aggregator.

 * The next short application starts on this NM and quickly finish, but no idle 
thread for this app to upload log.

as a result, the following applications have to wait the previous applications 
finish before uploading their logs.

!image-2024-01-10-11-03-57-553.png|width=599,height=195!
h4. Solution

Change the spin behavior of AppLogAggregationImpl. If application has not 
finished, just return to yield current thread and resubmit itself to executor 
service. So the LogAggregationService can roll the task queue and the logs of 
finished application can be uploaded immediately.


> LogAggregationService can't upload log in time when application finished
> 
>
> Key: YARN-11644
> URL: https://issues.apache.org/jira/browse/YARN-11644
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Reporter: Xie YiFan
>Assignee: Xie YiFan
>Priority: Minor
> Attachments: image-2024-01-10-11-03-57-553.png
>
>
> LogAggregationService is responsible for uploading log to HDFS. It applies 
> thread pool to execute upload task.
> The workflow of upload log as follow:
>  # NM construct Applicaiton object when first container of a certain 
> application launch, then notify LogAggregationService to init 
> AppLogAggregationImpl.
>  # LogAggregationService submit AppLogAggregationImpl to task queue
>  # The idle worker of thread pool pulls AppLogAggregationImpl from task queue.
>  # AppLogAggregationImpl do while loop to check the application state, do 
> upload when application finished.
> Suppose the following scenario:
>  * LogAggregationService initialize thread pool with 4 threads.
>  * 4 long running applications start on this NM, so all threads are occupied 
> by aggregator.
>  * The next short application starts on this NM and quickly finish, but no 
> idle thread for this app to upload log.
> as a result, the following applications have to wait the previous 
> applications finish before uploading their logs.
> !image-2024-01-10-11-03-57-553.png|width=599,height=195!
> h4. Solution
> Change the spin behavior of AppLogAggregationImpl. If application has not 
> finished, just return to yield current thread and resubmit itself to executor 
> service. So the LogAggregationService can roll the task queue and the logs of 
> finished application can be uploaded immediately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoo

[jira] [Created] (YARN-11644) LogAggregationService can't upload log in time when application finished

2024-01-09 Thread Xie YiFan (Jira)
Xie YiFan created YARN-11644:


 Summary: LogAggregationService can't upload log in time when 
application finished
 Key: YARN-11644
 URL: https://issues.apache.org/jira/browse/YARN-11644
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: log-aggregation
Reporter: Xie YiFan
Assignee: Xie YiFan
 Attachments: image-2024-01-10-11-03-57-553.png

LogAggregationService is responsible for uploading log to HDFS. It applies 
thread pool to execute upload task.

The workflow of upload log as follow:
 # NM construct Applicaiton object when first container of a certain 
application launch, then notify LogAggregationService to init 
AppLogAggregationImpl.
 # LogAggregationService submit AppLogAggregationImpl to task queue.

 # The idle worker of thread pool pulls AppLogAggregationImpl from task queue.

 # AppLogAggregationImpl do while loop to check the application state, do 
upload when application finished.

Suppose the following scenario:
 * LogAggregationService initialize thread pool with 4 threads.

 * 4 long running applications start on this NM, so all threads are occupied by 
aggregator.

 * The next short application starts on this NM and quickly finish, but no idle 
thread for this app to upload log.

as a result, the following applications have to wait the previous applications 
finish before uploading their logs.

!image-2024-01-10-11-03-57-553.png|width=599,height=195!
h4. Solution

Change the spin behavior of AppLogAggregationImpl. If application has not 
finished, just return to yield current thread and resubmit itself to executor 
service. So the LogAggregationService can roll the task queue and the logs of 
finished application can be uploaded immediately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11634) Speed-up TestTimelineClient

2024-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804933#comment-17804933
 ] 

ASF GitHub Bot commented on YARN-11634:
---

slfan1989 commented on code in PR #6419:
URL: https://github.com/apache/hadoop/pull/6419#discussion_r1446781464


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineConnector.java:
##
@@ -145,14 +145,14 @@ protected void serviceInit(Configuration conf) throws 
Exception {
 @Override
 public HttpURLConnection configure(HttpURLConnection conn)
 throws IOException {
-  setTimeouts(conn, DEFAULT_SOCKET_TIMEOUT);
+  setTimeouts(conn, 60_000);

Review Comment:
   @brumi1024 Thanks for reviewing the code, I will improve it.





> Speed-up TestTimelineClient
> ---
>
> Key: YARN-11634
> URL: https://issues.apache.org/jira/browse/YARN-11634
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The TimelineConnector.class has a hardcoded 1-minute connection time out, 
> which makes the TestTimelineClient a long-running test (~15:30 min).
> Decreasing the timeout to 10ms will speed up the test run (~56 sec).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11643) Skip unnecessary pre-check in Multi Node Placement

2024-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804825#comment-17804825
 ] 

ASF GitHub Bot commented on YARN-11643:
---

hadoop-yetus commented on PR #6426:
URL: https://github.com/apache/hadoop/pull/6426#issuecomment-1883508019

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 20s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  31m 18s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 31s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 26s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 30s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 34s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 35s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 31s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 14s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  19m 54s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 25s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 26s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 26s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 24s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 24s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 20s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 24s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 27s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 26s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m  4s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  19m 53s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 577m  3s | 
[/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6426/1/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 22s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 658m 39s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerWithMultiResourceTypes
 |
   |   | hadoop.yarn.server.resourcemanager.TestResourceTrackerService |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerNodeLabelUpdate
 |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestAMAllocatedToNonExclusivePartition
 |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerMultiNodes
 |
   |   | hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter |
   |   | hadoop.yarn.server.resourcemanager.TestCapacitySchedulerMetrics |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestAbsoluteResourceWithAutoQueue
 |
   |   | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue |
   |   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps |
   |   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesSchedulerActivitiesWithMultiNodesEnabled
 |
   |   | hadoop.yarn.server.resourcemanager.scheduler

[jira] [Assigned] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy

2024-01-09 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi reassigned YARN-11639:
-

Assignee: Ferenc Erdelyi

> ConcurrentModificationException and NPE in 
> PriorityUtilizationQueueOrderingPolicy
> -
>
> Key: YARN-11639
> URL: https://issues.apache.org/jira/browse/YARN-11639
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>
> When dynamic queue creation is enabled in weight mode and the deletion policy 
> coincides with the PriorityQueueResourcesForSorting, RM stops assigning 
> resources because of either ConcurrentModificationException or NPE in 
> PriorityUtilizationQueueOrderingPolicy.
> Reproduced the NPE issue in Java8 and Java11 environment:
> {code:java}
> ... INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Removing queue: root.dyn.PmvkMgrEBQppu
> 2024-01-02 17:00:59,399 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[Thread-11,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225)
>   at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
>   at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
>   at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
>   at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
>   at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
>   at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
> {code}
> Observed the ConcurrentModificationException in Java8 environment, but could 
> not reproduce yet:
> {code:java}
> 2023-10-27 02:50:37,584 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread 
> Thread[Thread-15,5, main] threw an Exception.
> java.util.ConcurrentModificationException
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterat