[
https://issues.apache.org/jira/browse/YARN-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636460#comment-17636460
]
ASF GitHub Bot commented on YARN-11277:
---------------------------------------
aajisaka commented on code in PR #4797:
URL: https://github.com/apache/hadoop/pull/4797#discussion_r1027616899
##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/NonAggregatingLogHandler.java:
##########
@@ -149,47 +156,52 @@ private void recover() throws IOException {
@Override
public void handle(LogHandlerEvent event) {
switch (event.getType()) {
- case APPLICATION_STARTED:
- LogHandlerAppStartedEvent appStartedEvent =
- (LogHandlerAppStartedEvent) event;
- this.appOwners.put(appStartedEvent.getApplicationId(),
- appStartedEvent.getUser());
- this.dispatcher.getEventHandler().handle(
- new ApplicationEvent(appStartedEvent.getApplicationId(),
- ApplicationEventType.APPLICATION_LOG_HANDLING_INITED));
+ case APPLICATION_STARTED:
+ LogHandlerAppStartedEvent appStartedEvent =
+ (LogHandlerAppStartedEvent) event;
+ this.appOwners.put(appStartedEvent.getApplicationId(),
+ appStartedEvent.getUser());
+ this.dispatcher.getEventHandler().handle(
+ new ApplicationEvent(appStartedEvent.getApplicationId(),
+ ApplicationEventType.APPLICATION_LOG_HANDLING_INITED));
+ break;
+ case CONTAINER_FINISHED:
+ // Ignore
+ break;
+ case APPLICATION_FINISHED:
+ LogHandlerAppFinishedEvent appFinishedEvent =
+ (LogHandlerAppFinishedEvent) event;
+ ApplicationId appId = appFinishedEvent.getApplicationId();
+ String user = appOwners.remove(appId);
+ if (user == null) {
+ LOG.error("Unable to locate user for " + appId);
+ // send LOG_HANDLING_FAILED out
+ NonAggregatingLogHandler.this.dispatcher.getEventHandler().handle(
+ new ApplicationEvent(appId,
+ ApplicationEventType.APPLICATION_LOG_HANDLING_FAILED));
break;
- case CONTAINER_FINISHED:
- // Ignore
- break;
- case APPLICATION_FINISHED:
- LogHandlerAppFinishedEvent appFinishedEvent =
- (LogHandlerAppFinishedEvent) event;
- ApplicationId appId = appFinishedEvent.getApplicationId();
+ }
+ LogDeleterRunnable logDeleter = new LogDeleterRunnable(user, appId);
+ long appLogSize = calculateSizeOfAppLogs(user, appId);
+ long deletionTimestamp = System.currentTimeMillis()
+ + this.deleteDelaySeconds * 1000;
+ LogDeleterProto deleterProto = LogDeleterProto.newBuilder()
+ .setUser(user)
+ .setDeletionTime(deletionTimestamp)
+ .build();
+ try {
+ stateStore.storeLogDeleter(appId, deleterProto);
+ } catch (IOException e) {
+ LOG.error("Unable to record log deleter state", e);
+ }
+ // delete no delay if log size exceed deleteThreshold
+ if (enableTriggerDeleteBySize && appLogSize >= deleteThreshold) {
Review Comment:
Hi @leixm thank you for your update.
1. Can we calculate the size of the application log directory only if the
feature is enabled?
2. Can we use `sched.schedule(logDeleter, 0, TimeUnit.SECONDS);` to delete
the files in background?
The code will be like
```java
try {
boolean logDeleterStarted = false;
if (enableTriggerDeleteBySize) {
final long appLogSize = calculateSizeOfAppLogs(user, appId);
if (appLogSize >= threshold) {
...
sched.schedule(logDeleter, 0, TimeUnit.SECONDS);
logDeleterStarted = true;
}
}
if (!logDeleterStarted) {
sched.schedule(logDeleter, this.deleteDelaySeconds, TimeUnit.SECONDS);
}
} catch (RejectedExecutionException e) {
logDeleter.run();
}
```
> trigger deletion of log-dir by size for NonAggregatingLogHandler
> ----------------------------------------------------------------
>
> Key: YARN-11277
> URL: https://issues.apache.org/jira/browse/YARN-11277
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Affects Versions: 3.4.0
> Reporter: Xianming Lei
> Priority: Minor
> Labels: pull-request-available
>
> In our yarn cluster, the log files of some containers are too large, which
> causes the NodeManager to frequently switch to the unhealthy state. For logs
> that are too large, we can consider deleting them directly without delaying
> yarn.nodemanager.log.retain-seconds.
> Cluster environment:
> # 8k nodes+
> # 50w+ apps / day
> Configuration:
> # yarn.nodemanager.log.retain-seconds=3days
> # yarn.log-aggregation-enable=false
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]