[jira] [Updated] (YARN-11188) Only files belong to the first file controller are removed even if multiple log aggregation file controllers are configured

Shilun Fan (Jira) Sat, 27 Jan 2024 21:19:05 -0800


     [ 
https://issues.apache.org/jira/browse/YARN-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shilun Fan updated YARN-11188:
------------------------------
     Target Version/s: 3.4.0
    Affects Version/s: 3.4.0

> Only files belong to the first file controller are removed even if multiple 
> log aggregation file controllers are configured
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-11188
>                 URL: https://issues.apache.org/jira/browse/YARN-11188
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: log-aggregation
>    Affects Versions: 3.4.0
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.0
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Log aggregation can be configured to have a comma-separated list of file 
> controllers.
> The current behaviour only removes files that belong to the first file 
> controller.
> This can be problematic. 
> For example, if some user configures IFile as the file controller, and later 
> on changes the file controllers to specify multiple file controllers (e.g. 
> value = TFile,IFile) then only the first controller will be considered and 
> the files belong to that controller will be removed, in this case files 
> written by the TFile controller will be removed and the files created with 
> the IFile controller will be kept.
> This behaviour should be changed so that all of the files should be removed 
> if multiple file controllers are enabled.
> h2. CODE PATH
> ----
> 1. 
> [AggregatedLogDeletionService$LogDeletionTask#run|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L82-L108]:
>  
> Let's understand what does this method do.
> 1.1 An important bit is to see how the value of the field called 
> 'retentionMillis' is set. In the constructor of LogDeletionTask, there's an 
> incoming parameter called 'retentionSecs' that is just multiplied by 1000 to 
> have a millisecond value.
> Let's see where 'retentionSecs' is coming from.
> 1.2 
> [AggregatedLogDeletionService#scheduleLogDeletionTask|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L258-L283]
>  that sets the value of retentionSecs.
> The config key for this value is 'yarn.log-aggregation.retain-seconds'.
> The javadoc says: "How long to wait before deleting aggregated logs, -1 
> disables. Be careful set this too small and you will spam the name node."
> 1.3 Going back to 
> [https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L82-L108],
>  the 'cutOffMillis' value is computed by getting the current time in millis 
> minus the retentionMillis.
> 1.4 The main point of this method is to iterate over the files in the remote 
> root log dir (field called 'remoteRootLogDir') and to check if it is a 
> directory. If so, a new Path is created with that particular directory ([code 
> link|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L90-L96]).
> One more important thing to mention: There's a field called 'suffix' that is 
> added to the remote root log dir path.
> Let's check how the 'remoteRootLogDir' and 'suffix' field get its value as 
> this is crucial to understand how the log dirs are deleted.
> 1.5 remoteRootLogDir is set in the constructor of LogDeletionTask, 
> [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L77].
> The value is returned by calling fileController.getRemoteRootLogDir().
> The LogAggregationFileControllerFactory creates the instance of 
> LogAggregationFileController.
> ----
> *The process of determining the log aggregation file controller is quite 
> messy, let me describe this in detail.*
> *There are 2 types of file controllers: LogAggregationIndexedFileController 
> and LogAggregationTFileController*
> *There's a testcase called 
> [TestLogAggregationFileControllerFactory#testLogAggregationFileControllerFactory|#testLogAggregationFileControllerFactory]
>  that shows how the LogAggregationFileControllerFactory is configured.*
> 2.1 First, some important configs:
> 2.1.1 Generic config key for the log aggregation file controller class: 
> yarn.log-aggregation.file-controller.<controllerName>.class
> An example real-world config key: 
> yarn.log-aggregation.file-controller.IFile.class
> An example real-world config value: LogAggregationFileController.class
> 2.1.2 Generic config key for the log aggregation file controller's remote app 
> log dir: 
> yarn.log-aggregation.<controllerName>.remote-app-log-dir
> An example real-world config key: 
> yarn.log-aggregation.IFile.remote-app-log-dir
> An example real-world config value: /tmp/logs/IFile/
> 2.1.3 Generic config key for the log aggregation file controller's remote app 
> log dir suffix: 
> yarn.log-aggregation.<controllerName>.remote-app-log-dir-suffix
> An example real-world config key: 
> yarn.log-aggregation.IFile.remote-app-log-dir-suffix
> An example real-world config value: IFile
> 2.1.4 There's one more config called 'yarn.log-aggregation.file-formats', 
> that can store a comma separated list of file controllers.
> An example value: IFile,TFile
> 2.2 Let's examine how the [LogAggregationFileControllerFactory's 
> contstructor|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L63-L80]
>  works.
> 2.2.1 There's [an 
> iteration|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L69]
>  over file controllers.
> 2.2.2 
> The remote app log dir per file controller is [read from the 
> config|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L196-L216]
> An example for a config key: yarn.log-aggregation.IFile.remote-app-log-dir
> An example real-world value of this config: /tmp/logs/IFile/
> 2.2.3 If the specified remote app log dir is null or empty, the remote dir 
> for the particular file controller falls back to the NM's log dir.
> The log dir is either specified by the config 
> 'yarn.nodemanager.remote-app-log-dir' or falls back to the default path 
> '/tmp/logs'.
> This logic is implemented 
> [here|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L208-L215]
> 2.2.4 Next, the remote app log dir suffix is read 
> [here|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L225-L232].
> Example config key: yarn.log-aggregation.IFile.remote-app-log-dir-suffix
> An example real-world config value: IFile
> If the suffix is null or empty, the suffix is tried to read by the value of 
> config key 'yarn.nodemanager.remote-app-log-dir-suffix' or if it's not 
> specified still, the default prefix will be 'logs'.
> 2.2.5 Now we now the remoteDir (/tmp/logs/IFile/) + the suffix (IFile), we 
> just concatenate them and add a hyphen in between, so the final value will 
> be: target/app-logs/IFile/-IFile [TODO]
> 2.2.6 The rest of the method reads the log aggregation file controller's 
> class name and initializes the controller. This is implemented 
> [here|hhttps://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L82-L95].
> An example config key for the class: 
> 'yarn.log-aggregation.file-controller.IFile.class'
> An example value of this config: 
> "org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController"
> 2.2.7 Next, the controller is created by creating a new instance of the class 
> with reflection.
> 2.2.8 An important bit is to [initialize the 
> controller|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L77]
> 2.2.9 The initialize method [is implemented in 
> LogAggregationFileController|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileController.java#L121-L140],
>  which is an abstract base class for the file controllers.
> 2.2.10 The remote root log dir + the suffix [is 
> read|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileController.java#L136-L137]
>  by the same config logic as described above.
> 2.2.11 As a final step, the controller instance is [added to the factory's 
> controllers 
> list|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L78]
> 2.2.3 Now we know how the LogAggregationFileControllerFactory works and how 
> it reads the config to create and store the File controller instances.
> Let's jump back to the constructor of 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService.LogDeletionTask#LogDeletionTask.
> The file controller is determined by calling the 'getFileControllerForWrite' 
> method on the LogAggregationFileControllerFactory instance, 
> [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L75].
> 2.2.4 [The 
> method|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L128]
>  is quite simple, it just returns the first element from the list, so if 
> multiple log aggregation file controllers were instantiated during the 
> initialization (as per the config), always the first instance will be 
> returned here.
> ----
> *WE need to jump back to step 1.4 and 1.5, where the files are being listed 
> with the help of the abstract FileSystem implementation 
> [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L93-L97].*
> *So we know how the values for 'remoteRootLogDir' and 'suffix' are set as 
> described in detail above.*
> ----
> 1.6 Let's see what the deleteOldLogDirsFrom method does since this is the 
> main call of the loop that lists the log dirs.
> [The 
> method|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L110-L122]
>  is very simple: It accepts a Path as a parameter (which we know that it is a 
> directory), it lists the dirs from this main directory and iterates over the 
> dirs and [calls 
> deleteAppDirLogs|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L120].
> 1.7 The [deleteAppDirLogs 
> method|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L124-L165]
>  is quite messy again.
> 1.7.1 Parameters are: 
> cutOffMillis: The 'cutOffMillis' value is computed by getting the current 
> time in millis minus the retentionMillis that is coming from the 
> configuration.
> If it's set to 2 minutes, the calculated time will be NOW-2 minutes in 
> milliseconds.
> fs: The abstract FileSystem implementation
> rmClient: Not important for us right now
> appDir: The directory to clean up
> 1.7.2 The whole method only does anything useful if the directory's 
> modification time < cutOffMillis. What this means in practice is that only 
> the dirs that are modified earlier than the retention period will be touched 
> / deleted.
> 1.7.3 If the app is not terminated, we list the directory and try to remove 
> the log files. Only the log files will be deleted that are having a 
> modification time which is earlier than the retention period.
> [This is the 
> logic|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L133-L152]
>  that implements this.
> 1.7.4 [The other part of the if 
> condition|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L152-L160]
>  tries to delete the log dir, but checks if the return value of 
> 'shouldDeleteLogDir' is true, first.
> 1.7.5 Let's check the method 
> [AggregatedLogDeletionService.LogDeletionTask#shouldDeleteLogDir|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L167-L182]:
>  
> This is basically the same logic as the other retention period based logic 
> that I described above.
> We set shouldDelete to true by default, then set it to false only if the 
> modification date of the dir itself is later than the timestampt that is 
> defined by the retention period.
> ----
> h2. CONCLUSION
> *We just checked the implementation of how the log aggregation file 
> controllers are instantiated and configured.*
> *Just by reading the code + the logic, I think reading / parsing the 
> configuration is okay.*
> *What really bothers me is how the file controller instance is getting 
> created by the factory (step 2.2.3).*
> *If multiple log aggregation file controllers (TFile + IFile) are configured, 
> always the 0th item (first) will be picked by the factory. This is resulting 
> in the incorrect behaviour so that only one controller's files will be 
> cleaned up.*
> *As the 
> [AggregatedLogDeletionService#scheduleLogDeletionTask|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L258-L277]
>  method just creates the LogDeletionTask instance once and schedules it on a 
> fixed rate with the help of a Timer, there's no distinction in log 
> aggregation File controllers on this abstraction, meaning that only the 
> LogAggregationFileControllerFactory could return different file controllers.*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11188) Only files belong to the first file controller are removed even if multiple log aggregation file controllers are configured

Reply via email to