[ https://issues.apache.org/jira/browse/YARN-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shilun Fan updated YARN-11188: ------------------------------ Target Version/s: 3.4.0 Affects Version/s: 3.4.0 > Only files belong to the first file controller are removed even if multiple > log aggregation file controllers are configured > --------------------------------------------------------------------------------------------------------------------------- > > Key: YARN-11188 > URL: https://issues.apache.org/jira/browse/YARN-11188 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation > Affects Versions: 3.4.0 > Reporter: Szilard Nemeth > Assignee: Szilard Nemeth > Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Log aggregation can be configured to have a comma-separated list of file > controllers. > The current behaviour only removes files that belong to the first file > controller. > This can be problematic. > For example, if some user configures IFile as the file controller, and later > on changes the file controllers to specify multiple file controllers (e.g. > value = TFile,IFile) then only the first controller will be considered and > the files belong to that controller will be removed, in this case files > written by the TFile controller will be removed and the files created with > the IFile controller will be kept. > This behaviour should be changed so that all of the files should be removed > if multiple file controllers are enabled. > h2. CODE PATH > ---- > 1. > [AggregatedLogDeletionService$LogDeletionTask#run|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L82-L108]: > > Let's understand what does this method do. > 1.1 An important bit is to see how the value of the field called > 'retentionMillis' is set. In the constructor of LogDeletionTask, there's an > incoming parameter called 'retentionSecs' that is just multiplied by 1000 to > have a millisecond value. > Let's see where 'retentionSecs' is coming from. > 1.2 > [AggregatedLogDeletionService#scheduleLogDeletionTask|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L258-L283] > that sets the value of retentionSecs. > The config key for this value is 'yarn.log-aggregation.retain-seconds'. > The javadoc says: "How long to wait before deleting aggregated logs, -1 > disables. Be careful set this too small and you will spam the name node." > 1.3 Going back to > [https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L82-L108], > the 'cutOffMillis' value is computed by getting the current time in millis > minus the retentionMillis. > 1.4 The main point of this method is to iterate over the files in the remote > root log dir (field called 'remoteRootLogDir') and to check if it is a > directory. If so, a new Path is created with that particular directory ([code > link|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L90-L96]). > One more important thing to mention: There's a field called 'suffix' that is > added to the remote root log dir path. > Let's check how the 'remoteRootLogDir' and 'suffix' field get its value as > this is crucial to understand how the log dirs are deleted. > 1.5 remoteRootLogDir is set in the constructor of LogDeletionTask, > [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L77]. > The value is returned by calling fileController.getRemoteRootLogDir(). > The LogAggregationFileControllerFactory creates the instance of > LogAggregationFileController. > ---- > *The process of determining the log aggregation file controller is quite > messy, let me describe this in detail.* > *There are 2 types of file controllers: LogAggregationIndexedFileController > and LogAggregationTFileController* > *There's a testcase called > [TestLogAggregationFileControllerFactory#testLogAggregationFileControllerFactory|#testLogAggregationFileControllerFactory] > that shows how the LogAggregationFileControllerFactory is configured.* > 2.1 First, some important configs: > 2.1.1 Generic config key for the log aggregation file controller class: > yarn.log-aggregation.file-controller.<controllerName>.class > An example real-world config key: > yarn.log-aggregation.file-controller.IFile.class > An example real-world config value: LogAggregationFileController.class > 2.1.2 Generic config key for the log aggregation file controller's remote app > log dir: > yarn.log-aggregation.<controllerName>.remote-app-log-dir > An example real-world config key: > yarn.log-aggregation.IFile.remote-app-log-dir > An example real-world config value: /tmp/logs/IFile/ > 2.1.3 Generic config key for the log aggregation file controller's remote app > log dir suffix: > yarn.log-aggregation.<controllerName>.remote-app-log-dir-suffix > An example real-world config key: > yarn.log-aggregation.IFile.remote-app-log-dir-suffix > An example real-world config value: IFile > 2.1.4 There's one more config called 'yarn.log-aggregation.file-formats', > that can store a comma separated list of file controllers. > An example value: IFile,TFile > 2.2 Let's examine how the [LogAggregationFileControllerFactory's > contstructor|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L63-L80] > works. > 2.2.1 There's [an > iteration|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L69] > over file controllers. > 2.2.2 > The remote app log dir per file controller is [read from the > config|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L196-L216] > An example for a config key: yarn.log-aggregation.IFile.remote-app-log-dir > An example real-world value of this config: /tmp/logs/IFile/ > 2.2.3 If the specified remote app log dir is null or empty, the remote dir > for the particular file controller falls back to the NM's log dir. > The log dir is either specified by the config > 'yarn.nodemanager.remote-app-log-dir' or falls back to the default path > '/tmp/logs'. > This logic is implemented > [here|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L208-L215] > 2.2.4 Next, the remote app log dir suffix is read > [here|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L225-L232]. > Example config key: yarn.log-aggregation.IFile.remote-app-log-dir-suffix > An example real-world config value: IFile > If the suffix is null or empty, the suffix is tried to read by the value of > config key 'yarn.nodemanager.remote-app-log-dir-suffix' or if it's not > specified still, the default prefix will be 'logs'. > 2.2.5 Now we now the remoteDir (/tmp/logs/IFile/) + the suffix (IFile), we > just concatenate them and add a hyphen in between, so the final value will > be: target/app-logs/IFile/-IFile [TODO] > 2.2.6 The rest of the method reads the log aggregation file controller's > class name and initializes the controller. This is implemented > [here|hhttps://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L82-L95]. > An example config key for the class: > 'yarn.log-aggregation.file-controller.IFile.class' > An example value of this config: > "org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController" > 2.2.7 Next, the controller is created by creating a new instance of the class > with reflection. > 2.2.8 An important bit is to [initialize the > controller|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L77] > 2.2.9 The initialize method [is implemented in > LogAggregationFileController|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileController.java#L121-L140], > which is an abstract base class for the file controllers. > 2.2.10 The remote root log dir + the suffix [is > read|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileController.java#L136-L137] > by the same config logic as described above. > 2.2.11 As a final step, the controller instance is [added to the factory's > controllers > list|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L78] > 2.2.3 Now we know how the LogAggregationFileControllerFactory works and how > it reads the config to create and store the File controller instances. > Let's jump back to the constructor of > org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService.LogDeletionTask#LogDeletionTask. > The file controller is determined by calling the 'getFileControllerForWrite' > method on the LogAggregationFileControllerFactory instance, > [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L75]. > 2.2.4 [The > method|https://github.com/apache/hadoop/blob/c9a174a260577f6c0ff6ef1594eea1cb19d63012/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L128] > is quite simple, it just returns the first element from the list, so if > multiple log aggregation file controllers were instantiated during the > initialization (as per the config), always the first instance will be > returned here. > ---- > *WE need to jump back to step 1.4 and 1.5, where the files are being listed > with the help of the abstract FileSystem implementation > [here|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L93-L97].* > *So we know how the values for 'remoteRootLogDir' and 'suffix' are set as > described in detail above.* > ---- > 1.6 Let's see what the deleteOldLogDirsFrom method does since this is the > main call of the loop that lists the log dirs. > [The > method|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L110-L122] > is very simple: It accepts a Path as a parameter (which we know that it is a > directory), it lists the dirs from this main directory and iterates over the > dirs and [calls > deleteAppDirLogs|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L120]. > 1.7 The [deleteAppDirLogs > method|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L124-L165] > is quite messy again. > 1.7.1 Parameters are: > cutOffMillis: The 'cutOffMillis' value is computed by getting the current > time in millis minus the retentionMillis that is coming from the > configuration. > If it's set to 2 minutes, the calculated time will be NOW-2 minutes in > milliseconds. > fs: The abstract FileSystem implementation > rmClient: Not important for us right now > appDir: The directory to clean up > 1.7.2 The whole method only does anything useful if the directory's > modification time < cutOffMillis. What this means in practice is that only > the dirs that are modified earlier than the retention period will be touched > / deleted. > 1.7.3 If the app is not terminated, we list the directory and try to remove > the log files. Only the log files will be deleted that are having a > modification time which is earlier than the retention period. > [This is the > logic|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L133-L152] > that implements this. > 1.7.4 [The other part of the if > condition|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L152-L160] > tries to delete the log dir, but checks if the return value of > 'shouldDeleteLogDir' is true, first. > 1.7.5 Let's check the method > [AggregatedLogDeletionService.LogDeletionTask#shouldDeleteLogDir|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L167-L182]: > > This is basically the same logic as the other retention period based logic > that I described above. > We set shouldDelete to true by default, then set it to false only if the > modification date of the dir itself is later than the timestampt that is > defined by the retention period. > ---- > h2. CONCLUSION > *We just checked the implementation of how the log aggregation file > controllers are instantiated and configured.* > *Just by reading the code + the logic, I think reading / parsing the > configuration is okay.* > *What really bothers me is how the file controller instance is getting > created by the factory (step 2.2.3).* > *If multiple log aggregation file controllers (TFile + IFile) are configured, > always the 0th item (first) will be picked by the factory. This is resulting > in the incorrect behaviour so that only one controller's files will be > cleaned up.* > *As the > [AggregatedLogDeletionService#scheduleLogDeletionTask|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogDeletionService.java#L258-L277] > method just creates the LogDeletionTask instance once and schedules it on a > fixed rate with the help of a Timer, there's no distinction in log > aggregation File controllers on this abstraction, meaning that only the > LogAggregationFileControllerFactory could return different file controllers.* -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org