[ https://issues.apache.org/jira/browse/YARN-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598299#comment-14598299 ]
Varun Saxena commented on YARN-3793: ------------------------------------ [~kasha], I think I know whats happening. When disks become bad(say due to disk full), there is a problem when uploading container logs. In {{AppLogAggregatorImpl#doContainerLogAggregation}} only good log directories are considered for log aggregation. This leads to {{AggregatedLogFormat#getPendingLogFilesToUploadForThisContainer}} returning no log files to be uploaded. The caller of {{doContainerLogAggregation}} is {{AppLogAggregatorImpl#uploadLogsForContainers}} which as can be seen under will call {{DeletionService#delete}}. If {{uploadedFilePathsInThisCycle}} is empty *(which will be if disks are full)*, this will lead to both sub directory and base directories being null. This explains the NPEs' being thrown. When these deletion tasks are stored in state store, they will be stored with nulls as well and this can explain why it happens on recovery as well. {code} boolean uploadedLogsInThisCycle = false; for (ContainerId container : pendingContainerInThisCycle) { ContainerLogAggregator aggregator = null; if (containerLogAggregators.containsKey(container)) { aggregator = containerLogAggregators.get(container); } else { aggregator = new ContainerLogAggregator(container); containerLogAggregators.put(container, aggregator); } Set<Path> uploadedFilePathsInThisCycle = aggregator.doContainerLogAggregation(writer, appFinished); if (uploadedFilePathsInThisCycle.size() > 0) { uploadedLogsInThisCycle = true; } this.delService.delete(this.userUgi.getShortUserName(), null, uploadedFilePathsInThisCycle .toArray(new Path[uploadedFilePathsInThisCycle.size()])); ...... } {code} Log aggregation should consider full disks as well otherwise there will be nothing to be aggregated if disks are full. Anyways log aggregation would lead to deletion of local logs. I verified the occurrence of this issue via TestLogAggregationService#testLocalFileDeletionAfterUpload by making good log directories return nothing. > Several NPEs when deleting local files on NM recovery > ----------------------------------------------------- > > Key: YARN-3793 > URL: https://issues.apache.org/jira/browse/YARN-3793 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.6.0 > Reporter: Karthik Kambatla > Assignee: Karthik Kambatla > > When NM work-preserving restart is enabled, we see several NPEs on recovery. > These seem to correspond to sub-directories that need to be deleted. I wonder > if null pointers here mean incorrect tracking of these resources and a > potential leak. This JIRA is to investigate and fix anything required. > Logs show: > {noformat} > 2015-05-18 07:06:10,225 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : null > 2015-05-18 07:06:10,224 ERROR > org.apache.hadoop.yarn.server.nodemanager.DeletionService: Exception during > execution of task in DeletionService > java.lang.NullPointerException > at > org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:274) > at org.apache.hadoop.fs.FileContext.delete(FileContext.java:755) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.deleteAsUser(DefaultContainerExecutor.java:458) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:293) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)