[ 
https://issues.apache.org/jira/browse/HIVE-27135?focusedWorklogId=853745&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-853745
 ]

ASF GitHub Bot logged work on HIVE-27135:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 29/Mar/23 17:03
            Start Date: 29/Mar/23 17:03
    Worklog Time Spent: 10m 
      Work Description: mdayakar commented on code in PR #4114:
URL: https://github.com/apache/hive/pull/4114#discussion_r1152246387


##########
ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java:
##########
@@ -1538,32 +1538,36 @@ private static HdfsDirSnapshot addToSnapshot(Map<Path, 
HdfsDirSnapshot> dirToSna
   public static Map<Path, HdfsDirSnapshot> getHdfsDirSnapshots(final 
FileSystem fs, final Path path)
       throws IOException {
     Map<Path, HdfsDirSnapshot> dirToSnapshots = new HashMap<>();
-    RemoteIterator<LocatedFileStatus> itr = FileUtils.listFiles(fs, path, 
true, acidHiddenFileFilter);
-    while (itr.hasNext()) {
-      FileStatus fStatus = itr.next();
-      Path fPath = fStatus.getPath();
-      if (fStatus.isDirectory() && acidTempDirFilter.accept(fPath)) {
-        addToSnapshot(dirToSnapshots, fPath);
-      } else {
-        Path parentDirPath = fPath.getParent();
-        if (acidTempDirFilter.accept(parentDirPath)) {
-          while (isChildOfDelta(parentDirPath, path)) {
-            // Some cases there are other directory layers between the delta 
and the datafiles
-            // (export-import mm table, insert with union all to mm table, 
skewed tables).
-            // But it does not matter for the AcidState, we just need the 
deltas and the data files
-            // So build the snapshot with the files inside the delta directory
-            parentDirPath = parentDirPath.getParent();
-          }
-          HdfsDirSnapshot dirSnapshot = addToSnapshot(dirToSnapshots, 
parentDirPath);
-          // We're not filtering out the metadata file and acid format file,
-          // as they represent parts of a valid snapshot
-          // We're not using the cached values downstream, but we can 
potentially optimize more in a follow-up task
-          if 
(fStatus.getPath().toString().contains(MetaDataFile.METADATA_FILE)) {
-            dirSnapshot.addMetadataFile(fStatus);
-          } else if 
(fStatus.getPath().toString().contains(OrcAcidVersion.ACID_FORMAT)) {
-            dirSnapshot.addOrcAcidFormatFile(fStatus);
-          } else {
-            dirSnapshot.addFile(fStatus);
+    Deque<RemoteIterator<LocatedFileStatus>> stack = new ArrayDeque<>();
+    stack.push(FileUtils.listLocatedStatusIterator(fs, path, 
acidHiddenFileFilter));
+    while (!stack.isEmpty()) {
+      RemoteIterator<LocatedFileStatus> itr = stack.pop();
+      while (itr.hasNext()) {
+        FileStatus fStatus = itr.next();
+        Path fPath = fStatus.getPath();
+        if (fStatus.isDirectory()) {
+          stack.push(FileUtils.listLocatedStatusIterator(fs, fPath, 
acidHiddenFileFilter));

Review Comment:
   Here it will not list empty directories, actually the above if condition is 
obsolete in old code. Tested with old and modified code, both don't add empty 
directories.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 853745)
    Time Spent: 6.5h  (was: 6h 20m)

> AcidUtils#getHdfsDirSnapshots() throws FNFE when a directory is removed in 
> HDFS
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-27135
>                 URL: https://issues.apache.org/jira/browse/HIVE-27135
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Dayakar M
>            Assignee: Dayakar M
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> AcidUtils#getHdfsDirSnapshots() throws FileNotFoundException when a directory 
> is removed in HDFS while fetching HDFS Snapshots.
> Below testcode can be used to reproduce this issue.
> {code:java}
>  @Test
>   public void 
> testShouldNotThrowFNFEWhenHiveStagingDirectoryIsRemovedWhileFetchingHDFSSnapshots()
>  throws Exception {
>     MockFileSystem fs = new MockFileSystem(new HiveConf(),
>         new MockFile("mock:/tbl/part1/.hive-staging_dir/-ext-10002", 500, new 
> byte[0]),
>         new MockFile("mock:/tbl/part2/.hive-staging_dir", 500, new byte[0]),
>         new MockFile("mock:/tbl/part1/_tmp_space.db", 500, new byte[0]),
>         new MockFile("mock:/tbl/part1/delta_1_1/bucket-0000-0000", 500, new 
> byte[0]));
>     Path path = new MockPath(fs, "/tbl");
>     Path stageDir = new MockPath(fs, "mock:/tbl/part1/.hive-staging_dir");
>     FileSystem mockFs = spy(fs);
>     Mockito.doThrow(new 
> FileNotFoundException("")).when(mockFs).listLocatedStatus(eq(stageDir));
>     try {
>       Map<Path, AcidUtils.HdfsDirSnapshot> hdfsDirSnapshots = 
> AcidUtils.getHdfsDirSnapshots(mockFs, path);
>       Assert.assertEquals(1, hdfsDirSnapshots.size());
>     }
>     catch (FileNotFoundException fnf) {
>       fail("Should not throw FileNotFoundException when a directory is 
> removed while fetching HDFSSnapshots");
>     }
>   }{code}
> This issue got fixed as a part of HIVE-26481 but here its not fixed 
> completely. 
> [Here|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java#L1541]
>  FileUtils.listFiles() API which returns a RemoteIterator<LocatedFileStatus>. 
> So while iterating over, it checks if it is a directory and recursive listing 
> then it will try to list files from that directory but if that directory is 
> removed by other thread/task then it throws FileNotFoundException. Here the 
> directory which got removed is the .staging directory which needs to be 
> excluded through by using passed filter.
>  
> So here we can use same logic written in 
> _org.apache.hadoop.hive.ql.io.AcidUtils#getHdfsDirSnapshotsForCleaner()_ API 
> to avoid FileNotFoundException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to