Vihang Karajgaonkar created HIVE-21040:
------------------------------------------

             Summary: msck does unnecessary file listing at last level of 
partitions
                 Key: HIVE-21040
                 URL: https://issues.apache.org/jira/browse/HIVE-21040
             Project: Hive
          Issue Type: Improvement
            Reporter: Vihang Karajgaonkar
            Assignee: Vihang Karajgaonkar


Here is the code snippet which is run by {{msck}} to list directories

{noformat}
final Path currentPath = pd.p;
      final int currentDepth = pd.depth;
      FileStatus[] fileStatuses = fs.listStatus(currentPath, 
FileUtils.HIDDEN_FILES_PATH_FILTER);
      // found no files under a sub-directory under table base path; it is 
possible that the table
      // is empty and hence there are no partition sub-directories created 
under base path
      if (fileStatuses.length == 0 && currentDepth > 0 && currentDepth < 
partColNames.size()) {
        // since maxDepth is not yet reached, we are missing partition
        // columns in currentPath
        logOrThrowExceptionWithMsg(
            "MSCK is missing partition columns under " + 
currentPath.toString());
      } else {
        // found files under currentPath add them to the queue if it is a 
directory
        for (FileStatus fileStatus : fileStatuses) {
          if (!fileStatus.isDirectory() && currentDepth < partColNames.size()) {
            // found a file at depth which is less than number of partition keys
            logOrThrowExceptionWithMsg(
                "MSCK finds a file rather than a directory when it searches for 
"
                    + fileStatus.getPath().toString());
          } else if (fileStatus.isDirectory() && currentDepth < 
partColNames.size()) {
            // found a sub-directory at a depth less than number of partition 
keys
            // validate if the partition directory name matches with the 
corresponding
            // partition colName at currentDepth
            Path nextPath = fileStatus.getPath();
            String[] parts = nextPath.getName().split("=");
            if (parts.length != 2) {
              logOrThrowExceptionWithMsg("Invalid partition name " + nextPath);
            } else if 
(!parts[0].equalsIgnoreCase(partColNames.get(currentDepth))) {
              logOrThrowExceptionWithMsg(
                  "Unexpected partition key " + parts[0] + " found at " + 
nextPath);
            } else {
              // add sub-directory to the work queue if maxDepth is not yet 
reached
              pendingPaths.add(new PathDepthInfo(nextPath, currentDepth + 1));
            }
          }
        }
        if (currentDepth == partColNames.size()) {
          return currentPath;
        }
      }
{noformat}

You can see that when the {{currentDepth}} at the {{maxDepth}} it still does a 
unnecessary listing of the files. We can improve this call by checking the 
currentDepth and bailing out early.

This can improve the performance of msck command significantly especially when 
there are lot of files in each partitions on remote filesystems like S3 or ADLS



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to