Alex Behm has posted comments on this change. Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading ......................................................................
Patch Set 11: (40 comments) http://gerrit.cloudera.org:8080/#/c/5148/11/fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java File fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java: Line 29: * ids. This is internally implemented as a static object to share the storageIDs across Several sentences Starting with "This", sometimes not clear what "This" references. Would be nice to list the reasons in bullet point, something like this. The storage UUID to disk id is global (shared across tables) for the following reasons: 1. Mention consistent scan range to disk thread assignment 2. Mention memory benefit 3. any more? Line 44: // sequential 0-based integer disk id used by the hdfs scanners. This assumes that used by the BE scanners. Line 46: private ConcurrentHashMap<String, Integer> storageIdToInt = storageUuidToDiskId Line 50: // with the corresponding latest 0-based integer ID. Given a storage UUID, we first with -> to Line 51: // look in 'storageIdtoInt' map if it already exists, else we generate a new ID for no need for this second sentence, that's what we do in getDiskId() Line 64: if (Strings.isNullOrEmpty(storageUuid)) return diskId; let's move this check to the caller and require the storageUuid to be non-null and non-empty in here Line 68: // across the cluster, it is possible that we have a good hit rate. it is possible -> we expect to have a good hit rate. Line 72: // At this point, it could be possible due to race, that storageId might've been // Mapping might have been added by another thread that entered the synchronized block first. Line 76: // No mapping exists, create a new disk ID for 'storageId' stoprageUuid Line 80: // Host hasn't been populated yet. Start with a 0-based id. // First disk id of this host. http://gerrit.cloudera.org:8080/#/c/5148/11/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: Line 289: * given directory 'dirPath'. Only blocks corresponding to paths from 'partsByPath' Mention that the block metadata is effectively dropped and re-created. Please provide a brief summary of the steps involved in this function. something like: 1. Clear block metadata of partitions 2. Call HDFS to list all files 3. etc. Line 307: RemoteIterator<LocatedFileStatus> fileStatusIter = fs.listFiles(dirPath, true); move right before the while loop Line 308: // Reset the current state of partitions under dirPath from perPartitionFileDescMap_ Reset -> Clear Line 310: for (Path partDir: partsByPath.keySet()) { Iterate over the entrySet(), you are using an unnecessary get(partDir) below. Line 321: String fileName = fileStatus.getPath().getName(); move closer to where it's used, i.e. right above L336 Line 322: if (!FileSystemUtil.isValidPartitionFile(fileStatus)) continue; isValidDataFile() Line 323: // At this point, file status points to a valid file under the directory. Now we // Find the partition that this file belongs (if any). Line 325: Path partPathDir = fileStatus.getPath().getParent(); move closer to where it's used, i.e. L357 Line 330: // Skip if this file doesn't correspond to any of the existing partitions. Skip if this file does not belong to any known partition. Line 332: LOG.debug("File " + fileStatus.getPath().toString() + " doesn't correspond " + add if to check log level Line 333: " to a valid partition. Skipping metadata load for this file."); to a known partition Line 361: Preconditions.checkState(perPartitionFileDescMap_.containsKey(partPathDirName)); avoid duplicate lookup with get() + Preconditions.checkNotNull() Line 363: // Update the partitions metadata corresponding to this file descriptor partitions' metadata that this file belongs to. Line 371: LOG.warn("Unknown disk id count for filesystem " + fs + ":" + unknownDiskIdCount); add if to check log level Line 374: throw new RuntimeException("Error loading block md for directory " block metadata Line 380: * Loads the disk IDs for BlockLocation 'location' and its corresponding file block. Comment should mention the storage UUID to disk id mapping. Line 386: String[] hosts; Do you know if these are the unique hosts? I'm wondering if storageIds and hosts are guaranteed to have the same length. Line 394: for (int i =0; i < storageIds.length; ++i) { int i = 0; (space after "=") Line 410: for (Path partPath: partsByPath.keySet()) { iterate over Map.Entry, you are doing a get() in the next line Line 412: // For each valid file in the partPath, synthesize the block metadata. For each data file in the partPath... Line 427: // We assume all the partitions corresponding to the same partition path, have For the purpose of synthesizing block metadata, we assume that all partitions with the same location have the same file format. Line 447: perPartitionFileDescMap_.put(partition.getLocation(), fileDescMap); move put() inside the if (fileDescMap == null) block Line 673: * Create HdfsPartition objects corresponding to 'msPartitions' and add them to this Briefly list the high-level steps, in particular that we first create the partition objects based on the HMS partitions, and then in a second step we populate the partitions' file metadata Line 742: // Check if the partition directory is a child dir of tbl directory location. remove, comment only states what the code does Line 752: // Load the block metadata remove comment (doesn't add anything) Line 764: throw new CatalogException("Error loading block location md for partition " + md -> metadata Line 827: public HdfsPartition createPartition(StorageDescriptor storageDescriptor, change this to createAndLoadPartition()? then we can get rid of the flag in createPartition() and this function becomes createPartition() + loadMetadataAndDiskIds() http://gerrit.cloudera.org:8080/#/c/5148/11/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java: Line 276: * Returns true if the file corresponding to 'fileStatus' is a valid file as per is a valid data file Line 278: * directory or a hidden file starting with . or _ or if it is a LZO index file. hidden file is enough, no need to mention specific prefixes Line 291: public static boolean supportsBlockLocationApi(FileSystem fs) { supportsBlockLocations()? The API part is confusing because we are not really referencing a specific API. If you prefer to reference the API, then let's reference a specific method. -- To view, visit http://gerrit.cloudera.org:8080/5148 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-HasComments: Yes