Alex Behm has posted comments on this change. Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading ......................................................................
Patch Set 4: (29 comments) http://gerrit.cloudera.org:8080/#/c/5148/4/fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java File fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java: Line 37: private static DiskIdMapper diskIdMapper = null; why not: public static DiskIdMapper INSTANCE = new DiskIdMapper()? Line 50: private static HashMap<String, Integer> storageIdToInt = Maps.newHashMap(); make members and methods non-static since we are using a singleton Line 63: public static int getDiskId(String host, String storageId) { storageUuid Line 72: if (storageIdToInt.containsKey(storageId)) { I did some more reading and I'm afraid this "double checked locking" scheme does not work correctly in Java. The readers could return a ConcurrentModificationException or even corrupted data. See for example: http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html I think we should make storageIdGenerator a ConcurrentHashMap to address the issue. You can avoid doing two lookups here by doing get() and then checking for null. http://gerrit.cloudera.org:8080/#/c/5148/4/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: Line 120: import java.io.IOException; I think these might need cleanup. Line 351: Map<String, Map<String, FileDescriptor>> newFileDescMap = Maps.newHashMap(); What's the benefit of populating this temporary map and then merging it into the global one? Why does updating the global one directly not work? Line 382: // Loop over all blocks in the file. comment doesn't add anything, remove Line 384: Preconditions.checkNotNull(loc); Not your change, but this looks weird, why would it be null? Remove? Line 386: String[] blockHostPorts = loc.getNames(); Let's just use loc.getNames() and loc.getHosts() inside the loop in L403. I don't see any benefit to having these as separate variables and their surrounding comments, do you? Line 395: Sets.newHashSet(Arrays.asList(loc.getCachedHosts())); I think we can get rid of the asList() here Line 399: // to hostMap_/hostList_. The host ID (index in to the hostList_) for each update comment, "hostMap_" doesn't seem to exist Line 413: // Remember the THdfsFileBlocks and corresponding BlockLocations. Once all the is this comment still accurate? Line 423: for (List<HdfsPartition> partLists: filteredParts.values()) { Seems cheaper to loop on newFileDescMap on look up filteredParts Line 431: numHdfsFiles_ += fileDescs.size(); Why do the number of files and bytes always keep increasing? Line 458: fileStatus.getModificationTime()); indent 4 (and same issue elsewhere) Line 468: // All the partitions corresponding to the same partition path, have the same I don't think this is necessarily true, but you can state that we assume it is so. Line 483: for (HdfsPartition partition: partitions) { This function looks much more expensive than before. We loop over all partition paths in L451 and here we loop over all partitions, so that's O(N^2) in the number of partitions. Line 780: Map<FsKey, FileBlocksInfo> blocksToLoad = Maps.newHashMap(); unused? Line 784: HashMap<Path, List<HdfsPartition>> filteredParts = Maps.newHashMap(); The var name filteredParts is a little confusing to me. How about partsByPath or something like that. The "filtered" part is a little unclear. Line 830: if (filteredParts.containsKey(partDir)) filteredParts.get(partDir).add(partition); use {} use get() and check for null to avoid multiple lookups Line 833: // Check if the partition directory's is a child dir of hdfsBaseDir_ typo: directory's -> directory Line 835: !FileSystemUtil.isChildPath(partDir, getHdfsBaseDirPath())) { use tblLocation, that's clearer (and in comment above) Line 836: // Add the partition to the list of paths to load block MD. Comment only describes the code below, maybe say something like: // This partition has a custom filesystem location. Load its file/block metadata separately by adding it to the list of dirs to load. Line 868: loadBlockMetadata(fs, location, filteredParts, fileBlocksToLoad); Having separate loadBlockMetadata() and loadDiskIds() doesn't seem to make much sense. We're spending a lot of effort to populate the temporary fileBlocksToLoad map just so we can maintain the separation (which does not seem to make sense). Is there a reason why we could not want to load the block md and disk ids at the same time (without this tmp map)? Line 934: boolean isMarkedCached = isMarkedCached_; looks unused, are you sure nothing got broken here? Line 953: try { not your change, but you can do this instead: for (Expr v: keyValues) v.analyzeNoThrow(null); Line 1583: boolean isMarkedCached = isMarkedCached_; looks unused, sure nothing is broken here? Line 1628: totalHdfsBytes_ += hdfsPart.getSize(); Why can't this stay inside addPartition()? In the loading case the size and number of file descriptors should be 0, so adding that should be harmless. http://gerrit.cloudera.org:8080/#/c/5148/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java: Line 423: if (p.isRoot()) return false; isn't the equals() below sufficient? -- To view, visit http://gerrit.cloudera.org:8080/5148 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-HasComments: Yes