[Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

Alex Behm (Code Review) Mon, 28 Nov 2016 15:28:47 -0800

Alex Behm has posted comments on this change.

Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
......................................................................



Patch Set 4:

(29 comments)

http://gerrit.cloudera.org:8080/#/c/5148/4/fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java
File fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java:

Line 37:     private static DiskIdMapper diskIdMapper = null;
why not:

public static DiskIdMapper INSTANCE = new DiskIdMapper()?


Line 50:     private static HashMap<String, Integer> storageIdToInt = 
Maps.newHashMap();
make members and methods non-static since we are using a singleton


Line 63:     public static int getDiskId(String host, String storageId) {
storageUuid


Line 72:       if (storageIdToInt.containsKey(storageId)) {
I did some more reading and I'm afraid this "double checked locking" scheme 
does not work correctly in Java. The readers could return a 
ConcurrentModificationException or even corrupted data.

See for example:
http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html

I think we should make storageIdGenerator a ConcurrentHashMap to address the 
issue.


You can avoid doing two lookups here by doing get() and then checking for null.


http://gerrit.cloudera.org:8080/#/c/5148/4/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 120: import java.io.IOException;
I think these might need cleanup.


Line 351:       Map<String, Map<String, FileDescriptor>> newFileDescMap = 
Maps.newHashMap();
What's the benefit of populating this temporary map and then merging it into 
the global one? Why does updating the global one directly not work?


Line 382:           // Loop over all blocks in the file.
comment doesn't add anything, remove


Line 384:             Preconditions.checkNotNull(loc);
Not your change, but this looks weird, why would it be null? Remove?


Line 386:             String[] blockHostPorts = loc.getNames();
Let's just use loc.getNames() and loc.getHosts() inside the loop in L403. I 
don't see any benefit to having these as separate variables and their 
surrounding comments, do you?


Line 395:                     
Sets.newHashSet(Arrays.asList(loc.getCachedHosts()));
I think we can get rid of the asList() here


Line 399:             // to hostMap_/hostList_. The host ID (index in to the 
hostList_) for each
update comment, "hostMap_" doesn't seem to exist


Line 413:           // Remember the THdfsFileBlocks and corresponding 
BlockLocations.  Once all the
is this comment still accurate?


Line 423:       for (List<HdfsPartition> partLists: filteredParts.values()) {
Seems cheaper to loop on newFileDescMap on look up filteredParts


Line 431:           numHdfsFiles_ += fileDescs.size();
Why do the number of files and bytes always keep increasing?


Line 458:                 fileStatus.getModificationTime());
indent 4 (and same issue elsewhere)


Line 468:         // All the partitions corresponding to the same partition 
path, have the same
I don't think this is necessarily true, but you can state that we assume it is 
so.


Line 483:         for (HdfsPartition partition: partitions) {
This function looks much more expensive than before. We loop over all partition 
paths in L451 and here we loop over all partitions, so that's O(N^2) in the 
number of partitions.


Line 780:     Map<FsKey, FileBlocksInfo> blocksToLoad = Maps.newHashMap();
unused?


Line 784:     HashMap<Path, List<HdfsPartition>> filteredParts = 
Maps.newHashMap();
The var name filteredParts is a little confusing to me. How about partsByPath 
or something like that. The "filtered" part is a little unclear.


Line 830:         if (filteredParts.containsKey(partDir)) 
filteredParts.get(partDir).add(partition);
use {}

use get() and check for null to avoid multiple lookups


Line 833:         // Check if the partition directory's is a child dir of 
hdfsBaseDir_
typo: directory's -> directory


Line 835:             !FileSystemUtil.isChildPath(partDir, 
getHdfsBaseDirPath())) {
use tblLocation, that's clearer (and in comment above)


Line 836:           // Add the partition to the list of paths to load block MD.
Comment only describes the code below, maybe say something like:

// This partition has a custom filesystem location. Load its file/block 
metadata separately by adding it to the list of dirs to load.


Line 868:       loadBlockMetadata(fs, location, filteredParts, 
fileBlocksToLoad);
Having separate loadBlockMetadata() and loadDiskIds() doesn't seem to make much 
sense. We're spending a lot of effort to populate the temporary 
fileBlocksToLoad map just so we can maintain the separation (which does not 
seem to make sense).

Is there a reason why we could not want to load the block md and disk ids at 
the same time (without this tmp map)?


Line 934:     boolean isMarkedCached = isMarkedCached_;
looks unused, are you sure nothing got broken here?


Line 953:       try {
not your change, but you can do this instead:

for (Expr v: keyValues) v.analyzeNoThrow(null);


Line 1583:     boolean isMarkedCached = isMarkedCached_;
looks unused, sure nothing is broken here?


Line 1628:         totalHdfsBytes_ += hdfsPart.getSize();
Why can't this stay inside addPartition()? In the loading case the size and 
number of file descriptors should be 0, so adding that should be harmless.


http://gerrit.cloudera.org:8080/#/c/5148/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 423:     if (p.isRoot()) return false;
isn't the equals() below sufficient?


-- 
To view, visit http://gerrit.cloudera.org:8080/5148
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

Reply via email to