[Impala-ASF-CR] IMPALA-5429: Multi threaded block metadata loading

Bharath Vissapragada (Code Review) Fri, 20 Oct 2017 14:10:03 -0700

Bharath Vissapragada has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8235 )


Change subject: IMPALA-5429: Multi threaded block metadata loading
......................................................................


Patch Set 6:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/8235/6/be/src/catalog/catalog.cc
File be/src/catalog/catalog.cc:

http://gerrit.cloudera.org:8080/#/c/8235/6/be/src/catalog/catalog.cc@38
PS6, Line 38: DEFINE_int32(max_hdfs_parts_parallel_load, 5,
> Let's spell out "parts" into "partitions" to avoid confusion.
Done


http://gerrit.cloudera.org:8080/#/c/8235/6/be/src/catalog/catalog.cc@40
PS6, Line 40:     "tables. Due to HDFS architectural limitations, it is 
unlikely to get a linear "
> In the sentence "Due to HDFS architectural..." what does the "it" after the
Done


http://gerrit.cloudera.org:8080/#/c/8235/6/be/src/catalog/catalog.cc@42
PS6, Line 42: DEFINE_int32(max_s3_parts_parallel_load, 20,
> Does ADLS fall into this or the HDFS bucket? What about other filesystems?
Yes, although I haven't tested it specifically with ADLS. I mentioned that in 
the commit message but it looks like a good thing to reflect that in the config 
name too. I changed it too, let me know if you think we should use a better 
name here.


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@216
PS6, Line 216:   private class PathStorageMdLoadingStats {
> Naming seems a little weird. I think "Storage" is too generic and "Block" i
Wfm. Changed it.


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@240
PS6, Line 240:     // If set to true, reloads the block metadata only when the 
underlying file
> Please clarify "the underlying file"
Done


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@413
PS6, Line 413:         ++loadStats.ignoredFiles;
> I think we should distinguish the hidden file and already-up-to-date file c
Done


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@773
PS6, Line 773:    * Returns the thread pool size to load the block metadata of 
this table.
> Command on how we distinguish between HDFS, S3 and other filesystems (using
Done


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@793
PS6, Line 793:     return Math.max(Math.min(numPaths, threadPoolSize), 1);
> For my understanding, why is the max() needed here? Is it not be a precondi
For empty tables, numPaths is 0. But yes this should be a preconditions check. 
So I moved it to the caller and added a preconditions check in this method.


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@800
PS6, Line 800:    * metadata is reloaded, else it is loaded from scratch.
> Difference between "reloaded" and "loaded from scratch" is not very clear.
Sounds better. Done.


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@804
PS6, Line 804:     int numPathsToLoad = partsByPath.keySet().size();
> partsByPath.size()?
Done


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@812
PS6, Line 812:       for (Path p: partsByPath.keySet()) {
> The tasks are ordered randomly. I wonder if submitting the tasks in a clust
Will do. I have two other jiras (based on others' comments) to create, so will 
add this to the list.


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@821
PS6, Line 821:         } catch (ExecutionException | InterruptedException e) {
> We still consider the load successful if one of these fails. What do you th
Oops, I forgot to wrap it as a TLE and throw. Thanks for pointing it out. My 
opinion is that a partially loaded table is ok, but we should let the user know 
about it. What do you think?


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@822
PS6, Line 822:           LOG.error("Encountered an error loading block metadata 
for table: " +
> Let's also dump the path
ExecutionException (ex, IOException in this case) includes the path by default 
and LOG.error() logs the full trace?


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@898
PS6, Line 898:    * the newly created HdfsPartitions in parallel.
> remove "the"
Done


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/util/ListMap.java
File fe/src/main/java/org/apache/impala/util/ListMap.java:

http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/util/ListMap.java@38
PS6, Line 38:   private List<T> list_ = Collections.synchronizedList(new 
ArrayList<T>());
> The CC requirements and behavior are not clear to me. Why is it not suffici
My understanding is that it could affect the throughput of getIndex(). Once the 
cache is warmed up with all the host mapping, we don't need it to be 
synchronized anymore? Is my understanding wrong?



--
To view, visit http://gerrit.cloudera.org:8080/8235
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481
Gerrit-Change-Number: 8235
Gerrit-PatchSet: 6
Gerrit-Owner: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jbapple-imp...@apache.org>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com>
Gerrit-Comment-Date: Fri, 20 Oct 2017 21:09:38 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5429: Multi threaded block metadata loading

Reply via email to