Paul Rogers has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11158 )

Change subject: IMPALA-7540. Intern most repetitive strings and network 
addresses in catalog
......................................................................


Patch Set 3:

(12 comments)

Bharath, applied your review comments. Please take another look to see if this 
is ready to go.

http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java
File fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java:

http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@62
PS2, Line 62:    */
> nit: for each method, should we list the set of fields that we selectively
Documentation of this kind tends to get out of date quickly. Looking at the 
code, it seems clear enough what is being interned. Not sure that maintaining a 
list of fields is worth the effort.


http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@88
PS2, Line 88:   public static void internFieldsInPlace(TTable table) {
> null check for table
Done


http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@113
PS2, Line 113:    */
> null check for sd.
Done


http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@131
PS2, Line 131:    * Intern low-cardinality fields in the given FieldSchema.
> null check
Done


http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@155
PS2, Line 155:  not like 'impala_%' group by PARAM_VALUE order by count(*) desc 
limit 100;
             :       //
             :       // In a large catalog from a production install, these 
represented about 68% of the
             :       // entries.
             :       String val = e.getValue();
             :       if (val.isEmpty() ||
             :           "-1".equals(val) ||
> Can we create a static HashSet out of these (lowercased) and use something
Looks like only five of the values would be in the table. The empty string and 
the prefix string would still be handled specially. Given only five entries, 
the above, though clunky, is simpler than, and probably performs as well as, a 
hash table.


http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@172
PS2, Line 172: STRING_INTERNER.int
> Maybe move everything to guava interner to be consistent? Native interning
Done


http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@174
PS2, Line 174:
> Preconditions.checkState(ret.size() == parameters.size())?
Done


http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@188
PS2, Line 188:
> can we rename it to something like internString() that we can reuse for all
Done


http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@198
PS2, Line 198:     if (value == null) return null;
> This looks risky if someone updates the thrift def for TNetworkAddress :-)
The risk is only if fields are added. If fields are removed, this code won't 
compile and will thus call attention to itself.
The note says that not all fields are stubbed out. So, yes, will lead to odd 
bugs if someone alters one of these after interning.

But, given the semantics of metadata, we probably should not be altering 
objects cached in metadata anyway because they are used by multiple planner 
threads.


http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HBaseColumn.java
File fe/src/main/java/org/apache/impala/catalog/HBaseColumn.java:

http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HBaseColumn.java@35
PS2, Line 35: columnQualifier
> just curious, any reason why not this?
According to this link: 
https://www.dummies.com/programming/big-data/hadoop/column-qualifiers-in-the-hbase-data-model/

"Unlike column families, column qualifiers can be virtually unlimited in 
content, length and number. If you omit the column qualifier, the HBase system 
will assign one for you. Printable characters are not required, so any type and 
number of bytes can be used to create a column qualifier."

So, looks like the qualifier is high cardinality, low reuse.


http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@853
PS2, Line 853: _ = id;
             :     accessLevel_ = accessLevel;
> Shouldn't we do it after we remove the incremental stats ? (or) blacklist i
Done


http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1203
PS2, Line 1203:
> just curious, why not intern everything via Guava?
I suspect that the "null partition key value" is a single value, so we just 
intern that one value here.



--
To view, visit http://gerrit.cloudera.org:8080/11158
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3121aefa4391bcb1477d9dba0a49440d7000d26
Gerrit-Change-Number: 11158
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <prog...@cloudera.com>
Gerrit-Comment-Date: Sat, 26 Jan 2019 05:42:14 +0000
Gerrit-HasComments: Yes

Reply via email to