Paul Rogers has posted comments on this change. ( http://gerrit.cloudera.org:8080/11158 )
Change subject: IMPALA-7540. Intern most repetitive strings and network addresses in catalog ...................................................................... Patch Set 3: (12 comments) Bharath, applied your review comments. Please take another look to see if this is ready to go. http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java File fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java: http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@62 PS2, Line 62: */ > nit: for each method, should we list the set of fields that we selectively Documentation of this kind tends to get out of date quickly. Looking at the code, it seems clear enough what is being interned. Not sure that maintaining a list of fields is worth the effort. http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@88 PS2, Line 88: public static void internFieldsInPlace(TTable table) { > null check for table Done http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@113 PS2, Line 113: */ > null check for sd. Done http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@131 PS2, Line 131: * Intern low-cardinality fields in the given FieldSchema. > null check Done http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@155 PS2, Line 155: not like 'impala_%' group by PARAM_VALUE order by count(*) desc limit 100; : // : // In a large catalog from a production install, these represented about 68% of the : // entries. : String val = e.getValue(); : if (val.isEmpty() || : "-1".equals(val) || > Can we create a static HashSet out of these (lowercased) and use something Looks like only five of the values would be in the table. The empty string and the prefix string would still be handled specially. Given only five entries, the above, though clunky, is simpler than, and probably performs as well as, a hash table. http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@172 PS2, Line 172: STRING_INTERNER.int > Maybe move everything to guava interner to be consistent? Native interning Done http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@174 PS2, Line 174: > Preconditions.checkState(ret.size() == parameters.size())? Done http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@188 PS2, Line 188: > can we rename it to something like internString() that we can reuse for all Done http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@198 PS2, Line 198: if (value == null) return null; > This looks risky if someone updates the thrift def for TNetworkAddress :-) The risk is only if fields are added. If fields are removed, this code won't compile and will thus call attention to itself. The note says that not all fields are stubbed out. So, yes, will lead to odd bugs if someone alters one of these after interning. But, given the semantics of metadata, we probably should not be altering objects cached in metadata anyway because they are used by multiple planner threads. http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HBaseColumn.java File fe/src/main/java/org/apache/impala/catalog/HBaseColumn.java: http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HBaseColumn.java@35 PS2, Line 35: columnQualifier > just curious, any reason why not this? According to this link: https://www.dummies.com/programming/big-data/hadoop/column-qualifiers-in-the-hbase-data-model/ "Unlike column families, column qualifiers can be virtually unlimited in content, length and number. If you omit the column qualifier, the HBase system will assign one for you. Printable characters are not required, so any type and number of bytes can be used to create a column qualifier." So, looks like the qualifier is high cardinality, low reuse. http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@853 PS2, Line 853: _ = id; : accessLevel_ = accessLevel; > Shouldn't we do it after we remove the incremental stats ? (or) blacklist i Done http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1203 PS2, Line 1203: > just curious, why not intern everything via Guava? I suspect that the "null partition key value" is a single value, so we just intern that one value here. -- To view, visit http://gerrit.cloudera.org:8080/11158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib3121aefa4391bcb1477d9dba0a49440d7000d26 Gerrit-Change-Number: 11158 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Paul Rogers <prog...@cloudera.com> Gerrit-Comment-Date: Sat, 26 Jan 2019 05:42:14 +0000 Gerrit-HasComments: Yes