Mihaly Szjatinya has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: IMPALA-10319: Support arbitrary encodings on Text files ...................................................................... Patch Set 22: (30 comments) Thanks for review! http://gerrit.cloudera.org:8080/#/c/22049/21//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22049/21//COMMIT_MSG@7 PS21, Line 7: files > What's the situation now when serialization.encoding is set for a Sequence I've proposed earlier to do Sequence files in a separate issue if that's ok, since even implementation for Text got pretty bulky. Edited this ticket and created a follow-up for Sequence files. http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/exec/hdfs-scanner.h@326 PS21, Line 326: > typo: decoded Ack http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/exec/hdfs-text-table-writer.cc File be/src/exec/hdfs-text-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/exec/hdfs-text-table-writer.cc@56 PS21, Line 56: HdfsTextTableWriter::~HdfsTextTableWriter() { > nit: can fit in one line Ack http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.h File be/src/util/char-codec.h: http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.h@54 PS21, Line 54: std::string* result > here and in the Handle* functions: Impala has the convention of using point Ack http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc File be/src/util/char-codec.cc: http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc@61 PS21, Line 61: scopedMemTracker > nit: could be scoped_mem_tracker let's leave it like e.g. MemTracker http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc@103 PS21, Line 103: prefix > Is it possible to use partial_symbol to build up the prefix? That way, we c These are really different entities with different scope, but maybe I didn't get what you mean. http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc@113 PS21, Line 113: boost:: > nit: reinterpret_cast instead of c-style casts Ack http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc@159 PS21, Line 159: u > nit: reinterpret_cast instead of c-style casts Ack http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java File fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java: http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@289 PS21, Line 289: .e > It should be .equals() Yes, my bad. http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@294 PS21, Line 294: 'serialization.encodin > nit: could be lowercase with apostrophe: 'serialization.encoding'. Ack http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@295 PS21, Line 295: %s", > nit: could be simplified as %s Ack http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@303 PS21, Line 303: %s %s", > nit: could be simplified as %s / %s Ack http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@308 PS21, Line 308: l.getMeta > nit: unnecessary cast Ack http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@314 PS21, Line 314: %s %s", tbl > nit: could be simplified as %s / %s Ack http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@320 PS21, Line 320: tring.format("Unsupported encoding: %s." > nit: it could use a format-string similarly to the previous exception messa Ack http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@318 PS21, Line 318: String encoding = tblProperties.get(serdeConstants.SERIALIZATION_ENCODING); : if (!Charset.isSupported(encoding)) { : throw new AnalysisException(String.format("Unsupported encoding: %s.", encoding)); : } > nit: it could be moved after the first property check at L291 Hmm, it kind of goes from general to more specific. http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@331 PS21, Line 331: %s. > nit: could be simplified as %s Ack http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@332 PS21, Line 332: Please refer to IMPALA-10319 for more info.", > nit: I'd keep it simpler by skipping "Jira ticket": Please refer to IMPALA- Ack http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@337 PS21, Line 337: l.getMeta > nit: unneccessary cast Ack http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java: http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java@21 PS21, Line 21: import java.util.ArrayList; > nit: not used Ack http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py File tests/query_test/test_charcodec.py: http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@59 PS21, Line 59: .format(db, utf8_table)) > it would be nice to also pass the expected row count to the function to ens Agree, done http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@68 PS21, Line 68: sert result.data > general notes on tests: the coverage seems great in general, but I am conce Some of these were already added in earlier patches and then removed. Added all in a separate class. http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@74 PS21, Line 74: def add_test_dimensions(cls): > A ticket could be mentioned that deals with supporting encoding for other f Ack http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@79 PS21, Line 79: encodings = [enc for enc in encodings if enc == 'gbk'] > what is the reason for this line? some comment could be added like "Basic t Ack http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@87 PS21, Line 87: > It could use tempfile instead, no extra cleanup needed with it Ack, this was also addressed by Csaba. http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@109 PS21, Line 109: encoding_name_tbl = encoding_name.replace('-', '') > a comment like '# remove REFRESH when IMPALA-13749 is fixed" would be usefu Ack http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@127 PS21, Line 127: STORED AS TEXTFILE""".format(db, encoded_table)) > Some basic function comment would be nice, e.g. """Write encoded table with Ok, added some comments. http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@134 PS21, Line 134: return encoded_table > I think that it would be better to create a temporary directory that is cle It looks like auto-deleting 'tempfile.TemporaryDirectory' is only available for python 3. 'tempfile.mkdtemp' still requires shutil.rmtree(tmp_dir). We can make a simple context manager class and use it everywhere, if we don't plan to fully switch to python3. http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@195 PS21, Line 195: > The reason for choosing the codecs is that Snappy doesn't support streaming Ack http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@303 PS21, Line 303: PARTITIONED BY (part STRING) > looks duplicated (also see my comment at line 134) Ack, but this particular folder is not temp but table's path. -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 22 Gerrit-Owner: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Peter Rozsa <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Wed, 21 May 2025 13:21:29 +0000 Gerrit-HasComments: Yes
