Mihaly Szjatinya has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22049 )

Change subject: IMPALA-10319: Support arbitrary encodings on Text files
......................................................................


Patch Set 22:

(30 comments)

Thanks for review!

http://gerrit.cloudera.org:8080/#/c/22049/21//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/22049/21//COMMIT_MSG@7
PS21, Line 7: files
> What's the situation now when serialization.encoding is set for a Sequence
I've proposed earlier to do Sequence files in a separate issue if that's ok, 
since even implementation for Text got pretty bulky. Edited this ticket and 
created a follow-up for Sequence files.


http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/exec/hdfs-scanner.h@326
PS21, Line 326:
> typo: decoded
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/exec/hdfs-text-table-writer.cc@56
PS21, Line 56: HdfsTextTableWriter::~HdfsTextTableWriter() {
> nit: can fit in one line
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.h
File be/src/util/char-codec.h:

http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.h@54
PS21, Line 54: std::string* result
> here and in the Handle* functions: Impala has the convention of using point
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc
File be/src/util/char-codec.cc:

http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc@61
PS21, Line 61: scopedMemTracker
> nit: could be scoped_mem_tracker
let's leave it like e.g. MemTracker


http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc@103
PS21, Line 103: prefix
> Is it possible to use partial_symbol to build up the prefix? That way, we c
These are really different entities with different scope, but maybe I didn't 
get what you mean.


http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc@113
PS21, Line 113: boost::
> nit: reinterpret_cast instead of c-style casts
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc@159
PS21, Line 159: u
> nit: reinterpret_cast instead of c-style casts
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
File 
fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java:

http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@289
PS21, Line 289: .e
> It should be .equals()
Yes, my bad.


http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@294
PS21, Line 294: 'serialization.encodin
> nit: could be lowercase with apostrophe: 'serialization.encoding'.
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@295
PS21, Line 295: %s",
> nit: could be simplified as %s
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@303
PS21, Line 303: %s %s",
> nit: could be simplified as %s / %s
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@308
PS21, Line 308: l.getMeta
> nit: unnecessary cast
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@314
PS21, Line 314: %s %s", tbl
> nit: could be simplified as %s / %s
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@320
PS21, Line 320: tring.format("Unsupported encoding: %s."
> nit: it could use a format-string similarly to the previous exception messa
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@318
PS21, Line 318: String encoding = 
tblProperties.get(serdeConstants.SERIALIZATION_ENCODING);
              :     if (!Charset.isSupported(encoding)) {
              :       throw new AnalysisException(String.format("Unsupported 
encoding: %s.", encoding));
              :     }
> nit: it could be moved after the first property check at L291
Hmm, it kind of goes from general to more specific.


http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@331
PS21, Line 331: %s.
> nit: could be simplified as %s
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@332
PS21, Line 332: Please refer to IMPALA-10319 for more info.",
> nit: I'd keep it simpler by skipping "Jira ticket": Please refer to IMPALA-
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java@337
PS21, Line 337: l.getMeta
> nit: unneccessary cast
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java:

http://gerrit.cloudera.org:8080/#/c/22049/21/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java@21
PS21, Line 21: import java.util.ArrayList;
> nit: not used
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py
File tests/query_test/test_charcodec.py:

http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@59
PS21, Line 59:         .format(db, utf8_table))
> it would be nice to also pass the expected row count to the function to ens
Agree, done


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@68
PS21, Line 68: sert result.data
> general notes on tests: the coverage seems great in general, but I am conce
Some of these were already added in earlier patches and then removed. Added all 
in a separate class.


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@74
PS21, Line 74:   def add_test_dimensions(cls):
> A ticket could be mentioned that deals with supporting encoding for other f
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@79
PS21, Line 79:       encodings = [enc for enc in encodings if enc == 'gbk']
> what is the reason for this line? some comment could be added like "Basic t
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@87
PS21, Line 87:
> It could use tempfile instead, no extra cleanup needed with it
Ack, this was also addressed by Csaba.


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@109
PS21, Line 109:     encoding_name_tbl = encoding_name.replace('-', '')
> a comment like '# remove REFRESH when IMPALA-13749 is fixed" would be usefu
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@127
PS21, Line 127:         STORED AS TEXTFILE""".format(db, encoded_table))
> Some basic function comment would be nice, e.g. """Write encoded table with
Ok, added some comments.


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@134
PS21, Line 134:     return encoded_table
> I think that it would be better to create a temporary directory that is cle
It looks like auto-deleting 'tempfile.TemporaryDirectory' is only available for 
python 3. 'tempfile.mkdtemp' still requires shutil.rmtree(tmp_dir).

We can make a simple context manager class and use it everywhere, if we don't 
plan to fully switch to python3.


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@195
PS21, Line 195:
> The reason for choosing the codecs is that Snappy doesn't support streaming
Ack


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@303
PS21, Line 303:         PARTITIONED BY (part STRING)
> looks duplicated (also see my comment at line 134)
Ack, but this particular folder is not temp but table's path.



--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 22
Gerrit-Owner: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Wed, 21 May 2025 13:21:29 +0000
Gerrit-HasComments: Yes

Reply via email to