[Impala-ASF-CR] IMPALA-10319: Support arbitrary encodings on Text files

Mihaly Szjatinya (Code Review) Wed, 28 May 2025 15:52:43 -0700

Mihaly Szjatinya has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22049 )


Change subject: IMPALA-10319: Support arbitrary encodings on Text files
......................................................................


Patch Set 24:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc
File be/src/util/char-codec.cc:

http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc@61
PS21, Line 61: scoped_mem_track
> I meant the variable name could follow the snake-case, as other variable na
Oh, ok, why not.


http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc@103
PS21, Line 103: prefix
> partial_symbol can be used to build up the symbol from the new buffer, as i
This would indeed save an extra vector and a few lines of code, but on the 
other side it would lower function's exception-safety guarantee from strong to 
basic, by:

1. chaning class-level 'partial_symbol_' in case of throw outside funcion.
2. more importantly, changing 'buf_start' in case of throw outside function, 
and thus invalidating it.

This may be functionally ok with the current code outside the function but 
generally it's a good design to keep throwing functions with strong guarantee. 
Besides, it's easier to debug this way.

We could probably still get rid of 'success' flag, but I would keep it for 
consistency with later 'HandleSuffix'.


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py
File tests/query_test/test_charcodec.py:

http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@72
PS21, Line 72: class TestCharCodecGen(ImpalaTestSuite):
> It seems a bit slow, especially the 32 mins, we could check if there are so
If it's not important to have 'disable_codegen' both True and False of 
exhaustive mode, I've constrained it out to cut time in half.

We might also generate smaller tables.


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py
File tests/query_test/test_charcodec.py:

http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@362
PS23, Line 362:     # There is no reason to run these tests using all 
dimensions.
> nit: trailing white space
ack


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@372
PS23, Line 372:     self.execute_query("""CREATE TABLE IF NOT EXISTS {}.{} 
(name STRING)
              :         STORED AS TEXTFILE""".format(db, tbl_name))
              :     if encoding:
              :       self.execute_query("""ALTER TABLE {}.{} SET
              :           SERDEPROPERTIES("serialization.encoding"="{}")"""
> there are a few function in file_utils.py that helps with such tasks, e.g.
I wasn't aware of that function, but unfortunately it copies the entire folder, 
while we have a folder with different files for each table.

And more generally perhaps it would be more consistent with the above tests 
where we do this manually for different reasons.


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@379
PS23, Line 379:         "charcod
> Does this actually need Hive?
This comes down simply to 'IS_HDFS' check and I thought many tests use this 
just as a basic sanity check, not necessarily only with hive involved. Isn't 
that the case?


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@386
PS23, Line 386:   def test_precreated_files(self, vector, unique_database):
> Too long line (why didn't jenkins warn about this?)
Sorry for this and similar, just hadn't updated small fixes yet.


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@395
PS23, Line 395:         vector, db, enc + '_names_none', enc + '_names.txt', 
None)
              :     with pytest.raises(AssertionError) as exc_info:
> There could be a utility function like check_tables_equal that checks this
We can use '_compare_tables' from this file above. Good idea.


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@400
PS23, Line 400: t
> Does this actually need Hive?
Ack


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@410
PS23, Line 410:     if enc not in ['gbk', 'shift_jis']: pytest.skip()
> Does this actually need Hive?
Ack


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@417
PS23, Line 417:   @SkipIfFS.hive
> nit: trailing white space
Ack


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@432
PS23, Line 432:     """Write table with Impala and read it back with Hive."""
              :     db = unique_database
              :     enc = vector.get_value('charset')
              :
> This may work for tables with a single file, but generally we cannot make a
Interestingly enough. after adding 'order by name' to both queries the 
comparison doesn't work anymore. It's probably due to ordering taking place 
before decoding, which likely deserves a separate discussion.

In this test however we have tiny single files. Or if needed we can do an 
ordering in python here.



--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 24
Gerrit-Owner: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Wed, 28 May 2025 22:52:32 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10319: Support arbitrary encodings on Text files

Reply via email to