[Impala-ASF-CR] WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Mihaly Szjatinya (Code Review) Sun, 15 Dec 2024 15:36:08 -0800

Mihaly Szjatinya has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22049 )


Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence 
files
......................................................................


Patch Set 4:

(3 comments)

Made following changes:
1. Extracted decoder into separate class to be used for writing text files and 
reading Sequence files. Memory / pool management still to be improved.
2. Moved 'serialization.encoding' property to THdfsStorageDescriptor as per 
partition SerdeProperty.
3. Prohibited multibyte charsets.
4. Adjusted tests. Added test draft for dataload.

What's left:
1. Add encoding for Text files.
2. Add decoding for Sequence files.
3. Improve memory management.
4. Add property ignore option.
5. Extend tests.

http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG@23
PS2, Line 23:
> We discussed this offline with Mihaly and the current consensus is to imple
Added a check for whether '\n' is compatible with ASCII on analysis stage. We 
should think of whether and how we want to check for an arbitrary 
'line.delimiter':
1. 'line.delimiter' already set.
2. 'line.delimiter' being set together with the 'serialization.encoding' within 
the same 'alter table' query.


http://gerrit.cloudera.org:8080/#/c/22049/3/tests/query_test/test_decoding.py
File tests/query_test/test_decoding.py:

http://gerrit.cloudera.org:8080/#/c/22049/3/tests/query_test/test_decoding.py@36
PS3, Line 36: text_dimensio
> It would be nice to also create a table during dataload:
Added dataload from 'funcaional.alltypes' as a draft. Since we filter out 
multibyte charsets, and all data in 'functional.alltypes' is ASCII compatible, 
this effectively only checks that nothing is broken (which is also useful 
though).

To better check decoding itself we need to generate large tables with plenty of 
local / special symbols. I've seen multiple designs for that in tests, like 
using generator, pre-created files, copying from existing tables, etc. What 
would be the best option here?

To check with partitions, we need to implement writing first, since hive has a 
bug inserting partitions.


http://gerrit.cloudera.org:8080/#/c/22049/3/tests/query_test/test_decoding.py@49
PS3, Line 49:
> These small tables are useful to test encoding, but having a bigger table c
Ack



--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 4
Gerrit-Owner: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Mihaly Szjatinya <[email protected]>
Gerrit-Comment-Date: Sun, 15 Dec 2024 23:35:52 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Reply via email to