Mihaly Szjatinya has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... Patch Set 4: (3 comments) Made following changes: 1. Extracted decoder into separate class to be used for writing text files and reading Sequence files. Memory / pool management still to be improved. 2. Moved 'serialization.encoding' property to THdfsStorageDescriptor as per partition SerdeProperty. 3. Prohibited multibyte charsets. 4. Adjusted tests. Added test draft for dataload. What's left: 1. Add encoding for Text files. 2. Add decoding for Sequence files. 3. Improve memory management. 4. Add property ignore option. 5. Extend tests. http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22049/2//COMMIT_MSG@23 PS2, Line 23: > We discussed this offline with Mihaly and the current consensus is to imple Added a check for whether '\n' is compatible with ASCII on analysis stage. We should think of whether and how we want to check for an arbitrary 'line.delimiter': 1. 'line.delimiter' already set. 2. 'line.delimiter' being set together with the 'serialization.encoding' within the same 'alter table' query. http://gerrit.cloudera.org:8080/#/c/22049/3/tests/query_test/test_decoding.py File tests/query_test/test_decoding.py: http://gerrit.cloudera.org:8080/#/c/22049/3/tests/query_test/test_decoding.py@36 PS3, Line 36: text_dimensio > It would be nice to also create a table during dataload: Added dataload from 'funcaional.alltypes' as a draft. Since we filter out multibyte charsets, and all data in 'functional.alltypes' is ASCII compatible, this effectively only checks that nothing is broken (which is also useful though). To better check decoding itself we need to generate large tables with plenty of local / special symbols. I've seen multiple designs for that in tests, like using generator, pre-created files, copying from existing tables, etc. What would be the best option here? To check with partitions, we need to implement writing first, since hive has a bug inserting partitions. http://gerrit.cloudera.org:8080/#/c/22049/3/tests/query_test/test_decoding.py@49 PS3, Line 49: > These small tables are useful to test encoding, but having a bigger table c Ack -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 4 Gerrit-Owner: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Mihaly Szjatinya <[email protected]> Gerrit-Comment-Date: Sun, 15 Dec 2024 23:35:52 +0000 Gerrit-HasComments: Yes
