[Impala-ASF-CR] WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Mihaly Szjatinya (Code Review) Tue, 11 Feb 2025 13:47:16 -0800

Mihaly Szjatinya has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22049 )


Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence 
files
......................................................................


Patch Set 9:

(9 comments)

- Answered everything except mem management (for next patch).

- Added more self-generating tests.

- Removed previous .tests.

http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@19
PS8, Line 19:
> typo
ack


http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@33
PS8, Line 33:
> Complex types do not seem relevant here as those are only supported in Parq
ack


http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@35
PS8, Line 35:
> It doesn't seem that critical to me to test per-partition different encodin
I've already implemented it locally itm.


http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@37
PS8, Line 37:
> This could be done by exposing a c++ function to check compatibility which
ack


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/hdfs-text-table-writer.cc@162
PS8, Line 162:     RETURN_IF_ERROR(encoder_->EncodeBuffer(
             :         rowbatch_stringstream_.str(), rowbatch_string));
> We don't needb to care about splitting multibyte characters because flush o
Correct, added comment to the header, describing the function.


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/text/hdfs-text-scanner.cc
File be/src/exec/text/hdfs-text-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/text/hdfs-text-scanner.cc@546
PS8, Line 546:     
RETURN_IF_ERROR(decoder_->DecodeBuffer(reinterpret_cast<uint8_t**>(&byte_buffer_ptr_),
             :         &byte_buffer_read_size_));
> optional: for me it looks clearer to have +2 output parameters instead of u
This follows the pattern inside Decompress* functions


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.h
File be/src/util/char-codec.h:

http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.h@47
PS8, Line 47:   /// HdfsTextTableW
> This could be const IMO.

Agree.

> If line delimiter could change, the question of multi byte > characters would 
> come up again, so char may not be enough  > to store it.

There are custom delimiters but it is not allowed to have multi-byte delimiter. 
The only special case is '\n\r', but that only means that '\n\r' is respected 
while delimiter is still '\n'.

https://github.com/apache/Impala/blob/master/fe/src/main/java/org/apache/impala/analysis/TableDef.java#L929


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc
File be/src/util/char-codec.cc:

http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@36
PS8, Line 36: DecodeBuffer
> It would be very nice to have some comment that describes the steps of this
Added function description to the header.
Split into smaller functions.


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@47
PS8, Line 47:
> can you add DCHECK_LT(partial_symbol_.size(), MAX_SYMBOL?
Agree, added.



--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 9
Gerrit-Owner: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Tue, 11 Feb 2025 21:47:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Reply via email to