Mihaly Szjatinya has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... Patch Set 9: (9 comments) - Answered everything except mem management (for next patch). - Added more self-generating tests. - Removed previous .tests. http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@19 PS8, Line 19: > typo ack http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@33 PS8, Line 33: > Complex types do not seem relevant here as those are only supported in Parq ack http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@35 PS8, Line 35: > It doesn't seem that critical to me to test per-partition different encodin I've already implemented it locally itm. http://gerrit.cloudera.org:8080/#/c/22049/8//COMMIT_MSG@37 PS8, Line 37: > This could be done by exposing a c++ function to check compatibility which ack http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/hdfs-text-table-writer.cc File be/src/exec/hdfs-text-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/hdfs-text-table-writer.cc@162 PS8, Line 162: RETURN_IF_ERROR(encoder_->EncodeBuffer( : rowbatch_stringstream_.str(), rowbatch_string)); > We don't needb to care about splitting multibyte characters because flush o Correct, added comment to the header, describing the function. http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/text/hdfs-text-scanner.cc File be/src/exec/text/hdfs-text-scanner.cc: http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/text/hdfs-text-scanner.cc@546 PS8, Line 546: RETURN_IF_ERROR(decoder_->DecodeBuffer(reinterpret_cast<uint8_t**>(&byte_buffer_ptr_), : &byte_buffer_read_size_)); > optional: for me it looks clearer to have +2 output parameters instead of u This follows the pattern inside Decompress* functions http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.h File be/src/util/char-codec.h: http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.h@47 PS8, Line 47: /// HdfsTextTableW > This could be const IMO. Agree. > If line delimiter could change, the question of multi byte > characters would > come up again, so char may not be enough > to store it. There are custom delimiters but it is not allowed to have multi-byte delimiter. The only special case is '\n\r', but that only means that '\n\r' is respected while delimiter is still '\n'. https://github.com/apache/Impala/blob/master/fe/src/main/java/org/apache/impala/analysis/TableDef.java#L929 http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc File be/src/util/char-codec.cc: http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@36 PS8, Line 36: DecodeBuffer > It would be very nice to have some comment that describes the steps of this Added function description to the header. Split into smaller functions. http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@47 PS8, Line 47: > can you add DCHECK_LT(partial_symbol_.size(), MAX_SYMBOL? Agree, added. -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 9 Gerrit-Owner: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Tue, 11 Feb 2025 21:47:03 +0000 Gerrit-HasComments: Yes
