Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/23239 )
Change subject: IMPALA-13648: Implement a decoder and an encoder for Byte Stream Split encoding ...................................................................... Patch Set 1: (6 comments) http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-decoder.cc File be/src/exec/parquet/parquet-byte-stream-split-decoder.cc: http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-decoder.cc@47 PS1, Line 47: if (num_values < 0) return -1; : if (stride < getByteSize()) return -1; : if (input_buffer_ == nullptr) return -1; These could be all DCHECKs to avoid the extra checks in release mode. http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-decoder.cc@51 PS1, Line 51: if (num_values == 0) return 0; this could be also covered in a DCHECK - or is it valid to call the function like this? http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-decoder.cc@72 PS1, Line 72: if (num_values < 0) return -1; : if (input_buffer_ == nullptr) return -1; Could be DCHECKs http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-encoder.h File be/src/exec/parquet/parquet-byte-stream-split-encoder.h: http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-encoder.h@65 PS1, Line 65: if (sizeof(T) != getByteSize()) { return false; } : if (output_buffer_ == nullptr) { return false; } Could the be DCHECKs? http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc File be/src/exec/parquet/parquet-byte-stream-split-encoder.cc: http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc@28 PS1, Line 28: if (output_buffer_len < 0) { : return Status("The buffer length must not be negative."); : } : if (output_buffer_len % getByteSize() != 0) { : return Status( : strings::Substitute("The buffer length must be divisible by $0.", getByteSize())); : } : if (output_buffer == nullptr) { : return Status("The pointer to the buffer must not be null."); : } IMO these could be DCHECKs and NewPage deosnát reallz need to return Status. http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc@63 PS1, Line 63: std::vector<uint8_t> byte_list(value_count_ * getByteSize()); This is good for testing, but if we'll actually use it to write Parquet pages then it will be better to use buffers of tracked memory. I think what mainly needed is a static function that does the transformation using two buffers and gets value count / byte size as arguments. If the argument is const, the compiler can inline byte size similar to the current template solution. How I imagine this to be used in the future: 1. values are collected to an array practically as PLAIN encoding (buffer A) ( see https://github.com/apache/impala/blob/59fdd7169a4523a2c4916096d550855e49c8a35a/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L545 ) 2. when finalizing, this is rearranged to to another buffer with the same size using stream split encoding (buffer B) 3. the stream split encoded buffer is compressed to another buffer (buffer C) - without compression stream split encoding is not useful, as plain encoding has the same size and is faster to decode ( https://github.com/apache/impala/blob/59fdd7169a4523a2c4916096d550855e49c8a35a/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L1051 ) As buffer B is temporary, we don't win copying the data back to the original buffer as it happens in this function. -- To view, visit http://gerrit.cloudera.org:8080/23239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I71755d992536d70a22b8fdbfee1afce3aec81c26 Gerrit-Change-Number: 23239 Gerrit-PatchSet: 1 Gerrit-Owner: Gabriella Gyorgyevics <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Gabriella Gyorgyevics <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]> Gerrit-Comment-Date: Tue, 05 Aug 2025 13:52:36 +0000 Gerrit-HasComments: Yes
