Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23239 )

Change subject: IMPALA-13648: Implement a decoder and an encoder for Byte 
Stream Split encoding
......................................................................


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-decoder.cc
File be/src/exec/parquet/parquet-byte-stream-split-decoder.cc:

http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-decoder.cc@47
PS1, Line 47:   if (num_values < 0) return -1;
            :   if (stride < getByteSize()) return -1;
            :   if (input_buffer_ == nullptr) return -1;
These could be all DCHECKs to avoid the extra checks in release mode.


http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-decoder.cc@51
PS1, Line 51:   if (num_values == 0) return 0;
this could be also covered in a DCHECK - or is it valid to call the function 
like this?


http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-decoder.cc@72
PS1, Line 72:   if (num_values < 0) return -1;
            :   if (input_buffer_ == nullptr) return -1;
Could be DCHECKs


http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-encoder.h
File be/src/exec/parquet/parquet-byte-stream-split-encoder.h:

http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-encoder.h@65
PS1, Line 65:     if (sizeof(T) != getByteSize()) { return false; }
            :     if (output_buffer_ == nullptr) { return false; }
Could the be DCHECKs?


http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc
File be/src/exec/parquet/parquet-byte-stream-split-encoder.cc:

http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc@28
PS1, Line 28:   if (output_buffer_len < 0) {
            :     return Status("The buffer length must not be negative.");
            :   }
            :   if (output_buffer_len % getByteSize() != 0) {
            :     return Status(
            :         strings::Substitute("The buffer length must be divisible 
by $0.", getByteSize()));
            :   }
            :   if (output_buffer == nullptr) {
            :     return Status("The pointer to the buffer must not be null.");
            :   }
IMO these could be DCHECKs and NewPage deosnát reallz need to return Status.


http://gerrit.cloudera.org:8080/#/c/23239/1/be/src/exec/parquet/parquet-byte-stream-split-encoder.cc@63
PS1, Line 63:   std::vector<uint8_t> byte_list(value_count_ * getByteSize());
This is good for testing, but if we'll actually use it to write Parquet pages 
then it will be better to use buffers of tracked memory.

I think what mainly needed is a static function that does the transformation 
using two buffers and gets value count / byte size as arguments. If the 
argument is const, the compiler can inline byte size similar to the current 
template solution.

How I imagine this to be used in the future:
1. values are collected to an array practically as PLAIN encoding (buffer A) ( 
see 
https://github.com/apache/impala/blob/59fdd7169a4523a2c4916096d550855e49c8a35a/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L545
 )
2. when finalizing, this is rearranged to to another buffer with the same size 
using stream split encoding (buffer B)
3. the stream split encoded buffer is compressed to another buffer (buffer C) - 
without compression stream split encoding is not useful, as plain encoding has 
the same size and is faster to decode ( 
https://github.com/apache/impala/blob/59fdd7169a4523a2c4916096d550855e49c8a35a/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L1051
 )

As buffer B is temporary, we don't win copying the data back to the original 
buffer as it happens in this function.



--
To view, visit http://gerrit.cloudera.org:8080/23239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I71755d992536d70a22b8fdbfee1afce3aec81c26
Gerrit-Change-Number: 23239
Gerrit-PatchSet: 1
Gerrit-Owner: Gabriella Gyorgyevics <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Gabriella Gyorgyevics <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Comment-Date: Tue, 05 Aug 2025 13:52:36 +0000
Gerrit-HasComments: Yes

Reply via email to