Re: [PR] [fix](parquet)fix can not read parquet lz4 compress. [doris]

via GitHub Sun, 26 Nov 2023 19:02:01 -0800


github-actions[bot] commented on code in PR #27383:
URL: https://github.com/apache/doris/pull/27383#discussion_r1405569997



##########
be/src/exec/decompressor.cpp:
##########
@@ -339,66 +341,111 @@ Status Lz4BlockDecompressor::init() {
     return Status::OK();
 }
 
+// hadoop-lz4 is not compatible with lz4 CLI.
+// hadoop-lz4 uses a block compression scheme based on LZ4. According to the 
Hadoop document
+// (BlockCompressorStream. Java and BlockDecompressorStream. Java), is divided 
into several
+// large chunk of input data. Each block contains the original length of the 
current block of
+// large data chunk, followed by one or more small data blocks prefixed by the 
compressed length of
+// the current small data block.
+// example:
+//
+// A large data chunk be divided into three small block  :
+// OriginDate:                 | small block1 | small block2 | small block3 |
+// CompressDate:   <A [B1 compress(small block1) ] [B2 compress(small block1) 
] [B3 compress(small block1)]>
+//
+// A : original length of the current block of large data chunk. sizeof(A) = 4 
bytes.
+// A = length(small block1) + length(small block2) + length(small block3)
+// Bx : length of  small data block bx. sizeof(Bx) = 4 bytes.
+// Bx = length(compress(small blockx))
+//
+// the hadoop lz4codec source code can be found here:
+// 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/src/codec/Lz4Codec.cc
+//
+// More details refer to https://issues.apache.org/jira/browse/HADOOP-12990
 Status Lz4BlockDecompressor::decompress(uint8_t* input, size_t input_len, 
size_t* input_bytes_read,
                                         uint8_t* output, size_t output_max_len,
                                         size_t* decompressed_len, bool* 
stream_end,
                                         size_t* more_input_bytes, size_t* 
more_output_bytes) {
-    uint8_t* src = input;
-    size_t remaining_input_size = input_len;
-    int64_t uncompressed_total_len = 0;
-    *input_bytes_read = 0;
+    auto* input_ptr = input;
+    auto* output_ptr = output;
 
-    // The hadoop lz4 codec is as:
-    // <4 byte big endian uncompressed size>
-    // <4 byte big endian compressed size>
-    // <lz4 compressed block>
-    // ....
-    // <4 byte big endian uncompressed size>
-    // <4 byte big endian compressed size>
-    // <lz4 compressed block>
-    //
-    // See:
-    // 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/src/codec/Lz4Codec.cc
-    while (remaining_input_size > 0) {
-        // Read uncompressed size
-        uint32_t uncompressed_block_len = Decompressor::_read_int32(src);
-        int64_t remaining_output_size = output_max_len - 
uncompressed_total_len;
-        if (remaining_output_size < uncompressed_block_len) {
-            // Need more output buffer
-            *more_output_bytes = uncompressed_block_len - 
remaining_output_size;
-            break;
+    while (input_len > 0) {
+        //if faild ,  fall back to large block begin
+        auto large_block_input_ptr = input_ptr;

Review Comment:
   warning: 'auto large_block_input_ptr' can be declared as 'auto 
*large_block_input_ptr' [readability-qualified-auto]
   
   ```suggestion
           auto *large_block_input_ptr = input_ptr;
   ```
   



##########
be/src/exec/decompressor.cpp:
##########
@@ -339,66 +341,111 @@
     return Status::OK();
 }
 
+// hadoop-lz4 is not compatible with lz4 CLI.
+// hadoop-lz4 uses a block compression scheme based on LZ4. According to the 
Hadoop document
+// (BlockCompressorStream. Java and BlockDecompressorStream. Java), is divided 
into several
+// large chunk of input data. Each block contains the original length of the 
current block of
+// large data chunk, followed by one or more small data blocks prefixed by the 
compressed length of
+// the current small data block.
+// example:
+//
+// A large data chunk be divided into three small block  :
+// OriginDate:                 | small block1 | small block2 | small block3 |
+// CompressDate:   <A [B1 compress(small block1) ] [B2 compress(small block1) 
] [B3 compress(small block1)]>
+//
+// A : original length of the current block of large data chunk. sizeof(A) = 4 
bytes.
+// A = length(small block1) + length(small block2) + length(small block3)
+// Bx : length of  small data block bx. sizeof(Bx) = 4 bytes.
+// Bx = length(compress(small blockx))
+//
+// the hadoop lz4codec source code can be found here:
+// 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/src/codec/Lz4Codec.cc
+//
+// More details refer to https://issues.apache.org/jira/browse/HADOOP-12990
 Status Lz4BlockDecompressor::decompress(uint8_t* input, size_t input_len, 
size_t* input_bytes_read,
                                         uint8_t* output, size_t output_max_len,
                                         size_t* decompressed_len, bool* 
stream_end,
                                         size_t* more_input_bytes, size_t* 
more_output_bytes) {
-    uint8_t* src = input;
-    size_t remaining_input_size = input_len;
-    int64_t uncompressed_total_len = 0;
-    *input_bytes_read = 0;
+    auto* input_ptr = input;
+    auto* output_ptr = output;
 
-    // The hadoop lz4 codec is as:
-    // <4 byte big endian uncompressed size>
-    // <4 byte big endian compressed size>
-    // <lz4 compressed block>
-    // ....
-    // <4 byte big endian uncompressed size>
-    // <4 byte big endian compressed size>
-    // <lz4 compressed block>
-    //
-    // See:
-    // 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/src/codec/Lz4Codec.cc
-    while (remaining_input_size > 0) {
-        // Read uncompressed size
-        uint32_t uncompressed_block_len = Decompressor::_read_int32(src);
-        int64_t remaining_output_size = output_max_len - 
uncompressed_total_len;
-        if (remaining_output_size < uncompressed_block_len) {
-            // Need more output buffer
-            *more_output_bytes = uncompressed_block_len - 
remaining_output_size;
-            break;
+    while (input_len > 0) {
+        //if faild ,  fall back to large block begin
+        auto large_block_input_ptr = input_ptr;
+        auto large_block_output_ptr = output_ptr;

Review Comment:
   warning: 'auto large_block_output_ptr' can be declared as 'auto 
*large_block_output_ptr' [readability-qualified-auto]
   
   ```suggestion
           auto *large_block_output_ptr = output_ptr;
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [fix](parquet)fix can not read parquet lz4 compress. [doris]

Reply via email to