Attila Jeges has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 2:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/6107/1/be/src/exec/read-write-util-test.cc
File be/src/exec/read-write-util-test.cc:

PS1, Line 88:  
> nit: blank space.
Done


http://gerrit.cloudera.org:8080/#/c/6107/1/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

PS1, Line 217: ~val;
> Did you mean to convert this to the absolute value ? If so, you are missing
It should be ~val.


PS1, Line 219: 
             : 
             : inline int64_t ReadWriteUtil::PutVLong(int64_t val, uint8_t* 
buf) {
             :   int64_t num_bytes = VLongRequiredBytes(val);
             : 
             :   if (num_bytes == 1) {
             :     // store the value itself instead of the
> Typo. I meant bit scan from the MSB. __builtin_clzl()
Done


PS1, Line 240:     buf[i] = (val >> (8 * (num_bytes - i - 1))) & 0xFF;
             :   }
> Is this behavior documented somewhere ? Is it part of the sequence file sta
Yes, it is part of the sequence file standard and it can be read back by Hive. 
I've added an E2E test to confirm this.

Sequence file documentation: 
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/SequenceFile.html

The Zero-Compressed Integer format is described here:
https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/io/WritableUtils.html#writeVLong(java.io.DataOutput,
 long)

The details are in the actual source code (WriteVLong() method):
https://hadoop.apache.org/docs/r2.7.2/api/src-html/org/apache/hadoop/io/WritableUtils.html


PS1, Line 246: Util::PutVInt(int32
> Big Endianness ?
Correct. Changed the comment.


http://gerrit.cloudera.org:8080/#/c/6107/1/be/src/util/compress.cc
File be/src/util/compress.cc:

PS1, Line 209:  each preceded with an integer for
> , each preceded with an integer for its size.
Done


PS1, Line 210: // For testing purposes we are going to generate two blocks if 
input_length >= 4K
> That seems a bit inefficient to split into two blocks for input_length == 2
Done,changed the threshold to 4K


PS1, Line 235: ReadWriteUtil::
> The declaration can be moved into the loop body.
Done


http://gerrit.cloudera.org:8080/#/c/6107/1/testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
File testdata/workloads/functional-query/queries/QueryTest/seq-writer.test:

Line 93: SET COMPRESSION_CODEC=NONE;
> what about the other codecs?
This test case runs for 8.5 seconds.

I've added test cases for other codecs with RECORD/BLOCK compression mode.


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to