Attila Jeges has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 5:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.cc
File be/src/exec/hdfs-sequence-table-writer.cc:

PS4, Line 179: 
             : 
> Is this the reason the old sequence file writer created corrupted files ?
The old sequence file writer had multiple issues:

1. ReadWriteUtil::VLongRequiredBytes() and ReadWriteUtil::PutVLong() were 
broken. As a result, Impala could not read back uncompressed sequence files 
created by Impala (see be/src/exec/read-write-util.h).

2. KEY_CLASS_NAME was missing from the sequence file header. As a result, Hive 
could not read back uncompressed sequence files created by Impala (see 
be/src/exec/hdfs-sequence-table-writer.cc).

3. Keys were missing from record-compressed sequence files. Hive could not read 
back record-compressed sequence files created by Impala (see 
be/src/exec/hdfs-sequence-table-writer.cc).

4. In some cases the wrong Record-compression flag was written to the sequence 
file header. As a result, Hive could not read back record-compressed sequence 
files created by Impala (see be/src/exec/hdfs-sequence-table-writer.cc).

5.Impala added 'sync_marker' instead of 'neg1_sync_marker' to the beginning of 
blocks in block-compressed sequence files. Hive could not read these files back.

6. Impala created block-compressed files with:
- empty key-lengths block (L176)
- empty keys block (L177)
- empty value-lengths block (L180)

This resulted in invalid block-compressed sequence files that Hive could not 
read back (see HdfsSequenceTableWriter::WriteCompressedBlock()).


PS4, Line 139: _cast<const uint8_t*>(KEY_CLASS_NAME));
> nit: indent 4. Same below.
Done


Line 191:   record.WriteBytes(output_length, output);
> It seems a bit unfortunate that we don't free the temp buffer (i.e. output)
Added FreeAll to the end of the 'Flush()' function.


PS4, Line 193:   // Output compressed keys block-size & compressed keys block.
             :   // The keys block contains "\0\0\0\0" byte sequence as a key 
for each row (this is what
             :   // Hive does).
> Does not writing key-lengths block and key block prevent the file from bein
Yes, Hive failed with an exception when I tried that.


http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.h
File be/src/exec/hdfs-sequence-table-writer.h:

Line 29: 
> Would be good to add the details of the Sequence file's layout as top level
Done


http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

Line 230: // For more information, see the documentation for 
'WritableUtils.writeVLong()' method:
> DCHECK_GE(num_bytes, 2);
Done


PS3, Line 233: nt64_t num_b
> May also want to state that the source of this behavior.
Done


http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/util/compress.cc
File be/src/util/compress.cc:

Line 248:     outp += size;
> DCHECK_LE(outp - out_buffer_,  length);
Done


http://gerrit.cloudera.org:8080/#/c/6107/3/testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
File testdata/workloads/functional-query/queries/QueryTest/seq-writer.test:

Line 212: stored as SEQUENCEFILE;
> May be helpful to also add a test for writing empty file:
I tried this and it doesn't write an empty file. It doesn't create a file at 
all. Probably there's no easy way to force the sequence file writer to create 
an empty-file.


http://gerrit.cloudera.org:8080/#/c/6107/3/tests/query_test/test_compressed_formats.py
File tests/query_test/test_compressed_formats.py:

PS3, Line 170:       # Read it back in Impala
             :       output = self.client.execute('select count(*) from %s' % 
table_name)
             :       assert '16541' == output.get_data()
             :       # Read it back in Hive
             :       output = self.run_stmt_in_hive('select count(*) from %s' % 
table_name)
             :       assert '16541' == output.split('\n')[1]
             : 
             :   def test_avro_writer(self, vector):
             :     self.run_test_case('QueryTest/avro-wri
> Doesn't this duplicate the second test ? May help to test with empty file a
The 2nd test is for record-compressed sequence files while the 3rd is for 
block-compressed seq files.
Added tests for files greater than 4K and less than 4K. I could not figure out 
how to create an empty file.


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to