Attila Jeges has posted comments on this change. Change subject: IMPALA-3079: Fix sequence file writer ......................................................................
Patch Set 5: (10 comments) http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.cc File be/src/exec/hdfs-sequence-table-writer.cc: PS4, Line 179: : > Is this the reason the old sequence file writer created corrupted files ? The old sequence file writer had multiple issues: 1. ReadWriteUtil::VLongRequiredBytes() and ReadWriteUtil::PutVLong() were broken. As a result, Impala could not read back uncompressed sequence files created by Impala (see be/src/exec/read-write-util.h). 2. KEY_CLASS_NAME was missing from the sequence file header. As a result, Hive could not read back uncompressed sequence files created by Impala (see be/src/exec/hdfs-sequence-table-writer.cc). 3. Keys were missing from record-compressed sequence files. Hive could not read back record-compressed sequence files created by Impala (see be/src/exec/hdfs-sequence-table-writer.cc). 4. In some cases the wrong Record-compression flag was written to the sequence file header. As a result, Hive could not read back record-compressed sequence files created by Impala (see be/src/exec/hdfs-sequence-table-writer.cc). 5.Impala added 'sync_marker' instead of 'neg1_sync_marker' to the beginning of blocks in block-compressed sequence files. Hive could not read these files back. 6. Impala created block-compressed files with: - empty key-lengths block (L176) - empty keys block (L177) - empty value-lengths block (L180) This resulted in invalid block-compressed sequence files that Hive could not read back (see HdfsSequenceTableWriter::WriteCompressedBlock()). PS4, Line 139: _cast<const uint8_t*>(KEY_CLASS_NAME)); > nit: indent 4. Same below. Done Line 191: record.WriteBytes(output_length, output); > It seems a bit unfortunate that we don't free the temp buffer (i.e. output) Added FreeAll to the end of the 'Flush()' function. PS4, Line 193: // Output compressed keys block-size & compressed keys block. : // The keys block contains "\0\0\0\0" byte sequence as a key for each row (this is what : // Hive does). > Does not writing key-lengths block and key block prevent the file from bein Yes, Hive failed with an exception when I tried that. http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.h File be/src/exec/hdfs-sequence-table-writer.h: Line 29: > Would be good to add the details of the Sequence file's layout as top level Done http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/exec/read-write-util.h File be/src/exec/read-write-util.h: Line 230: // For more information, see the documentation for 'WritableUtils.writeVLong()' method: > DCHECK_GE(num_bytes, 2); Done PS3, Line 233: nt64_t num_b > May also want to state that the source of this behavior. Done http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/util/compress.cc File be/src/util/compress.cc: Line 248: outp += size; > DCHECK_LE(outp - out_buffer_, length); Done http://gerrit.cloudera.org:8080/#/c/6107/3/testdata/workloads/functional-query/queries/QueryTest/seq-writer.test File testdata/workloads/functional-query/queries/QueryTest/seq-writer.test: Line 212: stored as SEQUENCEFILE; > May be helpful to also add a test for writing empty file: I tried this and it doesn't write an empty file. It doesn't create a file at all. Probably there's no easy way to force the sequence file writer to create an empty-file. http://gerrit.cloudera.org:8080/#/c/6107/3/tests/query_test/test_compressed_formats.py File tests/query_test/test_compressed_formats.py: PS3, Line 170: # Read it back in Impala : output = self.client.execute('select count(*) from %s' % table_name) : assert '16541' == output.get_data() : # Read it back in Hive : output = self.run_stmt_in_hive('select count(*) from %s' % table_name) : assert '16541' == output.split('\n')[1] : : def test_avro_writer(self, vector): : self.run_test_case('QueryTest/avro-wri > Doesn't this duplicate the second test ? May help to test with empty file a The 2nd test is for record-compressed sequence files while the 3rd is for block-compressed seq files. Added tests for files greater than 4K and less than 4K. I could not figure out how to create an empty file. -- To view, visit http://gerrit.cloudera.org:8080/6107 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges <atti...@cloudera.com> Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-HasComments: Yes