Adar Dembo has posted comments on this change. Change subject: WIP: KUDU-463. Add checksumming to cfile ......................................................................
Patch Set 13: (28 comments) http://gerrit.cloudera.org:8080/#/c/6630/13/src/kudu/cfile/cfile-test.cc File src/kudu/cfile/cfile-test.cc: Line 244: if (FLAGS_cfile_write_checksums) { Can you make TestCFile a friend of CFileReader (see the FRIEND_TEST macro) and call HasChecksum() instead? Or make HasChecksum() a public method? Line 330: source->Size(&file_size); RETURN_NOT_OK Line 333: source->Read(0, &data); RETURN_NOT_OK Line 843: source->Size(&file_size); Should ASSERT_OK on this too. PS13, Line 847: 254 Just in case one of the file's bytes actually has the value 254, maybe you should do something more drastic like flip one of the bits? That way there's always a "change" to the byte's value. Bonus points: if it's not too slow, add another loop which selects which of 8 bits in the byte to flip. http://gerrit.cloudera.org:8080/#/c/6630/13/src/kudu/cfile/cfile_reader.cc File src/kudu/cfile/cfile_reader.cc: PS13, Line 56: TAG_FLAG(cfile_verify_checksums, experimental); Should probably default to evolving. Line 137: Status CFileReader::ReadMagicAndLength(uint64_t offset, Slice* slice) { Since this function is now just a block read, I think it's net more readable to remove it and do the block read inline wherever it was needed. Line 210: if (header.size() != header_size || checksum.size() != checksum_size) { Do we need this check now that there are no short reads? Line 215: if (FLAGS_cfile_verify_checksums) { If this is false, why bother reading the checksum out of the block at all? Line 226: RETURN_NOT_OK(block_->Read(off, &header)); Would be cleaner to call ReadV() regardless of whether checksums exist or not, and just pass either a single Slice or two Slices as needed. Line 227: if (header.size() != header_size) { Do we need this check now that there are no short reads? PS13, Line 271: // Read both the header and checksum in one call. : // We read the checksum position in case one exists. : // This is done to avoid the need for a follow up read call. Just to be clear, if there are no checksums, the four extra bytes we're reading: 1. Are definitely not checksum data, but, 2. Are guaranteed to exist. Is that right? And AFAICT, since the footer itself tells us whether the cfile has checksums, the only alternative is to do two separate reads, and we think reading an extra 4 bytes is the better trade-off, right? Line 277: if (footer.size() != footer_size || checksum.size() != checksum_size) { No short reads; do we still need this? PS13, Line 282: // This needs to be done before validating the checksum since the : // incompatible_features flag tells us if a checksum exists at all. What's the point of the footer checksum then? It seems extraordinarily rare that we'll find a footer than can serialize properly but fails its checksum test. PS13, Line 436: if (checksum_size > data_size) { : return Status::Corruption("invalid data size"); : } This effectively forbids zero-length block reads, right? Is that an invariant that we already enforce? Line 461: if (block.size() != data_size || checksum.size() != checksum_size) { No short reads... PS13, Line 466: if (FLAGS_cfile_verify_checksums) { : uint32_t expected_checksum = DecodeFixed32(checksum.data()); : uint32_t checksum_value = crc::Crc32c(block.data(), block.size()); : if (PREDICT_FALSE(checksum_value != expected_checksum)) { : return Status::Corruption( : Substitute("Checksum does not match at offset $0 size $1: $2 vs $3", : ptr.offset(), block.size(), checksum_value, expected_checksum)); : } : } I think I've seen this block of code three times; can you factor it out into a helper function? Something like: Status VerifyChecksum(...): if !FLAGS_cfile_verify_checksums: return Status::OK(); <verify the checksum> Line 476: RETURN_NOT_OK(block_->Read(ptr.offset(), &block)); Let's use ReadV() for both cases, and vary whether we pass a one-slice vector or a two-slice vector. Line 477: if (block.size() != data_size) { No short reads... http://gerrit.cloudera.org:8080/#/c/6630/13/src/kudu/cfile/cfile_reader.h File src/kudu/cfile/cfile_reader.h: Line 188: // Returns true if the file has checksums on the header, footer, and data blocks Nit: terminate with a period. Line 189: bool hasChecksums() const; Nit: should be either has_checksums() or HasChecksums() http://gerrit.cloudera.org:8080/#/c/6630/13/src/kudu/cfile/cfile_writer.cc File src/kudu/cfile/cfile_writer.cc: PS13, Line 65: DEFINE_bool(cfile_write_checksums, false, : "Write CRC32 checksums for each block"); : TAG_FLAG(cfile_write_checksums, experimental); 1. Why default to false? 2. Why experimental and not evolving? Line 186: RETURN_NOT_OK_PREPEND(WriteRawData(Slice(checksum_buf, sizeof(checksum_buf))), Rather than two separate calls to WriteRawData(), could we append the checksum to header_str and write out the whole thing in a single call to WriteRawData()? Line 263: // Prepend a footer checksum. Any idea why we're using block_->Append() here and not WriteRawData() as in the header? Seems like using the same code path would be cleaner, but maybe there's a good reason (i.e. not wanting to update off_). Line 272: RETURN_NOT_OK_PREPEND(block_->Append(footer_str), "Couldn't write footer"); Likewise here, could we combine the checksum and footer_str and make just a single Append() call? Line 494: RETURN_NOT_OK(WriteRawData(data)); Hmm, a little surprised that Todd didn't suggest coalescing these into one syscall too, via WriteV(). I'm guessing it's because we expect to read far more than to write. Still, it seems like a WriteV()-abstraction to use here would be overall useful too. Do you have any interest in doing that? Not as part of this patch though. Line 504: RETURN_NOT_OK(WriteRawData(Slice(crc_buf, sizeof(crc_buf)))); Let's also combine the writes here. It's a little trickier since we need to compute the combined crc32 first and then add an additional slice to out_slices, but it should be doable. http://gerrit.cloudera.org:8080/#/c/6630/13/src/kudu/util/crc-test.cc File src/kudu/util/crc-test.cc: PS13, Line 58: 0xa9421b7 Since this is used 3 times now, store it up front in a "const uint64_t kKnownGoodCrc32" or some such. -- To view, visit http://gerrit.cloudera.org:8080/6630 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I6756834cd7f27af258797a3654a95244abeb0976 Gerrit-PatchSet: 13 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Grant Henke <granthe...@gmail.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Grant Henke <granthe...@gmail.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes