Will Berkeley has posted comments on this change. Change subject: KUDU-1398 CFile index blocks can store shortest separating prefix ......................................................................
Patch Set 4: (15 comments) http://gerrit.cloudera.org:8080/#/c/3304/4//COMMIT_MSG Commit Message: PS4, Line 17: compatibility with bloomfile > what's this mean? There was a bug making bloom files not work. The trick is that bloom files call AppendRawBlock directly instead of FinishCurDataBlock. I altered AppendRawBlock to accept the last key as well as the first key, so now a BloomFileWriter maintains its own last key, just like it maintains its own first key. I feel like the change is a little awkward though, so I'd appreciate any ideas on how to improve it. http://gerrit.cloudera.org:8080/#/c/3304/4/src/kudu/cfile/binary_dict_block.cc File src/kudu/cfile/binary_dict_block.cc: Line 131: last_key_.assign_copy(last->data(), last->size()); > instead of copying this every time, could we use GetLast on the code word b Doesn't the dictionary in the block builder map from string to codeword? The last key we want is the last string in the block, not the codeword stored for it, so we can't look it up in the dictionary. I did improve this, though, by looking up in the dict block via GetKeyAtIdx, a method which was added to plain encoding per one of your comments. This means I had to drop CHECK(finished_) from it (but kept it in GetFirstKey and GetLastKey). Does that solution seem reasonable to you? http://gerrit.cloudera.org:8080/#/c/3304/4/src/kudu/cfile/binary_plain_block.cc File src/kudu/cfile/binary_plain_block.cc: Line 132: Status BinaryPlainBlockBuilder::GetLastKey(void *key_void) const { > can we refactor this to share code with GetFirstKey? eg a GetKeyAtIndex(int Done http://gerrit.cloudera.org:8080/#/c/3304/4/src/kudu/cfile/cfile_util.cc File src/kudu/cfile/cfile_util.cc: Line 126: void SeparatingKey(const faststring& left, const faststring& right, faststring *target) { > can you add some unit tests for this function? (eg the examples from the co Done Line 127: DCHECK(memcmp(left.data(), right.data(), min(left.size(), right.size())) <= > I think DCHECK_LE(Slice(left).compare(Slice(right), 0); is probably more re Done Line 130: if (left.size() == 0) { > .empty() Done Line 131: // special case for first block: use the whole key > not sure why this is the case Bloomfiles work with this now; no more special case needed. Line 135: target->assign_copy(right.data(), cpl == right.size() ? cpl : cpl + 1); > hm, this is a slightly different algorithm than the one used in gutil/strin I didn't realize that function existed! Still, it finds a separator in the range left <= * < right, and we require it in the range left < * <= right. I thought about it a bit and I don't see any advantage. http://gerrit.cloudera.org:8080/#/c/3304/4/src/kudu/cfile/cfile_util.h File src/kudu/cfile/cfile_util.h: Line 63: bool deltafile; > I think I'd prefer this be called something like 'optimize_index_keys' or s I'm going to keep it simple for now and just rename, but I'll consider implementing your suggestion for delta keys a little later. Line 110: // Computes the shortest key satisfying left < key < right and places it into target. > should clarify whether 'target' may be the same faststring as 'left' or 'ri Superseded by another change. PS4, Line 111: const faststring& left, const faststring& right > may make sense to use StringPiece or Slice here rather than specifically fa Agreed. Converted to use Slice. Line 111: void SeparatingKey(const faststring& left, const faststring& right, faststring *target); > rename to GetSeparatingKey Done http://gerrit.cloudera.org:8080/#/c/3304/4/src/kudu/cfile/cfile_writer.cc File src/kudu/cfile/cfile_writer.cc: Line 141: last_key_.clear(); > no need (it's initialized to clear) Done Line 437: SeparatingKey(last_key_, tmp_buf_, &sep_buf); > maybe make SeparatingKey actually just shorten tmp_buf_ in place to avoid t Done http://gerrit.cloudera.org:8080/#/c/3304/4/src/kudu/cfile/rle_block.h File src/kudu/cfile/rle_block.h: Line 265: if (count_ > 0) { > usually we put the error case first I'm being consistent with GetFirstKey's (preexisting) implementation. If you like, I'll adjust both of them to handle error case first. -- To view, visit http://gerrit.cloudera.org:8080/3304 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I68ae9146fabd4a19b17d103d118d2d60e28bb315 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will Berkeley <wdberke...@gmail.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Will Berkeley <wdberke...@gmail.com> Gerrit-HasComments: Yes