Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/20496 )
Change subject: IMPALA-12373: Small String Optimization for StringValue ...................................................................... Patch Set 4: (67 comments) http://gerrit.cloudera.org:8080/#/c/20496/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20496/4//COMMIT_MSG@15 PS4, Line 15: byte Nit: bytes. http://gerrit.cloudera.org:8080/#/c/20496/4//COMMIT_MSG@16 PS4, Line 16: byte Nit: bytes. http://gerrit.cloudera.org:8080/#/c/20496/4//COMMIT_MSG@19 PS4, Line 19: between 'ptr' and 'len' It also means that there is no padding _after_ 'len': if it wasn't packed, 4 bytes would be needed after 'len' so that 'ptr' would be aligned on an 8-byte boundary in an array. http://gerrit.cloudera.org:8080/#/c/20496/4//COMMIT_MSG@20 PS4, Line 20: byte Nit: bytes. http://gerrit.cloudera.org:8080/#/c/20496/4//COMMIT_MSG@40 PS4, Line 40: we can also smallify all strings in a tuple at : once. TODO: The description of this use case is not clear to me. http://gerrit.cloudera.org:8080/#/c/20496/4//COMMIT_MSG@59 PS4, Line 59: 6.5 columns It's not clear to me what 6.5 columns means? Is it that in the 7th column half of the strings are small and the other half are large? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/codegen/llvm-codegen-test.cc File be/src/codegen/llvm-codegen-test.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/codegen/llvm-codegen-test.cc@365 PS4, Line 365: llvm::Function* str_setlen_fn = codegen->GetFunction( If we retrieve 'str_ptr_fn' and 'str_len_fn' at the beginning of the function, we could do the same with 'str_setlen_fn', i.e. move it to the beginning. Alternatively we could move 'str_len_fn' to just before its usage on L362, but I usually take the first approach with IR functions. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/analytic-eval-node.cc File be/src/exec/analytic-eval-node.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/analytic-eval-node.cc@397 PS4, Line 397: StringValue::SimpleString s = sv->ToSimpleString(); Do we need this conversion? Couldn't we just use sv->Len() and sv->Ptr()? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/hdfs-text-table-writer.cc File be/src/exec/hdfs-text-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/hdfs-text-table-writer.cc@167 PS4, Line 167: StringValue::SimpleString s = str_val->ToSimpleString(); Can't we use str_val->Len() and str_val->Ptr()? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/kudu/kudu-util-ir.cc File be/src/exec/kudu/kudu-util-ir.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/kudu/kudu-util-ir.cc@59 PS4, Line 59: StringValue::SimpleString s = sv->ToSimpleString(); Can't we use sv->Len() and sv->Ptr()? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/kudu/kudu-util-ir.cc@72 PS4, Line 72: StringValue::SimpleString s = sv->ToSimpleString(); Can't we use sv->Len() and sv->Ptr()? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/orc/hdfs-orc-scanner.cc File be/src/exec/orc/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/orc/hdfs-orc-scanner.cc@1175 PS4, Line 1175: StringValue::SimpleString s = sv->ToSimpleString(); Can't we use sv->Len() and sv->Ptr()? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/parquet/parquet-column-stats.cc File be/src/exec/parquet/parquet-column-stats.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/parquet/parquet-column-stats.cc@417 PS4, Line 417: StringValue::SimpleString value_s = value->ToSimpleString(); Can't we use value->Len() and value->Ptr()? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/parquet/parquet-common.h@560 PS4, Line 560: StringValue::SimpleString s = v.ToSimpleString(); Can't we use v->Len() and v->Ptr()? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/parquet/parquet-common.h@573 PS4, Line 573: v->Clear(); Do we call Clear() to ensure the StringValue is not small? Instead we could use Assign() (instead of SetLen() and SetPtr()), that's probably more straightforward and solves the same problem. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/text-converter.inline.h File be/src/exec/text-converter.inline.h: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/text-converter.inline.h@95 PS4, Line 95: StringValue::SimpleString s = str.ToSimpleString(); Can't we use str->Len() and str->Ptr()? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/text-converter.inline.h@114 PS4, Line 114: StringValue::SimpleString s = str.ToSimpleString(); Can't we use str->Len() and str->Ptr()? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/experiments/string-search-sse.h File be/src/experiments/string-search-sse.h: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/experiments/string-search-sse.h@60 PS4, Line 60: StringValue::SimpleString haystack_s = haystack.ToSimpleString(); Can't we use haystack->Len() and haystack->Ptr()? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exprs/string-functions-ir.cc File be/src/exprs/string-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exprs/string-functions-ir.cc@525 PS4, Line 525: StringValue::SimpleString haystack_s = haystack.ToSimpleString(); Can't we use haystack->Len() and haystack->Ptr()? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/buffered-tuple-stream-test.cc File be/src/runtime/buffered-tuple-stream-test.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/buffered-tuple-stream-test.cc@235 PS4, Line 235: Assign We don't use sv.SetPtr() because it may be a small string? If so we could add a comment to make it clear. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/buffered-tuple-stream.cc File be/src/runtime/buffered-tuple-stream.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/buffered-tuple-stream.cc@1118 PS4, Line 1118: StringValue::SimpleString s = sv->ToSimpleString(); Can't we use sv->Len() and sv->Ptr()? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc File be/src/runtime/descriptors.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@713 PS4, Line 713: any_val->SetPtr(ptr); These pointers will now point into the fixed length part of StringValue if it is a small string. Is it possible that that fixed length memory slot is discarded when we still use the StringVal we created here? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@1112 PS4, Line 1112: CodegenWriteStringOrCollectionToSlot Optional: we could inline this into the caller, strings and collections are no longer handled in the same way (except if we unify them again, see the comment at L1123). http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@1123 PS4, Line 1123: As far as I can see, the main difference between CodegenWriteCollectionToSlot() and CodegenWriteStringToSlot() is the way we store the result, i.e. by using builder->CreateStore() or calling IRFunction::STRING_VALUE_ASSIGN. If we created a similar IR function for collections, we could unify the two functions again, with one IF/ELSE deciding which function to use. Alternatively, if you think it would complicate this patch, we should remove the conditionals/comments from this function that reference the string type. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@1130 PS4, Line 1130: type.IsStringType() Now it shouldn't be a string here. Alternatively, 'type' shouldn't need to be passed to this function. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@1133 PS4, Line 1133: raw_type This shouldn't be needed, we know it's a collection. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@1134 PS4, Line 1134: str_or_coll_value Should be renamed to 'coll_value'. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@1139 PS4, Line 1139: if (type.IsCollectionType()) { This IF is no longer needed, we know it's a collection type. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@1168 PS4, Line 1168: type.IsCollectionType() It shouldn't be a collection type. Alternatively, 'type' shouldn't need to be passed to this function. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/raw-value.inline.h File be/src/runtime/raw-value.inline.h: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/raw-value.inline.h@227 PS4, Line 227: StringValue::SimpleString s = v->ToSimpleString(); Can't we use v->Len() and v->Ptr()? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/raw-value.inline.h@352 PS4, Line 352: StringValue::SimpleString s = v->ToSimpleString(); Can't we use v->Len() and v->Ptr()? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h File be/src/runtime/smallable-string.h: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@29 PS4, Line 29: length Nit: the length. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@33 PS4, Line 33: /// being used. You could mention the differences between big endian and little endian. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@38 PS4, Line 38: struct SimpleString { I've commented on most uses of this struct and ToSimpleString() that it may not be necessary. I don't think using the Ptr() and Len() functions instead of the 'ptr' and 'len' members makes the code much harder to read, but if you think it's cleaner if we keep this struct it's ok too. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@48 PS4, Line 48: SmallableString(const SmallableString& other) { Could add a comment. Mention that it conserves the "smallness" state of 'other'. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@52 PS4, Line 52: SmallableString(char* ptr, int len) { Could add a comment. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@60 PS4, Line 60: explicit SmallableString(const std::string& s) : How much do we need the constructor with 'const std::string& s' and 'const char* s'? I'm a bit concerned because of the const_cast. If the source is actually declared const (and not just the reference we get), it will be UB to modify it. It also violates the API. Can't we take them as non-const? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@65 PS4, Line 65: Nit: unneeded space. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@90 PS4, Line 90: if (rep.long_rep.len == 0) { return true; } Shouldn't we set the "small bit"? Now we return true but a subsequent call to IsSmall() will return false. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@93 PS4, Line 93: memset(this, 0, sizeof(*this)); This memset doesn't seem to be necessary. We're setting 'rep.small_rep.len' and the useful bytes of 'rep.small_rep.buf' below. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@97 PS4, Line 97: 0b10000000 Could extract the masks into constexpr static members. We could also create functions for GetSmallLen() and SetSmallLen(). http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@133 PS4, Line 133: if (!IsSmall()) { I'd invert the IF, i.e. "if (IsSmall()) ...". It's more difficult to understand negated predicates and the other functions also follow this pattern. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@170 PS4, Line 170: int We could use int32_t to indicate that 'len' should be 4 bytes long. If we do, we could also change int to int32_t in the signatures of the accessor functions. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/sorter.cc File be/src/runtime/sorter.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/sorter.cc@608 PS4, Line 608: if (string_val->IsSmall()) continue; Optional: I think "if (!string_val->IsSmall()) { add length }" would be better than "continue", also for the other functions. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/sorter.cc@751 PS4, Line 751: if (value->IsSmall()) continue; This could go into the branch at L743. That IF could also become "if constexpr". http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/sorter.cc@786 PS4, Line 786: using ptr_type = decltype(value->Ptr()); Now that strings and collections are handled separately we don't need this type, we could use the concrete types in the branches. Alternatively we could create a SetPtr() method for CollectionValues too. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-buffer.h File be/src/runtime/string-buffer.h: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-buffer.h@45 PS4, Line 45: StringValue::SimpleString s = str->ToSimpleString(); We could use str->Ptr() and str->Len(). http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-search.h File be/src/runtime/string-search.h: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-search.h@83 PS4, Line 83: if (str == NULL || pattern_ == NULL) { We could use "pattern_->Len() == 0" here. I think it's clearer to check it here even if we use SimpleString later. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-search.h@140 PS4, Line 140: if (str == NULL || pattern_ == NULL) { See L83. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value-ir.cc File be/src/runtime/string-value-ir.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value-ir.cc@23 PS4, Line 23: IR_ALWAYS_INLINE int StringValue::IrLen() const { return Len(); } Optional: instead of defining new wrapper functions, we could define StringValue::{Len(), Ptr(), etc.} here in the IR file. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h File be/src/runtime/string-value.h: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h@36 PS4, Line 36: /// TODO: rename this to be less confusing with impala_udf::StringVal. Could mention SSO also here. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h@53 PS4, Line 53: Nit: extra space. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h@58 PS4, Line 58: Nit: extra space. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h@78 PS4, Line 78: void SetLen(int len) { return string_impl_.SetLen(len); } Mention that we can only decrease the length if the string is small. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h@82 PS4, Line 82: void SetPtr(char* ptr) { return string_impl_.SetPtr(ptr); } Mention that we can only call this if the string is not small. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h@136 PS4, Line 136: *sv = impala_udf::StringVal(reinterpret_cast<uint8_t*>(s.ptr), s.len); We could use sv->Ptr() and sv->Len(). http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h@170 PS4, Line 170: StringValue::SimpleString s = v.ToSimpleString(); We could use v->Ptr() and v->Len(). http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.cc File be/src/runtime/string-value.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.cc@38 PS4, Line 38: unsigned Why is it unsigned? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.cc@50 PS4, Line 50: unsigned Why unsigned? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.cc@54 PS4, Line 54: int i = len - 1; Not relevant to this patch, but what if len==0? Shouldn't we also return an empty string then? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.cc@80 PS4, Line 80: unsigned Why unsigned? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.inline.h File be/src/runtime/string-value.inline.h: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.inline.h@117 PS4, Line 117: Substring > did you check the call sites? I am concerned about using a temporary String Calling Smallify() on the result would solve this, but that would also smallify the result if 'this' was originally non-small but the result is short enough - do we want to do that? http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.inline.h@122 PS4, Line 122: inline StringValue StringValue::Substring(int start_pos, int new_len) const { Csaba's L117 comment also applies here. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/tuple-ir.cc File be/src/runtime/tuple-ir.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/tuple-ir.cc@42 PS4, Line 42: str_len We don't really need this variable. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/tuple.cc File be/src/runtime/tuple.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/tuple.cc@86 PS4, Line 86: for (vector<SlotDescriptor*>::const_iterator slot = desc.string_slots().begin(); Can't we use a range-based for-loop? "for (SlotDescriptor* const slot : desc.string_slots()) { ... }". http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/util/min-max-filter.cc File be/src/util/min-max-filter.cc: http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/util/min-max-filter.cc@504 PS4, Line 504: CopyToBuffer In the function comment in the header we should mention that small strings are not copied. http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/util/min-max-filter.cc@507 PS4, Line 507: if (value->IsSmall()) return; Can 'len' be shorter than 'value->Len()'? If yes, we should also decrease the size for small strings. If not, L515 is probably not needed. -- To view, visit http://gerrit.cloudera.org:8080/20496 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I741c3a5f12ab620b6b64b57d4c89b5f8e056efd3 Gerrit-Change-Number: 20496 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Michael Smith <michael.sm...@cloudera.com> Gerrit-Comment-Date: Tue, 26 Sep 2023 09:59:38 +0000 Gerrit-HasComments: Yes