Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20496 )

Change subject: IMPALA-12373: Small String Optimization for StringValue
......................................................................


Patch Set 4:

(67 comments)

http://gerrit.cloudera.org:8080/#/c/20496/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20496/4//COMMIT_MSG@15
PS4, Line 15: byte
Nit: bytes.


http://gerrit.cloudera.org:8080/#/c/20496/4//COMMIT_MSG@16
PS4, Line 16: byte
Nit: bytes.


http://gerrit.cloudera.org:8080/#/c/20496/4//COMMIT_MSG@19
PS4, Line 19: between 'ptr' and 'len'
It also means that there is no padding _after_ 'len': if it wasn't packed, 4 
bytes would be needed after 'len' so that 'ptr' would be aligned on an 8-byte 
boundary in an array.


http://gerrit.cloudera.org:8080/#/c/20496/4//COMMIT_MSG@20
PS4, Line 20: byte
Nit: bytes.


http://gerrit.cloudera.org:8080/#/c/20496/4//COMMIT_MSG@40
PS4, Line 40: we can also smallify all strings in a tuple at
            : once.
TODO: The description of this use case is not clear to me.


http://gerrit.cloudera.org:8080/#/c/20496/4//COMMIT_MSG@59
PS4, Line 59: 6.5 columns
It's not clear to me what 6.5 columns means? Is it that in the 7th column half 
of the strings are small and the other half are large?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/codegen/llvm-codegen-test.cc
File be/src/codegen/llvm-codegen-test.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/codegen/llvm-codegen-test.cc@365
PS4, Line 365:   llvm::Function* str_setlen_fn = codegen->GetFunction(
If we retrieve 'str_ptr_fn' and 'str_len_fn' at the beginning of the function, 
we could do the same with 'str_setlen_fn', i.e. move it to the beginning.

Alternatively we could move 'str_len_fn' to just before its usage on L362, but 
I usually take the first approach with IR functions.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/analytic-eval-node.cc
File be/src/exec/analytic-eval-node.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/analytic-eval-node.cc@397
PS4, Line 397:     StringValue::SimpleString s = sv->ToSimpleString();
Do we need this conversion? Couldn't we just use sv->Len() and sv->Ptr()?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/hdfs-text-table-writer.cc@167
PS4, Line 167:   StringValue::SimpleString s = str_val->ToSimpleString();
Can't we use str_val->Len() and str_val->Ptr()?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/kudu/kudu-util-ir.cc
File be/src/exec/kudu/kudu-util-ir.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/kudu/kudu-util-ir.cc@59
PS4, Line 59:       StringValue::SimpleString s = sv->ToSimpleString();
Can't we use sv->Len() and sv->Ptr()?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/kudu/kudu-util-ir.cc@72
PS4, Line 72:       StringValue::SimpleString s = sv->ToSimpleString();
Can't we use sv->Len() and sv->Ptr()?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/orc/hdfs-orc-scanner.cc
File be/src/exec/orc/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/orc/hdfs-orc-scanner.cc@1175
PS4, Line 1175:       StringValue::SimpleString s = sv->ToSimpleString();
Can't we use sv->Len() and sv->Ptr()?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/parquet/parquet-column-stats.cc
File be/src/exec/parquet/parquet-column-stats.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/parquet/parquet-column-stats.cc@417
PS4, Line 417:   StringValue::SimpleString value_s = value->ToSimpleString();
Can't we use value->Len() and value->Ptr()?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/parquet/parquet-common.h@560
PS4, Line 560:   StringValue::SimpleString s = v.ToSimpleString();
Can't we use v->Len() and v->Ptr()?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/parquet/parquet-common.h@573
PS4, Line 573:   v->Clear();
Do we call Clear() to ensure the StringValue is not small?
Instead we could use Assign() (instead of  SetLen() and SetPtr()), that's 
probably more straightforward and solves the same problem.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/text-converter.inline.h
File be/src/exec/text-converter.inline.h:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/text-converter.inline.h@95
PS4, Line 95:         StringValue::SimpleString s = str.ToSimpleString();
Can't we use str->Len() and str->Ptr()?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exec/text-converter.inline.h@114
PS4, Line 114:         StringValue::SimpleString s = str.ToSimpleString();
Can't we use str->Len() and str->Ptr()?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/experiments/string-search-sse.h
File be/src/experiments/string-search-sse.h:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/experiments/string-search-sse.h@60
PS4, Line 60:     StringValue::SimpleString haystack_s = 
haystack.ToSimpleString();
Can't we use haystack->Len() and haystack->Ptr()?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/exprs/string-functions-ir.cc@525
PS4, Line 525:   StringValue::SimpleString haystack_s = 
haystack.ToSimpleString();
Can't we use haystack->Len() and haystack->Ptr()?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/buffered-tuple-stream-test.cc
File be/src/runtime/buffered-tuple-stream-test.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/buffered-tuple-stream-test.cc@235
PS4, Line 235: Assign
We don't use sv.SetPtr() because it may be a small string? If so we could add a 
comment to make it clear.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/buffered-tuple-stream.cc
File be/src/runtime/buffered-tuple-stream.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/buffered-tuple-stream.cc@1118
PS4, Line 1118:       StringValue::SimpleString s = sv->ToSimpleString();
Can't we use sv->Len() and sv->Ptr()?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc
File be/src/runtime/descriptors.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@713
PS4, Line 713:       any_val->SetPtr(ptr);
These pointers will now point into the fixed length part of StringValue if it 
is a small string. Is it possible that that fixed length memory slot is 
discarded when we still use the StringVal we created here?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@1112
PS4, Line 1112: CodegenWriteStringOrCollectionToSlot
Optional: we could inline this into the caller, strings and collections are no 
longer handled in the same way (except if we unify them again, see the comment 
at L1123).


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@1123
PS4, Line 1123:
As far as I can see, the main difference between CodegenWriteCollectionToSlot() 
and CodegenWriteStringToSlot() is the way we store the result, i.e. by using 
builder->CreateStore() or calling IRFunction::STRING_VALUE_ASSIGN.

If we created a similar IR function for collections, we could unify the two 
functions again, with one IF/ELSE deciding which function to use.
Alternatively, if you think it would complicate this patch, we should remove 
the conditionals/comments from this function that reference the string type.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@1130
PS4, Line 1130: type.IsStringType()
Now it shouldn't be a string here. Alternatively, 'type' shouldn't need to be 
passed to this function.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@1133
PS4, Line 1133: raw_type
This shouldn't be needed, we know it's a collection.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@1134
PS4, Line 1134: str_or_coll_value
Should be renamed to 'coll_value'.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@1139
PS4, Line 1139:     if (type.IsCollectionType()) {
This IF is no longer needed, we know it's a collection type.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/descriptors.cc@1168
PS4, Line 1168: type.IsCollectionType()
It shouldn't be a collection type. Alternatively, 'type' shouldn't need to be 
passed to this function.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/raw-value.inline.h
File be/src/runtime/raw-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/raw-value.inline.h@227
PS4, Line 227:     StringValue::SimpleString s = v->ToSimpleString();
Can't we use v->Len() and v->Ptr()?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/raw-value.inline.h@352
PS4, Line 352:     StringValue::SimpleString s = v->ToSimpleString();
Can't we use v->Len() and v->Ptr()?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h
File be/src/runtime/smallable-string.h:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@29
PS4, Line 29: length
Nit: the length.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@33
PS4, Line 33: /// being used.
You could mention the differences between big endian and little endian.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@38
PS4, Line 38:   struct SimpleString {
I've commented on most uses of this struct and ToSimpleString() that it may not 
be necessary. I don't think using the Ptr() and Len() functions instead of the 
'ptr' and 'len' members makes the code much harder to read, but if you think 
it's cleaner if we keep this struct it's ok too.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@48
PS4, Line 48:   SmallableString(const SmallableString& other) {
Could add a comment. Mention that it conserves the "smallness" state of 'other'.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@52
PS4, Line 52:   SmallableString(char* ptr, int len) {
Could add a comment.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@60
PS4, Line 60:   explicit SmallableString(const std::string& s) :
How much do we need the constructor with 'const std::string& s' and 'const 
char* s'? I'm a bit concerned because of the const_cast. If the source is 
actually declared const (and not just the reference we get), it will be UB to 
modify it. It also violates the API. Can't we take them as non-const?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@65
PS4, Line 65:
Nit: unneeded space.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@90
PS4, Line 90:     if (rep.long_rep.len == 0) { return true; }
Shouldn't we set the "small bit"? Now we return true but a subsequent call to 
IsSmall() will return false.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@93
PS4, Line 93:     memset(this, 0, sizeof(*this));
This memset doesn't seem to be necessary. We're setting 'rep.small_rep.len' and 
the useful bytes of 'rep.small_rep.buf' below.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@97
PS4, Line 97: 0b10000000
Could extract the masks into constexpr static members.
We could also create functions for GetSmallLen() and SetSmallLen().


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@133
PS4, Line 133:     if (!IsSmall()) {
I'd invert the IF, i.e. "if (IsSmall()) ...". It's more difficult to understand 
negated predicates and the other functions also follow this pattern.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/smallable-string.h@170
PS4, Line 170: int
We could use int32_t to indicate that 'len' should be 4 bytes long. If we do, 
we could also change int to int32_t in the signatures of the accessor functions.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/sorter.cc@608
PS4, Line 608:       if (string_val->IsSmall()) continue;
Optional: I think "if (!string_val->IsSmall()) { add length }" would be better 
than "continue", also for the other functions.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/sorter.cc@751
PS4, Line 751:       if (value->IsSmall()) continue;
This could go into the branch at L743. That IF could also become "if constexpr".


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/sorter.cc@786
PS4, Line 786:     using ptr_type = decltype(value->Ptr());
Now that strings and collections are handled separately we don't need this 
type, we could use the concrete types in the branches. Alternatively we could 
create a SetPtr() method for CollectionValues too.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-buffer.h
File be/src/runtime/string-buffer.h:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-buffer.h@45
PS4, Line 45:       StringValue::SimpleString s = str->ToSimpleString();
We could use str->Ptr() and str->Len().


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-search.h
File be/src/runtime/string-search.h:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-search.h@83
PS4, Line 83:     if (str == NULL || pattern_ == NULL) {
We could use "pattern_->Len() == 0" here. I think it's clearer to check it here 
even if we use SimpleString later.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-search.h@140
PS4, Line 140:     if (str == NULL || pattern_ == NULL) {
See L83.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value-ir.cc
File be/src/runtime/string-value-ir.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value-ir.cc@23
PS4, Line 23: IR_ALWAYS_INLINE int StringValue::IrLen() const { return Len(); }
Optional: instead of defining new wrapper functions, we could define 
StringValue::{Len(), Ptr(), etc.} here in the IR file.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h
File be/src/runtime/string-value.h:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h@36
PS4, Line 36: /// TODO: rename this to be less confusing with 
impala_udf::StringVal.
Could mention SSO also here.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h@53
PS4, Line 53:
Nit: extra space.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h@58
PS4, Line 58:
Nit: extra space.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h@78
PS4, Line 78:   void SetLen(int len) { return string_impl_.SetLen(len); }
Mention that we can only decrease the length if the string is small.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h@82
PS4, Line 82:   void SetPtr(char* ptr) { return string_impl_.SetPtr(ptr); }
Mention that we can only call this if the string is not small.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h@136
PS4, Line 136:     *sv = 
impala_udf::StringVal(reinterpret_cast<uint8_t*>(s.ptr), s.len);
We could use sv->Ptr() and sv->Len().


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.h@170
PS4, Line 170:   StringValue::SimpleString s = v.ToSimpleString();
We could use v->Ptr() and v->Len().


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.cc
File be/src/runtime/string-value.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.cc@38
PS4, Line 38: unsigned
Why is it unsigned?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.cc@50
PS4, Line 50: unsigned
Why unsigned?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.cc@54
PS4, Line 54:   int i = len - 1;
Not relevant to this patch, but what if len==0? Shouldn't we also return an 
empty string then?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.cc@80
PS4, Line 80: unsigned
Why unsigned?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.inline.h
File be/src/runtime/string-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.inline.h@117
PS4, Line 117: Substring
> did you check the call sites? I am concerned about using a temporary String
Calling Smallify() on the result would solve this, but that would also smallify 
the result if 'this' was originally non-small but the result is short enough - 
do we want to do that?


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/string-value.inline.h@122
PS4, Line 122: inline StringValue StringValue::Substring(int start_pos, int 
new_len) const {
Csaba's L117 comment also applies here.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/tuple-ir.cc
File be/src/runtime/tuple-ir.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/tuple-ir.cc@42
PS4, Line 42: str_len
We don't really need this variable.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/runtime/tuple.cc@86
PS4, Line 86:   for (vector<SlotDescriptor*>::const_iterator slot = 
desc.string_slots().begin();
Can't we use a range-based for-loop?
"for (SlotDescriptor* const slot : desc.string_slots()) { ... }".


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/util/min-max-filter.cc@504
PS4, Line 504: CopyToBuffer
In the function comment in the header we should mention that small strings are 
not copied.


http://gerrit.cloudera.org:8080/#/c/20496/4/be/src/util/min-max-filter.cc@507
PS4, Line 507:   if (value->IsSmall()) return;
Can 'len' be shorter than 'value->Len()'? If yes, we should also decrease the 
size for small strings. If not, L515 is probably not needed.



--
To view, visit http://gerrit.cloudera.org:8080/20496
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I741c3a5f12ab620b6b64b57d4c89b5f8e056efd3
Gerrit-Change-Number: 20496
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Michael Smith <michael.sm...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Sep 2023 09:59:38 +0000
Gerrit-HasComments: Yes

Reply via email to