wgtmac commented on code in PR #2321:
URL: https://github.com/apache/orc/pull/2321#discussion_r2219569602
##########
c++/src/ColumnWriter.cc:
##########
@@ -991,61 +977,33 @@ namespace orc {
// insert a new string into dictionary, return its insertion order
size_t SortedStringDictionary::insert(const char* str, size_t len) {
size_t index = flatDict_.size();
- auto ret = keyToIndex_.emplace(std::string(str, len), index);
- if (ret.second) {
- flatDict_.emplace_back(ret.first->first.data(), ret.first->first.size(),
index);
+
+ auto it = keyToIndex_.find(std::string_view{str, len});
+ if (it != keyToIndex_.end()) {
+ return it->second;
+ } else {
+ flatDict_.emplace_back(str, len);
totalLength_ += len;
+
+ const auto& lastEntry = flatDict_.back();
+ keyToIndex_.emplace(std::string_view{lastEntry.data->data(),
lastEntry.data->size()}, index);
+ return index;
}
- return ret.first->second;
}
// write dictionary data & length to output buffer
void SortedStringDictionary::flush(AppendOnlyBufferedStream* dataStream,
RleEncoder* lengthEncoder) const {
- std::sort(flatDict_.begin(), flatDict_.end(), LessThan());
Review Comment:
`For dictionary encodings the dictionary is sorted (in lexicographical order
of bytes in the UTF-8 encodings) and UTF-8 bytes of each unique value are
placed into DICTIONARY_DATA.`
Sorry for chiming in late. According to the
[spec](https://orc.apache.org/specification/ORCv1/), entries in the dictionary
must be sorted. I think we need to revert this to observe the spec.
cc @ffacs
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]