[ https://issues.apache.org/jira/browse/IMPALA-12393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757639#comment-17757639 ]
Joe McDonnell commented on IMPALA-12393: ---------------------------------------- I can't get the Parquet writing to produce the issue, so maybe the padding is always zero somehow. > DictEncoder uses inconsistent hash function for TimestampValue > -------------------------------------------------------------- > > Key: IMPALA-12393 > URL: https://issues.apache.org/jira/browse/IMPALA-12393 > Project: IMPALA > Issue Type: Bug > Components: Backend > Affects Versions: Impala 4.3.0 > Reporter: Joe McDonnell > Assignee: Joe McDonnell > Priority: Blocker > > DictEncoder currently uses this hash function for TimestampValue: > {noformat} > template<typename T> > inline uint32_t DictEncoder<T>::Hash(const T& value) const { > return HashUtil::Hash(&value, sizeof(value), 0); > }{noformat} > TimestampValue has some padding, and nothing ensures that the padding is > cleared. This means that identical TimestampValue objects can hash to > different values. > This came up when fixing a Clang-Tidy performance check. This line in > dict-test.cc changed from iterating over values to iterating over const > references. > {noformat} > DictEncoder<InternalType> encoder(&pool, fixed_buffer_byte_size, > &track_encoder); > encoder.UsedbyTest(); > <<<<<< > for (InternalType i: values) encoder.Put(i); > ===== > for (const InternalType& i: values) encoder.Put(i); > >>>>> > bytes_alloc = encoder.DictByteSize(); > EXPECT_EQ(track_encoder.consumption(), bytes_alloc); > EXPECT_EQ(encoder.num_entries(), values_set.size()); <--------{noformat} > The test became flaky, with the encoder.num_entries() being larger than the > values_set.size() for TimestampValue. This happened because the hash values > didn't match even for identical entries and the dictionary would have > multiple copies of the same value. When iterating over a plain non-reference > TimestampValue, each TimestampValue is being copied to a temporary value. > Maybe in this circumstance the padding stays the same between iterations. > It's possible this would come up when writing Parquet data files. > One fix would be to use TimestampValue's Hash function, which ignores the > padding: > {noformat} > template<> > inline uint32_t DictEncoder<TimestampValue>::Hash(const TimestampValue& > value) const { > return value.Hash(); > }{noformat} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org