[jira] [Commented] (IMPALA-12393) DictEncoder uses inconsistent hash function for TimestampValue

Joe McDonnell (Jira) Tue, 22 Aug 2023 14:19:16 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-12393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757678#comment-17757678
 ]


Joe McDonnell commented on IMPALA-12393:
----------------------------------------

For normal execution, I think this statement that clears the Tuple means that 
the padding will be consistent. From be/src/runtime/tuple.h:
{noformat}
  void Init(int size) { memset(this, 0, size); }{noformat}
That would mean this is test-only and not a real issue.

> DictEncoder uses inconsistent hash function for TimestampValue
> --------------------------------------------------------------
>
>                 Key: IMPALA-12393
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12393
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.3.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Major
>
> DictEncoder currently uses this hash function for TimestampValue:
> {noformat}
> template<typename T>
> inline uint32_t DictEncoder<T>::Hash(const T& value) const {
>   return HashUtil::Hash(&value, sizeof(value), 0);
> }{noformat}
> TimestampValue has some padding, and nothing ensures that the padding is 
> cleared. This means that identical TimestampValue objects can hash to 
> different values.
> This came up when fixing a Clang-Tidy performance check. This line in 
> dict-test.cc changed from iterating over values to iterating over const 
> references.
> {noformat}
>   DictEncoder<InternalType> encoder(&pool, fixed_buffer_byte_size, 
> &track_encoder);
>   encoder.UsedbyTest();
> <<<<<<
>   for (InternalType i: values) encoder.Put(i);
> =====
>   for (const InternalType& i: values) encoder.Put(i);
> >>>>>
>   bytes_alloc = encoder.DictByteSize();
>   EXPECT_EQ(track_encoder.consumption(), bytes_alloc);
>   EXPECT_EQ(encoder.num_entries(), values_set.size()); <--------{noformat}
> The test became flaky, with the encoder.num_entries() being larger than the 
> values_set.size() for TimestampValue. This happened because the hash values 
> didn't match even for identical entries and the dictionary would have 
> multiple copies of the same value. When iterating over a plain non-reference 
> TimestampValue, each TimestampValue is being copied to a temporary value. 
> Maybe in this circumstance the padding stays the same between iterations.
> It's possible this would come up when writing Parquet data files.
> One fix would be to use TimestampValue's Hash function, which ignores the 
> padding:
> {noformat}
> template<>
> inline uint32_t DictEncoder<TimestampValue>::Hash(const TimestampValue& 
> value) const {
>   return value.Hash();
> }{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-12393) DictEncoder uses inconsistent hash function for TimestampValue

Reply via email to