[ 
https://issues.apache.org/jira/browse/IMPALA-12393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757639#comment-17757639
 ] 

Joe McDonnell commented on IMPALA-12393:
----------------------------------------

I can't get the Parquet writing to produce the issue, so maybe the padding is 
always zero somehow.

> DictEncoder uses inconsistent hash function for TimestampValue
> --------------------------------------------------------------
>
>                 Key: IMPALA-12393
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12393
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.3.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Blocker
>
> DictEncoder currently uses this hash function for TimestampValue:
> {noformat}
> template<typename T>
> inline uint32_t DictEncoder<T>::Hash(const T& value) const {
>   return HashUtil::Hash(&value, sizeof(value), 0);
> }{noformat}
> TimestampValue has some padding, and nothing ensures that the padding is 
> cleared. This means that identical TimestampValue objects can hash to 
> different values.
> This came up when fixing a Clang-Tidy performance check. This line in 
> dict-test.cc changed from iterating over values to iterating over const 
> references.
> {noformat}
>   DictEncoder<InternalType> encoder(&pool, fixed_buffer_byte_size, 
> &track_encoder);
>   encoder.UsedbyTest();
> <<<<<<
>   for (InternalType i: values) encoder.Put(i);
> =====
>   for (const InternalType& i: values) encoder.Put(i);
> >>>>>
>   bytes_alloc = encoder.DictByteSize();
>   EXPECT_EQ(track_encoder.consumption(), bytes_alloc);
>   EXPECT_EQ(encoder.num_entries(), values_set.size()); <--------{noformat}
> The test became flaky, with the encoder.num_entries() being larger than the 
> values_set.size() for TimestampValue. This happened because the hash values 
> didn't match even for identical entries and the dictionary would have 
> multiple copies of the same value. When iterating over a plain non-reference 
> TimestampValue, each TimestampValue is being copied to a temporary value. 
> Maybe in this circumstance the padding stays the same between iterations.
> It's possible this would come up when writing Parquet data files.
> One fix would be to use TimestampValue's Hash function, which ignores the 
> padding:
> {noformat}
> template<>
> inline uint32_t DictEncoder<TimestampValue>::Hash(const TimestampValue& 
> value) const {
>   return value.Hash();
> }{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to