Joe McDonnell created IMPALA-12393:
--------------------------------------

             Summary: DictEncoder uses inconsistent hash function for 
TimestampValue
                 Key: IMPALA-12393
                 URL: https://issues.apache.org/jira/browse/IMPALA-12393
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
    Affects Versions: Impala 4.3.0
            Reporter: Joe McDonnell


DictEncoder currently uses this hash function for TimestampValue:
{noformat}
template<typename T>
inline uint32_t DictEncoder<T>::Hash(const T& value) const {
  return HashUtil::Hash(&value, sizeof(value), 0);
}{noformat}
TimestampValue has some padding, and nothing ensures that the padding is 
cleared. This means that identical TimestampValue objects can hash to different 
values.

This came up when fixing a Clang-Tidy performance check. This line in 
dict-test.cc changed from iterating over values to iterating over const 
references.
{noformat}
  DictEncoder<InternalType> encoder(&pool, fixed_buffer_byte_size, 
&track_encoder);
  encoder.UsedbyTest();
<<<<<<
  for (InternalType i: values) encoder.Put(i);
=====
  for (const InternalType& i: values) encoder.Put(i);
>>>>>
  bytes_alloc = encoder.DictByteSize();
  EXPECT_EQ(track_encoder.consumption(), bytes_alloc);
  EXPECT_EQ(encoder.num_entries(), values_set.size()); <--------{noformat}
The test became flaky, with the encoder.num_entries() being larger than the 
values_set.size() for TimestampValue. This happened because the hash values 
didn't match even for identical entries and the dictionary would have multiple 
copies of the same value. When iterating over a plain non-reference 
TimestampValue, each TimestampValue is being copied to a temporary value. Maybe 
in this circumstance the padding stays the same between iterations.

It's possible this would come up when writing Parquet data files.

One fix would be to use TimestampValue's Hash function, which ignores the 
padding:
{noformat}
template<>
inline uint32_t DictEncoder<TimestampValue>::Hash(const TimestampValue& value) 
const {
  return value.Hash();
}{noformat}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to