alamb commented on code in PR #9720:
URL: https://github.com/apache/arrow-rs/pull/9720#discussion_r3124684303


##########
parquet/src/encodings/encoding/dict_encoder.rs:
##########
@@ -183,6 +188,214 @@ impl<T: DataType> Encoder<T> for DictEncoder<T> {
     ///
     /// For this encoder, the indices are unencoded bytes (refer to 
[`Self::write_indices`]).
     fn estimated_memory_size(&self) -> usize {
-        self.interner.storage().size_in_bytes + self.indices.len() * 
std::mem::size_of::<usize>()
+        self.interner.estimated_memory_size()
+            + self.indices.capacity() * std::mem::size_of::<usize>()
+    }
+}
+
+#[cfg(test)]
+mod tests {

Review Comment:
   Thank you @mzabaluev 
   
   I ran these tests without your code change and they all passed. Thus I don't 
think the are covering whatever issue you have found
   
   ```diff
   andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ git diff
   diff --git a/parquet/src/encodings/encoding/dict_encoder.rs 
b/parquet/src/encodings/encoding/dict_encoder.rs
   index 37cfdb9ba1..2f32f9c0bb 100644
   --- a/parquet/src/encodings/encoding/dict_encoder.rs
   +++ b/parquet/src/encodings/encoding/dict_encoder.rs
   @@ -64,12 +64,7 @@ impl<T: DataType> Storage for KeyStorage<T> {
        }
   
        fn estimated_memory_size(&self) -> usize {
   -        let uniques_heap_bytes = match T::get_physical_type() {
   -            Type::FIXED_LEN_BYTE_ARRAY => self.type_length * 
self.uniques.len(),
   -            _ => <Self::Value as 
ParquetValueType>::variable_length_bytes(&self.uniques)
   -                .unwrap_or(0) as usize,
   -        };
   -        self.uniques.capacity() * std::mem::size_of::<T::T>() + 
uniques_heap_bytes
   +        self.size_in_bytes + self.uniques.capacity() * 
std::mem::size_of::<T::T>()
        }
    }
   
   diff --git a/parquet/src/util/interner.rs b/parquet/src/util/interner.rs
   index deae3720d5..34c7d1390f 100644
   --- a/parquet/src/util/interner.rs
   +++ b/parquet/src/util/interner.rs
   @@ -77,7 +77,9 @@ impl<S: Storage> Interner<S> {
        /// Return estimate of the memory used, in bytes
        #[allow(dead_code)] // not used in parquet_derive, so is dead there
        pub fn estimated_memory_size(&self) -> usize {
   -        self.storage.estimated_memory_size() + self.dedup.allocation_size()
   +        self.storage.estimated_memory_size() +
   +            // estimate size of dedup hashmap as just th size of the keys
   +            self.dedup.capacity() + std::mem::size_of::<S::Key>()
        }
   
        /// Returns the storage for this interner
   ```
   
   And then 
   
   ```shell
   andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ cargo test --lib -p 
parquet -- dict_encoder
       Finished `test` profile [unoptimized + debuginfo] target(s) in 0.10s
        Running unittests src/lib.rs 
(target/debug/deps/parquet-d5e640393e7492a1)
   
   running 6 tests
   test 
encodings::encoding::dict_encoder::tests::test_estimated_memory_size_primitive_with_duplicates
 ... ok
   test 
encodings::encoding::dict_encoder::tests::test_estimated_memory_size_primitive_all_distinct
 ... ok
   test 
encodings::encoding::dict_encoder::tests::test_estimated_memory_size_fixed_len_byte_array_with_duplicates
 ... ok
   test 
encodings::encoding::dict_encoder::tests::test_estimated_memory_size_byte_array_with_duplicates
 ... ok
   test 
encodings::encoding::dict_encoder::tests::test_estimated_memory_size_fixed_len_byte_array_all_distinct
 ... ok
   test 
encodings::encoding::dict_encoder::tests::test_estimated_memory_size_byte_array_all_distinct
 ... ok
   
   test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 1051 filtered 
out; finished in 0.00s
   
   andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to