alamb commented on code in PR #9720:
URL: https://github.com/apache/arrow-rs/pull/9720#discussion_r3124684303
##########
parquet/src/encodings/encoding/dict_encoder.rs:
##########
@@ -183,6 +188,214 @@ impl<T: DataType> Encoder<T> for DictEncoder<T> {
///
/// For this encoder, the indices are unencoded bytes (refer to
[`Self::write_indices`]).
fn estimated_memory_size(&self) -> usize {
- self.interner.storage().size_in_bytes + self.indices.len() *
std::mem::size_of::<usize>()
+ self.interner.estimated_memory_size()
+ + self.indices.capacity() * std::mem::size_of::<usize>()
+ }
+}
+
+#[cfg(test)]
+mod tests {
Review Comment:
Thank you @mzabaluev
I ran these tests without your code change and they all passed. Thus I don't
think the are covering whatever issue you have found
```diff
andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ git diff
diff --git a/parquet/src/encodings/encoding/dict_encoder.rs
b/parquet/src/encodings/encoding/dict_encoder.rs
index 37cfdb9ba1..2f32f9c0bb 100644
--- a/parquet/src/encodings/encoding/dict_encoder.rs
+++ b/parquet/src/encodings/encoding/dict_encoder.rs
@@ -64,12 +64,7 @@ impl<T: DataType> Storage for KeyStorage<T> {
}
fn estimated_memory_size(&self) -> usize {
- let uniques_heap_bytes = match T::get_physical_type() {
- Type::FIXED_LEN_BYTE_ARRAY => self.type_length *
self.uniques.len(),
- _ => <Self::Value as
ParquetValueType>::variable_length_bytes(&self.uniques)
- .unwrap_or(0) as usize,
- };
- self.uniques.capacity() * std::mem::size_of::<T::T>() +
uniques_heap_bytes
+ self.size_in_bytes + self.uniques.capacity() *
std::mem::size_of::<T::T>()
}
}
diff --git a/parquet/src/util/interner.rs b/parquet/src/util/interner.rs
index deae3720d5..34c7d1390f 100644
--- a/parquet/src/util/interner.rs
+++ b/parquet/src/util/interner.rs
@@ -77,7 +77,9 @@ impl<S: Storage> Interner<S> {
/// Return estimate of the memory used, in bytes
#[allow(dead_code)] // not used in parquet_derive, so is dead there
pub fn estimated_memory_size(&self) -> usize {
- self.storage.estimated_memory_size() + self.dedup.allocation_size()
+ self.storage.estimated_memory_size() +
+ // estimate size of dedup hashmap as just th size of the keys
+ self.dedup.capacity() + std::mem::size_of::<S::Key>()
}
/// Returns the storage for this interner
```
And then
```shell
andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ cargo test --lib -p
parquet -- dict_encoder
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.10s
Running unittests src/lib.rs
(target/debug/deps/parquet-d5e640393e7492a1)
running 6 tests
test
encodings::encoding::dict_encoder::tests::test_estimated_memory_size_primitive_with_duplicates
... ok
test
encodings::encoding::dict_encoder::tests::test_estimated_memory_size_primitive_all_distinct
... ok
test
encodings::encoding::dict_encoder::tests::test_estimated_memory_size_fixed_len_byte_array_with_duplicates
... ok
test
encodings::encoding::dict_encoder::tests::test_estimated_memory_size_byte_array_with_duplicates
... ok
test
encodings::encoding::dict_encoder::tests::test_estimated_memory_size_fixed_len_byte_array_all_distinct
... ok
test
encodings::encoding::dict_encoder::tests::test_estimated_memory_size_byte_array_all_distinct
... ok
test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 1051 filtered
out; finished in 0.00s
andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]