Re: [PR] Add a crate for HeapSize trait [arrow-rs]
alchemist51 commented on code in PR #9138: URL: https://github.com/apache/arrow-rs/pull/9138#discussion_r2751559818 ## arrow-memory-size/README.md: ## @@ -0,0 +1,231 @@ + + +# `arrow-memory-size` + +[](https://crates.io/crates/arrow-memory-size) +[](https://docs.rs/arrow-memory-size/latest/arrow_memory_size/) + +Memory size estimation utilities for [Apache Arrow]. + +This crate provides the `HeapSize` trait for calculating heap memory usage of data structures. + +[Apache Arrow]: https://arrow.apache.org/ + +## Why This Crate? + +Several memory size estimation crates exist in the Rust ecosystem ([deepsize], [get-size2], etc.), but none has emerged as a clear standard. Rather than take a dependency on any of them, this crate provides a minimal `HeapSize` trait with a small API surface that can be implemented across the Arrow ecosystem. + +Key motivations: + +- **Minimal API**: Just two methods (`heap_size()` and `total_size()`) Review Comment: That makes sense, thanks @adriangb! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add a crate for HeapSize trait [arrow-rs]
adriangb commented on code in PR #9138:
URL: https://github.com/apache/arrow-rs/pull/9138#discussion_r2751328890
##
arrow-array/src/heap_size.rs:
##
@@ -0,0 +1,138 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! [`HeapSize`] implementations for arrow-array types
+
+use arrow_memory_size::HeapSize;
+
+use crate::Array;
+use crate::types::{ArrowDictionaryKeyType, ArrowPrimitiveType,
RunEndIndexType};
+use crate::{
+BinaryArray, BinaryViewArray, BooleanArray, DictionaryArray,
FixedSizeBinaryArray,
+FixedSizeListArray, LargeBinaryArray, LargeListArray, LargeListViewArray,
LargeStringArray,
+ListArray, ListViewArray, MapArray, NullArray, PrimitiveArray, RunArray,
StringArray,
+StringViewArray, StructArray, UnionArray,
+};
+
+// Note: A blanket implementation `impl HeapSize for T` would be
ideal,
+// but is not possible due to Rust's orphan rules (E0210) since HeapSize is
defined
+// in a separate crate.
+//
+// Note: HeapSize cannot be implemented for ArrayRef (Arc) here due
to
+// Rust's orphan rules. Use array.get_buffer_memory_size() directly instead.
+
+/// Implements HeapSize for array types that delegate to
get_buffer_memory_size()
+macro_rules! impl_heap_size {
+($($ty:ty),*) => {
+$(
+impl HeapSize for $ty {
+fn heap_size(&self) -> usize {
+self.get_buffer_memory_size()
+}
+}
+)*
+};
+}
+
+impl_heap_size!(
+BooleanArray,
+NullArray,
+StringArray,
+LargeStringArray,
+BinaryArray,
+LargeBinaryArray,
+StringViewArray,
+BinaryViewArray,
+FixedSizeBinaryArray,
+ListArray,
+LargeListArray,
+ListViewArray,
+LargeListViewArray,
+FixedSizeListArray,
+StructArray,
+MapArray,
+UnionArray
+);
+
+impl HeapSize for PrimitiveArray {
+fn heap_size(&self) -> usize {
+self.get_buffer_memory_size()
+}
+}
+
+impl HeapSize for DictionaryArray {
+fn heap_size(&self) -> usize {
+self.get_buffer_memory_size()
+}
+}
+
+impl HeapSize for RunArray {
+fn heap_size(&self) -> usize {
+self.get_buffer_memory_size()
+}
+}
+
+#[cfg(test)]
+mod tests {
+use super::*;
+use crate::Int32Array;
+
+#[test]
+fn test_primitive_array_heap_size() {
+let array = Int32Array::from(vec![1, 2, 3, 4, 5]);
+let size = array.heap_size();
+assert!(size >= 5 * std::mem::size_of::());
Review Comment:
Changed to be exact comparisons
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add a crate for HeapSize trait [arrow-rs]
adriangb commented on code in PR #9138: URL: https://github.com/apache/arrow-rs/pull/9138#discussion_r2751328438 ## arrow-memory-size/README.md: ## @@ -0,0 +1,231 @@ + + +# `arrow-memory-size` + +[](https://crates.io/crates/arrow-memory-size) +[](https://docs.rs/arrow-memory-size/latest/arrow_memory_size/) + +Memory size estimation utilities for [Apache Arrow]. + +This crate provides the `HeapSize` trait for calculating heap memory usage of data structures. + +[Apache Arrow]: https://arrow.apache.org/ + +## Why This Crate? + +Several memory size estimation crates exist in the Rust ecosystem ([deepsize], [get-size2], etc.), but none has emerged as a clear standard. Rather than take a dependency on any of them, this crate provides a minimal `HeapSize` trait with a small API surface that can be implemented across the Arrow ecosystem. + +Key motivations: + +- **Minimal API**: Just two methods (`heap_size()` and `total_size()`) Review Comment: Here you go: https://github.com/apache/arrow-rs/pull/9138/changes#r2751327162. The ignore option is useful e.g. for caches, shared references where you know / want to semantically treat them as owned by someone else. E.g. `Vec>` should still count the `Vec` and pointers, but not the `T`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add a crate for HeapSize trait [arrow-rs]
adriangb commented on code in PR #9138:
URL: https://github.com/apache/arrow-rs/pull/9138#discussion_r2751327162
##
arrow-memory-size/src/lib.rs:
##
@@ -0,0 +1,688 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Memory size estimation utilities for Apache Arrow
+//!
+//! This crate provides the [`HeapSize`] trait for calculating heap memory
usage
+//! of data structures, with implementations for standard library types.
+//!
+//! For Arrow type implementations, see:
+//! - [`arrow-buffer`](https://docs.rs/arrow-buffer) for buffer types
+//! - [`arrow-array`](https://docs.rs/arrow-array) for array types
+//!
+//! # Example
+//!
+//! ```
+//! use arrow_memory_size::HeapSize;
+//!
+//! let v: Vec = vec!["hello".to_string(), "world".to_string()];
+//! let heap_bytes = v.heap_size();
+//! let total_bytes = v.total_size();
+//! ```
+
+#![doc(
+html_logo_url =
"https://raw.githubusercontent.com/apache/arrow-rs/refs/heads/main/docs/source/_static/images/Arrow-logo_hex_black-txt_transparent-bg.svg";,
+html_favicon_url =
"https://raw.githubusercontent.com/apache/arrow-rs/refs/heads/main/docs/source/_static/images/Arrow-logo_hex_black-txt_transparent-bg.svg";
+)]
+#![cfg_attr(docsrs, feature(doc_cfg))]
+#![warn(missing_docs)]
+
+use std::collections::{BTreeMap, BTreeSet, HashMap, HashSet};
+use std::sync::{Arc, Mutex, RwLock};
+
+/// Trait for calculating the heap memory size of a value.
+///
+/// This trait provides methods for calculating how much heap memory
+/// a data structure has allocated. This is useful for memory tracking,
+/// cache management, and debugging memory usage.
+///
+/// # Semantics
+///
+/// - [`heap_size`](HeapSize::heap_size): Returns only the bytes allocated on
the heap
+/// by this value, not including the size of the value itself.
+/// - [`total_size`](HeapSize::total_size): Returns the total memory footprint
including
+/// both the stack size of the value and its heap allocations.
+///
+/// # Example
+///
+/// ```
+/// use arrow_memory_size::HeapSize;
+///
+/// let s = String::from("hello");
+/// assert!(s.heap_size() >= 5); // At least 5 bytes for "hello"
+/// assert!(s.total_size() >= s.heap_size() + std::mem::size_of::());
+/// ```
+pub trait HeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in nested structures.
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container) or via
+/// [`total_size`](HeapSize::total_size).
+fn heap_size(&self) -> usize;
+
+/// Return the total size of this object including heap allocations
+/// and the size of the object itself.
+fn total_size(&self) -> usize {
Review Comment:
Here is `total_size()`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add a crate for HeapSize trait [arrow-rs]
alchemist51 commented on code in PR #9138:
URL: https://github.com/apache/arrow-rs/pull/9138#discussion_r2740878515
##
arrow-memory-size-derive/src/lib.rs:
##
@@ -0,0 +1,457 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Derive macro for the [`HeapSize`] trait.
+//!
+//! This crate provides a `#[derive(HeapSize)]` macro that automatically
+//! implements the `HeapSize` trait for structs and enums.
+//!
+//! # Example
+//!
+//! ```rust,ignore
+//! use arrow_memory_size::HeapSize;
+//! use arrow_memory_size_derive::HeapSize;
+//!
+//! #[derive(HeapSize)]
+//! struct MyStruct {
+//! name: String,
+//! data: Vec,
+//! count: i32,
+//! }
+//! ```
+//!
+//! # Field Attributes
+//!
+//! The derive macro supports several attributes to customize behavior:
+//!
+//! ## `#[heap_size(ignore)]`
+//!
+//! Skip this field entirely (contributes 0 to heap size).
+//!
+//! ```rust,ignore
+//! #[derive(HeapSize)]
+//! struct MyStruct {
+//! data: Vec,
+//! #[heap_size(ignore)]
+//! cached_hash: u64, // Not counted
+//! }
+//! ```
+//!
+//! ## `#[heap_size(size = N)]`
+//!
+//! Use a constant value instead of calling `heap_size()`.
+//!
+//! ```rust,ignore
+//! #[derive(HeapSize)]
+//! struct MyStruct {
+//! #[heap_size(size = 1024)]
+//! fixed_buffer: *const u8, // Known to be 1KB
+//! }
+//! ```
+//!
+//! ## `#[heap_size(size_fn = path)]`
+//!
+//! Call a custom function to compute the heap size.
+//! The function must have signature `fn(&FieldType) -> usize`.
+//!
+//! ```rust,ignore
+//! fn custom_size(data: &ExternalType) -> usize {
+//! data.len() * 8
+//! }
+//!
+//! #[derive(HeapSize)]
+//! struct MyStruct {
+//! #[heap_size(size_fn = custom_size)]
Review Comment:
Loved the interface to add custom size for custom objects!
##
arrow-array/src/heap_size.rs:
##
@@ -0,0 +1,138 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! [`HeapSize`] implementations for arrow-array types
+
+use arrow_memory_size::HeapSize;
+
+use crate::Array;
+use crate::types::{ArrowDictionaryKeyType, ArrowPrimitiveType,
RunEndIndexType};
+use crate::{
+BinaryArray, BinaryViewArray, BooleanArray, DictionaryArray,
FixedSizeBinaryArray,
+FixedSizeListArray, LargeBinaryArray, LargeListArray, LargeListViewArray,
LargeStringArray,
+ListArray, ListViewArray, MapArray, NullArray, PrimitiveArray, RunArray,
StringArray,
+StringViewArray, StructArray, UnionArray,
+};
+
+// Note: A blanket implementation `impl HeapSize for T` would be
ideal,
+// but is not possible due to Rust's orphan rules (E0210) since HeapSize is
defined
+// in a separate crate.
+//
+// Note: HeapSize cannot be implemented for ArrayRef (Arc) here due
to
+// Rust's orphan rules. Use array.get_buffer_memory_size() directly instead.
+
+/// Implements HeapSize for array types that delegate to
get_buffer_memory_size()
+macro_rules! impl_heap_size {
+($($ty:ty),*) => {
+$(
+impl HeapSize for $ty {
+fn heap_size(&self) -> usize {
+self.get_buffer_memory_size()
+}
+}
+)*
+};
+}
+
+impl_heap_size!(
+BooleanArray,
+NullArray,
+StringArray,
+LargeStringArray,
+BinaryArray,
+LargeBinaryArray,
+StringViewArray,
+BinaryViewArray,
+FixedSizeBinaryArray,
+ListArray,
+LargeListArray,
+ListViewArray,
+LargeListViewArray,
+
Re: [PR] Add a crate for HeapSize trait [arrow-rs]
alamb commented on PR #9138: URL: https://github.com/apache/arrow-rs/pull/9138#issuecomment-3740554793 I haven't had a chance to look through it (2k lines is a lot!) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add a crate for HeapSize trait [arrow-rs]
alamb commented on PR #9138: URL: https://github.com/apache/arrow-rs/pull/9138#issuecomment-3740553915 What do we think about just adding the new crate in DataFusion? Or is there more to this PR that I am missing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add a crate for HeapSize trait [arrow-rs]
adriangb commented on PR #9138: URL: https://github.com/apache/arrow-rs/pull/9138#issuecomment-3740012566 I agree we can move this into `arrow-buffer` and then make a trait in DataFusion if that sounds better. We still a different crate for the macros though 😞 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add a crate for HeapSize trait [arrow-rs]
adriangb commented on code in PR #9138:
URL: https://github.com/apache/arrow-rs/pull/9138#discussion_r2683483197
##
Cargo.toml:
##
@@ -94,6 +96,8 @@ arrow-csv = { version = "57.2.0", path = "./arrow-csv" }
arrow-data = { version = "57.2.0", path = "./arrow-data" }
arrow-ipc = { version = "57.2.0", path = "./arrow-ipc" }
arrow-json = { version = "57.2.0", path = "./arrow-json" }
+arrow-memory-size = { version = "57.2.0", path = "./arrow-memory-size" }
Review Comment:
I'm happy to just move it there instead of a new crate, certainly easier
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add a crate for HeapSize trait [arrow-rs]
alamb commented on code in PR #9138:
URL: https://github.com/apache/arrow-rs/pull/9138#discussion_r2683277745
##
Cargo.toml:
##
@@ -94,6 +96,8 @@ arrow-csv = { version = "57.2.0", path = "./arrow-csv" }
arrow-data = { version = "57.2.0", path = "./arrow-data" }
arrow-ipc = { version = "57.2.0", path = "./arrow-ipc" }
arrow-json = { version = "57.2.0", path = "./arrow-json" }
+arrow-memory-size = { version = "57.2.0", path = "./arrow-memory-size" }
Review Comment:
If we are going to do this I would suggest we put it in `arrow-buffer` or
something
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add a crate for HeapSize trait [arrow-rs]
alamb commented on PR #9138:
URL: https://github.com/apache/arrow-rs/pull/9138#issuecomment-3739741636
I am not sure we need a new crate just for the HeapSize trait in a
downstream crate (DataFUsion)
What about making a trait like this in DataFusion common:
```rust
pub trait DFHeapSize {
...
}
```
And then providing a blanket implementation for `DFHeapSize` for everything
that implements `HeapSize`
```rust
impl for DFHeapSize {
...
}
```
That way we can use the arrow implementations, but we don't have to make a
whole new crate etc
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add a crate for HeapSize trait [arrow-rs]
adriangb commented on code in PR #9138:
URL: https://github.com/apache/arrow-rs/pull/9138#discussion_r2679000625
##
arrow-array/src/heap_size.rs:
##
@@ -0,0 +1,220 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! [`HeapSize`] implementations for arrow-array types
+
+use arrow_memory_size::HeapSize;
+
+use crate::Array;
+use crate::types::{ArrowDictionaryKeyType, ArrowPrimitiveType,
RunEndIndexType};
+use crate::{
+BinaryArray, BinaryViewArray, BooleanArray, DictionaryArray,
FixedSizeBinaryArray,
+FixedSizeListArray, LargeBinaryArray, LargeListArray, LargeListViewArray,
LargeStringArray,
+ListArray, ListViewArray, MapArray, NullArray, PrimitiveArray, RunArray,
StringArray,
+StringViewArray, StructArray, UnionArray,
+};
+
+// Note: HeapSize cannot be implemented for ArrayRef (Arc) here due
to
+// Rust's orphan rules. Use array.get_buffer_memory_size() directly instead.
+
+//
=
+// Primitive and Boolean Arrays
+//
=
+
+impl HeapSize for PrimitiveArray {
+fn heap_size(&self) -> usize {
+self.get_buffer_memory_size()
+}
+}
+
+impl HeapSize for BooleanArray {
+fn heap_size(&self) -> usize {
+self.get_buffer_memory_size()
+}
+}
+
+impl HeapSize for NullArray {
+fn heap_size(&self) -> usize {
+// NullArray has no buffers
+0
+}
+}
+
+//
=
+// String and Binary Arrays
+//
=
+
+impl HeapSize for StringArray {
+fn heap_size(&self) -> usize {
+self.get_buffer_memory_size()
+}
+}
Review Comment:
This could probably be a blanket implementation:
```rust
impl HeapSize for T
where
T: Array
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
