Re: [PR] Optimize coalesce kernel for StringView (10-50% faster) [arrow-rs]
alamb commented on PR #7650: URL: https://github.com/apache/arrow-rs/pull/7650#issuecomment-2991634517 Thanks again for all the help @Dandandan -- I'll try and make a few more PRs to this kernel for primitive arrays as well as optimizing the filtering shortlt -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Optimize coalesce kernel for StringView (10-50% faster) [arrow-rs]
alamb commented on code in PR #7650:
URL: https://github.com/apache/arrow-rs/pull/7650#discussion_r2158986179
##
arrow-array/src/array/byte_view_array.rs:
##
@@ -479,6 +479,32 @@ impl GenericByteViewArray {
builder.finish()
}
+/// Returns the total number of bytes used by all non inlined views in all
+/// buffers.
+///
+/// Note this does not account for views that point at the same underlying
+/// data in buffers
+///
+/// For example, if the array has three strings views:
+/// * View with length = 9 (inlined)
+/// * View with length = 32 (non inlined)
+/// * View with length = 16 (non inlined)
+///
+/// Then this method would report 48
+pub fn total_buffer_bytes_used(&self) -> usize {
+self.views()
+.iter()
+.map(|v| {
+let len = (*v as u32) as usize;
+if len > 12 {
Review Comment:
This is a good idea and I will do it in a follow on PR
##
arrow-select/src/coalesce.rs:
##
@@ -212,32 +221,57 @@ impl BatchCoalescer {
/// assert_eq!(completed_batch, expected_batch);
/// ```
pub fn push_batch(&mut self, batch: RecordBatch) -> Result<(), ArrowError>
{
-if batch.num_rows() == 0 {
-// If the batch is empty, we don't need to do anything
+let (_schema, arrays, mut num_rows) = batch.into_parts();
Review Comment:
This is the core logic here -- to break up the RecordBatch and incrementally
copy rows into the target output.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Optimize coalesce kernel for StringView (10-50% faster) [arrow-rs]
alamb merged PR #7650: URL: https://github.com/apache/arrow-rs/pull/7650 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Optimize coalesce kernel for StringView (10-50% faster) [arrow-rs]
Dandandan commented on code in PR #7650:
URL: https://github.com/apache/arrow-rs/pull/7650#discussion_r2157733581
##
arrow-array/src/array/byte_view_array.rs:
##
@@ -479,6 +479,32 @@ impl GenericByteViewArray {
builder.finish()
}
+/// Returns the total number of bytes used by all non inlined views in all
+/// buffers.
+///
+/// Note this does not account for views that point at the same underlying
+/// data in buffers
+///
+/// For example, if the array has three strings views:
+/// * View with length = 9 (inlined)
+/// * View with length = 32 (non inlined)
+/// * View with length = 16 (non inlined)
+///
+/// Then this method would report 48
+pub fn total_buffer_bytes_used(&self) -> usize {
+self.views()
+.iter()
+.map(|v| {
+let len = (*v as u32) as usize;
+if len > 12 {
Review Comment:
We probably should set this as constant somewhere and use it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Optimize coalesce kernel for StringView (10-50% faster) [arrow-rs]
alamb commented on PR #7650: URL: https://github.com/apache/arrow-rs/pull/7650#issuecomment-2987727569 I think this one is ready for review now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
