Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
alamb commented on PR #9137:
URL: https://github.com/apache/arrow-rs/pull/9137#issuecomment-3734899785
> I see your point but I think I don't see it being possible to unify them
("copy bits from bitmap to another", vs "write a iterator of bools"). I guess
the "copy bits" should be faster than iterator-based for large buffers.
Yeah -- I think you are right. I will keep thinking about the "right" set
of bit manipulation primitives (and hopefully get them centralized into
MutableBuffer).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
Dandandan merged PR #9137: URL: https://github.com/apache/arrow-rs/pull/9137 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
Dandandan commented on PR #9137:
URL: https://github.com/apache/arrow-rs/pull/9137#issuecomment-3734529496
> I know I am harping on it a lot, but I can't help shake the feeling that
this is so similar to
[set_bits](https://github.com/apache/arrow-rs/blob/28cf02db5500d9aa1a8effecc9622b331a5a69fe/arrow-buffer/src/util/bit_mask.rs#L28).
That takes an existing set of bits, I realize, but the core use case is to
extend an existing mutable buffer
I see your point but I think I don't see it being possible to unify them
("copy bits from bitmap to another", vs "write a iterator of bools"). I guess
the "copy bits" should be faster than iterator-based for large buffers.
```
/// This will sets all bits on `write_data` in the range
`[offset_write..offset_write+len]`
/// to be equal to the bits in `data` in the range
`[offset_read..offset_read+len]`
/// returns the number of `0` bits `data[offset_read..offset_read+len]`
/// `offset_write`, `offset_read`, and `len` are in terms of bits
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
Dandandan commented on code in PR #9137:
URL: https://github.com/apache/arrow-rs/pull/9137#discussion_r2679502471
##
arrow-buffer/src/buffer/mutable.rs:
##
@@ -623,6 +623,139 @@ impl MutableBuffer {
buffer
}
+/// Advances the buffer by `additional` bits without initializing the new
bytes.
+///
+/// # Safety
+/// Callers must ensure that all newly added bits are written before the
buffer is read.
+#[inline]
+unsafe fn advance_uninit(&mut self, additional: usize) {
+let new_len = self.len + additional;
+let new_len_bytes = bit_util::ceil(new_len, 8);
+if new_len_bytes > self.len() {
+self.reserve(new_len_bytes - self.len());
+// SAFETY: caller will initialize all newly exposed bytes
+unsafe { self.set_len(new_len_bytes) };
+}
+self.len = new_len;
Review Comment:
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
Dandandan commented on code in PR #9137:
URL: https://github.com/apache/arrow-rs/pull/9137#discussion_r2679500665
##
arrow-buffer/src/buffer/mutable.rs:
##
@@ -623,6 +623,139 @@ impl MutableBuffer {
buffer
}
+/// Advances the buffer by `additional` bits without initializing the new
bytes.
+///
+/// # Safety
+/// Callers must ensure that all newly added bits are written before the
buffer is read.
+#[inline]
+unsafe fn advance_uninit(&mut self, additional: usize) {
+let new_len = self.len + additional;
+let new_len_bytes = bit_util::ceil(new_len, 8);
+if new_len_bytes > self.len() {
+self.reserve(new_len_bytes - self.len());
+// SAFETY: caller will initialize all newly exposed bytes
+unsafe { self.set_len(new_len_bytes) };
+}
+self.len = new_len;
Review Comment:
Thanks I already pushed a fix in
`https://github.com/apache/arrow-rs/pull/9137/changes/88603765b81895d2412aba31c3b6e8196e789c31
I will probably steal your tests :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
alamb commented on code in PR #9137:
URL: https://github.com/apache/arrow-rs/pull/9137#discussion_r2679461906
##
arrow-buffer/src/buffer/mutable.rs:
##
@@ -623,6 +623,139 @@ impl MutableBuffer {
buffer
}
+/// Advances the buffer by `additional` bits without initializing the new
bytes.
+///
+/// # Safety
+/// Callers must ensure that all newly added bits are written before the
buffer is read.
+#[inline]
+unsafe fn advance_uninit(&mut self, additional: usize) {
+let new_len = self.len + additional;
+let new_len_bytes = bit_util::ceil(new_len, 8);
+if new_len_bytes > self.len() {
+self.reserve(new_len_bytes - self.len());
+// SAFETY: caller will initialize all newly exposed bytes
+unsafe { self.set_len(new_len_bytes) };
Review Comment:
I almost wonder if it would make sense to put this call to `self.reserve` in
`MutableBuffer::set_len`
##
arrow-buffer/src/buffer/mutable.rs:
##
@@ -623,6 +623,139 @@ impl MutableBuffer {
buffer
}
+/// Advances the buffer by `additional` bits without initializing the new
bytes.
+///
+/// # Safety
+/// Callers must ensure that all newly added bits are written before the
buffer is read.
+#[inline]
+unsafe fn advance_uninit(&mut self, additional: usize) {
+let new_len = self.len + additional;
+let new_len_bytes = bit_util::ceil(new_len, 8);
+if new_len_bytes > self.len() {
+self.reserve(new_len_bytes - self.len());
+// SAFETY: caller will initialize all newly exposed bytes
+unsafe { self.set_len(new_len_bytes) };
+}
+self.len = new_len;
+}
+
+/// Extends this builder with boolean values.
+///
+/// This requires `iter` to report an exact size via `size_hint`.
+///
+/// # Safety
+/// Callers must ensure that `iter` reports an exact size via `size_hint`.
+#[inline]
+pub unsafe fn extend_bool_trusted_len>(
+&mut self,
+iter: I,
+offset: usize,
+) {
+let (lower, upper) = iter.size_hint();
+let len = upper.expect("Iterator must have exact size_hint");
+assert_eq!(lower, len, "Iterator must have exact size_hint");
+
+if len == 0 {
+return;
+}
+
+let start_len = offset;
+let end_bit = start_len + len;
+
+// SAFETY: we will initialize all newly exposed bytes before they are
read
+unsafe { self.advance_uninit(len) };
+let slice = self.as_slice_mut();
+
+let mut iter = iter;
Review Comment:
nit you could avoid this by just declaring the parameter `mut` I think
```diff
--- a/arrow-buffer/src/buffer/mutable.rs
+++ b/arrow-buffer/src/buffer/mutable.rs
@@ -648,7 +648,7 @@ impl MutableBuffer {
#[inline]
pub unsafe fn extend_bool_trusted_len>(
&mut self,
-iter: I,
+mut iter: I,
offset: usize,
) {
let (lower, upper) = iter.size_hint();
```
##
arrow-buffer/src/builder/boolean.rs:
##
@@ -259,6 +259,20 @@ impl BooleanBufferBuilder {
pub fn finish_cloned(&self) -> BooleanBuffer {
BooleanBuffer::new(Buffer::from_slice_ref(self.as_slice()), 0,
self.len)
}
+
+/// Extends the builder from a trusted length iterator of booleans.
+/// # Safety
+/// Callers must ensure that `iter` reports an exact size via `size_hint`.
+///
+#[inline]
+pub unsafe fn extend_trusted_len(&mut self, iterator: I)
+where
+I: Iterator,
+{
+let len = iterator.size_hint().0;
+unsafe { self.buffer.extend_bool_trusted_len(iterator, self.len) };
Review Comment:
It might make it easier to understand the correctness if
`extend_bool_trusted_len` returned the number of bits that were appended rather
than having to get it from the iterator
##
arrow-buffer/src/builder/boolean.rs:
##
@@ -526,4 +540,65 @@ mod tests {
assert_eq!(buf.len(), buf2.inner().len());
assert_eq!(buf.as_slice(), buf2.values());
}
+
+#[test]
+fn test_extend() {
Review Comment:
I recommend adding some explicit tests for the mutable buffer API directly
as well.
##
arrow-buffer/src/buffer/mutable.rs:
##
@@ -623,6 +623,139 @@ impl MutableBuffer {
buffer
}
+/// Advances the buffer by `additional` bits without initializing the new
bytes.
+///
+/// # Safety
+/// Callers must ensure that all newly added bits are written before the
buffer is read.
+#[inline]
+unsafe fn advance_uninit(&mut self, additional: usize) {
+let new_len = self.len + additional;
+let new_len_bytes = bit_util::ceil(new_len, 8);
+if new_len_bytes > self.len() {
+self.reserve(new_len_bytes - self.len());
+// SAFETY: caller will
Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
alamb commented on code in PR #9137:
URL: https://github.com/apache/arrow-rs/pull/9137#discussion_r2679461520
##
arrow-buffer/src/buffer/mutable.rs:
##
@@ -623,6 +623,139 @@ impl MutableBuffer {
buffer
}
+/// Advances the buffer by `additional` bits without initializing the new
bytes.
+///
+/// # Safety
+/// Callers must ensure that all newly added bits are written before the
buffer is read.
+#[inline]
+unsafe fn advance_uninit(&mut self, additional: usize) {
+let new_len = self.len + additional;
+let new_len_bytes = bit_util::ceil(new_len, 8);
+if new_len_bytes > self.len() {
+self.reserve(new_len_bytes - self.len());
+// SAFETY: caller will initialize all newly exposed bytes
+unsafe { self.set_len(new_len_bytes) };
+}
+self.len = new_len;
Review Comment:
Some thoughts reading this: It is pretty similar to `set_len` except:
`set_len` doesn't change capacity and is in terms of bytes
Some readability suggestions:
1. Make it work with bytes (not bits) and move the byte len calculation to
`extend_bool_trusted_iter`
2. Potentially just use `set_len` (maybe adding the reserve calculation)
Since this is not a public API I don't think it is a big deal and/or we can
do this a follow on PR
##
arrow-buffer/src/buffer/mutable.rs:
##
@@ -623,6 +623,139 @@ impl MutableBuffer {
buffer
}
+/// Advances the buffer by `additional` bits without initializing the new
bytes.
+///
+/// # Safety
+/// Callers must ensure that all newly added bits are written before the
buffer is read.
+#[inline]
+unsafe fn advance_uninit(&mut self, additional: usize) {
+let new_len = self.len + additional;
+let new_len_bytes = bit_util::ceil(new_len, 8);
+if new_len_bytes > self.len() {
+self.reserve(new_len_bytes - self.len());
+// SAFETY: caller will initialize all newly exposed bytes
+unsafe { self.set_len(new_len_bytes) };
+}
+self.len = new_len;
Review Comment:
Some thoughts reading this: It is pretty similar to `set_len` except:
`set_len` doesn't change capacity and is in terms of bytes
Some readability suggestions:
1. Make it work with bytes (not bits) and move the byte len calculation to
`extend_bool_trusted_iter`
2. Potentially just use `set_len` (maybe adding the reserve calculation)
Since this is not a public API I don't think it is a big deal and/or we can
do this a follow on PR
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
Dandandan commented on code in PR #9137:
URL: https://github.com/apache/arrow-rs/pull/9137#discussion_r2679403039
##
arrow-buffer/src/builder/boolean.rs:
##
@@ -259,6 +259,135 @@ impl BooleanBufferBuilder {
pub fn finish_cloned(&self) -> BooleanBuffer {
BooleanBuffer::new(Buffer::from_slice_ref(self.as_slice()), 0,
self.len)
}
+
+/// Advances the buffer by `additional` bits without initializing the new
bytes.
+///
+/// # Safety
+/// Callers must ensure that all newly added bits are written before the
buffer is read.
+#[inline]
+unsafe fn advance_uninit(&mut self, additional: usize) {
+let new_len = self.len + additional;
+let new_len_bytes = bit_util::ceil(new_len, 8);
+if new_len_bytes > self.buffer.len() {
+self.buffer.reserve(new_len_bytes - self.buffer.len());
+// SAFETY: caller will initialize all newly exposed bytes
+unsafe { self.buffer.set_len(new_len_bytes) };
+}
+self.len = new_len;
+}
+
+/// Extends this builder with boolean values.
+///
+/// This requires `iter` to report an exact size via `size_hint`.
+///
+/// # Safety
+/// Callers must ensure that `iter` reports an exact size via `size_hint`.
+#[inline]
+pub unsafe fn extend_trusted_len>(&mut self,
iter: I) {
Review Comment:
Done
##
arrow-buffer/src/builder/boolean.rs:
##
@@ -259,6 +259,135 @@ impl BooleanBufferBuilder {
pub fn finish_cloned(&self) -> BooleanBuffer {
BooleanBuffer::new(Buffer::from_slice_ref(self.as_slice()), 0,
self.len)
}
+
+/// Advances the buffer by `additional` bits without initializing the new
bytes.
Review Comment:
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
alamb commented on code in PR #9137:
URL: https://github.com/apache/arrow-rs/pull/9137#discussion_r2678872270
##
arrow-buffer/src/builder/boolean.rs:
##
@@ -259,6 +259,135 @@ impl BooleanBufferBuilder {
pub fn finish_cloned(&self) -> BooleanBuffer {
BooleanBuffer::new(Buffer::from_slice_ref(self.as_slice()), 0,
self.len)
}
+
+/// Advances the buffer by `additional` bits without initializing the new
bytes.
+///
+/// # Safety
+/// Callers must ensure that all newly added bits are written before the
buffer is read.
+#[inline]
+unsafe fn advance_uninit(&mut self, additional: usize) {
+let new_len = self.len + additional;
+let new_len_bytes = bit_util::ceil(new_len, 8);
+if new_len_bytes > self.buffer.len() {
+self.buffer.reserve(new_len_bytes - self.buffer.len());
+// SAFETY: caller will initialize all newly exposed bytes
+unsafe { self.buffer.set_len(new_len_bytes) };
+}
+self.len = new_len;
+}
+
+/// Extends this builder with boolean values.
+///
+/// This requires `iter` to report an exact size via `size_hint`.
+///
+/// # Safety
+/// Callers must ensure that `iter` reports an exact size via `size_hint`.
+#[inline]
+pub unsafe fn extend_trusted_len>(&mut self,
iter: I) {
+let (lower, upper) = iter.size_hint();
+let len = upper.expect("Iterator must have exact size_hint");
+assert_eq!(lower, len, "Iterator must have exact size_hint");
+
+if len == 0 {
+return;
+}
+
+let start_len = self.len;
+let end_bit = start_len + len;
+
+// SAFETY: we will initialize all newly exposed bytes before they are
read
+unsafe { self.advance_uninit(len) };
+let slice = self.buffer.as_slice_mut();
+
+let mut iter = iter;
+let mut bit_idx = start_len;
+
+// Unaligned prefix: advance to the next 64-bit boundary
+let misalignment = bit_idx & 63;
+let prefix_bits = if misalignment == 0 {
+0
+} else {
+(64 - misalignment).min(end_bit - bit_idx)
+};
+
+if prefix_bits != 0 {
+let byte_start = bit_idx / 8;
+let byte_end = bit_util::ceil(bit_idx + prefix_bits, 8);
+let bit_offset = bit_idx % 8;
+
+// Clear any newly-visible bits in the existing partial byte
+if bit_offset != 0 {
+let keep_mask = (1u8 << bit_offset).wrapping_sub(1);
+slice[byte_start] &= keep_mask;
+}
+
+// Zero any new bytes we will partially fill in this prefix
+let zero_from = if bit_offset == 0 {
+byte_start
+} else {
+byte_start + 1
+};
+if byte_end > zero_from {
+slice[zero_from..byte_end].fill(0);
+}
+
+for _ in 0..prefix_bits {
+let v = iter.next().unwrap();
+if v {
+let byte_idx = bit_idx / 8;
+let bit = bit_idx % 8;
+slice[byte_idx] |= 1 << bit;
+}
+bit_idx += 1;
+}
+}
+
+if bit_idx < end_bit {
+// Aligned middle: write u64 chunks
+debug_assert_eq!(bit_idx & 63, 0);
+let remaining_bits = end_bit - bit_idx;
+let chunks = remaining_bits / 64;
+
+let words_start = bit_idx / 8;
+let words_end = words_start + chunks * 8;
+for dst in slice[words_start..words_end].chunks_exact_mut(8) {
Review Comment:
makes sense
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
Dandandan commented on code in PR #9137:
URL: https://github.com/apache/arrow-rs/pull/9137#discussion_r2678772049
##
arrow-buffer/src/builder/boolean.rs:
##
@@ -259,6 +259,135 @@ impl BooleanBufferBuilder {
pub fn finish_cloned(&self) -> BooleanBuffer {
BooleanBuffer::new(Buffer::from_slice_ref(self.as_slice()), 0,
self.len)
}
+
+/// Advances the buffer by `additional` bits without initializing the new
bytes.
Review Comment:
Hmm yeah seems reasonable I think
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
Dandandan commented on code in PR #9137:
URL: https://github.com/apache/arrow-rs/pull/9137#discussion_r2678771615
##
arrow-buffer/src/builder/boolean.rs:
##
@@ -259,6 +259,135 @@ impl BooleanBufferBuilder {
pub fn finish_cloned(&self) -> BooleanBuffer {
BooleanBuffer::new(Buffer::from_slice_ref(self.as_slice()), 0,
self.len)
}
+
+/// Advances the buffer by `additional` bits without initializing the new
bytes.
+///
+/// # Safety
+/// Callers must ensure that all newly added bits are written before the
buffer is read.
+#[inline]
+unsafe fn advance_uninit(&mut self, additional: usize) {
+let new_len = self.len + additional;
+let new_len_bytes = bit_util::ceil(new_len, 8);
+if new_len_bytes > self.buffer.len() {
+self.buffer.reserve(new_len_bytes - self.buffer.len());
+// SAFETY: caller will initialize all newly exposed bytes
+unsafe { self.buffer.set_len(new_len_bytes) };
+}
+self.len = new_len;
+}
+
+/// Extends this builder with boolean values.
+///
+/// This requires `iter` to report an exact size via `size_hint`.
+///
+/// # Safety
+/// Callers must ensure that `iter` reports an exact size via `size_hint`.
+#[inline]
+pub unsafe fn extend_trusted_len>(&mut self,
iter: I) {
+let (lower, upper) = iter.size_hint();
+let len = upper.expect("Iterator must have exact size_hint");
+assert_eq!(lower, len, "Iterator must have exact size_hint");
+
+if len == 0 {
+return;
+}
+
+let start_len = self.len;
+let end_bit = start_len + len;
+
+// SAFETY: we will initialize all newly exposed bytes before they are
read
+unsafe { self.advance_uninit(len) };
+let slice = self.buffer.as_slice_mut();
+
+let mut iter = iter;
+let mut bit_idx = start_len;
+
+// Unaligned prefix: advance to the next 64-bit boundary
+let misalignment = bit_idx & 63;
+let prefix_bits = if misalignment == 0 {
+0
+} else {
+(64 - misalignment).min(end_bit - bit_idx)
+};
+
+if prefix_bits != 0 {
+let byte_start = bit_idx / 8;
+let byte_end = bit_util::ceil(bit_idx + prefix_bits, 8);
+let bit_offset = bit_idx % 8;
+
+// Clear any newly-visible bits in the existing partial byte
+if bit_offset != 0 {
+let keep_mask = (1u8 << bit_offset).wrapping_sub(1);
+slice[byte_start] &= keep_mask;
+}
+
+// Zero any new bytes we will partially fill in this prefix
+let zero_from = if bit_offset == 0 {
+byte_start
+} else {
+byte_start + 1
+};
+if byte_end > zero_from {
+slice[zero_from..byte_end].fill(0);
+}
+
+for _ in 0..prefix_bits {
+let v = iter.next().unwrap();
+if v {
+let byte_idx = bit_idx / 8;
+let bit = bit_idx % 8;
+slice[byte_idx] |= 1 << bit;
+}
+bit_idx += 1;
+}
+}
+
+if bit_idx < end_bit {
+// Aligned middle: write u64 chunks
+debug_assert_eq!(bit_idx & 63, 0);
+let remaining_bits = end_bit - bit_idx;
+let chunks = remaining_bits / 64;
+
+let words_start = bit_idx / 8;
+let words_end = words_start + chunks * 8;
+for dst in slice[words_start..words_end].chunks_exact_mut(8) {
Review Comment:
`1.31.0` it says. Only `const` seems unstable...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
alamb commented on code in PR #9137:
URL: https://github.com/apache/arrow-rs/pull/9137#discussion_r2678747639
##
arrow-buffer/src/builder/boolean.rs:
##
@@ -526,4 +655,65 @@ mod tests {
assert_eq!(buf.len(), buf2.inner().len());
assert_eq!(buf.as_slice(), buf2.values());
}
+
+#[test]
+fn test_extend() {
+let mut builder = BooleanBufferBuilder::new(0);
+let bools = vec![true, false, true, true, false, true, true, true,
false];
+unsafe { builder.extend_trusted_len(bools.clone().into_iter()) };
+assert_eq!(builder.len(), 9);
+let finished = builder.finish();
+for (i, v) in bools.into_iter().enumerate() {
+assert_eq!(finished.value(i), v);
+}
+
+// Test > 64 bits
+let mut builder = BooleanBufferBuilder::new(0);
+let bools: Vec<_> = (0..100).map(|i| i % 3 == 0 || i % 7 ==
0).collect();
+unsafe { builder.extend_trusted_len(bools.clone().into_iter()) };
+assert_eq!(builder.len(), 100);
+let finished = builder.finish();
+for (i, v) in bools.into_iter().enumerate() {
+assert_eq!(finished.value(i), v, "at index {}", i);
+}
+}
+
+#[test]
+fn test_extend_misaligned() {
+// Test misaligned start
+for offset in 1..65 {
+let mut builder = BooleanBufferBuilder::new(0);
+builder.append_n(offset, false);
+
+let bools: Vec<_> = (0..100).map(|i| i % 3 == 0 || i % 7 ==
0).collect();
+unsafe { builder.extend_trusted_len(bools.clone().into_iter()) };
+assert_eq!(builder.len(), offset + 100);
+
+let finished = builder.finish();
+for i in 0..offset {
+assert!(!finished.value(i));
+}
+for (i, v) in bools.into_iter().enumerate() {
+assert_eq!(finished.value(offset + i), v, "at index {}",
offset + i);
+}
+}
+}
+
+#[test]
+fn test_extend_misaligned_end() {
+for len in 1..130 {
+let mut builder = BooleanBufferBuilder::new(0);
+let mut bools: Vec<_> = (0..len).map(|i| i % 2 == 0).collect();
+unsafe { builder.extend_trusted_len(bools.clone().into_iter()) };
+unsafe { builder.extend_trusted_len(bools.clone().into_iter()) };
+let copy = bools.clone();
+bools.extend(copy);
+assert_eq!(builder.len(), 2 * len);
+
+let finished = builder.finish();
+for (i, &v) in bools.iter().enumerate() {
+assert_eq!(finished.value(i), v, "at index {} for len {}", i,
len);
+}
+}
+}
Review Comment:
Update -- I actually verified you have basically already done this
```shell
cargo llvm-cov --html test -p arrow-buffer
```
https://github.com/user-attachments/assets/e99b7b89-3f9b-40cf-a38e-7e9cb553ecb4";
/>
So my analyss is that the testing is good
##
arrow-buffer/src/builder/boolean.rs:
##
@@ -526,4 +655,65 @@ mod tests {
assert_eq!(buf.len(), buf2.inner().len());
assert_eq!(buf.as_slice(), buf2.values());
}
+
+#[test]
+fn test_extend() {
+let mut builder = BooleanBufferBuilder::new(0);
+let bools = vec![true, false, true, true, false, true, true, true,
false];
+unsafe { builder.extend_trusted_len(bools.clone().into_iter()) };
+assert_eq!(builder.len(), 9);
+let finished = builder.finish();
+for (i, v) in bools.into_iter().enumerate() {
+assert_eq!(finished.value(i), v);
+}
+
+// Test > 64 bits
+let mut builder = BooleanBufferBuilder::new(0);
+let bools: Vec<_> = (0..100).map(|i| i % 3 == 0 || i % 7 ==
0).collect();
+unsafe { builder.extend_trusted_len(bools.clone().into_iter()) };
+assert_eq!(builder.len(), 100);
+let finished = builder.finish();
+for (i, v) in bools.into_iter().enumerate() {
+assert_eq!(finished.value(i), v, "at index {}", i);
+}
+}
+
+#[test]
+fn test_extend_misaligned() {
+// Test misaligned start
+for offset in 1..65 {
+let mut builder = BooleanBufferBuilder::new(0);
+builder.append_n(offset, false);
+
+let bools: Vec<_> = (0..100).map(|i| i % 3 == 0 || i % 7 ==
0).collect();
+unsafe { builder.extend_trusted_len(bools.clone().into_iter()) };
+assert_eq!(builder.len(), offset + 100);
+
+let finished = builder.finish();
+for i in 0..offset {
+assert!(!finished.value(i));
+}
+for (i, v) in bools.into_iter().enumerate() {
+assert_eq!(finished.value(offset + i), v, "at index {}",
offset + i);
+}
+}
+}
+
+#[test]
+fn test_extend_misaligned_end() {
+for len in 1..130 {
+
Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
alamb commented on code in PR #9137:
URL: https://github.com/apache/arrow-rs/pull/9137#discussion_r2678743814
##
arrow-buffer/src/builder/boolean.rs:
##
@@ -259,6 +259,135 @@ impl BooleanBufferBuilder {
pub fn finish_cloned(&self) -> BooleanBuffer {
BooleanBuffer::new(Buffer::from_slice_ref(self.as_slice()), 0,
self.len)
}
+
+/// Advances the buffer by `additional` bits without initializing the new
bytes.
+///
+/// # Safety
+/// Callers must ensure that all newly added bits are written before the
buffer is read.
+#[inline]
+unsafe fn advance_uninit(&mut self, additional: usize) {
+let new_len = self.len + additional;
+let new_len_bytes = bit_util::ceil(new_len, 8);
+if new_len_bytes > self.buffer.len() {
+self.buffer.reserve(new_len_bytes - self.buffer.len());
+// SAFETY: caller will initialize all newly exposed bytes
+unsafe { self.buffer.set_len(new_len_bytes) };
+}
+self.len = new_len;
+}
+
+/// Extends this builder with boolean values.
+///
+/// This requires `iter` to report an exact size via `size_hint`.
+///
+/// # Safety
+/// Callers must ensure that `iter` reports an exact size via `size_hint`.
+#[inline]
+pub unsafe fn extend_trusted_len>(&mut self,
iter: I) {
+let (lower, upper) = iter.size_hint();
+let len = upper.expect("Iterator must have exact size_hint");
+assert_eq!(lower, len, "Iterator must have exact size_hint");
+
+if len == 0 {
+return;
+}
+
+let start_len = self.len;
+let end_bit = start_len + len;
+
+// SAFETY: we will initialize all newly exposed bytes before they are
read
+unsafe { self.advance_uninit(len) };
+let slice = self.buffer.as_slice_mut();
+
+let mut iter = iter;
+let mut bit_idx = start_len;
+
+// Unaligned prefix: advance to the next 64-bit boundary
+let misalignment = bit_idx & 63;
+let prefix_bits = if misalignment == 0 {
+0
+} else {
+(64 - misalignment).min(end_bit - bit_idx)
+};
+
+if prefix_bits != 0 {
+let byte_start = bit_idx / 8;
+let byte_end = bit_util::ceil(bit_idx + prefix_bits, 8);
+let bit_offset = bit_idx % 8;
+
+// Clear any newly-visible bits in the existing partial byte
+if bit_offset != 0 {
+let keep_mask = (1u8 << bit_offset).wrapping_sub(1);
+slice[byte_start] &= keep_mask;
+}
+
+// Zero any new bytes we will partially fill in this prefix
+let zero_from = if bit_offset == 0 {
+byte_start
+} else {
+byte_start + 1
+};
+if byte_end > zero_from {
+slice[zero_from..byte_end].fill(0);
+}
+
+for _ in 0..prefix_bits {
+let v = iter.next().unwrap();
+if v {
+let byte_idx = bit_idx / 8;
+let bit = bit_idx % 8;
+slice[byte_idx] |= 1 << bit;
+}
+bit_idx += 1;
+}
+}
+
+if bit_idx < end_bit {
+// Aligned middle: write u64 chunks
+debug_assert_eq!(bit_idx & 63, 0);
+let remaining_bits = end_bit - bit_idx;
+let chunks = remaining_bits / 64;
+
+let words_start = bit_idx / 8;
+let words_end = words_start + chunks * 8;
+for dst in slice[words_start..words_end].chunks_exact_mut(8) {
Review Comment:
But the MSRV check passes, so I must be confused
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
alamb commented on PR #9137: URL: https://github.com/apache/arrow-rs/pull/9137#issuecomment-3732983204 FYI @jhorstmann as I think you are also into low level bit twiddling and perhaps have some more suggestions for this one -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
alamb commented on code in PR #9137:
URL: https://github.com/apache/arrow-rs/pull/9137#discussion_r2678735476
##
arrow-buffer/src/builder/boolean.rs:
##
@@ -259,6 +259,135 @@ impl BooleanBufferBuilder {
pub fn finish_cloned(&self) -> BooleanBuffer {
BooleanBuffer::new(Buffer::from_slice_ref(self.as_slice()), 0,
self.len)
}
+
+/// Advances the buffer by `additional` bits without initializing the new
bytes.
+///
+/// # Safety
+/// Callers must ensure that all newly added bits are written before the
buffer is read.
+#[inline]
+unsafe fn advance_uninit(&mut self, additional: usize) {
+let new_len = self.len + additional;
+let new_len_bytes = bit_util::ceil(new_len, 8);
+if new_len_bytes > self.buffer.len() {
+self.buffer.reserve(new_len_bytes - self.buffer.len());
+// SAFETY: caller will initialize all newly exposed bytes
+unsafe { self.buffer.set_len(new_len_bytes) };
+}
+self.len = new_len;
+}
+
+/// Extends this builder with boolean values.
+///
+/// This requires `iter` to report an exact size via `size_hint`.
+///
+/// # Safety
+/// Callers must ensure that `iter` reports an exact size via `size_hint`.
+#[inline]
+pub unsafe fn extend_trusted_len>(&mut self,
iter: I) {
Review Comment:
So the idea would be that we would have
`MutableBuffer::extend_bool_trusted_len` or something that is in parallel to
collect_bool
Then this method would just call the mutable buffer version
##
arrow-buffer/src/builder/boolean.rs:
##
@@ -526,4 +655,65 @@ mod tests {
assert_eq!(buf.len(), buf2.inner().len());
assert_eq!(buf.as_slice(), buf2.values());
}
+
+#[test]
+fn test_extend() {
+let mut builder = BooleanBufferBuilder::new(0);
+let bools = vec![true, false, true, true, false, true, true, true,
false];
+unsafe { builder.extend_trusted_len(bools.clone().into_iter()) };
+assert_eq!(builder.len(), 9);
+let finished = builder.finish();
+for (i, v) in bools.into_iter().enumerate() {
+assert_eq!(finished.value(i), v);
+}
+
+// Test > 64 bits
+let mut builder = BooleanBufferBuilder::new(0);
+let bools: Vec<_> = (0..100).map(|i| i % 3 == 0 || i % 7 ==
0).collect();
+unsafe { builder.extend_trusted_len(bools.clone().into_iter()) };
+assert_eq!(builder.len(), 100);
+let finished = builder.finish();
+for (i, v) in bools.into_iter().enumerate() {
+assert_eq!(finished.value(i), v, "at index {}", i);
+}
+}
+
+#[test]
+fn test_extend_misaligned() {
+// Test misaligned start
+for offset in 1..65 {
+let mut builder = BooleanBufferBuilder::new(0);
+builder.append_n(offset, false);
+
+let bools: Vec<_> = (0..100).map(|i| i % 3 == 0 || i % 7 ==
0).collect();
+unsafe { builder.extend_trusted_len(bools.clone().into_iter()) };
+assert_eq!(builder.len(), offset + 100);
+
+let finished = builder.finish();
+for i in 0..offset {
+assert!(!finished.value(i));
+}
+for (i, v) in bools.into_iter().enumerate() {
+assert_eq!(finished.value(offset + i), v, "at index {}",
offset + i);
+}
+}
+}
+
+#[test]
+fn test_extend_misaligned_end() {
+for len in 1..130 {
+let mut builder = BooleanBufferBuilder::new(0);
+let mut bools: Vec<_> = (0..len).map(|i| i % 2 == 0).collect();
+unsafe { builder.extend_trusted_len(bools.clone().into_iter()) };
+unsafe { builder.extend_trusted_len(bools.clone().into_iter()) };
+let copy = bools.clone();
+bools.extend(copy);
+assert_eq!(builder.len(), 2 * len);
+
+let finished = builder.finish();
+for (i, &v) in bools.iter().enumerate() {
+assert_eq!(finished.value(i), v, "at index {} for len {}", i,
len);
+}
+}
+}
Review Comment:
For this API I think we should also implement a "fuzz" test to ensure all
the paths are covered
Something like
https://github.com/apache/arrow-rs/blob/96637fc8b928a94de53bbec3501337c0ecfbf936/arrow-buffer/src/buffer/boolean.rs#L780-L799
where we use both the extend iterator and push it bit by bit and verify the
output is the same
##
arrow-buffer/src/builder/boolean.rs:
##
@@ -259,6 +259,135 @@ impl BooleanBufferBuilder {
pub fn finish_cloned(&self) -> BooleanBuffer {
BooleanBuffer::new(Buffer::from_slice_ref(self.as_slice()), 0,
self.len)
}
+
+/// Advances the buffer by `additional` bits without initializing the new
bytes.
Review Comment:
What do you think about putting this code in MutableBuffer rather than
B
Re: [PR] Add `BooleanBufferBuilder::extend_trusted_len` [arrow-rs]
alamb commented on PR #9137: URL: https://github.com/apache/arrow-rs/pull/9137#issuecomment-3732876572 Yes, I have seen this before too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
