jorgecarleitao commented on a change in pull request #9588:
URL: https://github.com/apache/arrow/pull/9588#discussion_r584293489
##########
File path: rust/arrow/src/array/array_binary.rs
##########
@@ -258,6 +258,8 @@ where
}
}
+ // calculate actual data_len, which may be different from the
iterator's upper bound
+ let data_len = offsets.len() - 1;
Review comment:
I think that @alamb is right (my comment is actually wrong).
@yordan-pavlov , the Iterator spec recommends, does not require, that the
iterator reports a correct length. Consumer that lead to undefined behavior
from an incorrect `size_hint` are the causers of said undefined behavior.
The only case where consumers can trust the iterators' length is when the
interator implement `unsafe trait TrustedLen`. Unfortunately, `TrustedLen` is
still in unstable. For that reason, we have been exposing `unsafe
Buffer::from_trusted_len_iter` and the like for those cases.
The issue here is that to correctly declare a FromIterator in this case, we
must use `BooleanBufferBuilder` for the null bitmap, which is significantly
slower than a `MutableBuffer`. This is why I added that `panic!` there, even
though it is unsound. This PR is just surfacing this problem where the current
implementation is `unsound` (because the iterator can be larger than
`upper_bound`).
Since we have been seeing people prefer the `FromIter` over the `Builder`
APIs, my suggestion is that we use `BooleanBufferBuilder`, like we are doing
for `PrimitiveArray`, and offer (in a future PR), a `from_trusted_len_iter` for
string arrays.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]