jorgecarleitao commented on a change in pull request #9588:
URL: https://github.com/apache/arrow/pull/9588#discussion_r584293489



##########
File path: rust/arrow/src/array/array_binary.rs
##########
@@ -258,6 +258,8 @@ where
             }
         }
 
+        // calculate actual data_len, which may be different from the 
iterator's upper bound
+        let data_len = offsets.len() - 1;

Review comment:
       I think that @alamb is right (my comment is actually wrong). 
@yordan-pavlov , the Iterator spec recommends, does not require, that the 
iterator reports a correct length. Consumer that lead to undefined behavior 
from an incorrect `size_hint` are the causers of said undefined behavior.
   
   The only case where consumers can trust the iterators' length is when the 
interator implement `unsafe trait TrustedLen`. Unfortunately, `TrustedLen` is 
still in unstable. For that reason, we have been exposing `unsafe 
Buffer::from_trusted_len_iter` and the like for those cases.
   
   The issue here is that to correctly declare a FromIterator in this case, we 
must use `BooleanBufferBuilder` for the null bitmap, which is significantly 
slower than a `MutableBuffer`. This is why I added that `panic!` there, even 
though it is unsound. This PR is just surfacing this problem where the current 
implementation is `unsound` (because the iterator can be larger or smaller than 
`upper_bound`).
   
   Since we have been seeing people prefer the `FromIter` over the `Builder` 
APIs, my suggestion is that we use `BooleanBufferBuilder`, like we are doing 
for `PrimitiveArray`, and offer (in a future PR), a `from_trusted_len_iter` for 
string arrays.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to