tyrelr commented on a change in pull request #9215: URL: https://github.com/apache/arrow/pull/9215#discussion_r562999845
########## File path: rust/arrow/src/array/array_primitive.rs ########## @@ -86,13 +86,9 @@ impl<T: ArrowPrimitiveType> PrimitiveArray<T> { } /// Returns the primitive value at index `i`. - /// - /// Note this doesn't do any bound checking, for performance reason. - /// # Safety - /// caller must ensure that the passed in offset is less than the array len() + #[inline] pub fn value(&self, i: usize) -> T::Native { - let offset = i + self.offset(); - unsafe { *self.raw_values.as_ptr().add(offset) } + self.values()[i] Review comment: #9291 is good progress towards eliminating use of the function. And certainly, we could 'split' the macro for primitives as a quick fix to get rid of the call to the function. I've been experimenting with an alternative approach that might be a bit more flexible to multiple use cases, described at the bottom of this comment. I am quite torn about whether I think value should or should not be in the interface. # Reasons to drop value(i) -> T::Native I think that even if value(i) was dropped from the PrimitiveArray impl's, efficient random access to items without a bounds check can still be achieved through ```unsafe{*primitive_array.values().get_unchecked(i)}``` (the extra * because get_unchecked() returns a ref to the value). I'm not sure I have any example code or measurements to demonstrate it on hand, but I am certain I saw the silently-unsafe implementation```x.values().iter().zip(y.values().iter())``` did (slightly) outperform ```(0..x.len()).map(|i|{x.value(i),y.value(i)}```. I believe it was when I was playing with non-simd arithmetic kernels.... So that is the root of my hesitancy, is I'm worried it doesn't actually escape any overhead, and unintentionally lead people away from a more reliable/performant way. IF there is a context where unsafe{x.value(i)} beats the performance of unsafe{*x.values().get_unchecked(i)} # Reasons to keep value(i) -> T::Native All other array implementations have value functions as far as I recall, so it is a nice 'consistency'. In the back of my mind, the biggest argument to keep value(i) is for api consistency... so long term, a 'trait' may be the place where it might fit best? Very roughly, I'm thinking: ``` trait TypedArrowArray : ArrowArray { type RefType; fn is_valid(i:usize) -> bool; //bounds check unsafe fn is_valid(i:usize) -> bool; //no bounds check fn value(i:usize) -> RefType; //bounds check unsafe fn value_unchecked(i:usize) -> RefType; //no bounds checked fn iter() -> impl Iterator<Option<RefType>>; fn iter_values() -> impl Iterator<RefType>; } impl <T: ArrowPrimitiveType> TypedArrowArray<&T::Native> for PrimitiveArray<T> { ... } impl TypedArrowArray<ArrayRef> for GenericListArray<T> { ... } //and similar for string/binary. ... I am not sure whether struct arrays could fit... Dictionary would not give access to 'keys', only to the values referenced by each key? Union would require some kind of RefType that can downcast into the actual value? ``` Of course, I am uncertain how much overhead the 'standarization' such a trait impl implies would bring... would any kernels actually benefit from using generic implementations against such an api, or will they always go down to the concrete type to squeeze little short-cuts out that don't fit in the generic interface? I'm unsure, so very (very, very) slowly experimenting... # Summary So in short, my thoughts are: * I think that leaving value(i) safety consideration out of this PR makes sense. I've rebased to drop that out - although I did leave the additional values() test code. * Marking it unsafe in the near future is absolutely better than being silently-unsafe. The argument that adding bounds-checks could silently impact external users is reasonable, taking unsafe has the larger 'warning' so that the change isn't missed. * Longer term, the options of deprecating it, or explicitly moving it into an trait impl are both contenders in my mind... but neither option is directly relevant to this PR. Let me know if that seems reasonable. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org