[GitHub] [arrow] nevi-me commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders
nevi-me commented on a change in pull request #7365: URL: https://github.com/apache/arrow/pull/7365#discussion_r436354577 ## File path: rust/arrow/src/array/builder.rs ## @@ -577,6 +632,81 @@ where self } +/// Appends data from other arrays into the builder +/// +/// This is most useful when concatenating arrays of the same type into a builder. +fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> { +if !check_array_data_type(&self.data_type(), data) { +return Err(ArrowError::InvalidArgumentError( +"Cannot append data to builder if data types are different".to_string(), +)); +} +// determine the latest offset on the builder +let mut cum_offset = if self.offsets_builder.len() == 0 { +0 +} else { +// peek into buffer to get last appended offset +let buffer = self.offsets_builder.buffer.data(); +let len = self.offsets_builder.len(); +let (start, end) = ((len - 1) * 4, len * 4); +let slice = &buffer[start..end]; +i32::from_le_bytes(slice.try_into().unwrap()) +}; +for array in data { +if array.child_data().len() != 1 { Review comment: > But I do think we should make sure error handling behavior is consistent, i.e. invalid input should not lead to partial append to the array. If we are to use ArrayDataRef as input type, then I feel like we do need to have custom validation logic for each array type as you mentioned. Thanks, I agree. Changes made in my latest commit. > The idea of using ArrayRef as input type is interesting. It does look like a simpler interface for end users and could simplify the error handling logic. The preferred way to convert `&[ArrayRef]` to `ArrayRef` will be the `concat` kernel that you've added. concat might then forego its current validation, and potentially be like: ```rust pub fn concat(array_list: &[ArrayRef]) -> Result { // get data type from first element // create builder for data type (this'll have to cater for structs and lists) // pass `ArrayDataRef`to builder // finish builder and return `ArrayRef` } ``` > What's the downside of using ArrayRef here compared to ArrayDataRef? `ArrayDataRef` is more flexible. If someone is creating Arrow data from raw data, there currently isn't much flexibility for them, especially when working with nested data structures. It might be more convenient to then create `ArrayData` instead of going all the way to create an array only to append it to a builder. Constructing an `ArrayRef` to append is an extra step and at worst requires going through `arrow::utils::make_array(data: ArrayRef)`. The upside of `ArrayRef` is skipping the validation checks, though I wonder what cost the checks result in. We can wait for other reviewers' opinions on their necessity. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders
nevi-me commented on a change in pull request #7365: URL: https://github.com/apache/arrow/pull/7365#discussion_r436308395 ## File path: rust/arrow/src/array/builder.rs ## @@ -841,12 +1048,91 @@ impl ArrayBuilder for StringBuilder { } } +// Helper function for appending Binary and Utf8 data +fn append_binary_data( +builder: &mut ListBuilder, +data_type: &DataType, +data: &[ArrayDataRef], +) -> Result<()> { +if !check_array_data_type(data_type, data) { +return Err(ArrowError::InvalidArgumentError( +"Cannot append data to builder if data types are different".to_string(), +)); +} +for array in data { +// convert string to List to reuse list's cast +let int_data = &array.buffers()[1]; +let int_data = Arc::new(ArrayData::new( +DataType::UInt8, +int_data.len(), +None, +None, +0, +vec![int_data.clone()], +vec![], +)) as ArrayDataRef; +let list_data = Arc::new(ArrayData::new( +DataType::List(Box::new(DataType::UInt8)), +array.len(), +None, +array.null_buffer().map(|buf| buf.clone()), +array.offset(), +vec![(&array.buffers()[0]).clone()], +vec![int_data], +)); +builder.append_data(&[list_data])?; +} +Ok(()) +} + impl ArrayBuilder for FixedSizeBinaryBuilder { /// Returns the builder as a non-mutable `Any` reference. fn as_any(&self) -> &Any { self } +/// Appends data from other arrays into the builder +/// +/// This is most useful when concatenating arrays of the same type into a builder. +fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> { +if !check_array_data_type(&self.data_type(), data) { +return Err(ArrowError::InvalidArgumentError( +"Cannot append data to builder if data types are different".to_string(), +)); +} +for array in data { +// convert string to FixedSizeList to reuse list's append +let int_data = &array.buffers()[0]; +let int_data = Arc::new(ArrayData::new( +DataType::UInt8, +int_data.len(), +None, +None, +0, +vec![int_data.clone()], +vec![], +)) as ArrayDataRef; +let list_data = Arc::new(ArrayData::new( +DataType::FixedSizeList(Box::new(DataType::UInt8), self.builder.list_len), Review comment: I don't understand you entirely here, mind clarifying? Here I'm reconstructing the data as a FSList from FSBinary, otherwise I'd get an error about data types not being the same. I used list_len directly because i32 is a Copy type. I can change to builder.value_length() This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders
nevi-me commented on a change in pull request #7365: URL: https://github.com/apache/arrow/pull/7365#discussion_r436307832 ## File path: rust/arrow/src/array/builder.rs ## @@ -577,6 +632,81 @@ where self } +/// Appends data from other arrays into the builder +/// +/// This is most useful when concatenating arrays of the same type into a builder. +fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> { +if !check_array_data_type(&self.data_type(), data) { +return Err(ArrowError::InvalidArgumentError( +"Cannot append data to builder if data types are different".to_string(), +)); +} +// determine the latest offset on the builder +let mut cum_offset = if self.offsets_builder.len() == 0 { +0 +} else { +// peek into buffer to get last appended offset +let buffer = self.offsets_builder.buffer.data(); +let len = self.offsets_builder.len(); +let (start, end) = ((len - 1) * 4, len * 4); +let slice = &buffer[start..end]; +i32::from_le_bytes(slice.try_into().unwrap()) +}; +for array in data { +if array.child_data().len() != 1 { Review comment: The validation is only for data type, so we'd have to make a call on whether passing array data that's invalid should be undefined behaviour. If we passed in ArrayRef, we'd be certain that data is valid, but otherwise nothing stops someone from manually constructing ArrayDataRef incorrectly and passing it in. The validation check here at least give the user feedback, otherwise it would be a generic bounds error. I could alternatively customise the validation for different types, with potential allocation for both value and bitmap builders for primitive arrays. It becomes a slippery slope for lists and structs because those can be deeply nested. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders
nevi-me commented on a change in pull request #7365: URL: https://github.com/apache/arrow/pull/7365#discussion_r436257140 ## File path: rust/arrow/src/array/builder.rs ## @@ -1018,6 +1271,50 @@ impl ArrayBuilder for StructBuilder { self.len } +/// Appends data from other arrays into the builder +/// +/// This is most useful when concatenating arrays of the same type into a builder. +fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> { +for array in data { +if let DataType::Struct(fields) = array.data_type() { +if &self.fields != fields { +return Err(ArrowError::InvalidArgumentError( +"Struct arrays are not the same".to_string(), +)); +} +let len = array.len(); +if len == 0 { +continue; +} +let offset = array.offset(); +let results: Result> = self +.field_builders +.iter_mut() +.zip(array.child_data()) +.map(|(builder, child_data)| { +// slice child_data to account for offsets +let child_array = make_array(child_data.clone()); +let sliced = child_array.slice(offset, len); +builder.append_data(&[sliced.data()]) +}) +.collect(); +results?; Review comment: Thanks, I've changed it This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders
nevi-me commented on a change in pull request #7365: URL: https://github.com/apache/arrow/pull/7365#discussion_r436254430 ## File path: rust/arrow/src/array/builder.rs ## @@ -450,6 +455,44 @@ impl ArrayBuilder for PrimitiveBuilder { self.values_builder.len } +/// Appends data from other arrays into the builder +/// +/// This is most useful when concatenating arrays of the same type into a builder. +fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> { +let mul = T::get_bit_width() / 8; +for array in data { +if array.data_type() != &T::get_data_type() { Review comment: I considered this, I ideally wanted to leave the responsibility to the user, as we'd provide convenience methods like concat, but since this is public, I should check data types first, then return early if there's a mismatch. I'll update This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org