[GitHub] [arrow] houqp commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders
houqp commented on a change in pull request #7365: URL: https://github.com/apache/arrow/pull/7365#discussion_r436336846 ## File path: rust/arrow/src/array/builder.rs ## @@ -577,6 +632,81 @@ where self } +/// Appends data from other arrays into the builder +/// +/// This is most useful when concatenating arrays of the same type into a builder. +fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> { +if !check_array_data_type(&self.data_type(), data) { +return Err(ArrowError::InvalidArgumentError( +"Cannot append data to builder if data types are different".to_string(), +)); +} +// determine the latest offset on the builder +let mut cum_offset = if self.offsets_builder.len() == 0 { +0 +} else { +// peek into buffer to get last appended offset +let buffer = self.offsets_builder.buffer.data(); +let len = self.offsets_builder.len(); +let (start, end) = ((len - 1) * 4, len * 4); +let slice = &buffer[start..end]; +i32::from_le_bytes(slice.try_into().unwrap()) +}; +for array in data { +if array.child_data().len() != 1 { Review comment: I think we can leave the memory allocation performance optimization to future PRs given that number of arrays in `data: &[ArrayDataRef]` should not be too large. But I do think we should make sure error handling behavior is consistent, i.e. invalid input should not lead to partial append to the array. If we are to use ArrayDataRef as input type, then I feel like we do need to have custom validation logic for each array type as you mentioned. The idea of using `ArrayRef` as input type is interesting. It does look like a simpler interface for end users and could simplify the error handling logic. What's the downside of using `ArrayRef` here compared to `ArrayDataRef`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] houqp commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders
houqp commented on a change in pull request #7365: URL: https://github.com/apache/arrow/pull/7365#discussion_r436335986 ## File path: rust/arrow/src/array/builder.rs ## @@ -841,12 +1048,91 @@ impl ArrayBuilder for StringBuilder { } } +// Helper function for appending Binary and Utf8 data +fn append_binary_data( +builder: &mut ListBuilder, +data_type: &DataType, +data: &[ArrayDataRef], +) -> Result<()> { +if !check_array_data_type(data_type, data) { +return Err(ArrowError::InvalidArgumentError( +"Cannot append data to builder if data types are different".to_string(), +)); +} +for array in data { +// convert string to List to reuse list's cast +let int_data = &array.buffers()[1]; +let int_data = Arc::new(ArrayData::new( +DataType::UInt8, +int_data.len(), +None, +None, +0, +vec![int_data.clone()], +vec![], +)) as ArrayDataRef; +let list_data = Arc::new(ArrayData::new( +DataType::List(Box::new(DataType::UInt8)), +array.len(), +None, +array.null_buffer().map(|buf| buf.clone()), +array.offset(), +vec![(&array.buffers()[0]).clone()], +vec![int_data], +)); +builder.append_data(&[list_data])?; +} +Ok(()) +} + impl ArrayBuilder for FixedSizeBinaryBuilder { /// Returns the builder as a non-mutable `Any` reference. fn as_any(&self) -> &Any { self } +/// Appends data from other arrays into the builder +/// +/// This is most useful when concatenating arrays of the same type into a builder. +fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> { +if !check_array_data_type(&self.data_type(), data) { +return Err(ArrowError::InvalidArgumentError( +"Cannot append data to builder if data types are different".to_string(), +)); +} +for array in data { +// convert string to FixedSizeList to reuse list's append +let int_data = &array.buffers()[0]; +let int_data = Arc::new(ArrayData::new( +DataType::UInt8, +int_data.len(), +None, +None, +0, +vec![int_data.clone()], +vec![], +)) as ArrayDataRef; +let list_data = Arc::new(ArrayData::new( +DataType::FixedSizeList(Box::new(DataType::UInt8), self.builder.list_len), Review comment: Ha, good point, I forgot byte_width is part of the data type signature :+1: with regards to `value_length()` v.s. `list_len`, after thinking more about it, I think you are right to use list_len here since these are all internal module implementation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] houqp commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders
houqp commented on a change in pull request #7365: URL: https://github.com/apache/arrow/pull/7365#discussion_r436303006 ## File path: rust/arrow/src/array/builder.rs ## @@ -841,12 +1048,91 @@ impl ArrayBuilder for StringBuilder { } } +// Helper function for appending Binary and Utf8 data +fn append_binary_data( +builder: &mut ListBuilder, +data_type: &DataType, +data: &[ArrayDataRef], +) -> Result<()> { +if !check_array_data_type(data_type, data) { +return Err(ArrowError::InvalidArgumentError( +"Cannot append data to builder if data types are different".to_string(), +)); +} +for array in data { +// convert string to List to reuse list's cast +let int_data = &array.buffers()[1]; +let int_data = Arc::new(ArrayData::new( +DataType::UInt8, +int_data.len(), +None, +None, +0, +vec![int_data.clone()], +vec![], +)) as ArrayDataRef; +let list_data = Arc::new(ArrayData::new( +DataType::List(Box::new(DataType::UInt8)), +array.len(), +None, +array.null_buffer().map(|buf| buf.clone()), +array.offset(), +vec![(&array.buffers()[0]).clone()], +vec![int_data], +)); +builder.append_data(&[list_data])?; +} +Ok(()) +} + impl ArrayBuilder for FixedSizeBinaryBuilder { /// Returns the builder as a non-mutable `Any` reference. fn as_any(&self) -> &Any { self } +/// Appends data from other arrays into the builder +/// +/// This is most useful when concatenating arrays of the same type into a builder. +fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> { +if !check_array_data_type(&self.data_type(), data) { +return Err(ArrowError::InvalidArgumentError( +"Cannot append data to builder if data types are different".to_string(), +)); +} +for array in data { +// convert string to FixedSizeList to reuse list's append +let int_data = &array.buffers()[0]; +let int_data = Arc::new(ArrayData::new( +DataType::UInt8, +int_data.len(), +None, +None, +0, +vec![int_data.clone()], +vec![], +)) as ArrayDataRef; +let list_data = Arc::new(ArrayData::new( +DataType::FixedSizeList(Box::new(DataType::UInt8), self.builder.list_len), Review comment: do we need to validate list_len for each ArrayData as well? Also I recommend calling value_length() method to get list_len value instead or remove value_length entirely and use list_len directly everywhere. ## File path: rust/arrow/src/array/builder.rs ## @@ -1018,6 +1304,48 @@ impl ArrayBuilder for StructBuilder { self.len } +/// Appends data from other arrays into the builder +/// +/// This is most useful when concatenating arrays of the same type into a builder. +fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> { +if !check_array_data_type(&self.data_type(), data) { +return Err(ArrowError::InvalidArgumentError( +"Cannot append data to builder if data types are different".to_string(), +)); +} +for array in data { +let len = array.len(); +if len == 0 { +continue; +} +let offset = array.offset(); +for (builder, child_data) in self +.field_builders +.iter_mut() +.zip(array.child_data().iter()) +{ +// slice child_data to account for offsets +let child_array = make_array(child_data.clone()); +let sliced = child_array.slice(offset, len); +builder.append_data(&[sliced.data()])?; +} +// append array length +self.len += len; +for i in 0..len { +// account for offset as `ArrayData` does not +self.bitmap_builder.append(array.is_valid(offset + i))?; Review comment: looks like we missed the reserve call before the loop here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] houqp commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders
houqp commented on a change in pull request #7365: URL: https://github.com/apache/arrow/pull/7365#discussion_r436301311 ## File path: rust/arrow/src/array/builder.rs ## @@ -577,6 +632,81 @@ where self } +/// Appends data from other arrays into the builder +/// +/// This is most useful when concatenating arrays of the same type into a builder. +fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> { +if !check_array_data_type(&self.data_type(), data) { +return Err(ArrowError::InvalidArgumentError( +"Cannot append data to builder if data types are different".to_string(), +)); +} +// determine the latest offset on the builder +let mut cum_offset = if self.offsets_builder.len() == 0 { +0 +} else { +// peek into buffer to get last appended offset +let buffer = self.offsets_builder.buffer.data(); +let len = self.offsets_builder.len(); +let (start, end) = ((len - 1) * 4, len * 4); +let slice = &buffer[start..end]; +i32::from_le_bytes(slice.try_into().unwrap()) +}; +for array in data { +if array.child_data().len() != 1 { Review comment: If we are already doing validation before mutating data (which I think it's the right behavior), it's best to move all validation logic into the initial validation loop. Another thing we can move into the initial loop is stats gathering for optimization purpose. For example, we can calculate aggregated element count in the loop. Then before we enter the mutation loop, call reserve method on various builders to allocate memory in one go. This way, we don't have to trigger multiple memory reallocation in the mutation loop. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] houqp commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders
houqp commented on a change in pull request #7365: URL: https://github.com/apache/arrow/pull/7365#discussion_r436301311 ## File path: rust/arrow/src/array/builder.rs ## @@ -577,6 +632,81 @@ where self } +/// Appends data from other arrays into the builder +/// +/// This is most useful when concatenating arrays of the same type into a builder. +fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> { +if !check_array_data_type(&self.data_type(), data) { +return Err(ArrowError::InvalidArgumentError( +"Cannot append data to builder if data types are different".to_string(), +)); +} +// determine the latest offset on the builder +let mut cum_offset = if self.offsets_builder.len() == 0 { +0 +} else { +// peek into buffer to get last appended offset +let buffer = self.offsets_builder.buffer.data(); +let len = self.offsets_builder.len(); +let (start, end) = ((len - 1) * 4, len * 4); +let slice = &buffer[start..end]; +i32::from_le_bytes(slice.try_into().unwrap()) +}; +for array in data { +if array.child_data().len() != 1 { Review comment: If we are already doing validation before mutating data (which I think it's the right behavior), it's best to move all validation logic into the initial validation loop. Another thing we can move into the initial loop is stats gathering for optimization purpose. For example, we can calculate aggregated element count in the loop. Then before we enter the mutation loop, call reserve method on various builders to allocate memory in one go. This way, we don't have to trigger multiple memory reallocation in the mutation loop, especially for bitmap_builder. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] houqp commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders
houqp commented on a change in pull request #7365: URL: https://github.com/apache/arrow/pull/7365#discussion_r436250767 ## File path: rust/arrow/src/array/builder.rs ## @@ -577,6 +620,78 @@ where self } +/// Appends data from other arrays into the builder +/// +/// This is most useful when concatenating arrays of the same type into a builder. +fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> { +// determine the latest offset on the builder +let mut cum_offset = if self.offsets_builder.len() == 0 { +0 +} else { +// peek into buffer to get last appended offset +let buffer = self.offsets_builder.buffer.data(); +let len = self.offsets_builder.len(); +let (start, end) = ((len - 1) * 4, len * 4); +let slice = &buffer[start..end]; +i32::from_le_bytes(slice.try_into().unwrap()) +}; +for array in data { +if let DataType::List(_) = array.data_type() { +if array.child_data().len() != 1 { +return Err(ArrowError::InvalidArgumentError( +"When appending list arrays, data must contain 1 child_data element" +.to_string(), +)); +} +let len = array.len(); +if len == 0 { +continue; +} +let offset = array.offset(); + +// `typed_data` is unsafe, however this call is safe as `ListArray` has i32 offsets +unsafe { Review comment: looks like scope of this unsafe block can be reduced? ## File path: rust/arrow/src/array/builder.rs ## @@ -450,6 +455,44 @@ impl ArrayBuilder for PrimitiveBuilder { self.values_builder.len } +/// Appends data from other arrays into the builder +/// +/// This is most useful when concatenating arrays of the same type into a builder. +fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> { +let mul = T::get_bit_width() / 8; +for array in data { +if array.data_type() != &T::get_data_type() { Review comment: if one of the array data has the wrong type, the append will be partially applied, is this intentional? ## File path: rust/arrow/src/array/builder.rs ## @@ -1018,6 +1271,50 @@ impl ArrayBuilder for StructBuilder { self.len } +/// Appends data from other arrays into the builder +/// +/// This is most useful when concatenating arrays of the same type into a builder. +fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> { +for array in data { +if let DataType::Struct(fields) = array.data_type() { +if &self.fields != fields { +return Err(ArrowError::InvalidArgumentError( +"Struct arrays are not the same".to_string(), Review comment: nitpick, perhaps `have different fields` would be a better wording here. ## File path: rust/arrow/src/array/builder.rs ## @@ -1018,6 +1271,50 @@ impl ArrayBuilder for StructBuilder { self.len } +/// Appends data from other arrays into the builder +/// +/// This is most useful when concatenating arrays of the same type into a builder. +fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> { +for array in data { +if let DataType::Struct(fields) = array.data_type() { +if &self.fields != fields { +return Err(ArrowError::InvalidArgumentError( +"Struct arrays are not the same".to_string(), +)); +} +let len = array.len(); +if len == 0 { +continue; +} +let offset = array.offset(); +let results: Result> = self +.field_builders +.iter_mut() +.zip(array.child_data()) +.map(|(builder, child_data)| { +// slice child_data to account for offsets +let child_array = make_array(child_data.clone()); +let sliced = child_array.slice(offset, len); +builder.append_data(&[sliced.data()]) +}) +.collect(); +results?; Review comment: minor, if results is not used, then it's better to use a for loop over zipped iterators above to avoid creating a temporary vector at the end. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries ab