[GitHub] [arrow] nevi-me commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders

2020-06-07 Thread GitBox


nevi-me commented on a change in pull request #7365:
URL: https://github.com/apache/arrow/pull/7365#discussion_r436354577



##
File path: rust/arrow/src/array/builder.rs
##
@@ -577,6 +632,81 @@ where
 self
 }
 
+/// Appends data from other arrays into the builder
+///
+/// This is most useful when concatenating arrays of the same type into a 
builder.
+fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> {
+if !check_array_data_type(&self.data_type(), data) {
+return Err(ArrowError::InvalidArgumentError(
+"Cannot append data to builder if data types are 
different".to_string(),
+));
+}
+// determine the latest offset on the builder
+let mut cum_offset = if self.offsets_builder.len() == 0 {
+0
+} else {
+// peek into buffer to get last appended offset
+let buffer = self.offsets_builder.buffer.data();
+let len = self.offsets_builder.len();
+let (start, end) = ((len - 1) * 4, len * 4);
+let slice = &buffer[start..end];
+i32::from_le_bytes(slice.try_into().unwrap())
+};
+for array in data {
+if array.child_data().len() != 1 {

Review comment:
   > But I do think we should make sure error handling behavior is 
consistent, i.e. invalid input should not lead to partial append to the array. 
If we are to use ArrayDataRef as input type, then I feel like we do need to 
have custom validation logic for each array type as you mentioned.
   
   Thanks, I agree. Changes made in my latest commit.
   
   > The idea of using ArrayRef as input type is interesting. It does look like 
a simpler interface for end users and could simplify the error handling logic. 
   
   The preferred way to convert `&[ArrayRef]` to `ArrayRef` will be the 
`concat` kernel that you've added. concat might then forego its current 
validation, and potentially be like:
   
   ```rust
   pub fn concat(array_list: &[ArrayRef]) -> Result {
   // get data type from first element
   // create builder for data type (this'll have to cater for structs and 
lists)
   // pass `ArrayDataRef`to builder
   // finish builder and return `ArrayRef`
   }
   ```
   
   > What's the downside of using ArrayRef here compared to ArrayDataRef?
   
   `ArrayDataRef` is more flexible. If someone is creating Arrow data from raw 
data, there currently isn't much flexibility for them, especially when working 
with nested data structures. It might be more convenient to then create 
`ArrayData` instead of going all the way to create an array only to append it 
to a builder.
   Constructing an `ArrayRef` to append is an extra step and at worst requires 
going through `arrow::utils::make_array(data: ArrayRef)`.
   
   The upside of `ArrayRef` is skipping the validation checks, though I wonder 
what cost the checks result in. We can wait for other reviewers' opinions on 
their necessity.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders

2020-06-06 Thread GitBox


nevi-me commented on a change in pull request #7365:
URL: https://github.com/apache/arrow/pull/7365#discussion_r436308395



##
File path: rust/arrow/src/array/builder.rs
##
@@ -841,12 +1048,91 @@ impl ArrayBuilder for StringBuilder {
 }
 }
 
+// Helper function for appending Binary and Utf8 data
+fn append_binary_data(
+builder: &mut ListBuilder,
+data_type: &DataType,
+data: &[ArrayDataRef],
+) -> Result<()> {
+if !check_array_data_type(data_type, data) {
+return Err(ArrowError::InvalidArgumentError(
+"Cannot append data to builder if data types are 
different".to_string(),
+));
+}
+for array in data {
+// convert string to List to reuse list's cast
+let int_data = &array.buffers()[1];
+let int_data = Arc::new(ArrayData::new(
+DataType::UInt8,
+int_data.len(),
+None,
+None,
+0,
+vec![int_data.clone()],
+vec![],
+)) as ArrayDataRef;
+let list_data = Arc::new(ArrayData::new(
+DataType::List(Box::new(DataType::UInt8)),
+array.len(),
+None,
+array.null_buffer().map(|buf| buf.clone()),
+array.offset(),
+vec![(&array.buffers()[0]).clone()],
+vec![int_data],
+));
+builder.append_data(&[list_data])?;
+}
+Ok(())
+}
+
 impl ArrayBuilder for FixedSizeBinaryBuilder {
 /// Returns the builder as a non-mutable `Any` reference.
 fn as_any(&self) -> &Any {
 self
 }
 
+/// Appends data from other arrays into the builder
+///
+/// This is most useful when concatenating arrays of the same type into a 
builder.
+fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> {
+if !check_array_data_type(&self.data_type(), data) {
+return Err(ArrowError::InvalidArgumentError(
+"Cannot append data to builder if data types are 
different".to_string(),
+));
+}
+for array in data {
+// convert string to FixedSizeList to reuse list's append
+let int_data = &array.buffers()[0];
+let int_data = Arc::new(ArrayData::new(
+DataType::UInt8,
+int_data.len(),
+None,
+None,
+0,
+vec![int_data.clone()],
+vec![],
+)) as ArrayDataRef;
+let list_data = Arc::new(ArrayData::new(
+DataType::FixedSizeList(Box::new(DataType::UInt8), 
self.builder.list_len),

Review comment:
   I don't understand you entirely here, mind clarifying?
   Here I'm reconstructing the data as a FSList from FSBinary, 
otherwise I'd get an error about data types not being the same.
   
   I used list_len directly because i32 is a Copy type. I can change to 
builder.value_length()





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders

2020-06-06 Thread GitBox


nevi-me commented on a change in pull request #7365:
URL: https://github.com/apache/arrow/pull/7365#discussion_r436307832



##
File path: rust/arrow/src/array/builder.rs
##
@@ -577,6 +632,81 @@ where
 self
 }
 
+/// Appends data from other arrays into the builder
+///
+/// This is most useful when concatenating arrays of the same type into a 
builder.
+fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> {
+if !check_array_data_type(&self.data_type(), data) {
+return Err(ArrowError::InvalidArgumentError(
+"Cannot append data to builder if data types are 
different".to_string(),
+));
+}
+// determine the latest offset on the builder
+let mut cum_offset = if self.offsets_builder.len() == 0 {
+0
+} else {
+// peek into buffer to get last appended offset
+let buffer = self.offsets_builder.buffer.data();
+let len = self.offsets_builder.len();
+let (start, end) = ((len - 1) * 4, len * 4);
+let slice = &buffer[start..end];
+i32::from_le_bytes(slice.try_into().unwrap())
+};
+for array in data {
+if array.child_data().len() != 1 {

Review comment:
   The validation is only for data type, so we'd have to make a call on 
whether passing array data that's invalid should be undefined behaviour. If we 
passed in ArrayRef, we'd be certain that data is valid, but otherwise nothing 
stops someone from manually constructing ArrayDataRef incorrectly and passing 
it in. The validation check here at least give the user feedback, otherwise it 
would be a generic bounds error.
   
   I could alternatively customise the validation for different types, with 
potential allocation for both value and bitmap builders for primitive arrays. 
It becomes a slippery slope for lists and structs because those can be deeply 
nested.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders

2020-06-06 Thread GitBox


nevi-me commented on a change in pull request #7365:
URL: https://github.com/apache/arrow/pull/7365#discussion_r436257140



##
File path: rust/arrow/src/array/builder.rs
##
@@ -1018,6 +1271,50 @@ impl ArrayBuilder for StructBuilder {
 self.len
 }
 
+/// Appends data from other arrays into the builder
+///
+/// This is most useful when concatenating arrays of the same type into a 
builder.
+fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> {
+for array in data {
+if let DataType::Struct(fields) = array.data_type() {
+if &self.fields != fields {
+return Err(ArrowError::InvalidArgumentError(
+"Struct arrays are not the same".to_string(),
+));
+}
+let len = array.len();
+if len == 0 {
+continue;
+}
+let offset = array.offset();
+let results: Result> = self
+.field_builders
+.iter_mut()
+.zip(array.child_data())
+.map(|(builder, child_data)| {
+// slice child_data to account for offsets
+let child_array = make_array(child_data.clone());
+let sliced = child_array.slice(offset, len);
+builder.append_data(&[sliced.data()])
+})
+.collect();
+results?;

Review comment:
   Thanks, I've changed it





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on a change in pull request #7365: ARROW-9007: [Rust] Support appending array data to builders

2020-06-06 Thread GitBox


nevi-me commented on a change in pull request #7365:
URL: https://github.com/apache/arrow/pull/7365#discussion_r436254430



##
File path: rust/arrow/src/array/builder.rs
##
@@ -450,6 +455,44 @@ impl ArrayBuilder for 
PrimitiveBuilder {
 self.values_builder.len
 }
 
+/// Appends data from other arrays into the builder
+///
+/// This is most useful when concatenating arrays of the same type into a 
builder.
+fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> {
+let mul = T::get_bit_width() / 8;
+for array in data {
+if array.data_type() != &T::get_data_type() {

Review comment:
   I considered this, I ideally wanted to leave the responsibility to the 
user, as we'd provide convenience methods like concat, but since this is 
public, I should check data types first, then return early if there's a 
mismatch. I'll update





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org