nevi-me commented on a change in pull request #381:
URL: https://github.com/apache/arrow-rs/pull/381#discussion_r642516947
##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -87,17 +92,31 @@ impl<W: 'static + ParquetWriter> ArrowWriter<W> {
"Record batch schema does not match writer schema".to_string(),
));
}
- // compute the definition and repetition levels of the batch
- let batch_level = LevelInfo::new_from_batch(batch);
- let mut row_group_writer = self.writer.next_row_group()?;
- for (array, field) in
batch.columns().iter().zip(batch.schema().fields()) {
- let mut levels = batch_level.calculate_array_levels(array, field);
- // Reverse levels as we pop() them when writing arrays
- levels.reverse();
- write_leaves(&mut row_group_writer, array, &mut levels)?;
+ // Track the number of rows being written in the batch.
+ // We currently do not have a way of slicing nested arrays, thus we
+ // track this manually.
+ let num_rows = batch.num_rows();
+ let batches = (num_rows + self.max_row_group_size - 1) /
self.max_row_group_size;
Review comment:
Yes, that's a fair observation. I think it's a bit tricky because we
would need to get the other 5 records from the next batch.
If we passed all batches at once, we would be able to segment them into
equal rows.
This is something we can think of, as I think it's a valid expectation from
a user.
I can check if we are able to keep row groups open, so that when the next
batch comes in, we take its 5 records
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]