wecharyu commented on PR #48468: URL: https://github.com/apache/arrow/pull/48468#issuecomment-4012592452
> Well, the already compressed page data is sufficient to get a lower bound estimate. The input batch size is uncertain. If we use the written compressed page data to determine whether to flush a new row group, we may need to probe the appropriate batch size for each write. Otherwise, writing the entire batch at once could cause the row group size to exceed `max_row_group_bytes` by a large margin. It could make things more complicated. Conversely, estimating the remaining number of rows based on total values appears to be a more concise approach. It's like the arrow-rs used `get_estimated_total_bytes` for batch split: https://github.com/apache/arrow-rs/blob/5ba451531efd2e98de38f6a8443aad605b6b5cc5/parquet/src/arrow/arrow_writer/mod.rs#L354-L380 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
