rynewang opened a new pull request, #48718: URL: https://github.com/apache/arrow/pull/48718
### Rationale for this change Fixes https://github.com/apache/arrow/issues/36889 When writing CSV from a table where the first batch is empty, the header gets written twice: ```python table = pa.table({"col1": ["a", "b", "c"]}) combined = pa.concat_tables([table.schema.empty_table(), table]) write_csv(combined, buf) # Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n <-- header appears twice ``` ### What changes are included in this PR? The bug happens because: 1. Header is written to `data_buffer_` and flushed during `CSVWriterImpl` initialization 2. `TranslateMinimalBatch` returns early for empty batches without modifying `data_buffer_` 3. The `WriteTable`/`WriteRecordBatch` loop then writes `data_buffer_` which still contains the stale header The fix clears the buffer (resize to 0) when encountering an empty batch in `TranslateMinimalBatch`, so the subsequent write outputs nothing. ### Are these changes tested? Yes. Added C++ tests in `writer_test.cc` and Python tests in `test_csv.py`: - Empty batch at start of table - Empty batch in middle of table ### Are there any user-facing changes? No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
