[PR] GH-36889: [C++][Python] Fix duplicate CSV header when first batch is empty [arrow]

via GitHub Sat, 03 Jan 2026 15:13:09 -0800


rynewang opened a new pull request, #48718:
URL: https://github.com/apache/arrow/pull/48718


   ### Rationale for this change
   
   Fixes https://github.com/apache/arrow/issues/36889
   
   When writing CSV from a table where the first batch is empty, the header 
gets written twice:
   
   ```python
   table = pa.table({"col1": ["a", "b", "c"]})
   combined = pa.concat_tables([table.schema.empty_table(), table])
   write_csv(combined, buf)
   # Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n  <-- header appears twice
   ```
   
   ### What changes are included in this PR?
   
   The bug happens because:
   1. Header is written to `data_buffer_` and flushed during `CSVWriterImpl` 
initialization
   2. `TranslateMinimalBatch` returns early for empty batches without modifying 
`data_buffer_`
   3. The `WriteTable`/`WriteRecordBatch` loop then writes `data_buffer_` which 
still contains the stale header
   
   The fix clears the buffer (resize to 0) when encountering an empty batch in 
`TranslateMinimalBatch`, so the subsequent write outputs nothing.
   
   ### Are these changes tested?
   
   Yes. Added C++ tests in `writer_test.cc` and Python tests in `test_csv.py`:
   - Empty batch at start of table
   - Empty batch in middle of table
   
   ### Are there any user-facing changes?
   
   No API changes. This is a bug fix that prevents duplicate headers when 
writing CSV from tables with empty batches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] GH-36889: [C++][Python] Fix duplicate CSV header when first batch is empty [arrow]

Reply via email to