[jira] [Comment Edited] (CASSANDRA-21381) CSV COPY TO corrupts control characters (newline, null byte, etc.) in text values

Brad Schoening (Jira) Wed, 27 May 2026 20:32:36 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083997#comment-18083997
 ]


Brad Schoening edited comment on CASSANDRA-21381 at 5/28/26 3:31 AM:
---------------------------------------------------------------------

[~jensg] [~arvindk12] 
CSV format as standardized and defined in [RFC 
4180|https://datatracker.ietf.org/doc/html/rfc4180] doesn't support control 
characters. It excludes them by omission in the BNF, defining TEXTDATA as only 
*printable* characters:
{code:sh}
CR = %x0D
LF = %x0A
TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E
{code}

If not strictly RFC 4180, what definition of TEXTDATA should CQLSH COPY support 
for CSV? The CSV exports from CQLSH must stay close to the standard so they can 
be consumed as widely as possible — Java, C, Python, Excel, and so on. Allowing 
arbitrary text (control characters, embedded nulls, etc.) would make our output 
less portable and push the burden of handling it onto every downstream tool.

If you need to export and then re-import binary data (round-trip), you probably 
shouldn't be using CSV format. It's not designed to support binary data.

Control characters don't round-trip safely through the tools that actually 
consume CSV. CSV's whole reason for existing is interchange — it's the lowest 
common denominator between spreadsheets, databases, and scripts. A form feed, 
vertical tab, or escape character in a field will be silently mangled, 
reinterpreted, or rendered invisibly differently by Excel vs. a text editor vs. 
a terminal vs. a parser. There's no agreed-upon way to represent them, so the 
safe choice for an interchange format is to not allow them at all. RFC 4180 
bakes that reasoning in at the grammar level.


was (Author: bschoeni):
[~jensg] [~arvindk12] 
CSV format as standardized and defined in [RFC 
4180|https://datatracker.ietf.org/doc/html/rfc4180] doesn't support control 
characters. It excludes them by omission in the BNF, defining TEXTDATA as only 
*printable* characters:
{code:sh}
CR = %x0D
LF = %x0A
TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E
{code}

If not strictly RFC 4180, what definition of TEXTDATA should CQLSH COPY support 
for CSV? The CSV exports from CQLSH should stay close to the standard so they 
can be consumed as widely as possible — Java, C, Python, Excel, and so on. 
Allowing arbitrary text (control characters, embedded nulls, etc.) would make 
our output less portable and push the burden of handling it onto every 
downstream tool.

If you need to export and then re-import binary data (round-trip), you probably 
shouldn't be using CSV format. It's not designed to support binary data.

Control characters don't round-trip safely through the tools that actually 
consume CSV. CSV's whole reason for existing is interchange — it's the lowest 
common denominator between spreadsheets, databases, and scripts. A form feed, 
vertical tab, or escape character in a field will be silently mangled, 
reinterpreted, or rendered invisibly differently by Excel vs. a text editor vs. 
a terminal vs. a parser. There's no agreed-upon way to represent them, so the 
safe choice for an interchange format is to not allow them at all. RFC 4180 
bakes that reasoning in at the grammar level.

> CSV COPY TO corrupts control characters (newline, null byte, etc.) in text 
> values
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21381
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21381
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: CQL/Interpreter
>            Reporter: Jens Geyer
>            Assignee: Arvind Kandpal
>            Priority: Normal
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> h2. Problem
> During COPY TO, control characters in text column values are replaced with 
> their Python repr() notation by 
> UNICODE_CONTROLCHARS_RE.sub(_show_control_chars, ...) in 
> {{format_value_text}} ({{pylib/cqlshlib/formatting.py}}).
> Examples:
> * A stored newline (0x0A) becomes the two-character sequence {{\n}} in the 
> CSV; after COPY FROM it is stored as {{\n}} (backslash + n) -- a different 
> value.
> * A null byte (0x00) becomes {{\x00}} (six characters).
> The regex {{UNICODE_CONTROLCHARS_RE = re.compile(r"[\x00-\x1f\x7f-\xa0]")}} 
> matches all ASCII control characters (0x00-0x1F: newline, tab, carriage 
> return, BEL, etc.) and Latin-1 supplement control characters (0x7F-0xA0).
> This substitution is correct for terminal display of SELECT results (where 
> invisible characters need a human-readable representation). It is incorrect 
> in the *CSV export path*, where {{csv.writer}} handles control characters 
> correctly via field quoting -- no pre-processing is needed.
> h2. Affected Versions
> All Cassandra versions with {{format_value_text}} containing the 
> {{UNICODE_CONTROLCHARS_RE}} substitution (at minimum 3.x through trunk).
> h2. Root Cause
> {{format_value_text}} is shared between the terminal display path (SELECT 
> output) and the CSV export path (COPY TO). The {{UNICODE_CONTROLCHARS_RE}} 
> substitution converts control characters to their Python repr-string for 
> display, but this transformation is *not reversible* via the CSV import path.
> This bug is *independent of, but in the same function as*, the 
> backslash-doubling bug fixed in CASSANDRA-21131. Applying the CASSANDRA-21131 
> patch does NOT fix this issue.
> h2. Expected Fix
> In the CSV export path, skip the {{UNICODE_CONTROLCHARS_RE.sub(...)}} call. 
> An {{escape_control_chars}} parameter (analogous to the {{escape_backslash}} 
> parameter introduced by CASSANDRA-21131) can conditionally suppress the 
> substitution when calling {{format_value_text}} from the CSV export path.
> h2. Related
> CASSANDRA-21131 -- backslash-doubling bug in the same code path, already 
> patched.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-21381) CSV COPY TO corrupts control characters (newline, null byte, etc.) in text values

Reply via email to