[
https://issues.apache.org/jira/browse/CASSANDRA-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083997#comment-18083997
]
Brad Schoening edited comment on CASSANDRA-21381 at 5/28/26 3:31 AM:
---------------------------------------------------------------------
[~jensg] [~arvindk12]
CSV format as standardized and defined in [RFC
4180|https://datatracker.ietf.org/doc/html/rfc4180] doesn't support control
characters. It excludes them by omission in the BNF, defining TEXTDATA as only
*printable* characters:
{code:sh}
CR = %x0D
LF = %x0A
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
{code}
If not strictly RFC 4180, what definition of TEXTDATA should CQLSH COPY support
for CSV? The CSV exports from CQLSH must stay close to the standard so they can
be consumed as widely as possible — Java, C, Python, Excel, and so on. Allowing
arbitrary text (control characters, embedded nulls, etc.) would make our output
less portable and push the burden of handling it onto every downstream tool.
If you need to export and then re-import binary data (round-trip), you probably
shouldn't be using CSV format. It's not designed to support binary data.
Control characters don't round-trip safely through the tools that actually
consume CSV. CSV's whole reason for existing is interchange — it's the lowest
common denominator between spreadsheets, databases, and scripts. A form feed,
vertical tab, or escape character in a field will be silently mangled,
reinterpreted, or rendered invisibly differently by Excel vs. a text editor vs.
a terminal vs. a parser. There's no agreed-upon way to represent them, so the
safe choice for an interchange format is to not allow them at all. RFC 4180
bakes that reasoning in at the grammar level.
was (Author: bschoeni):
[~jensg] [~arvindk12]
CSV format as standardized and defined in [RFC
4180|https://datatracker.ietf.org/doc/html/rfc4180] doesn't support control
characters. It excludes them by omission in the BNF, defining TEXTDATA as only
*printable* characters:
{code:sh}
CR = %x0D
LF = %x0A
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
{code}
If not strictly RFC 4180, what definition of TEXTDATA should CQLSH COPY support
for CSV? The CSV exports from CQLSH should stay close to the standard so they
can be consumed as widely as possible — Java, C, Python, Excel, and so on.
Allowing arbitrary text (control characters, embedded nulls, etc.) would make
our output less portable and push the burden of handling it onto every
downstream tool.
If you need to export and then re-import binary data (round-trip), you probably
shouldn't be using CSV format. It's not designed to support binary data.
Control characters don't round-trip safely through the tools that actually
consume CSV. CSV's whole reason for existing is interchange — it's the lowest
common denominator between spreadsheets, databases, and scripts. A form feed,
vertical tab, or escape character in a field will be silently mangled,
reinterpreted, or rendered invisibly differently by Excel vs. a text editor vs.
a terminal vs. a parser. There's no agreed-upon way to represent them, so the
safe choice for an interchange format is to not allow them at all. RFC 4180
bakes that reasoning in at the grammar level.
> CSV COPY TO corrupts control characters (newline, null byte, etc.) in text
> values
> ---------------------------------------------------------------------------------
>
> Key: CASSANDRA-21381
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21381
> Project: Apache Cassandra
> Issue Type: Bug
> Components: CQL/Interpreter
> Reporter: Jens Geyer
> Assignee: Arvind Kandpal
> Priority: Normal
> Time Spent: 1h 40m
> Remaining Estimate: 0h
>
> h2. Problem
> During COPY TO, control characters in text column values are replaced with
> their Python repr() notation by
> UNICODE_CONTROLCHARS_RE.sub(_show_control_chars, ...) in
> {{format_value_text}} ({{pylib/cqlshlib/formatting.py}}).
> Examples:
> * A stored newline (0x0A) becomes the two-character sequence {{\n}} in the
> CSV; after COPY FROM it is stored as {{\n}} (backslash + n) -- a different
> value.
> * A null byte (0x00) becomes {{\x00}} (six characters).
> The regex {{UNICODE_CONTROLCHARS_RE = re.compile(r"[\x00-\x1f\x7f-\xa0]")}}
> matches all ASCII control characters (0x00-0x1F: newline, tab, carriage
> return, BEL, etc.) and Latin-1 supplement control characters (0x7F-0xA0).
> This substitution is correct for terminal display of SELECT results (where
> invisible characters need a human-readable representation). It is incorrect
> in the *CSV export path*, where {{csv.writer}} handles control characters
> correctly via field quoting -- no pre-processing is needed.
> h2. Affected Versions
> All Cassandra versions with {{format_value_text}} containing the
> {{UNICODE_CONTROLCHARS_RE}} substitution (at minimum 3.x through trunk).
> h2. Root Cause
> {{format_value_text}} is shared between the terminal display path (SELECT
> output) and the CSV export path (COPY TO). The {{UNICODE_CONTROLCHARS_RE}}
> substitution converts control characters to their Python repr-string for
> display, but this transformation is *not reversible* via the CSV import path.
> This bug is *independent of, but in the same function as*, the
> backslash-doubling bug fixed in CASSANDRA-21131. Applying the CASSANDRA-21131
> patch does NOT fix this issue.
> h2. Expected Fix
> In the CSV export path, skip the {{UNICODE_CONTROLCHARS_RE.sub(...)}} call.
> An {{escape_control_chars}} parameter (analogous to the {{escape_backslash}}
> parameter introduced by CASSANDRA-21131) can conditionally suppress the
> substitution when calling {{format_value_text}} from the CSV export path.
> h2. Related
> CASSANDRA-21131 -- backslash-doubling bug in the same code path, already
> patched.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]