Ruiqi Dong created CSV-329:
------------------------------
Summary: CSVParser with trackBytes=true throws on multi-character
delimiters containing supplementary Unicode characters
Key: CSV-329
URL: https://issues.apache.org/jira/browse/CSV-329
Project: Commons CSV
Issue Type: Bug
Reporter: Ruiqi Dong
*Summary*
With byte tracking enabled, parsing fails when a multi-character delimiter
contains a supplementary Unicode character such as an emoji.
The parser can handle the delimiter when byte tracking is disabled. The failure
is caused by `ExtendedBufferedReader.read(char[], ...)` updating `lastChar`
before computing the encoded byte length of the read buffer. For a surrogate
pair read into the delimiter lookahead buffer, the low surrogate is checked
against an already-updated `lastChar`, so byte-length calculation throws
`CharacterCodingException`.
*Affected code*
File: `src/main/java/org/apache/commons/csv/Lexer.java`
{code:java}
boolean isDelimiter(final int ch) throws IOException {
...
reader.peek(delimiterBuf);
...
final int count = reader.read(delimiterBuf, 0, delimiterBuf.length);
isLastTokenDelimiter = count != EOF;
return isLastTokenDelimiter;
} {code}
File: `src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java`
{code:java}
public int read(final char[] buf, final int offset, final int length) throws
IOException {
...
if (len > 0) {
...
lastChar = buf[offset + len - 1];
} else if (len == EOF) {
lastChar = EOF;
}
if (encoder != null) {
this.bytesRead += getEncodedCharLength(buf, offset, len);
}
position += len;
return len;
} {code}
`getEncodedCharLength(...)` relies on the previous `lastChar` to pair a low
surrogate:
{code:java}
if (Character.isSurrogatePair(lChar, cChar)) {
return encoder.encode(CharBuffer.wrap(new char[] { lChar, cChar })).limit();
}
throw new CharacterCodingException(); {code}
*Reproducer*
Add this test to `src/test/java/org/apache/commons/csv/CSVParserTest.java`:
{code:java}
@Test
void testTrackBytesWithSupplementaryCharacterInMultiCharacterDelimiter() throws
IOException {
final String delimiter = "x😀";
final String code = "ax😀b\n";
final CSVFormat format =
CSVFormat.DEFAULT.builder().setDelimiter(delimiter).get();
try (CSVParser parser = CSVParser.builder()
.setReader(new StringReader(code))
.setFormat(format)
.setCharset(UTF_8)
.setTrackBytes(true)
.get()) {
final CSVRecord record = parser.nextRecord();
assertNotNull(record);
assertEquals("a", record.get(0));
assertEquals("b", record.get(1));
}
} {code}
Run:
{code:java}
mvn -q
-Dtest=org.apache.commons.csv.CSVParserTest#testTrackBytesWithSupplementaryCharacterInMultiCharacterDelimiter
test {code}
Observed behavior
The test errors:
{code:java}
java.nio.charset.CharacterCodingException
at
org.apache.commons.csv.ExtendedBufferedReader.getEncodedCharLength(ExtendedBufferedReader.java:156)
at
org.apache.commons.csv.ExtendedBufferedReader.read(ExtendedBufferedReader.java:237)
at
org.apache.commons.io.input.UnsynchronizedBufferedReader.peek(UnsynchronizedBufferedReader.java:236)
at org.apache.commons.csv.Lexer.isDelimiter(Lexer.java:156){code}
*Expected behavior*
Byte tracking should not change whether a valid CSV input can be parsed. The
record should parse as two fields:
{code:java}
a
b {code}
This is a metadata-tracking side effect that changes parser correctness.
Enabling byte tracking should add byte-position metadata, not make valid input
fail when delimiter lookahead reads a surrogate pair.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)