Ruiqi Dong created CSV-329:
------------------------------

             Summary: CSVParser with trackBytes=true throws on multi-character 
delimiters containing supplementary Unicode characters
                 Key: CSV-329
                 URL: https://issues.apache.org/jira/browse/CSV-329
             Project: Commons CSV
          Issue Type: Bug
            Reporter: Ruiqi Dong


*Summary*
With byte tracking enabled, parsing fails when a multi-character delimiter 
contains a supplementary Unicode character such as an emoji.

The parser can handle the delimiter when byte tracking is disabled. The failure 
is caused by `ExtendedBufferedReader.read(char[], ...)` updating `lastChar` 
before computing the encoded byte length of the read buffer. For a surrogate 
pair read into the delimiter lookahead buffer, the low surrogate is checked 
against an already-updated `lastChar`, so byte-length calculation throws 
`CharacterCodingException`.
 
 
*Affected code*
File: `src/main/java/org/apache/commons/csv/Lexer.java`
{code:java}
boolean isDelimiter(final int ch) throws IOException {
    ...
    reader.peek(delimiterBuf);
    ...
    final int count = reader.read(delimiterBuf, 0, delimiterBuf.length);
    isLastTokenDelimiter = count != EOF;
    return isLastTokenDelimiter;
} {code}
File: `src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java`
{code:java}
public int read(final char[] buf, final int offset, final int length) throws 
IOException {
    ...
    if (len > 0) {
        ...
        lastChar = buf[offset + len - 1];
    } else if (len == EOF) {
        lastChar = EOF;
    }
    if (encoder != null) {
        this.bytesRead += getEncodedCharLength(buf, offset, len);
    }
    position += len;
    return len;
} {code}
`getEncodedCharLength(...)` relies on the previous `lastChar` to pair a low 
surrogate:
{code:java}
if (Character.isSurrogatePair(lChar, cChar)) {
    return encoder.encode(CharBuffer.wrap(new char[] { lChar, cChar })).limit();
}
throw new CharacterCodingException(); {code}
*Reproducer*
Add this test to `src/test/java/org/apache/commons/csv/CSVParserTest.java`:
{code:java}
@Test
void testTrackBytesWithSupplementaryCharacterInMultiCharacterDelimiter() throws 
IOException {
    final String delimiter = "x😀";
    final String code = "ax😀b\n";
    final CSVFormat format = 
CSVFormat.DEFAULT.builder().setDelimiter(delimiter).get();
    try (CSVParser parser = CSVParser.builder()
            .setReader(new StringReader(code))
            .setFormat(format)
            .setCharset(UTF_8)
            .setTrackBytes(true)
            .get()) {
        final CSVRecord record = parser.nextRecord();
        assertNotNull(record);
        assertEquals("a", record.get(0));
        assertEquals("b", record.get(1));
    }
} {code}
Run:
{code:java}
mvn -q 
-Dtest=org.apache.commons.csv.CSVParserTest#testTrackBytesWithSupplementaryCharacterInMultiCharacterDelimiter
 test {code}
Observed behavior
The test errors:
{code:java}
java.nio.charset.CharacterCodingException
    at 
org.apache.commons.csv.ExtendedBufferedReader.getEncodedCharLength(ExtendedBufferedReader.java:156)
    at 
org.apache.commons.csv.ExtendedBufferedReader.read(ExtendedBufferedReader.java:237)
    at 
org.apache.commons.io.input.UnsynchronizedBufferedReader.peek(UnsynchronizedBufferedReader.java:236)
    at org.apache.commons.csv.Lexer.isDelimiter(Lexer.java:156){code}
*Expected behavior* 
Byte tracking should not change whether a valid CSV input can be parsed. The 
record should parse as two fields:
{code:java}
a
b {code}
This is a metadata-tracking side effect that changes parser correctness. 
Enabling byte tracking should add byte-position metadata, not make valid input 
fail when delimiter lookahead reads a surrogate pair.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to