[
https://issues.apache.org/jira/browse/CSV-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gary D. Gregory resolved CSV-329.
---------------------------------
Fix Version/s: 1.15.0
Resolution: Fixed
> CSVParser with trackBytes=true throws on multi-character delimiters
> containing supplementary Unicode characters
> ---------------------------------------------------------------------------------------------------------------
>
> Key: CSV-329
> URL: https://issues.apache.org/jira/browse/CSV-329
> Project: Commons CSV
> Issue Type: Bug
> Reporter: Ruiqi Dong
> Priority: Minor
> Fix For: 1.15.0
>
>
> *Summary*
> With byte tracking enabled, parsing fails when a multi-character delimiter
> contains a supplementary Unicode character such as an emoji.
> The parser can handle the delimiter when byte tracking is disabled. The
> failure is caused by `ExtendedBufferedReader.read(char[], ...)` updating
> `lastChar` before computing the encoded byte length of the read buffer. For a
> surrogate pair read into the delimiter lookahead buffer, the low surrogate is
> checked against an already-updated `lastChar`, so byte-length calculation
> throws `CharacterCodingException`.
>
>
> *Affected code*
> File: `src/main/java/org/apache/commons/csv/Lexer.java`
> {code:java}
> boolean isDelimiter(final int ch) throws IOException {
> ...
> reader.peek(delimiterBuf);
> ...
> final int count = reader.read(delimiterBuf, 0, delimiterBuf.length);
> isLastTokenDelimiter = count != EOF;
> return isLastTokenDelimiter;
> } {code}
> File: `src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java`
> {code:java}
> public int read(final char[] buf, final int offset, final int length) throws
> IOException {
> ...
> if (len > 0) {
> ...
> lastChar = buf[offset + len - 1];
> } else if (len == EOF) {
> lastChar = EOF;
> }
> if (encoder != null) {
> this.bytesRead += getEncodedCharLength(buf, offset, len);
> }
> position += len;
> return len;
> } {code}
> `getEncodedCharLength(...)` relies on the previous `lastChar` to pair a low
> surrogate:
> {code:java}
> if (Character.isSurrogatePair(lChar, cChar)) {
> return encoder.encode(CharBuffer.wrap(new char[] { lChar, cChar
> })).limit();
> }
> throw new CharacterCodingException(); {code}
> *Reproducer*
> Add this test to `src/test/java/org/apache/commons/csv/CSVParserTest.java`:
> {code:java}
> @Test
> void testTrackBytesWithSupplementaryCharacterInMultiCharacterDelimiter()
> throws IOException {
> final String delimiter = "x😀";
> final String code = "ax😀b\n";
> final CSVFormat format =
> CSVFormat.DEFAULT.builder().setDelimiter(delimiter).get();
> try (CSVParser parser = CSVParser.builder()
> .setReader(new StringReader(code))
> .setFormat(format)
> .setCharset(UTF_8)
> .setTrackBytes(true)
> .get()) {
> final CSVRecord record = parser.nextRecord();
> assertNotNull(record);
> assertEquals("a", record.get(0));
> assertEquals("b", record.get(1));
> }
> } {code}
> Run:
> {code:java}
> mvn -q
> -Dtest=org.apache.commons.csv.CSVParserTest#testTrackBytesWithSupplementaryCharacterInMultiCharacterDelimiter
> test {code}
> Observed behavior
> The test errors:
> {code:java}
> java.nio.charset.CharacterCodingException
> at
> org.apache.commons.csv.ExtendedBufferedReader.getEncodedCharLength(ExtendedBufferedReader.java:156)
> at
> org.apache.commons.csv.ExtendedBufferedReader.read(ExtendedBufferedReader.java:237)
> at
> org.apache.commons.io.input.UnsynchronizedBufferedReader.peek(UnsynchronizedBufferedReader.java:236)
> at org.apache.commons.csv.Lexer.isDelimiter(Lexer.java:156){code}
> *Expected behavior*
> Byte tracking should not change whether a valid CSV input can be parsed. The
> record should parse as two fields:
> {code:java}
> a
> b {code}
> This is a metadata-tracking side effect that changes parser correctness.
> Enabling byte tracking should add byte-position metadata, not make valid
> input fail when delimiter lookahead reads a surrogate pair.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)