Ruiqi Dong created CSV-323:
------------------------------

             Summary: ExtendedBufferedReader byte tracking leads to an 
incorrect CSVRecord.getBytePosition()
                 Key: CSV-323
                 URL: https://issues.apache.org/jira/browse/CSV-323
             Project: Commons CSV
          Issue Type: Bug
          Components: Build
    Affects Versions: 1.14.1
            Reporter: Ruiqi Dong


*Summary*
ExtendedBufferedReader maintains internal byte-tracking state, and CSVParser 
later uses that state when it creates CSVRecord instances.

In the tested scenario below, the byte position of the second record is 
reported incorrectly. With the input 'aa[|]bb\ncc[|]dd\n`'and the delimiter 
"[|]", the second record starts at byte offset 8, but the parser reports byte 
offset 6.
 
*Affected Code*
Files:
 * src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java
 * src/main/java/org/apache/commons/csv/Lexer.java
 * src/main/java/org/apache/commons/csv/CSVParser.java
{code:java}
@Override
public int read(final char[] buf, final int offset, final int length) throws 
IOException {
    if (length == 0) {
        return 0;
    }
    final int len = super.read(buf, offset, length);
    if (len > 0) {
        ...
        lastChar = buf[offset + len - 1];
    } else if (len == EOF) {
        lastChar = EOF;
    }
    position += len;
    return len;
} {code}
{code:java}
boolean isDelimiter(final int ch) throws IOException {
    ...
    final int count = reader.read(delimiterBuf, 0, delimiterBuf.length);
    isLastTokenDelimiter = count != EOF;
    return isLastTokenDelimiter;
} {code}
{code:java}
final long startBytePosition = lexer.getBytesRead() + characterOffset;
...
result = new CSVRecord(this, recordList.toArray(Constants.EMPTY_STRING_ARRAY),
        Objects.toString(sb, null), recordNumber, startCharPosition, 
startBytePosition); {code}
**
*Reproducer*
Add the following test to 
src/test/java/org/apache/commons/csv/CSVParserTest.java:
{code:java}
@Test
void testBytePositionWithTrackBytesAndMultiCharacterDelimiter() throws 
IOException {
    final String code = "aa[|]bb\ncc[|]dd\n";
    final CSVFormat format = 
CSVFormat.DEFAULT.builder().setDelimiter("[|]").get();
    try (CSVParser parser = CSVParser.builder()
            .setReader(new StringReader(code))
            .setFormat(format)
            .setCharset(StandardCharsets.UTF_8)
            .setTrackBytes(true)
            .get()) {
        final Iterator<CSVRecord> it = parser.iterator();
        final CSVRecord first = it.next();
        final CSVRecord second = it.next();

        assertEquals(0, first.getBytePosition());
        assertEquals(8, second.getBytePosition());
    }
}{code}
Run:
{code:java}
mvn -q 
-Dtest=org.apache.commons.csv.CSVParserTest#testBytePositionWithTrackBytesAndMultiCharacterDelimiter
 test {code}
Expected behavior:
 # the first record starts at byte offset `0`
 # the second record should start at byte offset `8`

because the prefix "aa[|]bb\n" is exactly 8 ASCII bytes long.
In the tested scenario: # byte tracking is explicitly enabled,
 # the parser successfully returns both records,
 # but the second record receives the wrong byte offset.

So the record-position metadata is not reliable in this case.
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to