Ruiqi Dong created CSV-325:
------------------------------

             Summary: CSVParser applies characterOffset to bytePosition, which 
breaks getBytePosition() after multi-byte prefixes
                 Key: CSV-325
                 URL: https://issues.apache.org/jira/browse/CSV-325
             Project: Commons CSV
          Issue Type: Bug
          Components: Parser
            Reporter: Ruiqi Dong


*Summary*
When CSVParser.Builder#setTrackBytes(true) is enabled, and parsing starts from 
the middle of a larger source, CSVParser adds characterOffset to both the 
character position and the byte position. That is only correct for single-byte 
prefixes. If the skipped prefix contains multi-byte UTF-8 characters, 
CSVRecord.getBytePosition() is too small.
*Affected code*
File: src/main/java/org/apache/commons/csv/CSVParser.java
{code:java}
final long startCharPosition = lexer.getCharacterPosition() + characterOffset;
final long startBytePosition = lexer.getBytesRead() + characterOffset; {code}
File: src/main/java/org/apache/commons/csv/CSVRecord.java
{code:java}
/**
 * Returns the starting position of this record in the source stream, measured 
in bytes.
 */
public long getBytePosition() {
    return bytePosition;
} {code}
 
*Reproducer* 
Add the following test to 
src/test/java/org/apache/commons/csv/CSVParserTest.java:
{code:java}
@Test
void testGetBytePositionWithCharacterOffsetAndMultiBytePrefix() throws 
Exception {
    final String code = "é,x\nb,c\n";
    final long recordOffset = code.indexOf('b');
    final long expectedByteOffset = "é,x\n".getBytes(UTF_8).length;

    try (CSVParser parser = CSVParser.builder()
            .setReader(new StringReader(code.substring((int) recordOffset)))
            .setFormat(CSVFormat.DEFAULT)
            .setCharset(UTF_8)
            .setTrackBytes(true)
            .setCharacterOffset(recordOffset)
            .setRecordNumber(2)
            .get()) {
        final CSVRecord record = parser.nextRecord();
        assertNotNull(record);
        assertEquals(recordOffset, record.getCharacterPosition());
        assertEquals(expectedByteOffset, record.getBytePosition());
    }
}{code}
Run:
{code:java}
mvn -q 
-Dtest=org.apache.commons.csv.CSVParserTest#testGetBytePositionWithCharacterOffsetAndMultiBytePrefix
 test {code}
Observed behavior:
{code:java}
expected: <5> but was: <4> {code}
The first record prefix is "é,x\n": * character length: 4
 * UTF-8 byte length: 5

getCharacterPosition() correctly reports 4, but getBytePosition() also reports 
4, even though the record starts at byte offset 5.
 
Expected behavior:
If byte tracking is enabled, CSVRecord.getBytePosition() should report the true 
byte offset in the source stream. For the reproducer above, the record "b,c" 
should start at byte offset 5, not 4.
 
characterOffset and byte offset are not interchangeable once the skipped prefix 
can contain multi-byte characters. The current implementation: * correctly 
treats characterOffset as a character-space adjustment
 * incorrectly reuses the same value as a byte-space adjustment



As a result, getBytePosition() becomes unreliable for resumed parsing over 
UTF-8 or other variable-width encodings.

I think CSVParser likely needs a separate byte-offset input, or it needs to 
avoid applying characterOffset to byte positions when no true byte offset is 
available.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to