[
https://issues.apache.org/jira/browse/CSV-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gary D. Gregory updated CSV-325:
--------------------------------
Assignee: Gary D. Gregory
> CSVParser applies characterOffset to bytePosition, which breaks
> getBytePosition() for multi-byte prefixes
> ---------------------------------------------------------------------------------------------------------
>
> Key: CSV-325
> URL: https://issues.apache.org/jira/browse/CSV-325
> Project: Commons CSV
> Issue Type: Bug
> Components: Parser
> Reporter: Ruiqi Dong
> Assignee: Gary D. Gregory
> Priority: Major
>
> *Summary*
> When CSVParser.Builder#setTrackBytes(true) is enabled, and parsing starts
> from the middle of a larger source, CSVParser adds characterOffset to both
> the character position and the byte position. That is only correct for
> single-byte prefixes. If the skipped prefix contains multi-byte UTF-8
> characters, CSVRecord.getBytePosition() is too small.
> *Affected code*
> File: src/main/java/org/apache/commons/csv/CSVParser.java
> {code:java}
> final long startCharPosition = lexer.getCharacterPosition() + characterOffset;
> final long startBytePosition = lexer.getBytesRead() + characterOffset; {code}
> File: src/main/java/org/apache/commons/csv/CSVRecord.java
> {code:java}
> /**
> * Returns the starting position of this record in the source stream,
> measured in bytes.
> */
> public long getBytePosition() {
> return bytePosition;
> } {code}
>
> *Reproducer*
> Add the following test to
> src/test/java/org/apache/commons/csv/CSVParserTest.java:
> {code:java}
> @Test
> void testGetBytePositionWithCharacterOffsetAndMultiBytePrefix() throws
> Exception {
> final String code = "é,x\nb,c\n";
> final long recordOffset = code.indexOf('b');
> final long expectedByteOffset = "é,x\n".getBytes(UTF_8).length;
> try (CSVParser parser = CSVParser.builder()
> .setReader(new StringReader(code.substring((int) recordOffset)))
> .setFormat(CSVFormat.DEFAULT)
> .setCharset(UTF_8)
> .setTrackBytes(true)
> .setCharacterOffset(recordOffset)
> .setRecordNumber(2)
> .get()) {
> final CSVRecord record = parser.nextRecord();
> assertNotNull(record);
> assertEquals(recordOffset, record.getCharacterPosition());
> assertEquals(expectedByteOffset, record.getBytePosition());
> }
> }{code}
> Run:
> {code:java}
> mvn -q
> -Dtest=org.apache.commons.csv.CSVParserTest#testGetBytePositionWithCharacterOffsetAndMultiBytePrefix
> test {code}
> Observed behavior:
> {code:java}
> expected: <5> but was: <4> {code}
> The first record prefix is "é,x\n": * character length: 4
> * UTF-8 byte length: 5
> getCharacterPosition() correctly reports 4, but getBytePosition() also
> reports 4, even though the record starts at byte offset 5.
>
> Expected behavior:
> If byte tracking is enabled, CSVRecord.getBytePosition() should report the
> true byte offset in the source stream. For the reproducer above, the record
> "b,c" should start at byte offset 5, not 4.
>
> characterOffset and byte offset are not interchangeable once the skipped
> prefix can contain multi-byte characters. The current implementation: *
> correctly treats characterOffset as a character-space adjustment
> * incorrectly reuses the same value as a byte-space adjustment
> As a result, getBytePosition() becomes unreliable for resumed parsing over
> UTF-8 or other variable-width encodings.
> I think CSVParser likely needs a separate byte-offset input, or it needs to
> avoid applying characterOffset to byte positions when no true byte offset is
> available.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)