[ 
https://issues.apache.org/jira/browse/CSV-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruiqi Dong closed CSV-325.
--------------------------

> CSVParser applies characterOffset to bytePosition, breaking getBytePosition() 
> for multi-byte prefixes
> -----------------------------------------------------------------------------------------------------
>
>                 Key: CSV-325
>                 URL: https://issues.apache.org/jira/browse/CSV-325
>             Project: Commons CSV
>          Issue Type: Bug
>          Components: Parser
>            Reporter: Ruiqi Dong
>            Assignee: Gary D. Gregory
>            Priority: Major
>             Fix For: 1.15.0
>
>
> *Summary*
> When CSVParser.Builder#setTrackBytes(true) is enabled, and parsing starts 
> from the middle of a larger source, CSVParser adds characterOffset to both 
> the character position and the byte position. That is only correct for 
> single-byte prefixes. If the skipped prefix contains multi-byte UTF-8 
> characters, CSVRecord.getBytePosition() is too small.
> *Affected code*
> File: src/main/java/org/apache/commons/csv/CSVParser.java
> {code:java}
> final long startCharPosition = lexer.getCharacterPosition() + characterOffset;
> final long startBytePosition = lexer.getBytesRead() + characterOffset; {code}
> File: src/main/java/org/apache/commons/csv/CSVRecord.java
> {code:java}
> /**
>  * Returns the starting position of this record in the source stream, 
> measured in bytes.
>  */
> public long getBytePosition() {
>     return bytePosition;
> } {code}
>  
> *Reproducer* 
> Add the following test to 
> src/test/java/org/apache/commons/csv/CSVParserTest.java:
> {code:java}
> @Test
> void testGetBytePositionWithCharacterOffsetAndMultiBytePrefix() throws 
> Exception {
>     final String code = "é,x\nb,c\n";
>     final long recordOffset = code.indexOf('b');
>     final long expectedByteOffset = "é,x\n".getBytes(UTF_8).length;
>     try (CSVParser parser = CSVParser.builder()
>             .setReader(new StringReader(code.substring((int) recordOffset)))
>             .setFormat(CSVFormat.DEFAULT)
>             .setCharset(UTF_8)
>             .setTrackBytes(true)
>             .setCharacterOffset(recordOffset)
>             .setRecordNumber(2)
>             .get()) {
>         final CSVRecord record = parser.nextRecord();
>         assertNotNull(record);
>         assertEquals(recordOffset, record.getCharacterPosition());
>         assertEquals(expectedByteOffset, record.getBytePosition());
>     }
> }{code}
> Run:
> {code:java}
> mvn -q 
> -Dtest=org.apache.commons.csv.CSVParserTest#testGetBytePositionWithCharacterOffsetAndMultiBytePrefix
>  test {code}
> Observed behavior:
> {code:java}
> expected: <5> but was: <4> {code}
> The first record prefix is "é,x\n": * character length: 4
>  * UTF-8 byte length: 5
> getCharacterPosition() correctly reports 4, but getBytePosition() also 
> reports 4, even though the record starts at byte offset 5.
>  
> Expected behavior:
> If byte tracking is enabled, CSVRecord.getBytePosition() should report the 
> true byte offset in the source stream. For the reproducer above, the record 
> "b,c" should start at byte offset 5, not 4.
>  
> characterOffset and byte offset are not interchangeable once the skipped 
> prefix can contain multi-byte characters. The current implementation: * 
> correctly treats characterOffset as a character-space adjustment
>  * incorrectly reuses the same value as a byte-space adjustment
> As a result, getBytePosition() becomes unreliable for resumed parsing over 
> UTF-8 or other variable-width encodings.
> I think CSVParser likely needs a separate byte-offset input, or it needs to 
> avoid applying characterOffset to byte positions when no true byte offset is 
> available.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to