[ 
https://issues.apache.org/jira/browse/CSV-323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary D. Gregory updated CSV-323:
--------------------------------
    External issue URL: https://github.com/apache/commons-csv/pull/601

> ExtendedBufferedReader byte tracking leads to an incorrect 
> CSVRecord.getBytePosition()
> --------------------------------------------------------------------------------------
>
>                 Key: CSV-323
>                 URL: https://issues.apache.org/jira/browse/CSV-323
>             Project: Commons CSV
>          Issue Type: Bug
>          Components: Build
>    Affects Versions: 1.14.1
>            Reporter: Ruiqi Dong
>            Priority: Major
>
> *Summary*
> ExtendedBufferedReader maintains internal byte-tracking state, and CSVParser 
> later uses that state when it creates CSVRecord instances.
> In the tested scenario below, the byte position of the second record is 
> reported incorrectly. With the input 'aa[|]bb\ncc[|]dd\n' and the delimiter 
> "[|]", the second record starts at byte offset 8, but the parser reports byte 
> offset 6.
>  
> *Affected Code*
> Files:
>  * src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java
>  * src/main/java/org/apache/commons/csv/Lexer.java
>  * src/main/java/org/apache/commons/csv/CSVParser.java
> {code:java}
> @Override
> public int read(final char[] buf, final int offset, final int length) throws 
> IOException {
>     if (length == 0) {
>         return 0;
>     }
>     final int len = super.read(buf, offset, length);
>     if (len > 0) {
>         ...
>         lastChar = buf[offset + len - 1];
>     } else if (len == EOF) {
>         lastChar = EOF;
>     }
>     position += len;
>     return len;
> } {code}
> {code:java}
> boolean isDelimiter(final int ch) throws IOException {
>     ...
>     final int count = reader.read(delimiterBuf, 0, delimiterBuf.length);
>     isLastTokenDelimiter = count != EOF;
>     return isLastTokenDelimiter;
> } {code}
> {code:java}
> final long startBytePosition = lexer.getBytesRead() + characterOffset;
> ...
> result = new CSVRecord(this, recordList.toArray(Constants.EMPTY_STRING_ARRAY),
>         Objects.toString(sb, null), recordNumber, startCharPosition, 
> startBytePosition); {code}
> *Reproducer*
> Add the following test to 
> src/test/java/org/apache/commons/csv/CSVParserTest.java:
> {code:java}
> @Test
> void testBytePositionWithTrackBytesAndMultiCharacterDelimiter() throws 
> IOException {
>     final String code = "aa[|]bb\ncc[|]dd\n";
>     final CSVFormat format = 
> CSVFormat.DEFAULT.builder().setDelimiter("[|]").get();
>     try (CSVParser parser = CSVParser.builder()
>             .setReader(new StringReader(code))
>             .setFormat(format)
>             .setCharset(StandardCharsets.UTF_8)
>             .setTrackBytes(true)
>             .get()) {
>         final Iterator<CSVRecord> it = parser.iterator();
>         final CSVRecord first = it.next();
>         final CSVRecord second = it.next();
>         assertEquals(0, first.getBytePosition());
>         assertEquals(8, second.getBytePosition());
>     }
> }{code}
> Run:
> {code:java}
> mvn -q 
> -Dtest=org.apache.commons.csv.CSVParserTest#testBytePositionWithTrackBytesAndMultiCharacterDelimiter
>  test {code}
>  
> Expected behavior:
>  # the first record starts at byte offset `0`
>  # the second record should start at byte offset `8`
> because the prefix "aa[|]bb\n" is exactly 8 ASCII bytes long.
>  
> In the tested scenario: 
>  # byte tracking is explicitly enabled,
>  # the parser successfully returns both records,
>  # but the second record receives the wrong byte offset.
> So the record-position metadata is not reliable in this case.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to