[
https://issues.apache.org/jira/browse/CSV-323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gary D. Gregory updated CSV-323:
--------------------------------
External issue URL: https://github.com/apache/commons-csv/pull/601
> ExtendedBufferedReader byte tracking leads to an incorrect
> CSVRecord.getBytePosition()
> --------------------------------------------------------------------------------------
>
> Key: CSV-323
> URL: https://issues.apache.org/jira/browse/CSV-323
> Project: Commons CSV
> Issue Type: Bug
> Components: Build
> Affects Versions: 1.14.1
> Reporter: Ruiqi Dong
> Priority: Major
>
> *Summary*
> ExtendedBufferedReader maintains internal byte-tracking state, and CSVParser
> later uses that state when it creates CSVRecord instances.
> In the tested scenario below, the byte position of the second record is
> reported incorrectly. With the input 'aa[|]bb\ncc[|]dd\n' and the delimiter
> "[|]", the second record starts at byte offset 8, but the parser reports byte
> offset 6.
>
> *Affected Code*
> Files:
> * src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java
> * src/main/java/org/apache/commons/csv/Lexer.java
> * src/main/java/org/apache/commons/csv/CSVParser.java
> {code:java}
> @Override
> public int read(final char[] buf, final int offset, final int length) throws
> IOException {
> if (length == 0) {
> return 0;
> }
> final int len = super.read(buf, offset, length);
> if (len > 0) {
> ...
> lastChar = buf[offset + len - 1];
> } else if (len == EOF) {
> lastChar = EOF;
> }
> position += len;
> return len;
> } {code}
> {code:java}
> boolean isDelimiter(final int ch) throws IOException {
> ...
> final int count = reader.read(delimiterBuf, 0, delimiterBuf.length);
> isLastTokenDelimiter = count != EOF;
> return isLastTokenDelimiter;
> } {code}
> {code:java}
> final long startBytePosition = lexer.getBytesRead() + characterOffset;
> ...
> result = new CSVRecord(this, recordList.toArray(Constants.EMPTY_STRING_ARRAY),
> Objects.toString(sb, null), recordNumber, startCharPosition,
> startBytePosition); {code}
> *Reproducer*
> Add the following test to
> src/test/java/org/apache/commons/csv/CSVParserTest.java:
> {code:java}
> @Test
> void testBytePositionWithTrackBytesAndMultiCharacterDelimiter() throws
> IOException {
> final String code = "aa[|]bb\ncc[|]dd\n";
> final CSVFormat format =
> CSVFormat.DEFAULT.builder().setDelimiter("[|]").get();
> try (CSVParser parser = CSVParser.builder()
> .setReader(new StringReader(code))
> .setFormat(format)
> .setCharset(StandardCharsets.UTF_8)
> .setTrackBytes(true)
> .get()) {
> final Iterator<CSVRecord> it = parser.iterator();
> final CSVRecord first = it.next();
> final CSVRecord second = it.next();
> assertEquals(0, first.getBytePosition());
> assertEquals(8, second.getBytePosition());
> }
> }{code}
> Run:
> {code:java}
> mvn -q
> -Dtest=org.apache.commons.csv.CSVParserTest#testBytePositionWithTrackBytesAndMultiCharacterDelimiter
> test {code}
>
> Expected behavior:
> # the first record starts at byte offset `0`
> # the second record should start at byte offset `8`
> because the prefix "aa[|]bb\n" is exactly 8 ASCII bytes long.
>
> In the tested scenario:
> # byte tracking is explicitly enabled,
> # the parser successfully returns both records,
> # but the second record receives the wrong byte offset.
> So the record-position metadata is not reliable in this case.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)