Ruiqi Dong created CSV-323:
------------------------------
Summary: ExtendedBufferedReader byte tracking leads to an
incorrect CSVRecord.getBytePosition()
Key: CSV-323
URL: https://issues.apache.org/jira/browse/CSV-323
Project: Commons CSV
Issue Type: Bug
Components: Build
Affects Versions: 1.14.1
Reporter: Ruiqi Dong
*Summary*
ExtendedBufferedReader maintains internal byte-tracking state, and CSVParser
later uses that state when it creates CSVRecord instances.
In the tested scenario below, the byte position of the second record is
reported incorrectly. With the input 'aa[|]bb\ncc[|]dd\n`'and the delimiter
"[|]", the second record starts at byte offset 8, but the parser reports byte
offset 6.
*Affected Code*
Files:
* src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java
* src/main/java/org/apache/commons/csv/Lexer.java
* src/main/java/org/apache/commons/csv/CSVParser.java
{code:java}
@Override
public int read(final char[] buf, final int offset, final int length) throws
IOException {
if (length == 0) {
return 0;
}
final int len = super.read(buf, offset, length);
if (len > 0) {
...
lastChar = buf[offset + len - 1];
} else if (len == EOF) {
lastChar = EOF;
}
position += len;
return len;
} {code}
{code:java}
boolean isDelimiter(final int ch) throws IOException {
...
final int count = reader.read(delimiterBuf, 0, delimiterBuf.length);
isLastTokenDelimiter = count != EOF;
return isLastTokenDelimiter;
} {code}
{code:java}
final long startBytePosition = lexer.getBytesRead() + characterOffset;
...
result = new CSVRecord(this, recordList.toArray(Constants.EMPTY_STRING_ARRAY),
Objects.toString(sb, null), recordNumber, startCharPosition,
startBytePosition); {code}
**
*Reproducer*
Add the following test to
src/test/java/org/apache/commons/csv/CSVParserTest.java:
{code:java}
@Test
void testBytePositionWithTrackBytesAndMultiCharacterDelimiter() throws
IOException {
final String code = "aa[|]bb\ncc[|]dd\n";
final CSVFormat format =
CSVFormat.DEFAULT.builder().setDelimiter("[|]").get();
try (CSVParser parser = CSVParser.builder()
.setReader(new StringReader(code))
.setFormat(format)
.setCharset(StandardCharsets.UTF_8)
.setTrackBytes(true)
.get()) {
final Iterator<CSVRecord> it = parser.iterator();
final CSVRecord first = it.next();
final CSVRecord second = it.next();
assertEquals(0, first.getBytePosition());
assertEquals(8, second.getBytePosition());
}
}{code}
Run:
{code:java}
mvn -q
-Dtest=org.apache.commons.csv.CSVParserTest#testBytePositionWithTrackBytesAndMultiCharacterDelimiter
test {code}
Expected behavior:
# the first record starts at byte offset `0`
# the second record should start at byte offset `8`
because the prefix "aa[|]bb\n" is exactly 8 ASCII bytes long.
In the tested scenario: # byte tracking is explicitly enabled,
# the parser successfully returns both records,
# but the second record receives the wrong byte offset.
So the record-position metadata is not reliable in this case.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)