[ 
https://issues.apache.org/jira/browse/CSV-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary D. Gregory updated CSV-324:
--------------------------------
    Affects Version/s:     (was: 1.14.1)

> Lexer.isDelimiter() accepts a partial multi-character delimiter at EOF
> ----------------------------------------------------------------------
>
>                 Key: CSV-324
>                 URL: https://issues.apache.org/jira/browse/CSV-324
>             Project: Commons CSV
>          Issue Type: Bug
>          Components: Build
>            Reporter: Ruiqi Dong
>            Assignee: Gary D. Gregory
>            Priority: Minor
>             Fix For: 1.14.2
>
>
> *Summary*
> In the tested scenario below, a truncated multi-character delimiter at EOF is 
> treated as a real delimiter. The relevant code path appears to be 
> Lexer.isDelimiter(), which accepts the delimiter once the suffix read is not 
> EOF, instead of requiring the entire delimiter suffix to be consumed.
> *Affected code*
> File: src/main/java/org/apache/commons/csv/Lexer.java
> {code:java}
> private final char[] delimiter;
> private final char[] delimiterBuf;
> boolean isDelimiter(final int ch) throws IOException {
>     isLastTokenDelimiter = false;
>     if (ch != delimiter[0]) {
>         return false;
>     }
>     if (delimiter.length == 1) {
>         isLastTokenDelimiter = true;
>         return true;
>     }
>     reader.peek(delimiterBuf);
>     for (int i = 0; i < delimiterBuf.length; i++) {
>         if (delimiterBuf[i] != delimiter[i + 1]) {
>             return false;
>         }
>     }
>     final int count = reader.read(delimiterBuf, 0, delimiterBuf.length);
>     isLastTokenDelimiter = count != EOF;
>     return isLastTokenDelimiter;
> }{code}
> *Reproducer*
> Add the following test to src/test/java/org/apache/commons/csv/LexerTest.java:
> {code:java}
> @Test
> void testPartialMultiCharacterDelimiterAtEOFIsNotConsumed() throws 
> IOException {
>     final CSVFormat format = 
> CSVFormat.DEFAULT.builder().setDelimiter("[|]").get();
>     try (Lexer lexer = createLexer("a[|]b[|", format)) {
>         assertNextToken(TOKEN, "a", lexer);
>         assertNextToken(EOF, "b[|", lexer);
>     }
> } {code}
> Run:
> {code:java}
> mvn -q 
> -Dtest=org.apache.commons.csv.LexerTest#testPartialMultiCharacterDelimiterAtEOFIsNotConsumed
>  test {code}
> Observed behavior:
> {code:java}
> LexerTest.testPartialMultiCharacterDelimiterAtEOFIsNotConsumed:237
> expected: <EOF> but was: <TOKEN> {code}
> In other words, the trailing "[|" is not preserved as data. Instead, it is 
> treated as a delimiter and produces an extra token boundary.
> Expected behavior:
> The trailing "[|" should remain part of the final token because the full 
> delimiter "[|]" was not present.
>  
> The lexer is recognizing an incomplete delimiter as a complete field 
> separator, which changes the parsed token stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to