Vadim created NIFI-5525: --------------------------- Summary: CSVRecordReader fails with StringIndexOutOfBoundsException when field is a double quote Key: NIFI-5525 URL: https://issues.apache.org/jira/browse/NIFI-5525 Project: Apache NiFi Issue Type: Bug Components: Core Framework Affects Versions: 1.7.1 Reporter: Vadim
*Bug description:* When trying to parse a CSV file given in RFC4180 format and one of its fields is a double quote, CSVRecordReader fails with the following exception: {quote}java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1967) at org.apache.nifi.csv.AbstractCSVRecordReader.convert(AbstractCSVRecordReader.java:82) at org.apache.nifi.csv.CSVRecordReader.nextRecord(CSVRecordReader.java:102) at org.apache.nifi.serialization.RecordReader.nextRecord(RecordReader.java:50) at org.apache.nifi.csv.TestCSVRecordReader.testQuote(TestCSVRecordReader.java:610) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) {quote} Note, that according to RFC4180: If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. [https://tools.ietf.org/html/rfc4180#page-2] Then a field whose value is a double quote character would be encoded like this: """" (4 double quote characters) *How to reproduce* Add the following method to TestCSVRecordReader.java and run the test: {code:java} @Test public void testQuote() throws IOException, MalformedRecordException { final CSVFormat format = CSVFormat.RFC4180.withFirstRecordAsHeader().withTrim().withQuote('"'); final String text = "\"name\"\n\"\"\"\""; final List<RecordField> fields = new ArrayList<>(); fields.add(new RecordField("name", RecordFieldType.STRING.getDataType())); final RecordSchema schema = new SimpleRecordSchema(fields); try (final InputStream bais = new ByteArrayInputStream(text.getBytes(StandardCharsets.UTF_8)); final CSVRecordReader reader = new CSVRecordReader(bais, Mockito.mock(ComponentLog.class), schema, format, true, false, RecordFieldType.DATE.getDefaultFormat(), RecordFieldType.TIME.getDefaultFormat(), RecordFieldType.TIMESTAMP.getDefaultFormat(), StandardCharsets.UTF_8.name())) { final Record record = reader.nextRecord(); final String name = (String)record.getValue("name"); assertEquals("\"", name); } } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)